2025-05-30

Title: Training Language Models to Generate Quality Code with Program Analysis Feedback

Authors: Feng Yao, Zilong Wang, Liyuan Liu, Junxia Cui, Li Zhong, Xiaohan Fu, Haohui Mai, Vish Krishnan, Jianfeng Gao, Jingbo Shang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22704
Pdf URL: https://arxiv.org/pdf/2505.22704
Copy Paste: [[2505.22704]] Training Language Models to Generate Quality Code with Program Analysis Feedback(https://arxiv.org/abs/2505.22704)
Keywords: language model, llm, prompt
Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.
摘要：使用大型语言模型（LLMS）的代码生成通常称为Vibe编码，在生产中越来越多地采用，但无法确保代码质量，尤其是在安全性（例如SQL注入漏洞）和可维护性（例如缺少类型的注释）中。现有方法，例如监督的微调和基于规则的后处理，依赖于劳动密集型注释或脆弱的启发式方法，从而限制了它们的可扩展性和有效性。我们提出了一个真实的增强学习框架，该框架激励LLM使用程序分析指导的反馈来生成生产质量代码。具体而言，实际集成了两个自动信号：（1）程序分析检测安全性或可维护性缺陷以及（2）单位测试确保功能正确性。与先前的工作不同，我们的框架是及时且无参考的框架，可以在没有手动干预的情况下进行可扩展的监督。跨多个数据集和模型量表的实验表明，在功能和代码质量的同时评估中，实际的表现优于最先进的方法。我们的工作弥合了快速原型制作与生产准备代码之间的差距，使LLMS能够提供速度和质量。

Title: Climate Finance Bench

Authors: Rafik Mankour, Yassine Chafai, Hamada Saleh, Ghassen Ben Hassine, Thibaud Barreau, Peter Tankov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22752
Pdf URL: https://arxiv.org/pdf/2505.22752
Copy Paste: [[2505.22752]] Climate Finance Bench(https://arxiv.org/abs/2505.22752)
Keywords: language model, retrieval-augmented generation
Abstract: Climate Finance Bench introduces an open benchmark that targets question-answering over corporate climate disclosures using Large Language Models. We curate 33 recent sustainability reports in English drawn from companies across all 11 GICS sectors and annotate 330 expert-validated question-answer pairs that span pure extraction, numerical reasoning, and logical reasoning. Building on this dataset, we propose a comparison of RAG (retrieval-augmented generation) approaches. We show that the retriever's ability to locate passages that actually contain the answer is the chief performance bottleneck. We further argue for transparent carbon reporting in AI-for-climate applications, highlighting advantages of techniques such as Weight Quantization.
摘要：气候融资台推出了一个开放的基准测试，该基准针对的是使用大语言模型对公司气候披露的提问。我们策划了33个最新的可持续性报告，这些报告是从所有11个GIC部门的公司中吸引的英文报告，并注释了330个专家验证的问题解答对，涵盖了纯粹的提取，数值推理和逻辑推理。在此数据集的基础上，我们提出了对抹布（检索的生成）方法的比较。我们表明，猎犬找到真正包含答案的段落的能力是主要的性能瓶颈。我们进一步主张在气候应用中透明的碳报告，强调了诸如体重量化等技术的优势。

Title: Pre-Training Curriculum for Multi-Token Prediction in Language Models

Authors: Ansar Aynetdinov, Alan Akbik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22757
Pdf URL: https://arxiv.org/pdf/2505.22757
Copy Paste: [[2505.22757]] Pre-Training Curriculum for Multi-Token Prediction in Language Models(https://arxiv.org/abs/2505.22757)
Keywords: language model
Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
摘要：多语预测（MTP）是最近提出的语言模型预训练目标。 MTP不仅要仅预测下一代币（NTP），还使用多个预测负责人在每个预测步骤中预测下一个$ k $令牌。 MTP在改善下游性能，推理速度和训练效率方面表现出了希望，尤其是对于大型模型。但是，先前的工作表明，较小的语言模型（SLM）与MTP目标斗争。为了解决这个问题，我们提出了一种用于MTP培训的课程学习策略，探索了两个变体：远期课程，该课程逐渐增加了从NTP到MTP的预训练目标的复杂性，以及一个反向课程，相反。我们的实验表明，远期课程使SLM能够在预训练期间更好地利用MTP目标，改善下游NTP性能和生成产量质量，同时保留自我指导的解码的好处。反向课程可实现更强的NTP性能和输出质量，但无法提供任何自我指导的解码优势。

Title: StressTest: Can YOUR Speech LM Handle the Stress?

Authors: Iddo Yosha, Gallil Maimon, Yossi Adi
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.22765
Pdf URL: https://arxiv.org/pdf/2505.22765
Copy Paste: [[2505.22765]] StressTest: Can YOUR Speech LM Handle the Stress?(https://arxiv.org/abs/2505.22765)
Keywords: language model
Abstract: Sentence stress refers to emphasis, placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information. It is often used to imply an underlying intention that is not explicitly stated. Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio, allowing models to bypass transcription and access the full richness of the speech signal and perform audio reasoning tasks such as spoken question answering. Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models. In this work, we address this gap by introducing StressTest, a benchmark specifically designed to evaluate a model's ability to distinguish between interpretations of spoken sentences based on the stress pattern. We assess the performance of several leading SLMs and find that, despite their overall capabilities, they perform poorly on such tasks. To overcome this limitation, we propose a novel synthetic data generation pipeline, and create Stress17k, a training set that simulates change of meaning implied by stress variation. Then, we empirically show that optimizing models with this synthetic dataset aligns well with real-world recordings and enables effective finetuning of SLMs. Results suggest, that our finetuned model, StresSLM, significantly outperforms existing models on both sentence stress reasoning and detection tasks. Code, models, data, and audio samples - this http URL.
摘要：句子的压力是指重点，放在口语中的特定单词上，以突出或对比一个想法或引入新信息。它通常用来暗示未明确说明的基本意图。语音感知语言模型（SLM）的最新进展已启用了音频的直接处理，使模型可以绕过转录并访问语音信号的全部丰富性并执行音频推理任务，例如口头问题回答。尽管句子压力在塑造意义和说话者意图中的作用至关重要，但在评估和开发此类模型中，它仍然在很大程度上被忽略了。在这项工作中，我们通过引入压力测试来解决这一差距，这是一种专门设计的基准测试，旨在评估模型根据压力模式区分口语句子的解释的能力。我们评估了几个领先的SLM的表现，并发现尽管它们的整体能力，但它们在此类任务上的表现较差。为了克服这一局限性，我们提出了一种新型的合成数据生成管道，并创建压力17K，这是一个训练集，该训练集模拟了压力变化所隐含的含义的变化。然后，我们从经验上表明，使用此合成数据集优化模型与现实记录很好地对齐，并可以有效地对SLM进行填充。结果表明，在句子应力推理和检测任务上，我们的固定模型，压力符号明显优于现有模型。代码，模型，数据和音频样本 - 此HTTP URL。

Title: Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Authors: Christopher Ormerod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22771
Pdf URL: https://arxiv.org/pdf/2505.22771
Copy Paste: [[2505.22771]] Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems(https://arxiv.org/abs/2505.22771)
Keywords: language model, llm
Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.
摘要：这项研究说明了将面向反馈的注释纳入评分管道如何提高自动论文评分（AES）的准确性（AES）的准确性。这种方法是通过具有说服力的论文来证明的，用于评估，选择和理解论证和话语要素（CELSUADE）语料库。我们整合了两种反馈驱动的注释：识别拼写和语法错误的注释，以及突出论证组成部分的注释。为了说明如何在现实世界情景中应用此方法，我们采用两个LLM来生成注释 - 一种用于法术校正的生成语言模型，以及经过培训的基于编码的令牌分类器，以识别和标记论证元素。通过将注释纳入评分过程，我们使用基于编码器的大型语言模型微调作为分类器来证明性能的改进。

Title: MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators

Authors: John Mendonça, Alon Lavie, Isabel Trancoso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22777
Pdf URL: https://arxiv.org/pdf/2505.22777
Copy Paste: [[2505.22777]] MEDAL: A Framework for Benchmarking LLMs as Multilingual Open-Domain Chatbots and Dialogue Evaluators(https://arxiv.org/abs/2505.22777)
Keywords: gpt, llm, chat, agent
Abstract: As the capabilities of chatbots and their underlying LLMs continue to dramatically improve, evaluating their performance has increasingly become a major blocker to their further development. A major challenge is the available benchmarking datasets, which are largely static, outdated, and lacking in multilingual coverage, limiting their ability to capture subtle linguistic and cultural variations. This paper introduces MEDAL, an automated multi-agent framework for generating, evaluating, and curating more representative and diverse open-domain dialogue evaluation benchmarks. Our approach leverages several state-of-the-art LLMs to generate user-chatbot multilingual dialogues, conditioned on varied seed contexts. A strong LLM (GPT-4.1) is then used for a multidimensional analysis of the performance of the chatbots, uncovering noticeable cross-lingual performance differences. Guided by this large-scale evaluation, we curate a new meta-evaluation multilingual benchmark and human-annotate samples with nuanced quality judgments. This benchmark is then used to assess the ability of several reasoning and non-reasoning LLMs to act as evaluators of open-domain dialogues. We find that current LLMs struggle to detect nuanced issues, particularly those involving empathy and reasoning.
摘要：随着聊天机器人及其基本LLM的能力继续显着改善，评估其绩效已越来越成为其进一步发展的主要阻碍者。一个主要的挑战是可用的基准测试数据集，这些数据集在很大程度上是静态的，过时的，并且缺乏多语言覆盖范围，从而限制了它们捕获微妙的语言和文化变化的能力。本文介绍了奖牌，这是一个自动化的多代理框架，用于生成，评估和策划更具代表性和多样化的开放域对话评估基准。我们的方法利用了几个最先进的LLMS来生成以各种种子上下文为条件的用户chatbot多语言对话。然后使用强LLM（GPT-4.1）进行聊天机器人性能的多维分析，发现明显的跨语性绩效差异。在这项大规模评估的指导下，我们策划了一个新的元评估多语言基准和人类注销样本，并具有细微的质量判断。然后，该基准用于评估几种推理和非争议LLM充当开放域对话的评估者的能力。我们发现，当前的LLM难以检测细微的问题，尤其是涉及同理心和推理的问题。

Title: Can Large Language Models Match the Conclusions of Systematic Reviews?

Authors: Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22787
Pdf URL: https://arxiv.org/pdf/2505.22787
Copy Paste: [[2505.22787]] Can Large Language Models Match the Conclusions of Systematic Reviews?(https://arxiv.org/abs/2505.22787)
Keywords: language model, llm
Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate SR generation. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from 100 SRs with the studies they are based on. We benchmark 24 LLMs on MedEvidence, including reasoning, non-reasoning, medical specialist, and models across varying sizes (from 7B-700B). Through our systematic evaluation, we find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and, contrary to human experts, all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians. We release our codebase and benchmark to the broader research community to further investigate LLM-based SR systems.
摘要：系统评价（SR），专家在各个研究中总结和分析证据以提供有关专业主题的见解，是基于证据的临床决策，研究和政策的基石。鉴于科学文章的指数增长，人们对使用大型语言模型（LLMS）自动化SR产生越来越兴趣。但是，LLM批判性地评估多个文件的证据和理由的能力以与域专家相同的熟练程度提供建议的能力仍然很差。因此，我们提出：LLM可以符合临床专家访问同一研究时撰写的系统评价的结论吗？为了探讨这个问题，我们提出了Med -ndresidence，这是100个SRS的基准配对发现与它们所基于的研究。我们基准在24个LLM上进行Med-desmindies，包括推理，非调理，医学专家以及各种尺寸的模型（来自7b-700b）。通过系统的评估，我们发现推理不一定会提高性能，较大的模型不会始终如一地产生更大的收益，并且基于知识的微调降低了准确性的准确性。取而代之的是，大多数模型都表现出相似的行为：随着令牌长度的增加，它们的响应表现出过度的自信，并且与人类专家相反，所有模型均表现出对低质量发现的科学怀疑主义的缺乏。这些结果表明，即使LLM可以可靠地匹配专家指导的SRS的观察结果，即使这些系统已经部署并被临床医生使用，仍需要更多的工作。我们将代码库和基准发布到更广泛的研究社区，以进一步研究基于LLM的SR系统。

Title: First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Authors: Andrew Zhu, Evan Osgood, Chris Callison-Burch
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.22809
Pdf URL: https://arxiv.org/pdf/2505.22809
Copy Paste: [[2505.22809]] First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay(https://arxiv.org/abs/2505.22809)
Keywords: language model, llm, agent
Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at this https URL.
摘要：对对话的LLM代理已经完成了许多工作，该代理直接为人类用户完成任务。我们提出了与LLM代理相互作用的替代范式，我们称之为“偷听者”。这些偷听的代理不会积极参与对话 - 相反，他们“聆听”人与人之间的对话并执行背景任务或提供建议以帮助用户。在这项工作中，我们通过Dungeons＆Dragons游戏玩法的镜头探索了偷听的代理商范式。我们提出了一项深入的研究，使用大型多模式音频语言模型作为偷听剂来协助地牢大师。我们进行人类评估以检查此类代理的有益性，并发现一些大型音频模型具有新兴的能力，可以使用隐式音频提示执行偷窃代理任务。最后，我们发布了Python图书馆和我们的项目代码，以支持对此HTTPS URL的偷听代理范式的进一步研究。

Title: Self-Critique and Refinement for Faithful Natural Language Explanations

Authors: Yingming Wang, Pepa Atanasova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22823
Pdf URL: https://arxiv.org/pdf/2505.22823
Copy Paste: [[2505.22823]] Self-Critique and Refinement for Faithful Natural Language Explanations(https://arxiv.org/abs/2505.22823)
Keywords: language model, llm
Abstract: With the rapid development of large language models (LLMs), natural language explanations (NLEs) have become increasingly important for understanding model predictions. However, these explanations often fail to faithfully represent the model's actual reasoning process. While existing work has demonstrated that LLMs can self-critique and refine their initial outputs for various tasks, this capability remains unexplored for improving explanation faithfulness. To address this gap, we introduce Self-critique and Refinement for Natural Language Explanations (SR-NLE), a framework that enables models to improve the faithfulness of their own explanations -- specifically, post-hoc NLEs -- through an iterative critique and refinement process without external supervision. Our framework leverages different feedback mechanisms to guide the refinement process, including natural language self-feedback and, notably, a novel feedback approach based on feature attribution that highlights important input words. Our experiments across three datasets and four state-of-the-art LLMs demonstrate that SR-NLE significantly reduces unfaithfulness rates, with our best method achieving an average unfaithfulness rate of 36.02%, compared to 54.81% for baseline -- an absolute reduction of 18.79%. These findings reveal that the investigated LLMs can indeed refine their explanations to better reflect their actual reasoning process, requiring only appropriate guidance through feedback without additional training or fine-tuning.
摘要：随着大型语言模型（LLM）的快速发展，自然语言解释（NLE）对于理解模型预测变得越来越重要。但是，这些解释通常无法忠实地代表模型的实际推理过程。尽管现有的工作表明LLM可以自负并完善其针对各种任务的初始输出，但对于提高解释忠诚的能力仍未探索。为了解决这一差距，我们对自然语言解释（SR-NLE）介绍了自我批评和改进，该框架使模型能够通过迭代的批评和改进过程，而无需外部监督。我们的框架利用不同的反馈机制来指导改进过程，包括自然语言自我反馈，尤其是基于特征归因的新型反馈方法，突出了重要的输入词。我们在三个数据集和四个最先进的LLMS进行的实验表明，SR-NLE显着降低了不忠的率，我们的最佳方法达到了平均不忠度36.02％，而基线的54.81％，绝对降低了18.79％。这些发现表明，调查的LLM确实可以完善其解释，以更好地反映其实际的推理过程，只需要通过反馈而无需进行其他培训或进行微调才能进行适当的指导。

Title: What Has Been Lost with Synthetic Evaluation?

Authors: Alexander Gill, Abhilasha Ravichander, Ana Marasović
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22830
Pdf URL: https://arxiv.org/pdf/2505.22830
Copy Paste: [[2505.22830]] What Has Been Lost with Synthetic Evaluation?(https://arxiv.org/abs/2505.22830)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.
摘要：大型语言模型（LLM）越来越多地用于数据生成。但是，创建评估基准提高了这种新兴范式的标准。基准必须针对特定现象，惩罚开发快捷方式并具有挑战性。通过两个案例研究，我们研究了LLM是否可以通过产生推理过度文本基准并将其与通过仔细众包创建的基准进行比较来满足这些需求。具体而言，我们评估了两个高质量阅读理解数据集的LLM生成版本的有效性和难度：Condaqa，评估了有关否定的推理和删除的推理，该数据针对数量的推理。我们发现，提示LLM可以产生这些数据集的变体，这些变体通常根据注释指南有效，这是原始众包工作成本的一小部分。但是，我们表明它们对LLM的挑战性不如其人为实现的同行。这一发现阐明了通过使用LLM生成评估数据可能丢失的内容，并呼吁立即重新评估这种日益普遍的方法来创建这种基准的方法。

Title: Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation

Authors: Arthur S. Bianchessi, Rodrigo C. Barros, Lucas S. Kupssinskü
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22842
Pdf URL: https://arxiv.org/pdf/2505.22842
Copy Paste: [[2505.22842]] Bayesian Attention Mechanism: A Probabilistic Framework for Positional Encoding and Context Length Extrapolation(https://arxiv.org/abs/2505.22842)
Keywords: language model, long context
Abstract: Transformer-based language models rely on positional encoding (PE) to handle token order and support context length extrapolation. However, existing PE methods lack theoretical clarity and rely on limited evaluation metrics to substantiate their extrapolation claims. We propose the Bayesian Attention Mechanism (BAM), a theoretical framework that formulates positional encoding as a prior within a probabilistic model. BAM unifies existing methods (e.g., NoPE and ALiBi) and motivates a new Generalized Gaussian positional prior that substantially improves long-context generalization. Empirically, BAM enables accurate information retrieval at $500\times$ the training context length, outperforming previous state-of-the-art context length generalization in long context retrieval accuracy while maintaining comparable perplexity and introducing minimal additional parameters.
摘要：基于变压器的语言模型依赖于位置编码（PE）来处理令牌订单并支持上下文长度外推。但是，现有的PE方法缺乏理论上的清晰度，并且依靠有限的评估指标来证实其外推主体。我们提出了贝叶斯注意机制（BAM），这是一个理论框架，将位置编码作为概率模型中的先验进行了提出。 BAM统一了现有方法（例如，NOPE和Alibi），并激发了新的广义高斯位置，这显着改善了长期概括。从经验上讲，BAM可以在$ 500 \ times $上启用准确的信息检索训练上下文长度，在长上下文检索准确性中优于先前的最新上下文长度概括，同时保持可比的困惑并引入最小的其他参数。

Title: GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification

Authors: Iknoor Singh, Carolina Scarton, Kalina Bontcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22867
Pdf URL: https://arxiv.org/pdf/2505.22867
Copy Paste: [[2505.22867]] GateNLP at SemEval-2025 Task 10: Hierarchical Three-Step Prompting for Multilingual Narrative Classification(https://arxiv.org/abs/2505.22867)
Keywords: language model, llm, prompt
Abstract: The proliferation of online news and the increasing spread of misinformation necessitate robust methods for automatic data analysis. Narrative classification is emerging as a important task, since identifying what is being said online is critical for fact-checkers, policy markers and other professionals working on information studies. This paper presents our approach to SemEval 2025 Task 10 Subtask 2, which aims to classify news articles into a pre-defined two-level taxonomy of main narratives and sub-narratives across multiple languages. We propose Hierarchical Three-Step Prompting (H3Prompt) for multilingual narrative classification. Our methodology follows a three-step Large Language Model (LLM) prompting strategy, where the model first categorises an article into one of two domains (Ukraine-Russia War or Climate Change), then identifies the most relevant main narratives, and finally assigns sub-narratives. Our approach secured the top position on the English test set among 28 competing teams worldwide. The code is available at this https URL.
摘要：在线新闻的扩散和误导性的扩散需要进行自动数据分析的强大方法。叙事分类正在成为一项重要任务，因为在线确定所说的内容对于事实检查者，政策标记和其他从事信息研究的专业人员至关重要。本文介绍了我们对Semeval 2025 Task 10子任务2的方法，该方法旨在将新闻文章分类为跨多种语言的主要叙事和次生词的预定的两级分类学。我们建议用于多语言叙事分类的分层三步提示（H3Prompt）。我们的方法遵循三步的大语言模型（LLM）提示策略，该模型首先将文章归类为两个领域之一（乌克兰 - 俄罗斯战争或气候变化），然后确定最相关的主要叙述，最后分配了亚农。我们的方法确保了全球28个竞争团队中英语测试的最高位置。该代码可在此HTTPS URL上找到。

Title: When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy

Authors: Jirui Qi, Shan Chen, Zidi Xiong, Raquel Fernández, Danielle S. Bitterman, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22888
Pdf URL: https://arxiv.org/pdf/2505.22888
Copy Paste: [[2505.22888]] When Models Reason in Your Language: Controlling Thinking Trace Language Comes at the Cost of Accuracy(https://arxiv.org/abs/2505.22888)
Keywords: prompt
Abstract: Recent Large Reasoning Models (LRMs) with thinking traces have shown strong performance on English reasoning tasks. However, their ability to think in other languages is less studied. This capability is as important as answer accuracy for real world applications because users may find the reasoning trace useful for oversight only when it is expressed in their own language. We comprehensively evaluate two leading families of LRMs on our XReasoning benchmark and find that even the most advanced models often revert to English or produce fragmented reasoning in other languages, revealing a substantial gap in multilingual reasoning. Prompt based interventions that force models to reason in the users language improve readability and oversight but reduce answer accuracy, exposing an important trade off. We further show that targeted post training on just 100 examples mitigates this mismatch, though some accuracy loss remains. Our results highlight the limited multilingual reasoning capabilities of current LRMs and outline directions for future work. Code and data are available at this https URL.
摘要：最近具有思维痕迹的大型推理模型（LRMS）在英语推理任务上表现出了很强的表现。但是，他们以其他语言思考的能力的研究较少。对于现实世界应用程序，此功能与答案的准确性一样重要，因为用户只有在用自己的语言表达时，才能发现推理跟踪对监督有用。我们在我们的XSONASIT基准上全面评估了LRM的两个主要家庭，发现即使是最先进的模型也经常恢复为英语或用其他语言产生零散的推理，从而揭示了多语言推理的巨大差距。基于迅速的干预措施迫使模型在用户语言中进行推理，以提高可读性和监督，但会降低答案的准确性，从而实现重要的权衡。我们进一步表明，仅针对100个例子的有针对性的后培训减轻了这种不匹配，尽管仍然有一些准确的损失。我们的结果突出了当前LRM的多语言推理功能和未来工作的大纲方向的有限的多语言推理能力。代码和数据可在此HTTPS URL上找到。

Title: VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models

Authors: Chahat Raj, Bowen Wei, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22897
Pdf URL: https://arxiv.org/pdf/2505.22897
Copy Paste: [[2505.22897]] VIGNETTE: Socially Grounded Bias Evaluation for Vision-Language Models(https://arxiv.org/abs/2505.22897)
Keywords: language model, llm
Abstract: While bias in large language models (LLMs) is well-studied, similar concerns in vision-language models (VLMs) have received comparatively less attention. Existing VLM bias studies often focus on portrait-style images and gender-occupation associations, overlooking broader and more complex social stereotypes and their implied harm. This work introduces VIGNETTE, a large-scale VQA benchmark with 30M+ images for evaluating bias in VLMs through a question-answering framework spanning four directions: factuality, perception, stereotyping, and decision making. Beyond narrowly-centered studies, we assess how VLMs interpret identities in contextualized settings, revealing how models make trait and capability assumptions and exhibit patterns of discrimination. Drawing from social psychology, we examine how VLMs connect visual identity cues to trait and role-based inferences, encoding social hierarchies, through biased selections. Our findings uncover subtle, multifaceted, and surprising stereotypical patterns, offering insights into how VLMs construct social meaning from inputs.
摘要：尽管大语模型（LLM）的偏见是经过充分研究的，但视觉模型（VLM）的类似关注点已经受到相对较少的关注。现有的VLM偏见研究通常集中于肖像式图像和性别占领关联，忽略了更广泛，更复杂的社会刻板印象及其隐含的伤害。这项工作介绍了Vignette，这是一种大规模的VQA基准，具有30m+图像，可通过跨越四个方向的提问框架来评估VLMS中的偏见：事实，知觉，刻板印象和决策。除了以狭义的研究，我们还评估了VLM在上下文化设置中如何解释身份，从而揭示了模型如何使特质和能力假设和表现出歧视模式。我们从社会心理学来看，我们研究了VLM如何通过有偏见的选择将视觉身份线索连接到特质和基于角色的推论，编码社会等级制度。我们的发现揭示了微妙，多方面且令人惊讶的刻板印象模式，为VLMS如何从输入中构建社会意义提供了见解。

Title: Talent or Luck? Evaluating Attribution Bias in Large Language Models

Authors: Chahat Raj, Mahika Banerjee, Aylin Caliskan, Antonios Anastasopoulos, Ziwei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22910
Pdf URL: https://arxiv.org/pdf/2505.22910
Copy Paste: [[2505.22910]] Talent or Luck? Evaluating Attribution Bias in Large Language Models(https://arxiv.org/abs/2505.22910)
Keywords: language model, llm
Abstract: When a student fails an exam, do we tend to blame their effort or the test's difficulty? Attribution, defined as how reasons are assigned to event outcomes, shapes perceptions, reinforces stereotypes, and influences decisions. Attribution Theory in social psychology explains how humans assign responsibility for events using implicit cognition, attributing causes to internal (e.g., effort, ability) or external (e.g., task difficulty, luck) factors. LLMs' attribution of event outcomes based on demographics carries important fairness implications. Most works exploring social biases in LLMs focus on surface-level associations or isolated stereotypes. This work proposes a cognitively grounded bias evaluation framework to identify how models' reasoning disparities channelize biases toward demographic groups.
摘要：当学生失败考试时，我们倾向于责备他们的努力还是考试的困难？归因是如何将原因分配给事件结果，形状感知，增强刻板印象并影响决策的原因。社会心理学中的归因理论解释了人类如何使用隐式认知来分配事件的责任，将原因归因于内部（例如，努力，能力）或外部（例如，任务困难，运气）。 LLMS基于人口统计学的事件结果的归因具有重要的公平意义。大多数探索LLMS社会偏见的作品都集中在表面水平的关联或孤立的刻板印象上。这项工作提出了一个认知依据的偏见评估框架，以确定模型的推理方式如何使偏见对人口统计组。

Title: ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room

Authors: Nikita Mehandru, Niloufar Golchini, David Bamman, Travis Zack, Melanie F. Molina, Ahmed Alaa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22919
Pdf URL: https://arxiv.org/pdf/2505.22919
Copy Paste: [[2505.22919]] ER-REASON: A Benchmark Dataset for LLM-Based Clinical Reasoning in the Emergency Room(https://arxiv.org/abs/2505.22919)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been extensively evaluated on medical question answering tasks based on licensing exams. However, real-world evaluations often depend on costly human annotators, and existing benchmarks tend to focus on isolated tasks that rarely capture the clinical reasoning or full workflow underlying medical decisions. In this paper, we introduce ER-Reason, a benchmark designed to evaluate LLM-based clinical reasoning and decision-making in the emergency room (ER)--a high-stakes setting where clinicians make rapid, consequential decisions across diverse patient presentations and medical specialties under time pressure. ER-Reason includes data from 3,984 patients, encompassing 25,174 de-identified longitudinal clinical notes spanning discharge summaries, progress notes, history and physical exams, consults, echocardiography reports, imaging notes, and ER provider documentation. The benchmark includes evaluation tasks that span key stages of the ER workflow: triage intake, initial assessment, treatment selection, disposition planning, and final diagnosis--each structured to reflect core clinical reasoning processes such as differential diagnosis via rule-out reasoning. We also collected 72 full physician-authored rationales explaining reasoning processes that mimic the teaching process used in residency training, and are typically absent from ER documentation. Evaluations of state-of-the-art LLMs on ER-Reason reveal a gap between LLM-generated and clinician-authored clinical reasoning for ER decisions, highlighting the need for future research to bridge this divide.
摘要：大型语言模型（LLM）已在基于许可考试的医疗问题上对任务进行了广泛评估。但是，现实世界中的评估通常取决于昂贵的人类注释，而现有的基准倾向于专注于孤立的任务，这些任务很少捕获临床推理或全部工作流程。在本文中，我们介绍了ER-REASAN，这是一种基准测试，旨在评估急诊室（ER）的基于LLM的临床推理和决策 - 一个高风险的环境，临床医生在时间压力下在多样化的患者演讲和医疗专业中做出快速，后果的决策。 ER-REASON包括来自3,984名患者的数据，其中包括25,174名纵向纵向临床笔记，涵盖了放电摘要，进度笔记，历史和体格检查，咨询，超声心动图报告，成像注释和ER提供者文档。基准包括跨越ER工作流程的关键阶段的评估任务：分类摄入，初步评估，治疗选择，处置计划和最终诊断 - 以反映核心临床推理过程，例如通过排除推理来反映核心临床推理过程。我们还收集了72个完整的医师作者的理由，解释了模仿居住培训中使用的教学过程的推理过程，并且通常不在ER文档中。对ER-REASON的最先进LLM的评估表明，LLM生成和临床医生由临床医生进行了临床推理之间的差距，以实现ER决策，这强调了未来的研究需要弥合这种鸿沟。

Title: Structured Memory Mechanisms for Stable Context Representation in Large Language Models

Authors: Yue Xing, Tao Yang, Yijiashun Qi, Minggu Wei, Yu Cheng, Honghui Xin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22921
Pdf URL: https://arxiv.org/pdf/2505.22921
Copy Paste: [[2505.22921]] Structured Memory Mechanisms for Stable Context Representation in Large Language Models(https://arxiv.org/abs/2505.22921)
Keywords: language model
Abstract: This paper addresses the limitations of large language models in understanding long-term context. It proposes a model architecture equipped with a long-term memory mechanism to improve the retention and retrieval of semantic information across paragraphs and dialogue turns. The model integrates explicit memory units, gated writing mechanisms, and attention-based reading modules. A forgetting function is introduced to enable dynamic updates of memory content, enhancing the model's ability to manage historical information. To further improve the effectiveness of memory operations, the study designs a joint training objective. This combines the main task loss with constraints on memory writing and forgetting. It guides the model to learn better memory strategies during task execution. Systematic evaluation across multiple subtasks shows that the model achieves clear advantages in text generation consistency, stability in multi-turn question answering, and accuracy in cross-context reasoning. In particular, the model demonstrates strong semantic retention and contextual coherence in long-text tasks and complex question answering scenarios. It effectively mitigates the context loss and semantic drift problems commonly faced by traditional language models when handling long-term dependencies. The experiments also include analysis of different memory structures, capacity sizes, and control strategies. These results further confirm the critical role of memory mechanisms in language understanding. They demonstrate the feasibility and effectiveness of the proposed approach in both architectural design and performance outcomes.
摘要：本文解决了在理解长期背景下的大语言模型的局限性。它提出了一种配备长期记忆机制的模型架构，以改善跨段落和对话转弯的语义信息的保留和检索。该模型集成了明确的内存单元，封闭式写作机制和基于注意力的阅读模块。引入遗忘功能以启用内存内容的动态更新，从而增强模型管理历史信息的能力。为了进一步提高记忆操作的有效性，该研究设计了一个联合培训目标。这将主要任务损失与记忆写作和忘记的限制结合在一起。它指导模型在执行任务期间学习更好的内存策略。跨多个子任务的系统评估表明，该模型在文本生成一致性，多转移问题的稳定性方面具有明显的优势，以及跨文本推理的准确性。特别是，该模型在长篇文本任务和复杂的问题回答场景中表现出强大的语义保留和上下文连贯性。它有效地减轻了传统语言模型在处理长期依赖时通常面临的上下文损失和语义漂移问题。实验还包括对不同记忆结构，容量大小和控制策略的分析。这些结果进一步证实了记忆机制在语言理解中的关键作用。他们证明了拟议方法在建筑设计和性能成果中的可行性和有效性。

Title: Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Authors: Haobo Zhang, Jiayu Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22934
Pdf URL: https://arxiv.org/pdf/2505.22934
Copy Paste: [[2505.22934]] Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging(https://arxiv.org/abs/2505.22934)
Keywords: language model
Abstract: Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace *prior* to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.
摘要：用于各个任务的大型语言模型（LMS）可产生强大的性能，但对于部署和存储而言是昂贵的。最近的作品探索了合并模型，以将多个特定于任务的模型组合为单个多任务模型，而无需其他培训。但是，由于性能降低，现有的合并方法通常会因低级别适应性（LORA）微调而失败。在本文中，我们表明此问题来自模型参数和数据分布之间的先前被忽视的相互作用。我们为鲁棒模型合并（OSRM）提出了正交子空间，以将lora子空间 *先验 *限制为微调，以确保与一个任务相关的更新不会对他人不利地转移输出。我们的方法可以与大多数现有的合并算法无缝集成，从而减少了任务之间的意外干扰。在八个数据集上进行了大量实验，该数据集使用三个广泛使用的LM和两个大型LMS测试，这表明我们的方法不仅可以提高合并性能，还可以保持单任务精度。此外，我们的方法对合并的超参数具有更大的鲁棒性。这些结果突出了数据参数交互在模型合并中的重要性，并为合并洛拉模型提供了插件解决方案。

Title: WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Authors: Yuchen Zhuang, Di Jin, Jiaao Chen, Wenqi Shi, Hanrui Wang, Chao Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22942
Pdf URL: https://arxiv.org/pdf/2505.22942
Copy Paste: [[2505.22942]] WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning(https://arxiv.org/abs/2505.22942)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.
摘要：大语言模型（LLMS）授权的Web代理可以在企业环境中自动化复杂的实时Web导航任务。但是，依靠监督的微调（SFT）的现有网络代理通常由于理解Web互动的固有动态性质时的推理能力不足而在概括和鲁棒性上挣扎。在这项研究中，我们介绍了一家基于LLM的Web Agent Workerceagent-R1，该Web代理使用基于规则的R1风格的增强学习框架进行了培训，该框架是针对增强以业务为导向的Web导航任务的单步推理和计划的明确设计的。我们采用结构化的奖励函数，可以评估遵守输出格式和动作的正确性，从而使劳动力R1能够隐式地学习强大的中间推理，而无需明确的注释或广泛的专家示范。对Workarena基准测试的广泛实验表明，Workerceagent-R1在10.26-16.59％的效果上大大优于SFT基准，在面向工作场所的Web导航任务中，相对于所有基于LLM的代理（GPT-4O），相对于所有基于LLM的代理（GPT-4O），实现了竞争性能。

Title: Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Authors: Jaewoo Ahn, Heeseung Yun, Dayoon Ko, Gunhee Kim
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.SD
Abstract URL: https://arxiv.org/abs/2505.22943
Pdf URL: https://arxiv.org/pdf/2505.22943
Copy Paste: [[2505.22943]] Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates(https://arxiv.org/abs/2505.22943)
Keywords: language model, llm
Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.
摘要：尽管预训练的多模式表示（例如剪辑）表现出令人印象深刻的功能，但它们表现出明显的组成脆弱性，导致违反直觉判断。我们介绍了多模式对抗性组成（MAC），这是一种利用大型语言模型（LLM）的基准，以生成欺骗性的文本样本，以跨不同的方式利用这些漏洞，并通过样本的攻击成功率和基于小组的熵多样性来评估它们。为了改善零射击方法，我们提出了一种自我训练的方法，该方法利用促进多样性的过滤来利用拒绝采样进行微调，从而提高了攻击成功率和样本多样性。我们的方法使用较小的语言模型，例如Llama-3.1-8B，在揭示各种多模式表示（包括图像，视频和音频）的构图脆弱性方面表现出了卓越的性能。

Title: OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Authors: Alisha Srivastava, Emir Korukluoglu, Minh Nhat Le, Duyen Tran, Chau Minh Pham, Marzena Karpinska, Mohit Iyyer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.22945
Pdf URL: https://arxiv.org/pdf/2505.22945
Copy Paste: [[2505.22945]] OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature(https://arxiv.org/abs/2505.22945)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.
摘要：众所周知，大型语言模型（LLMS）可以记住并回顾其预读数据中的英语文本。但是，这种能力在跨语言中推广到非英语语言或传输的程度尚不清楚。本文研究了LLMS中的多语言和跨语性记忆，并在翻译中介绍时探讨了是否以一种语言（例如，英语）记忆的内容。为此，我们介绍了OWL，该数据集的31.5K数据集对齐了20本书中的20本书，包括英语原件，官方翻译（越南语，西班牙语，土耳其语）和新的译文，以及六种低资源的语言（Sesotho，sesotho，sesotho，Yoruba，Yoruba，Yoruba，Yoruba，Maithili，Malagasy，Malagasy，Setswana，Setswana，Setswana，Tahitian）。我们通过三个任务评估模型家庭和大小的记忆：（1）直接探测，该探测要求该模型识别一本书的标题和作者；（2）名称披肩，需要预测蒙版的字符名称；（3）前缀探测，涉及生成连续性。我们发现，LLM始终将跨语言的内容召回，即使对于没有直接翻译数据的文本也是如此。例如，GPT-4O在新翻译的摘录中确定了69％的时间和掩盖实体的作者和标题。扰动（例如，掩盖字符，洗牌单词）适度降低了直接探测的准确性（洗牌官方翻译的7％下降）。我们的结果突出了跨语性记忆的程度，并就模型之间的差异提供了见解。

Title: NegVQA: Can Vision Language Models Understand Negation?

Authors: Yuhui Zhang, Yuchang Su, Yiming Liu, Serena Yeung-Levy
Subjects: cs.CL, cs.AI, cs.CV, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22946
Pdf URL: https://arxiv.org/pdf/2505.22946
Copy Paste: [[2505.22946]] NegVQA: Can Vision Language Models Understand Negation?(https://arxiv.org/abs/2505.22946)
Keywords: language model
Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at this https URL.
摘要：否定是一种基本的语言现象，可以完全扭转句子的含义。随着视觉语言模型（VLM）继续前进并将其部署到高风险应用程序中，评估其理解否定的能力变得至关重要。为了解决这个问题，我们介绍了NegvQa，这是一个视觉问题回答（VQA）基准，该基准由7,379个两项挑选问题组成，涵盖了各种否定场景和图像问题分布。我们通过利用大型语言模型来生成现有VQA数据集的问题的否定版本来构建NEGVQA。在七个模型系列中评估了20个最先进的VLM，我们发现这些模型在否定方面存在很大的努力，与对原始问题的回答相比表现出大量的性能下降。此外，我们发现了U形缩放趋势，在该趋势中，增加模型大小最初会在negVQA上降低性能，然后再进行改进。我们的基准揭示了VLMS的否定理解中的关键差距，并提供了对未来VLM开发的见解。项目页面可在此HTTPS URL上找到。

Title: StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs

Authors: Haohan Yuan, Sukhwa Hong, Haopeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22950
Pdf URL: https://arxiv.org/pdf/2505.22950
Copy Paste: [[2505.22950]] StrucSum: Graph-Structured Reasoning for Long Document Extractive Summarization with LLMs(https://arxiv.org/abs/2505.22950)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have shown strong performance in zero-shot summarization, but often struggle to model document structure and identify salient information in long texts. In this work, we introduce StrucSum, a training-free prompting framework that enhances LLM reasoning through sentence-level graph structures. StrucSum injects structural signals into prompts via three targeted strategies: Neighbor-Aware Prompting (NAP) for local context, Centrality-Aware Prompting (CAP) for importance estimation, and Centrality-Guided Masking (CGM) for efficient input reduction. Experiments on ArXiv, PubMed, and Multi-News demonstrate that StrucSum consistently improves both summary quality and factual consistency over unsupervised baselines and vanilla prompting. Notably, on ArXiv, it boosts FactCC and SummaC by 19.2 and 9.7 points, indicating stronger alignment between summaries and source content. These findings suggest that structure-aware prompting is a simple yet effective approach for zero-shot extractive summarization with LLMs, without any training or task-specific tuning.
摘要：大型语言模型（LLMS）在零击摘要中表现出很强的性能，但通常很难模拟文档结构并在长文本中识别出明显的信息。在这项工作中，我们介绍了一个无训练的提示框架，通过句子级的图形结构来增强LLM推理。通过三种有针对性的策略将结构信号注射到提示中：邻居意识提示（NAP）局部环境，中心性意识提示（CAP）重要性估计以及中心性引导的掩蔽（CGM），以减少有效的输入。关于Arxiv，PubMed和Multi-News的实验表明，与无监督的基线和香草提示相比，综合可以提高汇总质量和事实一致性。值得注意的是，在ARXIV上，它将FactCC和Summac提高了19.2和9.7分，表明摘要和源含量之间的一致性更强。这些发现表明，结构感知的提示是一种简单而有效的方法，用于使用LLMS进行零拍的摘要，而无需任何培训或特定于任务的调整。

Title: LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments

Authors: Matteo Guida, Yulia Otmakhova, Eduard Hovy, Lea Frermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22956
Pdf URL: https://arxiv.org/pdf/2505.22956
Copy Paste: [[2505.22956]] LLMs for Argument Mining: Detection, Extraction, and Relationship Classification of pre-defined Arguments in Online Comments(https://arxiv.org/abs/2505.22956)
Keywords: language model, llm
Abstract: Automated large-scale analysis of public discussions around contested issues like abortion requires detecting and understanding the use of arguments. While Large Language Models (LLMs) have shown promise in language processing tasks, their performance in mining topic-specific, pre-defined arguments in online comments remains underexplored. We evaluate four state-of-the-art LLMs on three argument mining tasks using datasets comprising over 2,000 opinion comments across six polarizing topics. Quantitative evaluation suggests an overall strong performance across the three tasks, especially for large and fine-tuned LLMs, albeit at a significant environmental cost. However, a detailed error analysis revealed systematic shortcomings on long and nuanced comments and emotionally charged language, raising concerns for downstream applications like content moderation or opinion analysis. Our results highlight both the promise and current limitations of LLMs for automated argument analysis in online comments.
摘要：对诸如堕胎之类的有争议问题的公众讨论的自动分析需要检测和理解参数的使用。尽管大型语言模型（LLMS）在语言处理任务中表现出了希望，但在在线评论中，它们在采矿特定于特定于主题的，预定的论点仍未得到充实。我们使用包含六个两极分化主题的2,000多个意见评论的数据集对三个参数挖掘任务进行了四个最先进的LLM评估。定量评估表明，这三个任务的总体表现强劲，尤其是对于大型和微调的LLM，尽管以巨大的环境成本。但是，详细的错误分析揭示了关于长期和细微的评论和情感充满电的语言的系统缺陷，从而引起了人们对诸如内容审核或意见分析之类的下游应用程序的担忧。我们的结果突出了在线评论中自动参数分析的LLM的承诺和当前局限性。

Title: LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements

Authors: Jianwei Wang, Mengqi Wang, Yinsi Zhou, Zhenchang Xing, Qing Liu, Xiwei Xu, Wenjie Zhang, Liming Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.22959
Pdf URL: https://arxiv.org/pdf/2505.22959
Copy Paste: [[2505.22959]] LLM-based HSE Compliance Assessment: Benchmark, Performance, and Advancements(https://arxiv.org/abs/2505.22959)
Keywords: language model, llm, prompt
Abstract: Health, Safety, and Environment (HSE) compliance assessment demands dynamic real-time decision-making under complicated regulations and complex human-machine-environment interactions. While large language models (LLMs) hold significant potential for decision intelligence and contextual dialogue, their capacity for domain-specific knowledge in HSE and structured legal reasoning remains underexplored. We introduce HSE-Bench, the first benchmark dataset designed to evaluate the HSE compliance assessment capabilities of LLM. HSE-Bench comprises over 1,000 manually curated questions drawn from regulations, court cases, safety exams, and fieldwork videos, and integrates a reasoning flow based on Issue spotting, rule Recall, rule Application, and rule Conclusion (IRAC) to assess the holistic reasoning pipeline. We conduct extensive evaluations on different prompting strategies and more than 10 LLMs, including foundation models, reasoning models and multimodal vision models. The results show that, although current LLMs achieve good performance, their capabilities largely rely on semantic matching rather than principled reasoning grounded in the underlying HSE compliance context. Moreover, their native reasoning trace lacks the systematic legal reasoning required for rigorous HSE compliance assessment. To alleviate these, we propose a new prompting technique, Reasoning of Expert (RoE), which guides LLMs to simulate the reasoning process of different experts for compliance assessment and reach a more accurate unified decision. We hope our study highlights reasoning gaps in LLMs for HSE compliance and inspires further research on related tasks.
摘要：健康，安全和环境（HSE）合规性评估要求在复杂的法规和复杂的人机环境相互作用下动态实时决策。尽管大型语言模型（LLMS）具有决策情报和上下文对话的重要潜力，但其在HSE和结构化的法律推理中具有特定领域知识的能力仍未得到充实。我们介绍了HSE Bench，这是第一个旨在评估LLM的HSE合规性评估功能的基准数据集。 HSE Bench包括从法规，法院案件，安全检查和实地调查视频中提出的1,000多个手动策划的问题，并基于问题发现，规则召回，规则申请和规则结论（IRAC）来整合推理流，以评估整体推理管道。我们对不同的提示策略和超过10个LLM进行了广泛的评估，包括基础模型，推理模型和多模式视觉模型。结果表明，尽管当前的LLMS取得了良好的性能，但它们的功能在很大程度上依赖于语义匹配，而不是基于基础HSE合规环境中的原则性推理。此外，他们的本地推理痕迹缺乏严格的HSE合规性评估所需的系统的法律推理。为了减轻这些方法，我们提出了一种新的提示技术，专家的推理（ROE），该技术指导LLMS模拟不同专家的推理过程，以进行合规性评估，并做出更准确的统一决定。我们希望我们的研究强调LLMS中HSE合规性的推理差距，并激发有关相关任务的进一步研究。

Title: ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

Authors: Peixuan Han, Zijia Liu, Jiaxuan You
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22961
Pdf URL: https://arxiv.org/pdf/2505.22961
Copy Paste: [[2505.22961]] ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind(https://arxiv.org/abs/2505.22961)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: this https URL.
摘要：大型语言模型（LLMS）在说服力方面表现出了有希望的潜力，但现有的培训LLM说服者的作品仍然是初步的。值得注意的是，尽管人类熟练地对对手的思想和观点进行建模，但目前的LLM与这种思想理论（TOM）推理斗争，从而导致了有限的多样性和对手的意识。为了解决这一局限性，我们介绍了思维理论增强说服者（TOMAP），这是一种新颖的方法，是通过结合两个心理模块模块来构建更灵活的说服者的方法，从而增强了说服者对对手心理状态的认识和分析。具体来说，我们首先提示说服者考虑对目标中心主张的可能异议，然后使用与训练有素的MLP分类器配对的文本编码器来预测对手对这些反诉的当前立场。我们精心设计的强化学习模式使说服者可以学习如何分析与对手相关的信息并利用它来产生更有效的论点。实验表明，Tomap说服者仅包含3B参数，但表现优于GPT-4O等更大的基准，在多个说服力模型中，相对增益为39.4％，并且多样化的语料库。值得注意的是，Tomap在训练过程中表现出复杂的推理链和减少的重复，从而导致更加多样化和有效的论点。 Tomap的对手感知功能还使其适用于长时间的对话，并使其能够采用更合乎逻辑和对手感知的策略。这些结果强调了我们方法的有效性，并强调了其发展更具说服力的语言代理的潜力。代码可用：此HTTPS URL。

Title: Exploring Scaling Laws for EHR Foundation Models

Authors: Sheng Zhang, Qin Liu, Naoto Usuyama, Cliff Wong, Tristan Naumann, Hoifung Poon
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.22964
Pdf URL: https://arxiv.org/pdf/2505.22964
Copy Paste: [[2505.22964]] Exploring Scaling Laws for EHR Foundation Models(https://arxiv.org/abs/2505.22964)
Keywords: language model, llm
Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.
摘要：缩放定律的出现已深刻地塑造了大型语言模型（LLM）的发展，从而通过系统的大小，数据集量和计算来实现可预测的绩效提高。然而，在电子健康记录（EHR）的背景下，这些原则在很大程度上尚未探索，这是一种丰富，顺序且全球丰富的数据源，与自然语言在结构上有所不同。在这项工作中，我们介绍了EHR基金会模型缩放定律的首次实证研究。通过培训变压器架构对不同模型大小和计算预算的模拟IV数据库的患者时间表数据进行训练，我们确定了一致的缩放模式，包括抛物线质量异形曲线曲线和计算，模型参数，数据大小和临床实用性之间的幂律关系。这些发现表明，EHR模型表现出类似于LLM的规模行为，从而提供了对资源有效培训策略的预测见解。我们的结果为开发强大的EHR基金会模型提供了基础，该模型能够改变临床预测任务并推进个性化的医疗保健。

Title: Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Authors: Hoang Pham, Thanh-Do Nguyen, Khac-Hoai Nam Bui
Subjects: cs.CL, cs.AI, cs.DB, cs.IR
Abstract URL: https://arxiv.org/abs/2505.22993
Pdf URL: https://arxiv.org/pdf/2505.22993
Copy Paste: [[2505.22993]] Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation(https://arxiv.org/abs/2505.22993)
Keywords: language model, llm, agent
Abstract: Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.
摘要：索赔验证是一项长期挑战的任务，不仅需要高准确性，而且需要验证过程的解释性。在大型语言模型（LLM）时代，这项任务成为一个新兴的研究问题，因为现实世界的主张通常很复杂，具有复杂的语义结构或混淆的实体。传统方法通常是通过将主张分解为子声称并查询知识库以解决隐藏或模棱两可的实体来解决这一问题。但是，这些实体没有有效的歧义策略会损害整个验证过程。为了应对这些挑战，我们提出了验证图像（Vegraph），这是一个新颖的框架，利用LLM代理的推理和理解能力。 Vegraph分为三个阶段：（1）图形表示 - 输入声明被分解为结构化的三胞胎，形成了基于图的表示，该表示同时集成了结构化和非结构化信息；（2）实体消除歧义 - 迭代性与知识库相互作用，以解决图表内模棱两可的实体，以进行更深入的子声称验证；（3）验证 - 剩余的三胞胎已验证以完成事实检查过程。使用Meta-Lalama-3-70B（指令版）的实验表明，与两个基准的基准相比，Vegraph取得了竞争性能，并有效地应对索赔验证挑战。我们的源代码和数据可用于进一步利用。

Title: DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors

Authors: Yize Cheng, Wenxiao Wang, Mazda Moayeri, Soheil Feizi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23001
Pdf URL: https://arxiv.org/pdf/2505.23001
Copy Paste: [[2505.23001]] DyePack: Provably Flagging Test Set Contamination in LLMs Using Backdoors(https://arxiv.org/abs/2505.23001)
Keywords: language model, llm
Abstract: Open benchmarks are essential for evaluating and advancing large language models, offering reproducibility and transparency. However, their accessibility makes them likely targets of test set contamination. In this work, we introduce DyePack, a framework that leverages backdoor attacks to identify models that used benchmark test sets during training, without requiring access to the loss, logits, or any internal details of the model. Like how banks mix dye packs with their money to mark robbers, DyePack mixes backdoor samples with the test data to flag models that trained on it. We propose a principled design incorporating multiple backdoors with stochastic targets, enabling exact false positive rate (FPR) computation when flagging every model. This provably prevents false accusations while providing strong evidence for every detected case of contamination. We evaluate DyePack on five models across three datasets, covering both multiple-choice and open-ended generation tasks. For multiple-choice questions, it successfully detects all contaminated models with guaranteed FPRs as low as 0.000073% on MMLU-Pro and 0.000017% on Big-Bench-Hard using eight backdoors. For open-ended generation tasks, it generalizes well and identifies all contaminated models on Alpaca with a guaranteed false positive rate of just 0.127% using six backdoors.
摘要：开放基准对于评估和推进大型语言模型，提供可重复性和透明度至关重要。但是，它们的可访问性使它们可能是测试集污染的目标。在这项工作中，我们介绍了DyePack，该框架利用后门攻击来识别在训练过程中使用基准测试集的模型，而无需访问损失，逻辑或模型的任何内部详细信息。就像银行如何将染料包装与他们的钱混合以标记抢劫犯一样，Dyepack将后门样品与测试数据混合到了训练有素的标志模型。我们提出了一种有原则的设计，该设计结合了带有随机目标的多个后门，在标记每个模型时可以确切的误报率（FPR）计算。事实证明，这可以防止虚假指控，同时为每个检测到的污染案例提供有力的证据。我们在三个数据集中评估了五个型号的染料，涵盖了多项选择和开放式生成任务。对于多项选择问题，它成功地检测了MMLU-PRO的FPR低至0.000073％的所有受污染模型，使用八个后门在Big-Bench-Hard上获得了0.00000017％。对于开放式生成任务，它可以很好地概括并确定羊驼上所有受污染的模型，可保证使用六个后门的假正率仅为0.127％。

Title: A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Authors: Chiwan Park, Wonjun Jang, Daeryong Kim, Aelim Ahn, Kichang Yang, Woosung Hwang, Jihyeon Roh, Hyerin Park, Hyosun Wang, Min Seok Kim, Jihoon Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23006
Pdf URL: https://arxiv.org/pdf/2505.23006
Copy Paste: [[2505.23006]] A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs(https://arxiv.org/abs/2505.23006)
Keywords: language model, llm, chat, agent
Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.
摘要：大型语言模型（LLM）的进步导致了各种服务领域的重大改进，包括搜索，建议和聊天机器人应用程序。但是，将最新的研究（SOTA）研究应用于工业环境提出了挑战，因为它需要保持灵活的对话能力，同时也严格遵守了特定的服务限制。由于LLM的概率性质，这可以看作是两个矛盾的要求。在本文中，我们提出了解决这一挑战的方法，并详细介绍了我们在现实世界应用中克服其固有局限性的策略。我们对专为电子商务领域设计的对话代理进行了实践案例研究，详细介绍了我们的实施工作流程和优化。我们的发现为弥合学术研究与现实世界应用之间的差距提供了见解，引入了开发可扩展，可控制和可靠的AI驱动代理的框架。

Title: Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models

Authors: Jinwen Chen, Hainan Zhang, Fei Sun, Qinnan Zhang, Sijia Wen, Ziwei Wang, Zhiming Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23015
Pdf URL: https://arxiv.org/pdf/2505.23015
Copy Paste: [[2505.23015]] Detecting Stealthy Backdoor Samples based on Intra-class Distance for Large Language Models(https://arxiv.org/abs/2505.23015)
Keywords: language model, llm
Abstract: Fine-tuning LLMs with datasets containing stealthy backdoors from publishers poses security risks to downstream applications. Mainstream detection methods either identify poisoned samples by analyzing the prediction probability of poisoned classification models or rely on the rewriting model to eliminate the stealthy triggers. However, the former cannot be applied to generation tasks, while the latter may degrade generation performance and introduce new triggers. Therefore, efficiently eliminating stealthy poisoned samples for LLMs remains an urgent problem. We observe that after applying TF-IDF clustering to the sample response, there are notable differences in the intra-class distances between clean and poisoned samples. Poisoned samples tend to cluster closely because of their specific malicious outputs, whereas clean samples are more scattered due to their more varied responses. Thus, in this paper, we propose a stealthy backdoor sample detection method based on Reference-Filtration and Tfidf-Clustering mechanisms (RFTC). Specifically, we first compare the sample response with the reference model's outputs and consider the sample suspicious if there's a significant discrepancy. And then we perform TF-IDF clustering on these suspicious samples to identify the true poisoned samples based on the intra-class distance. Experiments on two machine translation datasets and one QA dataset demonstrate that RFTC outperforms baselines in backdoor detection and model performance. Further analysis of different reference models also confirms the effectiveness of our Reference-Filtration.
摘要：使用包含出版商隐身后门的数据集的微调LLM对下游应用程序构成安全风险。主流检测方法要么通过分析有毒分类模型的预测概率来识别中毒样本，要么依靠重写模型来消除隐形触发器。但是，前者不能应用于生成任务，而后者可能会降低生成性能并引入新的触发器。因此，有效消除LLMS的隐秘中毒样品仍然是一个紧迫的问题。我们观察到，将TF-IDF聚类应用于样品响应后，清洁和有毒样品之间的阶层距离存在显着差异。由于其特定的恶意输出，中毒样品往往会紧密聚集，而由于它们的反应越来越多，干净的样品更加分散。因此，在本文中，我们提出了一种基于参考过滤和TFIDF群集机制（RFTC）的隐形后门样品检测方法。具体来说，我们首先将样本响应与参考模型的输出进行比较，并在存在明显差异的情况下考虑样本可疑。然后，我们在这些可疑样品上执行TF-IDF聚类，以基于类内距离识别真正的中毒样品。在两个机器翻译数据集和一个QA数据集上进行的实验表明，RFTC在后门检测和模型性能中的表现优于基准。对不同参考模型的进一步分析也证实了我们参考过滤的有效性。

Title: Context Robust Knowledge Editing for Language Models

Authors: Haewon Park, Gyubin Choi, Minjun Kim, Yohan Jo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23026
Pdf URL: https://arxiv.org/pdf/2505.23026
Copy Paste: [[2505.23026]] Context Robust Knowledge Editing for Language Models(https://arxiv.org/abs/2505.23026)
Keywords: language model
Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.
摘要：知识编辑（KE）方法提供了一种有效的方法来修改大语言模型中的知识。当前的KE评估通常通过仅考虑编辑的知识而没有任何前面的上下文来评估成功。但是，在实际应用程序中，前面的上下文通常会触发原始知识的检索并破坏预期的编辑。为了解决这个问题，我们开发了CHED - 一种旨在评估KE方法上下文鲁棒性的基准。对CHED的评估表明，在存在之前的上下文时，它们通常会失败。为了减轻这一缺点，我们引入了Core，这是一种旨在通过最大程度地减少模型隐藏状态中上下文敏感的差异来增强上下文鲁棒性的方法，以进行编辑知识。此方法不仅在存在前面上下文的情况下提高了编辑成功率，而且还可以保留模型的整体功能。我们对作为用户话语与助手响应引入上下文的不同影响进行了深入的分析，并剖析了注意分数模式，以评估特定代币如何影响编辑成功。

Title: Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse

Authors: Hyunwoo Kim, Hanau Yi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23035
Pdf URL: https://arxiv.org/pdf/2505.23035
Copy Paste: [[2505.23035]] Machine-Facing English: Defining a Hybrid Register Shaped by Human-AI Discourse(https://arxiv.org/abs/2505.23035)
Keywords: prompt
Abstract: Machine-Facing English (MFE) is an emergent register shaped by the adaptation of everyday language to the expanding presence of AI interlocutors. Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency. Our analysis is grounded in qualitative observations from bilingual (Korean/English) voice- and text-based product testing sessions, with reflexive drafting conducted using Natural Language Declarative Prompting (NLD-P) under human curation. Thematic analysis identifies five recurrent traits - redundant clarity, directive syntax, controlled vocabulary, flattened prosody, and single-intent structuring - that improve execution accuracy but compress expressive range. MFE's evolution highlights a persistent tension between communicative efficiency and linguistic richness, raising design challenges for conversational interfaces and pedagogical considerations for multilingual users. We conclude by underscoring the need for comprehensive methodological exposition and future empirical validation.
摘要：面向机器的英语（MFE）是一个新兴的寄存器，该寄存器由日常语言改编为扩大AI对话者的存在。 Drawing on register theory (Halliday 1985, 2006), enregisterment (Agha 2003), audience design (Bell 1984), and interactional pragmatics (Giles & Ogay 2007), this study traces how sustained human-AI interaction normalizes syntactic rigidity, pragmatic simplification, and hyper-explicit phrasing - features that enhance machine parseability at the expense of natural fluency.我们的分析基于双语（韩语/英语）语音和基于文本的产品测试的定性观察，并在人类策划下使用自然语言声明提示（NLD-P）进行了反身起草。主题分析确定了五个经常性特征 - 冗余清晰度，指令语法，受控词汇，扁平化的韵律和单意en结构 - 可提高执行精度，但压缩表达范围。 MFE的进化强调了交流效率和语言丰富性之间的持续张力，为对话界面带来了设计挑战，并为多语言用户带来了教学方面的考虑。我们通过强调需要进行全面的方法论和未来的经验验证的必要性。

Title: Improving Multilingual Social Media Insights: Aspect-based Comment Analysis

Authors: Longyin Zhang, Bowei Zou, Ai Ti Aw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23037
Pdf URL: https://arxiv.org/pdf/2505.23037
Copy Paste: [[2505.23037]] Improving Multilingual Social Media Insights: Aspect-based Comment Analysis(https://arxiv.org/abs/2505.23037)
Keywords: language model, llm
Abstract: The inherent nature of social media posts, characterized by the freedom of language use with a disjointed array of diverse opinions and topics, poses significant challenges to downstream NLP tasks such as comment clustering, comment summarization, and social media opinion analysis. To address this, we propose a granular level of identifying and generating aspect terms from individual comments to guide model attention. Specifically, we leverage multilingual large language models with supervised fine-tuning for comment aspect term generation (CAT-G), further aligning the model's predictions with human expectations through DPO. We demonstrate the effectiveness of our method in enhancing the comprehension of social media discourse on two NLP tasks. Moreover, this paper contributes the first multilingual CAT-G test set on English, Chinese, Malay, and Bahasa Indonesian. As LLM capabilities vary among languages, this test set allows for a comparative analysis of performance across languages with varying levels of LLM proficiency.
摘要：社交媒体帖子的固有性质，其特征是语言使用自由以及各种各样的不同意见和主题，对下游NLP任务（例如评论集群，评论摘要和社交媒体意见分析）提出了重大挑战。为了解决这个问题，我们提出了从单个评论中识别和生成方面术语的颗粒状水平，以指导模型注意力。具体来说，我们利用具有监督的微调的多语言大语言模型来发表评论学期生成（CAT-G），从而进一步使模型的预测与通过DPO的人类期望保持一致。我们证明了我们方法在增强社交媒体对两个NLP任务的理解的有效性。此外，本文为英语，中文，马来语和巴哈萨印尼人提供了第一个多语言CAT-G测试。随着LLM功能在语言之间有所不同，该测试集可以对具有不同LLM熟练程度的语言进行比较分析。

Title: EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models

Authors: Yuzhen Xiao, Jiahe Song, Yongxin Xu, Ruizhe Zhang, Yiqi Xiao, Xin Lu, Runchuan Zhu, Bowen Jiang, Junfeng Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23038
Pdf URL: https://arxiv.org/pdf/2505.23038
Copy Paste: [[2505.23038]] EL4NER: Ensemble Learning for Named Entity Recognition via Multiple Small-Parameter Large Language Models(https://arxiv.org/abs/2505.23038)
Keywords: language model, llm
Abstract: In-Context Learning (ICL) technique based on Large Language Models (LLMs) has gained prominence in Named Entity Recognition (NER) tasks for its lower computing resource consumption, less manual labeling overhead, and stronger generalizability. Nevertheless, most ICL-based NER methods depend on large-parameter LLMs: the open-source models demand substantial computational resources for deployment and inference, while the closed-source ones incur high API costs, raise data-privacy concerns, and hinder community collaboration. To address this question, we propose an Ensemble Learning Method for Named Entity Recognition (EL4NER), which aims at aggregating the ICL outputs of multiple open-source, small-parameter LLMs to enhance overall performance in NER tasks at less deployment and inference cost. Specifically, our method comprises three key components. First, we design a task decomposition-based pipeline that facilitates deep, multi-stage ensemble learning. Second, we introduce a novel span-level sentence similarity algorithm to establish an ICL demonstration retrieval mechanism better suited for NER tasks. Third, we incorporate a self-validation mechanism to mitigate the noise introduced during the ensemble process. We evaluated EL4NER on multiple widely adopted NER datasets from diverse domains. Our experimental results indicate that EL4NER surpasses most closed-source, large-parameter LLM-based methods at a lower parameter cost and even attains state-of-the-art (SOTA) performance among ICL-based methods on certain datasets. These results show the parameter efficiency of EL4NER and underscore the feasibility of employing open-source, small-parameter LLMs within the ICL paradigm for NER tasks.
摘要：基于大语言模型（LLM）的内部文化学习（ICL）技术在较低的计算资源消耗，较少的手动标记开销和更强的推广性方面已获得了命名实体识别（NER）任务的突出性。然而，大多数基于ICL的NER方法都取决于大参数LLM：开源模型需要大量的计算资源来进行部署和推理，而封闭源的One则需要高度的API成本，提高数据私人关系，并阻碍社区的协作。为了解决这个问题，我们为指定实体识别（EL4NER）提出了一种合奏学习方法，该方法旨在汇总多个开源的小参数LLMS的ICL输出，以在较小的部署和推理成本下提高NER任务中的整体性能。具体而言，我们的方法包括三个关键组件。首先，我们设计了一个基于任务分解的管道，该管道促进了深度多阶段的集合学习。其次，我们引入了一种新颖的跨度句子相似性算法，以建立一个更适合NER任务的ICL演示检索机制。第三，我们结合了一种自然验证机制，以减轻整体过程中引入的噪声。我们评估了来自不同领域的多个广泛采用的NER数据集的EL4NER。我们的实验结果表明，EL4NER以较低的参数成本超过了最封闭的大参数LLM方法，甚至达到了某些数据集中ICL方法中最先进的（SOTA）性能。这些结果表明，EL4NER的参数效率，并强调了在ICL范式内采用开源的小参数LLM来完成NER任务的可行性。

Title: Query Routing for Retrieval-Augmented Language Models

Authors: Jiarui Zhang, Xiangyu Liu, Yong Hu, Chaoyue Niu, Fan Wu, Guihai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23052
Pdf URL: https://arxiv.org/pdf/2505.23052
Copy Paste: [[2505.23052]] Query Routing for Retrieval-Augmented Language Models(https://arxiv.org/abs/2505.23052)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) significantly improves the performance of Large Language Models (LLMs) on knowledge-intensive tasks. However, varying response quality across LLMs under RAG necessitates intelligent routing mechanisms, which select the most suitable model for each query from multiple retrieval-augmented LLMs via a dedicated router model. We observe that external documents dynamically affect LLMs' ability to answer queries, while existing routing methods, which rely on static parametric knowledge representations, exhibit suboptimal performance in RAG scenarios. To address this, we formally define the new retrieval-augmented LLM routing problem, incorporating the influence of retrieved documents into the routing framework. We propose RAGRouter, a RAG-aware routing design, which leverages document embeddings and RAG capability embeddings with contrastive learning to capture knowledge representation shifts and enable informed routing decisions. Extensive experiments on diverse knowledge-intensive tasks and retrieval settings show that RAGRouter outperforms the best individual LLM by 3.61% on average and existing routing methods by 3.29%-9.33%. With an extended score-threshold-based mechanism, it also achieves strong performance-efficiency trade-offs under low-latency constraints.
摘要：检索增强的生成（RAG）可显着提高大语模型（LLMS）在知识密集型任务上的性能。但是，在抹布下，LLM的响应质量各不相同，需要智能路由机制，该机制通过专用的路由器模型从多个检索型LLM中选择了最合适的每个查询模型。我们观察到，外部文档动态地影响了LLMS回答查询的能力，而现有的路由方法依赖于静态参数知识表示，在破布场景中表现出次优性能。为了解决这个问题，我们正式定义了新的检索型LLM路由问题，并将检索文档的影响纳入路由框架。我们提出了RAG感知的路由设计Ragrouter，它利用文档嵌入和RAG功能嵌入具有对比度学习来捕获知识表示的变化并启用知情路由决策。关于各种知识密集任务和检索设置的广泛实验表明，Ragrouter的平均劳格（平均LLM）平均比最佳的单个LLM优于3.61％，现有路由方法的表现为3.29％-9.33％。凭借基于阈值的扩展机制，它还在低延节限制下实现了强大的性能效率权衡。

Title: Self-Correcting Code Generation Using Small Language Models

Authors: Jeonghun Cho, Deokhyung Kang, Hyounghun Kim, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23060
Pdf URL: https://arxiv.org/pdf/2505.23060
Copy Paste: [[2505.23060]] Self-Correcting Code Generation Using Small Language Models(https://arxiv.org/abs/2505.23060)
Keywords: language model, prompt
Abstract: Self-correction has demonstrated potential in code generation by allowing language models to revise and improve their outputs through successive refinement. Recent studies have explored prompting-based strategies that incorporate verification or feedback loops using proprietary models, as well as training-based methods that leverage their strong reasoning capabilities. However, whether smaller models possess the capacity to effectively guide their outputs through self-reflection remains unexplored. Our findings reveal that smaller models struggle to exhibit reflective revision behavior across both self-correction paradigms. In response, we introduce CoCoS, an approach designed to enhance the ability of small language models for multi-turn code correction. Specifically, we propose an online reinforcement learning objective that trains the model to confidently maintain correct outputs while progressively correcting incorrect outputs as turns proceed. Our approach features an accumulated reward function that aggregates rewards across the entire trajectory and a fine-grained reward better suited to multi-turn correction scenarios. This facilitates the model in enhancing initial response quality while achieving substantial improvements through self-correction. With 1B-scale models, CoCoS achieves improvements of 35.8% on the MBPP and 27.7% on HumanEval compared to the baselines.
摘要：自我纠正通过允许语言模型通过连续的改进来修改和改善其产出，从而在代码生成中表现出了潜力。最近的研究探讨了基于促进的策略，这些策略使用专有模型结合了验证或反馈循环，以及利用其强大推理能力的基于培训的方法。但是，较小的模型是否具有通过自我反射有效指导其产量的能力，尚未得到探索。我们的发现表明，较小的模型难以在两个自我纠正范式中表现出反思性修订行为。作为回应，我们介绍了Cocos，这种方法旨在增强小语言模型进行多转化代码校正的能力。具体来说，我们提出了一个在线增强学习目标，该目标训练模型以确保正确的输出，同时随着转弯的进行逐步纠正不正确的产出。我们的方法具有累积的奖励功能，可以在整个轨迹中汇总奖励，并且可以更好地适合多转弯校正方案。这有助于该模型增强初始响应质量，同时通过自我纠正实现实质性改进。与基准相比，COCOS借助1B尺度的模型，在MBPP上的提高了35.8％，人类事件的提高为27.7％。

Title: SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services

Authors: Hongcheng Guo, Zheyong Xie, Shaosheng Cao, Boyang Wang, Weiting Liu, Anjie Le, Lei Li, Zhoujun Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23065
Pdf URL: https://arxiv.org/pdf/2505.23065
Copy Paste: [[2505.23065]] SNS-Bench-VL: Benchmarking Multimodal Large Language Models in Social Networking Services(https://arxiv.org/abs/2505.23065)
Keywords: language model, llm
Abstract: With the increasing integration of visual and textual content in Social Networking Services (SNS), evaluating the multimodal capabilities of Large Language Models (LLMs) is crucial for enhancing user experience, content understanding, and platform intelligence. Existing benchmarks primarily focus on text-centric tasks, lacking coverage of the multimodal contexts prevalent in modern SNS ecosystems. In this paper, we introduce SNS-Bench-VL, a comprehensive multimodal benchmark designed to assess the performance of Vision-Language LLMs in real-world social media scenarios. SNS-Bench-VL incorporates images and text across 8 multimodal tasks, including note comprehension, user engagement analysis, information retrieval, and personalized recommendation. It comprises 4,001 carefully curated multimodal question-answer pairs, covering single-choice, multiple-choice, and open-ended tasks. We evaluate over 25 state-of-the-art multimodal LLMs, analyzing their performance across tasks. Our findings highlight persistent challenges in multimodal social context comprehension. We hope SNS-Bench-VL will inspire future research towards robust, context-aware, and human-aligned multimodal intelligence for next-generation social networking services.
摘要：随着社交网络服务（SNS）中视觉和文本内容的越来越多的集成，评估大语言模型（LLMS）的多模式能力（LLMS）对于增强用户体验，内容理解和平台智能至关重要。现有的基准主要集中于以文本为中心的任务，缺乏对现代SNS生态系统中普遍存在的多模式上下文的报道。在本文中，我们介绍了SNS-Bench-VL，这是一种综合的多模式基准测试，旨在评估现实世界社交媒体场景中视觉LLM的性能。 SNS-Bench-VL在8个多模式任务中结合了图像和文本，包括注释理解，用户参与分析，信息检索和个性化建议。它包括4,001个经过精心策划的多模式提问对，涵盖单选择，多项选择和开放式任务。我们评估了超过25个最先进的多模式LLM，并分析了他们在任务中的绩效。我们的发现突出了多模式社会环境理解中的持续挑战。我们希望SNS基础VL能够激发未来的研究，以实现下一代社交网络服务的强大，背景感和人类一致的多模式智能。

Title: Generating Diverse Training Samples for Relation Extraction with Large Language Models

Authors: Zexuan Li, Hongliang Dai, Piji Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23108
Pdf URL: https://arxiv.org/pdf/2505.23108
Copy Paste: [[2505.23108]] Generating Diverse Training Samples for Relation Extraction with Large Language Models(https://arxiv.org/abs/2505.23108)
Keywords: language model, llm, prompt
Abstract: Using Large Language Models (LLMs) to generate training data can potentially be a preferable way to improve zero or few-shot NLP tasks. However, many problems remain to be investigated for this direction. For the task of Relation Extraction (RE), we find that samples generated by directly prompting LLMs may easily have high structural similarities with each other. They tend to use a limited variety of phrasing while expressing the relation between a pair of entities. Therefore, in this paper, we study how to effectively improve the diversity of the training samples generated with LLMs for RE, while also maintaining their correctness. We first try to make the LLMs produce dissimilar samples by directly giving instructions in In-Context Learning (ICL) prompts. Then, we propose an approach to fine-tune LLMs for diversity training sample generation through Direct Preference Optimization (DPO). Our experiments on commonly used RE datasets show that both attempts can improve the quality of the generated training data. We also find that comparing with directly performing RE with an LLM, training a non-LLM RE model with its generated samples may lead to better performance.
摘要：使用大型语言模型（LLMS）生成培训数据可能是改善零或几次射击NLP任务的最佳方法。但是，对于这个方向，仍有许多问题待研究。对于关系提取的任务（RE），我们发现直接提示LLM生成的样品很容易彼此具有很高的结构相似性。他们倾向于在表达一对实体之间的关系时使用有限的措辞。因此，在本文中，我们研究了如何有效地改善LLMS生成的RE的培训样本的多样性，同时还保持其正确性。我们首先尝试通过直接在内在学习（ICL）提示中提供指令来制作不同的样本。然后，我们提出了一种通过直接偏好优化（DPO）来生成多样性训练样本的微调LLM的方法。我们对常用RE数据集的实验表明，这两种尝试都可以提高生成的培训数据的质量。我们还发现，与LLM直接执行RE进行比较，训练具有其生成样品的非LLLM RE模型可能会导致更好的性能。

Title: Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data

Authors: Seohyeong Lee, Eunwon Kim, Hwaran Lee, Buru Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23114
Pdf URL: https://arxiv.org/pdf/2505.23114
Copy Paste: [[2505.23114]] Dataset Cartography for Large Language Model Alignment: Mapping and Diagnosing Preference Data(https://arxiv.org/abs/2505.23114)
Keywords: language model, gpt, llm
Abstract: Human preference data plays a critical role in aligning large language models (LLMs) with human values. However, collecting such data is often expensive and inefficient, posing a significant scalability challenge. To address this, we introduce Alignment Data Map, a GPT-4o-assisted tool for analyzing and diagnosing preference data. Using GPT-4o as a proxy for LLM alignment, we compute alignment scores for LLM-generated responses to instructions from existing preference datasets. These scores are then used to construct an Alignment Data Map based on their mean and variance. Our experiments show that using only 33 percent of the data, specifically samples in the high-mean, low-variance region, achieves performance comparable to or better than using the entire dataset. This finding suggests that the Alignment Data Map can significantly improve data collection efficiency by identifying high-quality samples for LLM alignment without requiring explicit annotations. Moreover, the Alignment Data Map can diagnose existing preference datasets. Our analysis shows that it effectively detects low-impact or potentially misannotated samples. Source code is available online.
摘要：人类的偏好数据在使大语言模型（LLM）与人类价值观对齐中起着至关重要的作用。但是，收集此类数据通常是昂贵且效率低下的，带来了重大的可扩展性挑战。为了解决这个问题，我们介绍了Alignment数据图，这是一种用于分析和诊断首选项数据的GPT-4O辅助工具。使用GPT-4O作为LLM对齐的代理，我们计算对LLM生成的响应对现有优先数据集的说明的响应的对齐得分。然后，这些分数用于根据其平均值和方差构建对齐数据图。我们的实验表明，与使用整个数据集相比，仅使用33％的数据，特别是在高均值低变化区域中的样本，可以实现与或更好的性能。这一发现表明，对齐数据图可以通过识别LLM对齐的高质量样本而无需明确注释来显着提高数据收集效率。此外，对齐数据图可以诊断现有的首选项数据集。我们的分析表明，它有效地检测到了低影响或潜在误导的样品。源代码可在线提供。

Title: ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

Authors: Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23121
Pdf URL: https://arxiv.org/pdf/2505.23121
Copy Paste: [[2505.23121]] ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations(https://arxiv.org/abs/2505.23121)
Keywords: language model, long context
Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
摘要：多模式的大型语言模型表现出了显着的零击功能和强大的图像知识能力。但是，现有的开源多模式模型遭受了多转交互作用的能力较弱，尤其是对于长篇小说。为了解决该问题，我们首先引入上下文建模模块，称为上下文Qformer，该模块利用内存块来增强上下文信息的表示。此外，为了促进进一步的研究，我们仔细构建了一个新的多型多模式对话数据集（TMDIAGOG），以进行预培训，指导调查和评估，这将是开源的。与其他多模式对话数据集相比，TMDialog包含更长的对话，该对话支持多转变多模式对话的研究。此外，将ContextQformer与TMDIAGOG上的三个基准进行了比较，实验结果表明，ContextQformer的可用率比基线相比可提高2％-4％。

Title: PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics

Authors: Atharva Naik, Darsh Agrawal, Manav Kapadnis, Yuwei An, Yash Mathur, Carolyn Rose, David Mortensen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23126
Pdf URL: https://arxiv.org/pdf/2505.23126
Copy Paste: [[2505.23126]] PBEBench: A Multi-Step Programming by Examples Reasoning Benchmark inspired by Historical Linguistics(https://arxiv.org/abs/2505.23126)
Keywords: language model, llm
Abstract: Recently, long chain of thought (LCoT), Large Language Models (LLMs), have taken the machine learning world by storm with their breathtaking reasoning capabilities. However, are the abstract reasoning abilities of these models general enough for problems of practical importance? Unlike past work, which has focused mainly on math, coding, and data wrangling, we focus on a historical linguistics-inspired inductive reasoning problem, formulated as Programming by Examples. We develop a fully automated pipeline for dynamically generating a benchmark for this task with controllable difficulty in order to tackle scalability and contamination issues to which many reasoning benchmarks are subject. Using our pipeline, we generate a test set with nearly 1k instances that is challenging for all state-of-the-art reasoning LLMs, with the best model (Claude-3.7-Sonnet) achieving a mere 54% pass rate, demonstrating that LCoT LLMs still struggle with a class or reasoning that is ubiquitous in historical linguistics as well as many other domains.
摘要：最近，长长的思想链（LCOT），大型语言模型（LLMS）以其令人叹为观止的推理能力席卷了机器学习世界。但是，这些模型的抽象推理能力是否足够一般地解决实际重要性问题？与过去主要关注数学，编码和数据争吵的工作不同，我们专注于历史语言学启发性的归纳推理问题，该问题以示例为编程。我们开发了一条全自动管道，以动态为该任务动态生成基准，以控制难度，以解决许多推理基准受到的可扩展性和污染问题。使用管道，我们生成了一个具有近1K实例的测试集，这对于所有最先进的推理LLM都充满挑战，最佳模型（Claude-3.7-Sonnet）仅实现了54％的通过率，这表明LCOT LLMS仍在班级或推理中挣扎在历史语言语言上的班级或推理，以及其他许多其他Domains。

Title: Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring

Authors: Qiuyu Ding, Zhiqiang Cao, Hailong Cao, Tiejun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23140
Pdf URL: https://arxiv.org/pdf/2505.23140
Copy Paste: [[2505.23140]] Enhancing Large Language Models'Machine Translation via Dynamic Focus Anchoring(https://arxiv.org/abs/2505.23140)
Keywords: language model, llm
Abstract: Large language models have demonstrated exceptional performance across multiple crosslingual NLP tasks, including machine translation (MT). However, persistent challenges remain in addressing context-sensitive units (CSUs), such as polysemous words. These CSUs not only affect the local translation accuracy of LLMs, but also affect LLMs' understanding capability for sentences and tasks, and even lead to translation failure. To address this problem, we propose a simple but effective method to enhance LLMs' MT capabilities by acquiring CSUs and applying semantic focus. Specifically, we dynamically analyze and identify translation challenges, then incorporate them into LLMs in a structured manner to mitigate mistranslations or misunderstandings of CSUs caused by information flattening. Efficiently activate LLMs to identify and apply relevant knowledge from its vast data pool in this way, ensuring more accurate translations for translating difficult terms. On a benchmark dataset of MT, our proposed method achieved competitive performance compared to multiple existing open-sourced MT baseline models. It demonstrates effectiveness and robustness across multiple language pairs, including both similar language pairs and distant language pairs. Notably, the proposed method requires no additional model training and enhances LLMs' performance across multiple NLP tasks with minimal resource consumption.
摘要：大型语言模型已经在多个跨语言NLP任务（包括机器翻译（MT））中表现出了出色的性能。但是，持续的挑战仍然在解决上下文敏感单元（CSU）（例如多义单词）中。这些CSU不仅会影响LLM的局部翻译准确性，还影响LLMS对句子和任务的理解能力，甚至导致翻译失败。为了解决这个问题，我们提出了一种简单但有效的方法，以通过获取CSU和应用语义重点来增强LLM的MT功能。具体而言，我们动态分析和识别翻译挑战，然后以结构化的方式将其纳入LLM，以减轻信息扁平引起的CSU的误导或误解。有效地激活LLM，以这种方式从其庞大的数据库中识别和应用相关知识，从而确保更准确的翻译以翻译困难术语。在MT的基准数据集上，与多种现有开源MT基线模型相比，我们提出的方法实现了竞争性能。它表现出多种语言对的有效性和鲁棒性，包括类似的语言对和遥远的语言对。值得注意的是，所提出的方法不需要其他模型培训，并且可以在最少的资源消耗中提高LLMS在多个NLP任务中的性能。

Title: Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models

Authors: Qiuyu Ding, Zhiqiang Cao, Hailong Cao, Tiejun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23146
Pdf URL: https://arxiv.org/pdf/2505.23146
Copy Paste: [[2505.23146]] Cross-Domain Bilingual Lexicon Induction via Pretrained Language Models(https://arxiv.org/abs/2505.23146)
Keywords: language model
Abstract: Bilingual Lexicon Induction (BLI) is generally based on common domain data to obtain monolingual word embedding, and by aligning the monolingual word embeddings to obtain the cross-lingual embeddings which are used to get the word translation pairs. In this paper, we propose a new task of BLI, which is to use the monolingual corpus of the general domain and target domain to extract domain-specific bilingual dictionaries. Motivated by the ability of Pre-trained models, we propose a method to get better word embeddings that build on the recent work on BLI. This way, we introduce the Code Switch(Qin et al., 2020) firstly in the cross-domain BLI task, which can match differit is yet to be seen whether these methods are suitable for bilingual lexicon extraction in professional fields. As we can see in table 1, the classic and efficient BLI approach, Muse and Vecmap, perform much worse on the Medical dataset than on the Wiki dataset. On one hand, the specialized domain data set is relatively smaller compared to the generic domain data set generally, and specialized words have a lower frequency, which will directly affect the translation quality of bilingual dictionaries. On the other hand, static word embeddings are widely used for BLI, however, in some specific fields, the meaning of words is greatly influenced by context, in this case, using only static word embeddings may lead to greater bias. ent strategies in different contexts, making the model more suitable for this task. Experimental results show that our method can improve performances over robust BLI baselines on three specific domains by averagely improving 0.78 points.
摘要：双语词典感应（BLI）通常基于共同的域数据，以获取单语单词嵌入，并通过对齐单语词嵌入以获取用于获取单词翻译对的跨语性嵌入。在本文中，我们提出了一项新的BLI任务，该任务是使用通用域和目标域的单语语料库来提取特定领域的双语词典。受预训练模型的能力的激励，我们提出了一种方法，以获取基于BLI最近工作的更好单词嵌入。这样，我们首先在跨域BLI任务中介绍了代码开关（Qin等，2020），尚待匹配的差异尚待观察，这些方法是否适用于专业领域的双语词典提取。正如我们在表1中所看到的，经典，高效的BLI方法，Muse和Vecmap在医疗数据集上的表现要比Wiki数据集差得多。一方面，与一般的通用域数据集相比，专门的域数据集相对较小，并且专业单词的频率较低，这将直接影响双语词典的翻译质量。另一方面，静态词嵌入被广泛用于BLI，但是，在某些特定字段中，单词的含义受到上下文的极大影响，在这种情况下，仅使用静态单词嵌入可能会导致更大的偏见。在不同情况下的策略，使该模型更适合此任务。实验结果表明，我们的方法可以通过平均提高0.78点来改善三个特定领域的鲁棒BLI基准的性能。

Title: Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes

Authors: Li Lucy, Camilla Griffiths, Sarah Levine, Jennifer L. Eberhardt, Dorottya Demszky, David Bamman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23166
Pdf URL: https://arxiv.org/pdf/2505.23166
Copy Paste: [[2505.23166]] Tell, Don't Show: Leveraging Language Models' Abstractive Retellings to Model Literary Themes(https://arxiv.org/abs/2505.23166)
Keywords: language model, prompt
Abstract: Conventional bag-of-words approaches for topic modeling, like latent Dirichlet allocation (LDA), struggle with literary text. Literature challenges lexical methods because narrative language focuses on immersive sensory details instead of abstractive description or exposition: writers are advised to "show, don't tell." We propose Retell, a simple, accessible topic modeling approach for literature. Here, we prompt resource-efficient, generative language models (LMs) to tell what passages show, thereby translating narratives' surface forms into higher-level concepts and themes. By running LDA on LMs' retellings of passages, we can obtain more precise and informative topics than by running LDA alone or by directly asking LMs to list topics. To investigate the potential of our method for cultural analytics, we compare our method's outputs to expert-guided annotations in a case study on racial/cultural identity in high school English language arts books.
摘要：传统的单词袋方法用于主题建模，例如潜在的Dirichlet分配（LDA），与文学文本斗争。文学挑战词汇方法，因为叙事语言专注于沉浸感的感官细节，而不是抽象的描述或论述：建议作家“展示，不要说”。我们提出了Retell，这是一种简单，可访问的主题建模方法。在这里，我们促使资源有效的生成语言模型（LMS）讲述了哪些段落，从而将叙事的表面形式转化为更高级别的概念和主题。通过在LMS的重述段落上运行LDA，与单独运行LDA或直接要求LMS列出主题相比，我们可以获得更精确且内容丰富的主题。为了调查我们的文化分析方法的潜力，我们将方法的输出与专家指导的注释进行了比较，在高中英语语言艺术书籍中的种族/文化身份案例研究中。

Title: Map&Make: Schema Guided Text to Table Generation

Authors: Naman Ahuja, Fenil Bardoliya, Chitta Baral, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23174
Pdf URL: https://arxiv.org/pdf/2505.23174
Copy Paste: [[2505.23174]] Map&Make: Schema Guided Text to Table Generation(https://arxiv.org/abs/2505.23174)
Keywords: hallucination
Abstract: Transforming dense, detailed, unstructured text into an interpretable and summarised table, also colloquially known as Text-to-Table generation, is an essential task for information retrieval. Current methods, however, miss out on how and what complex information to extract; they also lack the ability to infer data from the text. In this paper, we introduce a versatile approach, Map&Make, which "dissects" text into propositional atomic statements. This facilitates granular decomposition to extract the latent schema. The schema is then used to populate the tables that capture the qualitative nuances and the quantitative facts in the original text. Our approach is tested against two challenging datasets, Rotowire, renowned for its complex and multi-table schema, and Livesum, which demands numerical aggregation. By carefully identifying and correcting hallucination errors in Rotowire, we aim to achieve a cleaner and more reliable benchmark. We evaluate our method rigorously on a comprehensive suite of comparative and referenceless metrics. Our findings demonstrate significant improvement results across both datasets with better interpretability in Text-to-Table generation. Moreover, through detailed ablation studies and analyses, we investigate the factors contributing to superior performance and validate the practicality of our framework in structured summarization tasks.
摘要：将密集，详细的，非结构化的文本转换为一个可解释的和摘要的表，也称为文本到桌子，是信息检索的必不可少的任务。但是，当前的方法错过了如何和哪些复杂信息提取的方法；他们还缺乏从文本中推断数据的能力。在本文中，我们介绍了一种多功能方法，地图和制作，该方法将文本“剖析”到命题原子语句中。这有助于颗粒分解以提取潜在的模式。然后，该架构用于填充表格，以捕获原始文本中的定性细微差别和定量事实。我们的方法对两个具有挑战性的数据集进行了测试，即Rotowire，该数据集以其复杂而多桌的模式和Livesum而闻名，这需要数值聚合。通过仔细识别和纠正Rotowire中的幻觉错误，我们旨在实现更清洁，更可靠的基准测试。我们严格评估我们的方法，以一套比较和无参考指标的综合套件。我们的发现表明，在两个数据集中都有显着改进的结果，并且在文本到餐桌上可以更好地解释性。此外，通过详细的消融研究和分析，我们研究了有助于卓越性能的因素，并验证我们在结构化摘要任务中的实用性。

Title: Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification

Authors: Wenjing Xing, Wenke Lu, Yeheng Duan, Bing Zhao, Zhenghui kang, Yaolong Wang, Kai Gao, Lei Qiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23177
Pdf URL: https://arxiv.org/pdf/2505.23177
Copy Paste: [[2505.23177]] Infinite-Instruct: Synthesizing Scaling Code instruction Data with Bidirectional Synthesis and Static Verification(https://arxiv.org/abs/2505.23177)
Keywords: language model, llm
Abstract: Traditional code instruction data synthesis methods suffer from limited diversity and poor logic. We introduce Infinite-Instruct, an automated framework for synthesizing high-quality question-answer pairs, designed to enhance the code generation capabilities of large language models (LLMs). The framework focuses on improving the internal logic of synthesized problems and the quality of synthesized code. First, "Reverse Construction" transforms code snippets into diverse programming problems. Then, through "Backfeeding Construction," keywords in programming problems are structured into a knowledge graph to reconstruct them into programming problems with stronger internal logic. Finally, a cross-lingual static code analysis pipeline filters invalid samples to ensure data quality. Experiments show that on mainstream code generation benchmarks, our fine-tuned models achieve an average performance improvement of 21.70% on 7B-parameter models and 36.95% on 32B-parameter models. Using less than one-tenth of the instruction fine-tuning data, we achieved performance comparable to the Qwen-2.5-Coder-Instruct. Infinite-Instruct provides a scalable solution for LLM training in programming. We open-source the datasets used in the experiments, including both unfiltered versions and filtered versions via static analysis. The data are available at this https URL
摘要：传统的代码指导数据综合方法的多样性有限和逻辑差。我们介绍了Infinite-Instruct，这是一个自动化的框架，用于合成高质量的问题解答对，旨在增强大语言模型（LLMS）的代码生成功能。该框架着重于改善合成问题的内部逻辑和合成代码的质量。首先，“反向结构”将代码片段转化为各种编程问题。然后，通过“反馈结构”，编程问题中的关键字构成了知识图，以将其重构为具有更强内部逻辑的编程问题。最后，跨语言静态代码分析管道过滤了无效的样品，以确保数据质量。实验表明，在主流代码生成基准测试中，我们的微调模型在7B参数模型上的平均性能提高了21.70％，在32B参数模型上的平均性能提高了36.95％。使用少于十分之一的指令微调数据，我们实现了与QWEN-2.5-CODER-INSTRUCTION相当的性能。 Infinite-Instruct为LLM编程提供了可扩展的解决方案。我们开放实验中使用的数据集，包括未过滤版本和通过静态分析过滤版本。数据可在此HTTPS URL上找到

Title: Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Authors: Gabriele Sarti, Vilém Zouhar, Malvina Nissim, Arianna Bisazza
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.23183
Pdf URL: https://arxiv.org/pdf/2505.23183
Copy Paste: [[2505.23183]] Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement(https://arxiv.org/abs/2505.23183)
Keywords: language model, prompt
Abstract: Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.
摘要：单词级质量估计（WQE）旨在自动识别机器翻译的输出中的细粒度错误跨度，并发现了许多用途，包括在编辑后进行了协助翻译人员。现代WQE技术通常很昂贵，涉及大量人类标记数据的大型语言模型或临时培训。在这项工作中，我们研究了利用语言模型可解释性和不确定性量化的最新进展的有效替代方案，以确定翻译模型内部工作的翻译错误。在跨越12个翻译方向的14个指标的评估中，我们通过使用多组人类标签来量化人类标签变化对度量性能的影响。我们的结果突出了无监督指标的未开发潜力，面对标签不确定性时的监督方法的缺点以及单人通知者评估实践的脆弱性。

Title: Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Authors: Yilong Li, Chen Qian, Yu Xia, Ruijie Shi, Yufan Dang, Zihao Xie, Ziming You, Weize Chen, Cheng Yang, Weichuan Liu, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.23187
Pdf URL: https://arxiv.org/pdf/2505.23187
Copy Paste: [[2505.23187]] Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration(https://arxiv.org/abs/2505.23187)
Keywords: language model, llm, agent
Abstract: Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.
摘要：大型基于语言模型的多代理系统（MAS）在通过协作推理和机构间评论解决复杂任务方面表现出了显着的进步。但是，现有方法通常会孤立地对待每个任务，从而导致冗余计算和在结构相似的任务中的概括有限。为了解决这个问题，我们介绍了多代理交叉任务体验学习（MAEL），这是一个新型框架，它赋予LLM驱动的代理具有明确的交叉任务学习和经验积累。我们在图形结构化的多代理协作网络上对任务解决的工作流进行建模，在该网络中，代理传播信息并通过显式连接进行协调。在体验式学习阶段，我们量化了解决任务解决工作流程中每个步骤的质量，并将所得的奖励以及相应的输入和输出存储到每个代理人的个人体验池中。在推断期间，代理商将高回报，与任务相关的经验作为少数示例，以提高每个推理步骤的有效性，从而实现更准确，更有效的多代理协作。各种数据集的实验结果表明，Mael赋予了代理能力从先前的任务体验中学习，从而有效地获得更快的收敛性并在当前任务上产生更高质量的解决方案。

Title: ExpeTrans: LLMs Are Experiential Transfer Learners

Authors: Jinglong Gao, Xiao Ding, Lingxiao Zou, Bibo Cai, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23191
Pdf URL: https://arxiv.org/pdf/2505.23191
Copy Paste: [[2505.23191]] ExpeTrans: LLMs Are Experiential Transfer Learners(https://arxiv.org/abs/2505.23191)
Keywords: language model, llm, prompt
Abstract: Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.
摘要：最近的研究为大型语言模型（LLM）提供了通过提示提高其性能的文本解决经验。但是，以前的方法依靠大量的人工或时间来为每项任务收集这种经验，这是不切实际的，因为用户查询越来越多的任务类型。为了解决这个问题，我们设计了一个自主体验转移框架，以探讨LLM是否可以模仿人类认知智能，以自主将经验从现有源任务转移到新遇到的目标任务。这不仅允许在没有以前方法的大量成本的情况下获得经验，而且还为LLM的概括提供了新的途径。 13个数据集的实验结果表明，我们的框架有效地提高了LLM的性能。此外，我们对框架中的每个模块提供了详细的分析。

Title: MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration

Authors: Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, Yi R. (May)Fung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23224
Pdf URL: https://arxiv.org/pdf/2505.23224
Copy Paste: [[2505.23224]] MMBoundary: Advancing MLLM Knowledge Boundary Awareness through Reasoning Step Confidence Calibration(https://arxiv.org/abs/2505.23224)
Keywords: language model, llm, hallucination
Abstract: In recent years, multimodal large language models (MLLMs) have made significant progress but continue to face inherent challenges in multimodal reasoning, which requires multi-level (e.g., perception, reasoning) and multi-granular (e.g., multi-step reasoning chain) advanced inferencing. Prior work on estimating model confidence tends to focus on the overall response for training and calibration, but fails to assess confidence in each reasoning step, leading to undesirable hallucination snowballing. In this work, we present MMBoundary, a novel framework that advances the knowledge boundary awareness of MLLMs through reasoning step confidence calibration. To achieve this, we propose to incorporate complementary textual and cross-modal self-rewarding signals to estimate confidence at each step of the MLLM reasoning process. In addition to supervised fine-tuning MLLM on this set of self-rewarded confidence estimation signal for initial confidence expression warm-up, we introduce a reinforcement learning stage with multiple reward functions for further aligning model knowledge and calibrating confidence at each reasoning step, enhancing reasoning chain self-correction. Empirical results show that MMBoundary significantly outperforms existing methods across diverse domain datasets and metrics, achieving an average of 7.5% reduction in multimodal confidence calibration errors and up to 8.3% improvement in task performance.
摘要：近年来，多模式的大语言模型（MLLM）取得了重大进展，但在多模式推理中继续面临固有的挑战，这需要多层次（例如，感知，推理）和多粒度（例如，多步电推理链）。估计模型置信度的先前工作倾向于集中于训练和校准的整体响应，但未能评估对每个推理步骤的信心，从而导致幻觉不足。在这项工作中，我们提出了MMBoundary，这是一个新颖的框架，通过推理置信度校准来提高MLLM的知识边界意识。为了实现这一目标，我们建议将互补的文本和跨模式自我奖励信号纳入，以估计MLLM推理过程的每个步骤的信心。除了在这组自我回报的置信度估计信号上进行监督的微调MLLM以进行初始置信度表达热身之外，我们还引入了具有多个奖励功能的强化学习阶段，以进一步对齐模型知识并在每个推理步骤中校准置信度，从而增强了推理链自我纠正。经验结果表明，MMBOUNDARY在不同域数据集和指标上的现有方法显着优于现有方法，在多模式置信度校准误差中平均降低了7.5％，任务绩效提高了8.3％。

Title: MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Authors: Hao Lu, Yanchi Gu, Haoyuan Huang, Yulin Zhou, Ningxin Zhu, Chen Li
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.23229
Pdf URL: https://arxiv.org/pdf/2505.23229
Copy Paste: [[2505.23229]] MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration(https://arxiv.org/abs/2505.23229)
Keywords: language model, llm, prompt
Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.
摘要：蒙特卡洛树搜索（MCT）与大语言模型（LLMS）的集成在结构化的，面向问题的任务中取得了重大成功。但是，将这些方法应用于心理咨询等开放式对话中，带来了独特的挑战。与具有客观正确性的任务不同，治疗对话的成功取决于主观因素，例如善解人意的参与，道德依从性以及与人类偏好的一致性，因为这些因素是严格的“正确性”标准不明显。因此，现有的面向结果的MCT方法可以产生未对准的反应。为了解决这个问题，我们介绍了MCTSR-Zero，这是一个专为开放式，以人为中心的对话而设计的MCTS框架。它的核心创新是“域名”，它将MCT搜索目标从预定义的最终国家转移到符合目标领域原则（例如，咨询中的同理心）的对话轨迹。此外，MCTSR-Zero通过允许MCT来考虑从根本上不同的初始对话策略，将“再生”和“ Meta Prompt适应”机制结合起来，从而大大拓宽了探索。我们通过生成多转对话数据来评估心理咨询中的MCTSR-Zero，该数据用于微调LLM，Psyllm。我们还介绍了Psyeval，这是评估多转变心理咨询对话的基准。实验表明，Psyllm在Psyeval和其他相关指标上实现了最先进的性能，从而验证了MCTSR-Zero在生成以人为中心领域的高质量，原理对话数据的有效性，并解决了始终如一地挑战复杂的心理标准的LLM挑战。

Title: ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering

Authors: Jingxuan Wei, Nan Xu, Junnan Zhu, Yanni Hao, Gaowei Wu, Bihui Yu, Lei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23242
Pdf URL: https://arxiv.org/pdf/2505.23242
Copy Paste: [[2505.23242]] ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering(https://arxiv.org/abs/2505.23242)
Keywords: language model, llm, chain-of-thought
Abstract: Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models. While early approaches have shown promising performance by focusing on visual features or leveraging large-scale pre-training, most existing evaluations rely on rigid output formats and objective metrics, thus ignoring the complex, real-world demands of practical chart analysis. In this paper, we introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings. ChartMind covers seven task categories, incorporates multilingual contexts, supports open-domain textual outputs, and accommodates diverse chart formats, bridging the gap between real-world applications and traditional academic benchmarks. Furthermore, we propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements, reducing noise, and enhancing the reasoning accuracy of multimodal large language models. Extensive evaluations on ChartMind and three representative public benchmarks with 14 mainstream multimodal models show our framework significantly outperforms the previous three common CQA paradigms: instruction-following, OCR-enhanced, and chain-of-thought, highlighting the importance of flexible chart understanding for real-world CQA. These findings suggest new directions for developing more robust chart reasoning in future research.
摘要：图表问题回答（CQA）已成为评估视觉模型的推理功能的关键多模式任务。尽管早期方法通过专注于视觉特征或利用大规模预训练表现出有希望的性能，但大多数现有的评估依赖于刚性输出格式和客观指标，从而忽略了对实用图表分析的复杂，现实世界中的需求。在本文中，我们介绍了ChartMind，这是一种新的基准测试，旨在用于现实世界中的复杂CQA任务。 ChartMind涵盖了七个任务类别，结合了多语言上下文，支持开放域的文本输出，并适应各种图表格式，弥合了现实世界应用和传统学术基准之间的差距。此外，我们提出了一个上下文感知到的模型敏锐性框架Chartllm，该框架着重于提取关键上下文元素，减少噪声并增强多模式大型语言模型的推理准确性。对图表和三个代表性的公共基准的广泛评估和14个主流多模式模型表明，我们的框架大大胜过前三个常见的CQA范式：指导联系，OCR增强和思想链，突显了对现实世界中CQA的灵活图表理解的重要性。这些发现提出了新的方向，用于在未来的研究中开发更强大的图表推理。

Title: The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

Authors: Maged S. Al-Shaibani, Moataz Ahmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23276
Pdf URL: https://arxiv.org/pdf/2505.23276
Copy Paste: [[2505.23276]] The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text(https://arxiv.org/abs/2505.23276)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9\% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.
摘要：大型语言模型（LLM）在产生类似人类的文本方面已经实现了前所未有的能力，对包括教育，社交媒体和学术界在内的关键领域的信息完整性构成了微妙而又重大的挑战，从而促进了复杂的误导性运动，损害了医疗保健指南，并促进了目标宣传。这一挑战变得严重，尤其是在像阿拉伯语（如阿拉伯语）（如阿拉伯语）的探索不足和低资源语言中。本文介绍了对阿拉伯机器生成的文本的全面调查，研究了各种模型体系结构（Allam，Jais，Llama和GPT-4）在学术和社交媒体领域中进行了多种一代策略（仅从标题中产生的，内容吸引的生成和文本完善）。我们的造型分析揭示了在这些多样化的环境中，人文编写与机器生成的阿拉伯文本区分开的独特语言模式。尽管具有类似人类的品质，但我们证明了LLM在其阿拉伯语产量中产生可检测的特征，其特异性特征在不同的环境之间差异很大。基于这些见解，我们开发了基于BERT的检测模型，这些模型在正式环境（高达99.9 \％F1得分）中具有出色的性能，在模型体系结构之间具有很高的精度。我们的跨域分析证实了文献中先前报道的概括挑战。据我们所知，这项工作代表了迄今为止对阿拉伯语机器生成的文本的最全面调查，独特地结合了多种迅速生成方法，多样化的模型体系结构以及跨多种文本领域的深入的样式分析，为在阿拉伯语言中的维护中确立了可靠，语言上有力的检测系统的基础，以开发强大的，语言上有形的可检测系统。

Title: Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

Authors: Yong Zhang, Yanwen Huang, Ning Cheng, Yang Guo, Yun Zhu, Yanmeng Wang, Shaojun Wang, Jing Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23277
Pdf URL: https://arxiv.org/pdf/2505.23277
Copy Paste: [[2505.23277]] Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective(https://arxiv.org/abs/2505.23277)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5$\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: this https URL.
摘要：检索增强的生成（RAG）增强了具有外部上下文的大型语言模型（LLM），但是检索的段落通常是冗长，嘈杂或超过输入限制的。现有的压缩方法通常需要监督专用压缩模型的培训，增加成本并降低便携性。我们提出了Sentinel，这是一种轻巧的句子级压缩框架，将上下文过滤重新设计为基于注意力的理解任务。 Sentinel没有使用轻量级分类器来识别句子相关性，而不是训练压缩模型，而是从现成的0.5B代理LLM探测了解码器的注意。从经验上讲，我们发现Query-Context相关性估计在模型尺度上是一致的，而0.5B代理与较大模型的行为紧密匹配。在Longbench Benchmark上，Sentinel在匹配7B级压缩系统的QA性能的同时，最多可实现5 $ \ times $压缩。我们的结果表明，探测本地注意力信号可以实现快速，有效和提问的上下文压缩。代码可用：此HTTPS URL。

Title: ScEdit: Script-based Assessment of Knowledge Editing

Authors: Xinye Li, Zunwen Zheng, Qian Zhang, Dekai Zhuang, Jiabao Kang, Liyan Xu, Qingbin Liu, Xi Chen, Zhiying Tu, Dianhui Chu, Dianbo Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23291
Pdf URL: https://arxiv.org/pdf/2505.23291
Copy Paste: [[2505.23291]] ScEdit: Script-based Assessment of Knowledge Editing(https://arxiv.org/abs/2505.23291)
Keywords: llm, agent
Abstract: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at this https URL.
摘要：知识编辑（KE）引起了越来越多的关注，但是当前的KE任务仍然相对简单。在当前的评估框架下，许多编辑方法达到了异常高的分数，有时几乎是完美的。但是，很少有研究将KE纳入现实世界应用方案（例如，最近对LLM-As-As-agent的兴趣）。为了支持我们的分析，我们介绍了一种基于脚本的基于新颖的基准-Scedit（基于脚本的知识编辑基准） - 涵盖了反事实和时间编辑。我们整合了令牌级别和文本级别的评估方法，全面分析了现有的KE技术。基准测试将基于事实的传统事实（“什么”型问题）评估扩展到基于动作的（“如何” - 型问题）评估。我们观察到，所有KE方法在既定指标上都表现出绩效下降，并且在文本级指标上面临挑战，这表明一项艰巨的任务。我们的基准标准可在此HTTPS URL上找到。

Title: How Does Response Length Affect Long-Form Factuality

Authors: James Xu Zhao, Jimmy Z.J. Liu, Bryan Hooi, See-Kiong Ng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23295
Pdf URL: https://arxiv.org/pdf/2505.23295
Copy Paste: [[2505.23295]] How Does Response Length Affect Long-Form Factuality(https://arxiv.org/abs/2505.23295)
Keywords: language model, llm, long context
Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.
摘要：大型语言模型（LLM）广泛用于长篇文本生成。但是，响应中的事实错误将破坏其可靠性。尽管人们对LLM事实的关注日益增加，但响应长度对事实的影响仍未得到充实。在这项工作中，我们通过首先引入自动和双层长期事实评估框架来系统地研究这种关系，该框架与人类注释达到了高度的一致性，同时具有成本效益。使用此框架，我们进行了受控的实验，发现较长的响应表现出较低的事实精度，从而确认了长度偏差的存在。为了解释这一现象，我们从经验上检查了三个假设：错误传播，长篇小说和事实疲惫。我们的结果表明，该模型逐渐消耗更多可靠知识的事实是事实降级的主要原因，而不是其他两个假设。

Title: EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian

Authors: Daryna Dementieva, Nikolay Babakov, Alexander Fraser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23297
Pdf URL: https://arxiv.org/pdf/2505.23297
Copy Paste: [[2505.23297]] EmoBench-UA: A Benchmark Dataset for Emotion Detection in Ukrainian(https://arxiv.org/abs/2505.23297)
Keywords: language model, llm
Abstract: While Ukrainian NLP has seen progress in many texts processing tasks, emotion classification remains an underexplored area with no publicly available benchmark to date. In this work, we introduce EmoBench-UA, the first annotated dataset for emotion detection in Ukrainian texts. Our annotation schema is adapted from the previous English-centric works on emotion detection (Mohammad et al., 2018; Mohammad, 2022) guidelines. The dataset was created through crowdsourcing using the this http URL platform ensuring high-quality of the annotation process. Then, we evaluate a range of approaches on the collected dataset, starting from linguistic-based baselines, synthetic data translated from English, to large language models (LLMs). Our findings highlight the challenges of emotion classification in non-mainstream languages like Ukrainian and emphasize the need for further development of Ukrainian-specific models and training resources.
摘要：尽管乌克兰NLP在许多文本处理任务中都取得了进步，但情绪分类仍然是一个未经震惊的领域，迄今为止尚无公开可用的基准。在这项工作中，我们介绍了Emobench-UA，这是乌克兰文本中的第一个注释数据集。我们的注释模式改编自先前以英语为中心的情感检测作品（Mohammad等，2018； Mohammad，2022）指南。该数据集是通过使用此HTTP URL平台来确保注释过程的高质量的众包创建的。然后，我们评估了收集的数据集上的一系列方法，从基于语言的基准开始，从英语转换为大语模型（LLMS）。我们的发现强调了在乌克兰等非主流语言中情绪分类的挑战，并强调需要进一步发展乌克兰特定的模型和培训资源。

Title: Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs

Authors: Julia Belikova, Konstantin Polev, Rauf Parchiev, Dmitry Simakov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23299
Pdf URL: https://arxiv.org/pdf/2505.23299
Copy Paste: [[2505.23299]] Data-efficient Meta-models for Evaluation of Context-based Questions and Answers in LLMs(https://arxiv.org/abs/2505.23299)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) systems are increasingly deployed in industry applications, yet their reliability remains hampered by challenges in detecting hallucinations. While supervised state-of-the-art (SOTA) methods that leverage LLM hidden states -- such as activation tracing and representation analysis -- show promise, their dependence on extensively annotated datasets limits scalability in real-world applications. This paper addresses the critical bottleneck of data annotation by investigating the feasibility of reducing training data requirements for two SOTA hallucination detection frameworks: Lookback Lens, which analyzes attention head dynamics, and probing-based approaches, which decode internal model representations. We propose a methodology combining efficient classification algorithms with dimensionality reduction techniques to minimize sample size demands while maintaining competitive performance. Evaluations on standardized question-answering RAG benchmarks show that our approach achieves performance comparable to strong proprietary LLM-based baselines with only 250 training samples. These results highlight the potential of lightweight, data-efficient paradigms for industrial deployment, particularly in annotation-constrained scenarios.
摘要：大型语言模型（LLMS）和检索功能增强的发电（RAG）系统越来越多地部署在行业应用中，但是它们的可靠性仍然受到检测幻觉的挑战的阻碍。尽管有监督的最新方法（SOTA）方法利用LLM隐藏状态（例如激活跟踪和表示分析）显示了承诺，但它们对广泛注释的数据集的依赖限制了现实世界应用程序中的可伸缩性。本文通过研究减少两个SOTA幻觉检测框架的培训数据要求的可行性来解决数据注释的关键瓶颈：回顾镜头，该镜头分析了注意力头动力学和基于探测的方法，该方法解码了内部模型表示。我们提出了一种方法，将有效的分类算法与降低降低技术相结合，以最大程度地减少样本量需求，同时保持竞争性能。对标准化问题的评估索问题基准的评估表明，我们的方法可以使用仅250个培训样本的强大专有基于LLM的基准可比性。这些结果突出了轻巧，数据有效的工业部署范式的潜力，尤其是在注释约束的情况下。

Title: Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs

Authors: Yi Luo, Qiwen Wang, Junqi Yang, Luyao Tang, Zhenghao Lin, Zhenzhe Ying, Weiqiang Wang, Chen Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23304
Pdf URL: https://arxiv.org/pdf/2505.23304
Copy Paste: [[2505.23304]] Generalized Category Discovery in Event-Centric Contexts: Latent Pattern Mining with LLMs(https://arxiv.org/abs/2505.23304)
Keywords: llm
Abstract: Generalized Category Discovery (GCD) aims to classify both known and novel categories using partially labeled data that contains only known classes. Despite achieving strong performance on existing benchmarks, current textual GCD methods lack sufficient validation in realistic settings. We introduce Event-Centric GCD (EC-GCD), characterized by long, complex narratives and highly imbalanced class distributions, posing two main challenges: (1) divergent clustering versus classification groupings caused by subjective criteria, and (2) Unfair alignment for minority classes. To tackle these, we propose PaMA, a framework leveraging LLMs to extract and refine event patterns for improved cluster-class alignment. Additionally, a ranking-filtering-mining pipeline ensures balanced representation of prototypes across imbalanced categories. Evaluations on two EC-GCD benchmarks, including a newly constructed Scam Report dataset, demonstrate that PaMA outperforms prior methods with up to 12.58% H-score gains, while maintaining strong generalization on base GCD datasets.
摘要：广义类别发现（GCD）旨在使用仅包含已知类别的部分标记数据对已知类别和新颖类别进行分类。尽管在现有基准上取得了强大的性能，但当前的文本GCD方法在现实设置中仍缺乏足够的验证。我们介绍了以事件为中心的GCD（EC-GCD），其特征是长，复杂的叙述和高度不平衡的班级分布，提出了两个主要挑战：（1）分歧的聚类与由主观标准引起的分类组，以及（2）少数群体的不同步。为了解决这些问题，我们提出了PAMA，这是一个利用LLM的框架来提取和完善事件模式，以改善集群级对齐。此外，排名过滤的管道可确保在不平衡类别中平衡原型的平衡表示。对包括新构建的SCAM报告数据集在内的两个EC-GCD基准的评估表明，PAMA的表现优于先前的H-SCORE增益高达12.58％的先验方法，同时在基本GCD数据集上保持了强大的概括。

Title: Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO

Authors: Kaiyang Guo, Yinchuan Li, Zhitang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23316
Pdf URL: https://arxiv.org/pdf/2505.23316
Copy Paste: [[2505.23316]] Proximalized Preference Optimization for Diverse Feedback Types: A Decomposed Perspective on DPO(https://arxiv.org/abs/2505.23316)
Keywords: language model, llm
Abstract: Direct alignment methods typically optimize large language models (LLMs) by contrasting the likelihoods of preferred versus dispreferred responses. While effective in steering LLMs to match relative preference, these methods are frequently noted for decreasing the absolute likelihoods of example responses. As a result, aligned models tend to generate outputs that deviate from the expected patterns, exhibiting reward-hacking effect even without a reward model. This undesired consequence exposes a fundamental limitation in contrastive alignment, which we characterize as likelihood underdetermination. In this work, we revisit direct preference optimization (DPO) -- the seminal direct alignment method -- and demonstrate that its loss theoretically admits a decomposed reformulation. The reformulated loss not only broadens applicability to a wider range of feedback types, but also provides novel insights into the underlying cause of likelihood underdetermination. Specifically, the standard DPO implementation implicitly oversimplifies a regularizer in the reformulated loss, and reinstating its complete version effectively resolves the underdetermination issue. Leveraging these findings, we introduce PRoximalized PReference Optimization (PRO), a unified method to align with diverse feeback types, eliminating likelihood underdetermination through an efficient approximation of the complete regularizer. Comprehensive experiments show the superiority of PRO over existing methods in scenarios involving pairwise, binary and scalar feedback.
摘要：直接对齐方法通常通过对比优先响应与分配响应的可能性来优化大型语言模型（LLM）。虽然有效地转向LLM匹配相对偏好，但经常注意到这些方法可降低示例响应的绝对可能性。结果，对齐模型倾向于产生偏离预期模式的输出，即使没有奖励模型，也会表现出奖励效果。这种不受欢迎的后果暴露了对比对准的基本限制，我们认为这可能是可能性不足的。在这项工作中，我们重新访问直接偏好优化（DPO） - 开创性的直接对准方法 - 并证明其损失理论上承认了分解的重新印象。重新制定的损失不仅扩大了对更广泛的反馈类型的适用性，而且还为可能性不足的根本原因提供了新的见解。具体而言，标准的DPO实施隐含地简化了重新计算的损失中的正规化程序，并恢复其完整版本可以有效地解决不确定的问题。利用这些发现，我们引入了邻近的偏好优化（PRO），这是一种统一的方法，可以与各种收费类型保持一致，从而通过有效的完整正常制度来消除了可能性不足的可能性。全面的实验表明，在涉及成对，二进制和标量反馈的情况下，Pro优于现有方法。

Title: Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors

Authors: Harish Tayyar Madabushi, Melissa Torgbi, Claire Bonial
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23323
Pdf URL: https://arxiv.org/pdf/2505.23323
Copy Paste: [[2505.23323]] Neither Stochastic Parroting nor AGI: LLMs Solve Tasks through Context-Directed Extrapolation from Training Data Priors(https://arxiv.org/abs/2505.23323)
Keywords: llm
Abstract: In this position paper we raise critical awareness of a realistic view of LLM capabilities that eschews extreme alternative views that LLMs are either "stochastic parrots" or in possession of "emergent" advanced reasoning capabilities, which, due to their unpredictable emergence, constitute an existential threat. Our middle-ground view is that LLMs extrapolate from priors from their training data, and that a mechanism akin to in-context learning enables the targeting of the appropriate information from which to extrapolate. We call this "context-directed extrapolation." Under this view, substantiated though existing literature, while reasoning capabilities go well beyond stochastic parroting, such capabilities are predictable, controllable, not indicative of advanced reasoning akin to high-level cognitive capabilities in humans, and not infinitely scalable with additional training. As a result, fears of uncontrollable emergence of agency are allayed, while research advances are appropriately refocused on the processes of context-directed extrapolation and how this interacts with training data to produce valuable capabilities in LLMs. Future work can therefore explore alternative augmenting techniques that do not rely on inherent advanced reasoning in LLMs.
摘要：在该立场上，我们提高了对LLM能力的现实观点的批判性认识，该观点避免了极端的替代观点，即LLMS是“随机鹦鹉”，或者拥有“新兴”的先进推理能力，由于其不可预测的出现，它们构成了存在的威胁。我们的中间观点是，LLMS从先验中推断出其训练数据，而类似于秘密学习的机制可以实现适当的信息来推断其推断。我们称此为“上下文指导的外推”。根据这种观点，尽管存在现有的文献，但推理能力远远超出了随机鹦鹉的范围，但这种能力是可预测的，可控制的，不能指示类似于人类中高级认知能力的先进推理，而不是通过额外的培训而无限地扩展的。结果，人们对代理机构无法控制的出现的恐惧得到了减轻，而研究进展则适当地重新集中在上下文指导的外推过程上，以及这如何与培训数据相互作用以在LLMS中产生有价值的功能。因此，未来的工作可以探索不依赖于LLM中固有的高级推理的替代增强技术。

Title: Discriminative Policy Optimization for Token-Level Reward Models

Authors: Hongzhan Chen, Tao Yang, Shiping Gao, Ruijun Chen, Xiaojun Quan, Hongtao Tian, Ting Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23363
Pdf URL: https://arxiv.org/pdf/2505.23363
Copy Paste: [[2505.23363]] Discriminative Policy Optimization for Token-Level Reward Models(https://arxiv.org/abs/2505.23363)
Keywords: language model, llm
Abstract: Process reward models (PRMs) provide more nuanced supervision compared to outcome reward models (ORMs) for optimizing policy models, positioning them as a promising approach to enhancing the capabilities of LLMs in complex reasoning tasks. Recent efforts have advanced PRMs from step-level to token-level granularity by integrating reward modeling into the training of generative models, with reward scores derived from token generation probabilities. However, the conflict between generative language modeling and reward modeling may introduce instability and lead to inaccurate credit assignments. To address this challenge, we revisit token-level reward assignment by decoupling reward modeling from language generation and derive a token-level reward model through the optimization of a discriminative policy, termed the Q-function Reward Model (Q-RM). We theoretically demonstrate that Q-RM explicitly learns token-level Q-functions from preference data without relying on fine-grained annotations. In our experiments, Q-RM consistently outperforms all baseline methods across various benchmarks. For example, when integrated into PPO/REINFORCE algorithms, Q-RM enhances the average Pass@1 score by 5.85/4.70 points on mathematical reasoning tasks compared to the ORM baseline, and by 4.56/5.73 points compared to the token-level PRM counterpart. Moreover, reinforcement learning with Q-RM significantly enhances training efficiency, achieving convergence 12 times faster than ORM on GSM8K and 11 times faster than step-level PRM on MATH. Code and data are available at this https URL.
摘要：与结果奖励模型（ORM）相比，过程奖励模型（PRMS）提供了更细微的监督，以优化政策模型，将其定位为增强LLM在复杂推理任务中的能力的有前途的方法。最近的努力通过将奖励建模集成到生成模型的训练中，从而将PRM从梯级到代币级别的粒度提高到了代币级别，并从代币产生概率中得出了奖励得分。但是，生成语言建模与奖励建模之间的冲突可能引入不稳定并导致信用分配不正确。为了应对这一挑战，我们通过将奖励建模从语言生成中解耦，并通过优化判别策略来衍生出代币的奖励模型来重新访问令牌级别的奖励分配，称为Q功能奖励模型（Q-RM）。从理论上讲，我们证明了Q-RM明确地从偏好数据中学习令牌级别的Q-功能，而无需依赖细粒度的注释。在我们的实验中，Q-RM始终优于各种基准的所有基线方法。例如，与ORM基线相比，Q-RM集成到PPO/增强算法中时，Q-RM在数学推理任务上的平均得分@1得分提高了5.85/4.70点，与代币级别的PRM对方相比，平均得分和4.56/5.73点。此外，使用Q-RM的增强学习可以显着提高训练效率，在GSM8K上的收敛速度比ORM快12倍，而在数学上的趋势比Step级PRM快11倍。代码和数据可在此HTTPS URL上找到。

Title: Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation

Authors: Beiduo Chen, Yang Janet Liu, Anna Korhonen, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23368
Pdf URL: https://arxiv.org/pdf/2505.23368
Copy Paste: [[2505.23368]] Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation(https://arxiv.org/abs/2505.23368)
Keywords: language model, llm, chain-of-thought
Abstract: The recent rise of reasoning-tuned Large Language Models (LLMs)--which generate chains of thought (CoTs) before giving the final answer--has attracted significant attention and offers new opportunities for gaining insights into human label variation, which refers to plausible differences in how multiple annotators label the same data instance. Prior work has shown that LLM-generated explanations can help align model predictions with human label distributions, but typically adopt a reverse paradigm: producing explanations based on given answers. In contrast, CoTs provide a forward reasoning path that may implicitly embed rationales for each answer option, before generating the answers. We thus propose a novel LLM-based pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option from CoTs with improved accuracy. We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores, which instead favor direct comparison of label distributions. Our method outperforms a direct generation method as well as baselines on three datasets, and shows better alignment of ranking methods with humans, highlighting the effectiveness of our approach.
摘要：推理调整的大语言模型（LLMS）的最新兴起 - 在给出最终答案之前，它产生了思想链（COT） - 引起了重大关注，并为获得人类标签变化的见解提供了新的机会，这是指在多个注释者如何标记相同数据实例的方式上存在合理的差异。先前的工作表明，LLM生成的解释可以帮助模型预测与人类标签分布相结合，但通常采用反向范式：根据给定的答案产生解释。相比之下，COTS提供了一种前进的推理路径，在生成答案之前，可能会隐式地嵌入每个答案选项的理由。因此，我们提出了一条新型的基于LLM的管道，富含语言基础的话语细分器，以从COTS中提取每个答案选项的支持和相反的陈述，并提高了精度。我们还提出了一个基于等级的HLV评估框架，该框架优先考虑答案的排名，而不是精确的分数，而不是直接比较标签分布。我们的方法的表现优于直接生成方法以及在三个数据集上的基准，并显示出更好的排名方法与人类的一致性，从而强调了我们方法的有效性。

Title: Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models

Authors: Mingyu Yu, Wei Wang, Yanjie Wei, Sujuan Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23404
Pdf URL: https://arxiv.org/pdf/2505.23404
Copy Paste: [[2505.23404]] Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models(https://arxiv.org/abs/2505.23404)
Keywords: language model, gpt, llm
Abstract: Adversarial attacks on Large Language Models (LLMs) via jailbreaking techniques-methods that circumvent their built-in safety and ethical constraints-have emerged as a critical challenge in AI security. These attacks compromise the reliability of LLMs by exploiting inherent weaknesses in their comprehension capabilities. This paper investigates the efficacy of jailbreaking strategies that are specifically adapted to the diverse levels of understanding exhibited by different LLMs. We propose the Adaptive Jailbreaking Strategies Based on the Semantic Understanding Capabilities of Large Language Models, a novel framework that classifies LLMs into Type I and Type II categories according to their semantic comprehension abilities. For each category, we design tailored jailbreaking strategies aimed at leveraging their vulnerabilities to facilitate successful attacks. Extensive experiments conducted on multiple LLMs demonstrate that our adaptive strategy markedly improves the success rate of jailbreaking. Notably, our approach achieves an exceptional 98.9% success rate in jailbreaking GPT-4o(29 May 2025 release)
摘要：对越来越多的语言模型（LLM）的对抗性攻击是通过越狱技术来规避其内置安全性和道德约束的方法，这是AI安全的关键挑战。这些攻击通过利用其理解能力中固有的弱点来损害LLM的可靠性。本文调查了越狱策略的功效，这些策略专门适用于不同LLMS所表现出的各种理解水平。我们根据大型语言模型的语义理解能力提出了自适应越狱策略，这是一个新颖的框架，将LLMS根据其语义理解能力分类为I型和II型类别。对于每个类别，我们设计了量身定制的越狱策略，旨在利用其脆弱性来促进成功攻击。对多个LLM进行的广泛实验表明，我们的适应性策略显着提高了越狱的成功率。值得注意的是，我们的方法在越狱的GPT-4O中取得了出色的98.9％的成功率（2025年5月29日发布）

Title: From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs

Authors: Xuan Gong, Hanbo Huang, Shiyu Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23410
Pdf URL: https://arxiv.org/pdf/2505.23410
Copy Paste: [[2505.23410]] From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs(https://arxiv.org/abs/2505.23410)
Keywords: language model, llm, prompt
Abstract: Factual knowledge extraction aims to explicitly extract knowledge parameterized in pre-trained language models for application in downstream tasks. While prior work has been investigating the impact of supervised fine-tuning data on the factuality of large language models (LLMs), its mechanism remains poorly understood. We revisit this impact through systematic experiments, with a particular focus on the factuality gap that arises when fine-tuning on known versus unknown knowledge. Our findings show that this gap can be mitigated at the inference stage, either under out-of-distribution (OOD) settings or by using appropriate in-context learning (ICL) prompts (i.e., few-shot learning and Chain of Thought (CoT)). We prove this phenomenon theoretically from the perspective of knowledge graphs, showing that the test-time prompt may diminish or even overshadow the impact of fine-tuning data and play a dominant role in knowledge extraction. Ultimately, our results shed light on the interaction between finetuning data and test-time prompt, demonstrating that ICL can effectively compensate for shortcomings in fine-tuning data, and highlighting the need to reconsider the use of ICL prompting as a means to evaluate the effectiveness of fine-tuning data selection methods.
摘要：事实知识提取旨在在预训练的语言模型中明确提取知识，以在下游任务中应用。虽然先前的工作一直在研究监督的微调数据对大语言模型（LLMS）的事实的影响，但其机制仍然很少了解。我们通过系统的实验重新审视了这种影响，特别关注对已知知识与未知知识进行微调时产生的事实差距。我们的发现表明，可以在推理阶段（OOD）设置（OOD）设置或使用适当的文本学习（ICL）提示（即，很少的思想学习和思想链（COT））在推理阶段减轻此差距。从理论上讲，我们从知识图的角度来证明了这一现象，表明测试时间提示可能会减少甚至掩盖微调数据的影响，并在知识提取中起主要作用。最终，我们的结果阐明了鉴定数据和测试时间提示之间的相互作用，这表明ICL可以有效弥补微调数据中的缺点，并强调重新考虑使用ICL提示作为评估精细数据选择方法的有效性的手段的需求。

Title: UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions

Authors: Chuanyuan Tan, Wenbiao Shao, Hao Xiong, Tong Zhu, Zhenhua Liu, Kai Shi, Wenliang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23461
Pdf URL: https://arxiv.org/pdf/2505.23461
Copy Paste: [[2505.23461]] UAQFact: Evaluating Factual Knowledge Utilization of LLMs on Unanswerable Questions(https://arxiv.org/abs/2505.23461)
Keywords: llm
Abstract: Handling unanswerable questions (UAQ) is crucial for LLMs, as it helps prevent misleading responses in complex situations. While previous studies have built several datasets to assess LLMs' performance on UAQ, these datasets lack factual knowledge support, which limits the evaluation of LLMs' ability to utilize their factual knowledge when handling UAQ. To address the limitation, we introduce a new unanswerable question dataset UAQFact, a bilingual dataset with auxiliary factual knowledge created from a Knowledge Graph. Based on UAQFact, we further define two new tasks to measure LLMs' ability to utilize internal and external factual knowledge, respectively. Our experimental results across multiple LLM series show that UAQFact presents significant challenges, as LLMs do not consistently perform well even when they have factual knowledge stored. Additionally, we find that incorporating external knowledge may enhance performance, but LLMs still cannot make full use of the knowledge which may result in incorrect responses.
摘要：处理无法回答的问题（UAQ）对LLM至关重要，因为它有助于防止在复杂情况下的误导性回答。尽管以前的研究已经建立了几个数据集来评估LLMS在UAQ上的性能，但这些数据集缺乏事实知识支持，这限制了LLMS在处理UAQ时利用其事实知识的能力。为了解决限制，我们引入了一个新的无法回答的问题数据集UAQFACT，这是一个双语数据集，具有从知识图创建的辅助事实知识。基于UAQFACT，我们进一步定义了两个新任务，以分别衡量LLMS分别利用内部和外部事实知识的能力。我们在多个LLM系列中的实验结果表明，UAQFACT提出了重大挑战，因为LLM即使存储了事实知识，也无法始终如一地表现良好。此外，我们发现合并外部知识可能会提高性能，但是LLMS仍无法充分利用可能导致不正确响应的知识。

Title: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

Authors: Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23477
Pdf URL: https://arxiv.org/pdf/2505.23477
Copy Paste: [[2505.23477]] Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons(https://arxiv.org/abs/2505.23477)
Keywords: language model, llm
Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
摘要：神经外科医生（CNS-SANS）的神经外科医生自我评估大会被神经外科居民广泛使用，以准备书面检查。最近，这些问题也是评估大语模型（LLMS）神经外科知识的基准。这项研究旨在评估最先进的LLM在类似神经外科委员会的问题上的表现，并评估它们的鲁棒性，以包含分散术的陈述。使用28种大语言模型进行了全面的评估。对这些模型进行了2,904个神经外科委员会检查问题的测试，该问题来自CNS-SANS。此外，该研究还引入了一个分心框架来评估这些模型的脆弱性。该框架结合了简单的，无关的干扰物陈述，其中包含多义单词，其临床含义在非临床环境中使用，以确定这种干扰的程度降低了标准医疗基准上的模型性能。在28个测试的LLM中，有6个实现了董事会的结果，表现最佳的车型比通过阈值高出15.7％。当暴露于分心时，各种模型体系结构的准确性大大降低了20.4％ - 一个模型失败了。与专有变体相比，通用和医疗开源模型的性能下降幅度更大，受到附加干扰因素的影响。尽管当前的LLMS表现出令人印象深刻的回答神经外科董事会式考试问题的能力，但它们的表现显然容易受到无关紧要的分散注意力的信息。这些发现强调了制定旨在加强LLM弹性的新型缓解策略的关键需求，尤其是对于安全有效的临床部署而言。

Title: Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt

Authors: Keqin Peng, Liang Ding, Yuanxin Ouyang, Meng Fang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23480
Pdf URL: https://arxiv.org/pdf/2505.23480
Copy Paste: [[2505.23480]] Revisiting Overthinking in Long Chain-of-Thought from the Perspective of Self-Doubt(https://arxiv.org/abs/2505.23480)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reasoning Large Language Models (RLLMs) have demonstrated impressive performance on complex tasks, largely due to the adoption of Long Chain-of-Thought (Long CoT) reasoning. However, they often exhibit overthinking -- performing unnecessary reasoning steps even after arriving at the correct answer. Prior work has largely focused on qualitative analyses of overthinking through sample-based observations of long CoTs. In contrast, we present a quantitative analysis of overthinking from the perspective of self-doubt, characterized by excessive token usage devoted to re-verifying already-correct answer. We find that self-doubt significantly contributes to overthinking. In response, we introduce a simple and effective prompting method to reduce the model's over-reliance on input questions, thereby avoiding self-doubt. Specifically, we first prompt the model to question the validity of the input question, and then respond concisely based on the outcome of that evaluation. Experiments on three mathematical reasoning tasks and four datasets with missing premises demonstrate that our method substantially reduces answer length and yields significant improvements across nearly all datasets upon 4 widely-used RLLMs. Further analysis demonstrates that our method effectively minimizes the number of reasoning steps and reduces self-doubt.
摘要：推理大语言模型（RLLM）在复杂的任务上表现出了令人印象深刻的表现，这主要是由于采用了长长的经过三链（长床）推理。但是，他们经常表现出过度思考 - 即使是正确的答案，也要执行不必要的推理步骤。先前的工作主要集中在通过基于样本的长COT的观察结果来思考过度思考的定性分析。相比之下，我们从自我怀疑的角度进行了对过度思考的定量分析，其特征是致力于重新验证已经正确答案的过度令牌用法。我们发现，自我怀疑会有助于过度思考。作为回应，我们引入了一种简单有效的提示方法，以减少模型对输入问题的过度依赖，从而避免自我怀疑。具体来说，我们首先提示该模型质疑输入问题的有效性，然后根据该评估的结果简洁地做出回应。对三个数学推理任务和四个缺少前提的数据集进行的实验表明，我们的方法大大降低了答案的长度，并且在4个广泛使用的RLLM上几乎所有数据集都会取得重大改进。进一步的分析表明，我们的方法有效地最大程度地减少了推理步骤的数量并减少了自我怀疑。

Title: Spoken Language Modeling with Duration-Penalized Self-Supervised Units

Authors: Nicol Visser, Herman Kamper
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.23494
Pdf URL: https://arxiv.org/pdf/2505.23494
Copy Paste: [[2505.23494]] Spoken Language Modeling with Duration-Penalized Self-Supervised Units(https://arxiv.org/abs/2505.23494)
Keywords: language model
Abstract: Spoken language models (SLMs) operate on acoustic units obtained by discretizing self-supervised speech representations. Although the characteristics of these units directly affect performance, the interaction between codebook size and unit coarseness (i.e., duration) remains unexplored. We investigate SLM performance as we vary codebook size and unit coarseness using the simple duration-penalized dynamic programming (DPDP) method. New analyses are performed across different linguistic levels. At the phone and word levels, coarseness provides little benefit, as long as the codebook size is chosen appropriately. However, when producing whole sentences in a resynthesis task, SLMs perform better with coarser units. In lexical and syntactic language modeling tasks, coarser units also give higher accuracies at lower bitrates. We therefore show that coarser units aren't always better, but that DPDP is a simple and efficient way to obtain coarser units for the tasks where they are beneficial.
摘要：口语模型（SLM）是通过离散自我监督的语音表示来获得的声学单元的。尽管这些单元的特征直接影响性能，但代码书大小和单位粗糙度（即持续时间）之间的相互作用仍未探索。我们使用简单的持续时间含量的动态编程（DPDP）方法来调查SLM性能，因为我们改变了代码簿的尺寸和单位粗糙度。在不同的语言水平上进行新的分析。在电话和单词级别上，只要适当地选择代码簿尺寸，粗糙度就几乎没有好处。但是，当在重新合成任务中产生整个句子时，SLM的表现更好。在词汇和句法的语言建模任务中，粗糙单元在较低的比特率中还具有更高的精度。因此，我们表明，更粗糙的单元并不总是更好，但是DPDP是一种简单有效的方法，可以为其受益的任务提供更粗的单位。

Title: Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking

Authors: Liangliang Zhang, Zhuorui Jiang, Hongliang Chi, Haoyang Chen, Mohammed Elkoumy, Fali Wang, Qiong Wu, Zhengyi Zhou, Shirui Pan, Suhang Wang, Yao Ma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23495
Pdf URL: https://arxiv.org/pdf/2505.23495
Copy Paste: [[2505.23495]] Diagnosing and Addressing Pitfalls in KG-RAG Datasets: Toward More Reliable Benchmarking(https://arxiv.org/abs/2505.23495)
Keywords: llm
Abstract: Knowledge Graph Question Answering (KGQA) systems rely on high-quality benchmarks to evaluate complex multi-hop reasoning. However, despite their widespread use, popular datasets such as WebQSP and CWQ suffer from critical quality issues, including inaccurate or incomplete ground-truth annotations, poorly constructed questions that are ambiguous, trivial, or unanswerable, and outdated or inconsistent knowledge. Through a manual audit of 16 popular KGQA datasets, including WebQSP and CWQ, we find that the average factual correctness rate is only 57 %. To address these issues, we introduce KGQAGen, an LLM-in-the-loop framework that systematically resolves these pitfalls. KGQAGen combines structured knowledge grounding, LLM-guided generation, and symbolic verification to produce challenging and verifiable QA instances. Using KGQAGen, we construct KGQAGen-10k, a ten-thousand scale benchmark grounded in Wikidata, and evaluate a diverse set of KG-RAG models. Experimental results demonstrate that even state-of-the-art systems struggle on this benchmark, highlighting its ability to expose limitations of existing models. Our findings advocate for more rigorous benchmark construction and position KGQAGen as a scalable framework for advancing KGQA evaluation.
摘要：知识图应答（KGQA）系统依赖于高质量的基准来评估复杂的多跳推理。但是，尽管使用了广泛使用，但诸如WebQSP和CWQ之类的流行数据集都遭受了关键质量问题的困扰，包括不准确或不完整的地面真实注释，构造良好的问题，这些问题模棱两可，琐碎或无法回答，并且无法解决，并且过时或不一致或不一致的知识。通过对包括WebQSP和CWQ在内的16个流行KGQA数据集的手动审核，我们发现平均事实正确性率仅为57％。为了解决这些问题，我们介绍了KGQAGEN，这是一个系统地解决这些陷阱的LLM框架框架。 KGQAGEN结合了结构化的知识接地，LLM引导的生成和符号验证，以产生具有挑战性和可验证的质量检查实例。我们使用kgqagen，构建了kgqagen-10k，这是一个基于Wikidata建立的1000台基准，并评估了一套多样化的KG-rag模型。实验结果表明，即使是最先进的系统在此基准测试中也进行了努力，突显了其暴露现有模型局限性的能力。我们的发现倡导更严格的基准构建和KGQAGEN位置，作为用于推进KGQA评估的可扩展框架。

Title: Probability-Consistent Preference Optimization for Enhanced LLM Reasoning

Authors: Yunqiao Yang, Houxing Ren, Zimu Lu, Ke Wang, Weikang Shi, Aojun Zhou, Junting Pan, Mingjie Zhan, Hongsheng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23540
Pdf URL: https://arxiv.org/pdf/2505.23540
Copy Paste: [[2505.23540]] Probability-Consistent Preference Optimization for Enhanced LLM Reasoning(https://arxiv.org/abs/2505.23540)
Keywords: language model, llm
Abstract: Recent advances in preference optimization have demonstrated significant potential for improving mathematical reasoning capabilities in large language models (LLMs). While current approaches leverage high-quality pairwise preference data through outcome-based criteria like answer correctness or consistency, they fundamentally neglect the internal logical coherence of responses. To overcome this, we propose Probability-Consistent Preference Optimization (PCPO), a novel framework that establishes dual quantitative metrics for preference selection: (1) surface-level answer correctness and (2) intrinsic token-level probability consistency across responses. Extensive experiments show that our PCPO consistently outperforms existing outcome-only criterion approaches across a diverse range of LLMs and benchmarks. Our code is publicly available at this https URL.
摘要：偏好优化的最新进展表明，在大型语言模型（LLM）中提高数学推理能力的巨大潜力。尽管当前的方法通过基于结果的标准（如答案正确性或一致性）利用高质量的成对偏好数据，但它们从根本上忽略了响应的内部逻辑连贯性。为了克服这一点，我们提出了概率一致的偏好优化（PCPO），这是一个新的框架，该框架建立了双重定量指标以选择偏好选择：（1）表面级别的答案正确性和（2）跨反应之间的内在令牌级别的概率一致性。广泛的实验表明，我们的PCPO始终优于各种LLM和基准的现有标准方法。我们的代码在此HTTPS URL上公开可用。

Title: Translation in the Wild

Authors: Yuri Balashov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23548
Pdf URL: https://arxiv.org/pdf/2505.23548
Copy Paste: [[2505.23548]] Translation in the Wild(https://arxiv.org/abs/2505.23548)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in translation among other things, demonstrating competitive performance for many language pairs in zero- and few-shot settings. But unlike dedicated neural machine translation models, LLMs are not trained on any translation-related objective. What explains their remarkable translation abilities? Are these abilities grounded in "incidental bilingualism" (Briakou et al. 2023) in training data? Does instruction tuning contribute to it? Are LLMs capable of aligning and leveraging semantically identical or similar monolingual contents from different corners of the internet that are unlikely to fit in a single context window? I offer some reflections on this topic, informed by recent studies and growing user experience. My working hypothesis is that LLMs' translation abilities originate in two different types of pre-training data that may be internalized by the models in different ways. I discuss the prospects for testing the "duality" hypothesis empirically and its implications for reconceptualizing translation, human and machine, in the age of deep learning.
摘要：大型语言模型（LLMS）在翻译等方面表现出色，在零和几乎没有射击的设置中展示了许多语言对的竞争性能。但是，与专用的神经机器翻译模型不同，LLM并未以任何与翻译相关的目标进行培训。是什么解释了他们非凡的翻译能力？这些能力是否基于训练数据中的“偶然双语”（Briakou等，2023）？指令调整会导致它吗？ LLM是否能够对齐和利用来自Internet不同角落的语义上相同或类似的单语言内容，这些内容不太可能适合单个上下文窗口？我对这个主题进行了一些思考，这是由最近的研究和不断增长的用户体验所启示的。我的工作假设是，LLMS的翻译能力起源于两种不同类型的预训练数据，这些数据可能以不同的方式内化。我在深度学习时代讨论了经过经验测试“双重性”假设的前景及其对重新概念化翻译，人和机器的影响。

Title: Understanding Refusal in Language Models with Sparse Autoencoders

Authors: Wei Jie Yeo, Nirmalendu Prakash, Clement Neo, Roy Ka-Wei Lee, Erik Cambria, Ranjan Satapathy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23556
Pdf URL: https://arxiv.org/pdf/2505.23556
Copy Paste: [[2505.23556]] Understanding Refusal in Language Models with Sparse Autoencoders(https://arxiv.org/abs/2505.23556)
Keywords: language model, llm, chat
Abstract: Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks. We open source our code in this https URL.
摘要：拒绝是对齐语言模型中的关键安全行为，但是驱动的内部机制拒绝仍然不透明。在这项工作中，我们使用稀疏的自动编码器对指导调节的LLM中的拒绝进行了机理研究，以识别有因果介导拒绝行为的潜在特征。我们将方法应用于两个开源聊天模型，并干预拒绝相关的功能，以评估其对发电的影响，从而验证其在多个有害数据集中的行为影响。这可以对拒绝在激活水平上表现出来的精细检查，并解决了关键的研究问题，例如研究上游下游潜在的关系以及了解对抗性越狱技术的机制。我们还建立了拒绝特征在增强线性探针对分类任务中分布的对抗样本的概括方面的有用性。我们在此HTTPS URL中开源代码。

Title: Evaluating AI capabilities in detecting conspiracy theories on YouTube

Authors: Leonardo La Rocca, Francesco Corso, Francesco Pierri
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2505.23570
Pdf URL: https://arxiv.org/pdf/2505.23570
Copy Paste: [[2505.23570]] Evaluating AI capabilities in detecting conspiracy theories on YouTube(https://arxiv.org/abs/2505.23570)
Keywords: language model, llm
Abstract: As a leading online platform with a vast global audience, YouTube's extensive reach also makes it susceptible to hosting harmful content, including disinformation and conspiracy theories. This study explores the use of open-weight Large Language Models (LLMs), both text-only and multimodal, for identifying conspiracy theory videos shared on YouTube. Leveraging a labeled dataset of thousands of videos, we evaluate a variety of LLMs in a zero-shot setting and compare their performance to a fine-tuned RoBERTa baseline. Results show that text-based LLMs achieve high recall but lower precision, leading to increased false positives. Multimodal models lag behind their text-only counterparts, indicating limited benefits from visual data integration. To assess real-world applicability, we evaluate the most accurate models on an unlabeled dataset, finding that RoBERTa achieves performance close to LLMs with a larger number of parameters. Our work highlights the strengths and limitations of current LLM-based approaches for online harmful content detection, emphasizing the need for more precise and robust systems.
摘要：作为具有全球众多受众群体的领先在线平台，YouTube的广泛影响力也使其容易托管有害内容，包括虚假信息和阴谋论。这项研究探讨了使用仅文本和多模式的开放重量大型语言模型（LLM）的使用，以识别YouTube上共享的阴谋论视频。利用一个标有数千个视频的标签数据集，我们评估了零拍设置中的各种LLM，并将其性能与微调的Roberta基线进行了比较。结果表明，基于文本的LLM可以获得高回忆，但精度较低，导致假阳性增加。多模型模型落后于仅文本的对应物，表明视觉数据集成受益的好处有限。为了评估现实世界中的适用性，我们评估了未标记的数据集上最准确的模型，发现罗伯塔（Roberta）在具有更多参数的LLM附近实现了性能。我们的工作强调了当前基于LLM的在线有害内容检测方法的优势和局限性，强调了对更精确和强大的系统的需求。

Title: Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Authors: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2505.23604
Pdf URL: https://arxiv.org/pdf/2505.23604
Copy Paste: [[2505.23604]] Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering(https://arxiv.org/abs/2505.23604)
Keywords: language model
Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.
摘要：语言模型（LMS）在标准化的编码基准上表现良好，但与现实世界中软件工程任务（例如在SWE基础上解决GitHub问题）斗争，尤其是当模型参数小于100B时。尽管较小的模型在实践中是由于其计算成本较低而优选的，但提高性能仍然具有挑战性。现有的方法主要依靠具有高质量数据的监督微调（SFT），这是昂贵的。一种替代方法是测试时间缩放：生成多个输出，使用验证者对它们进行评分，然后选择最佳输出。尽管有效，但这种策略通常需要过度采样和昂贵的评分，从而限制其实际应用。我们提出了进化测试时间缩放（Evoscale），这是一种样品有效的方法，将产生视为进化过程。通过通过选择和突变进行迭代精炼的输出，eVoscale将输出分布转移到得分高得分区域，减少找到正确解决方案所需的样品数量。为了从反复采样和选择中减少开销，我们使用增强学习（RL）训练模型。该模型不是在推理时依靠外部验证者，而是学会自我介绍其在迭代中的世代分数。 Evoscale在SWE基础验证的验证中进行了评估，使我们的32B型号Satori-SWE-32B可以匹配或超过具有超过100b参数的模型的性能，同时使用了一些样品。代码，数据和模型将完全开源。

Title: Table-R1: Inference-Time Scaling for Table Reasoning

Authors: Zheyuan Yang, Lyuhao Chen, Arman Cohan, Yilun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23621
Pdf URL: https://arxiv.org/pdf/2505.23621
Copy Paste: [[2505.23621]] Table-R1: Inference-Time Scaling for Table Reasoning(https://arxiv.org/abs/2505.23621)
Keywords: gpt, llm
Abstract: In this work, we present the first study to explore inference-time scaling on table reasoning tasks. We develop and evaluate two post-training strategies to enable inference-time scaling: distillation from frontier model reasoning traces and reinforcement learning with verifiable rewards (RLVR). For distillation, we introduce a large-scale dataset of reasoning traces generated by DeepSeek-R1, which we use to fine-tune LLMs into the Table-R1-SFT model. For RLVR, we propose task-specific verifiable reward functions and apply the GRPO algorithm to obtain the Table-R1-Zero model. We evaluate our Table-R1-series models across diverse table reasoning tasks, including short-form QA, fact verification, and free-form QA. Notably, the Table-R1-Zero model matches or exceeds the performance of GPT-4.1 and DeepSeek-R1, while using only a 7B-parameter LLM. It also demonstrates strong generalization to out-of-domain datasets. Extensive ablation and qualitative analyses reveal the benefits of instruction tuning, model architecture choices, and cross-task generalization, as well as emergence of essential table reasoning skills during RL training.
摘要：在这项工作中，我们介绍了第一项研究，以探讨表上的推理推理任务的推理时间扩展。我们制定并评估两种训练后策略，以实现推理时间缩放：从前沿模型推理痕迹和可验证的奖励（RLVR）进行蒸馏。为了进行蒸馏，我们引入了DeepSeek-R1生成的推理痕迹的大规模数据集，我们将其用于将LLMS微调到Table-R1-SFT模型中。对于RLVR，我们提出了特定于任务的可验证奖励函数，并应用GRPO算法以获得Table-R1-Zero模型。我们评估了跨不同表推理任务的Table-r1系列模型，包括短形式质量质量质量检查，事实验证和自由形式QA。值得注意的是，Table-R1-Zero模型匹配或超过GPT-4.1和DeepSeek-R1的性能，同时仅使用7B参数LLM。它还证明了对室外数据集的强烈概括。广泛的消融和定性分析揭示了教学调整，模型体系结构选择和交叉任务概括的好处，以及在RL培训期间出现必要的桌面推理技能。

Title: Characterizing the Expressivity of Transformer Language Models

Authors: Jiaoda Li, Ryan Cotterell
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23623
Pdf URL: https://arxiv.org/pdf/2505.23623
Copy Paste: [[2505.23623]] Characterizing the Expressivity of Transformer Language Models(https://arxiv.org/abs/2505.23623)
Keywords: language model
Abstract: Transformer-based language models (LMs) have achieved widespread empirical success, but their theoretical expressive power remains only partially understood. Prior work often relies on idealized models with assumptions -- such as arbitrary numerical precision and hard attention -- that diverge from real-world transformers. In this work, we provide an exact characterization of fixed-precision transformers with strict future masking and soft attention, an idealization that more closely mirrors practical implementations. We show that these models are precisely as expressive as a specific fragment of linear temporal logic that includes only a single temporal operator: the past operator. We further relate this logic to established classes in formal language theory, automata theory, and algebra, yielding a rich and unified theoretical framework for understanding transformer expressivity. Finally, we present empirical results that align closely with our theory: transformers trained on languages within their theoretical capacity generalize perfectly over lengths, while they consistently fail to generalize on languages beyond it.
摘要：基于变压器的语言模型（LMS）取得了广泛的经验成功，但其理论表达能力仍然只能部分理解。先前的工作通常依赖于具有假设的理想化模型（例如任意数值精度和艰苦的关注）与现实世界变形金刚不同。在这项工作中，我们提供了固定精神变压器的确切表征，并以严格的将来的掩盖和软关注，这更紧密地反映了实际实现。我们表明，这些模型与仅包含单个暂时操作员的线性时间逻辑的特定片段一样表现出表达性：过去的运算符。我们将这种逻辑与正式语言理论，自动机理论和代数建立的阶级联系起来，从而产生了一个丰富而统一的理论框架，以理解变压器的表现力。最后，我们提出了与我们的理论紧密相吻合的经验结果：在其理论能力内接受语言训练的变形金刚在长度上完全推广，而它们始终无法概括超出其以外的语言。

Title: AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Authors: Jiaxin Bai, Wei Fan, Qi Hu, Qing Zong, Chunyang Li, Hong Ting Tsang, Hongyu Luo, Yauwai Yim, Haoyu Huang, Xiao Zhou, Feng Qin, Tianshi Zheng, Xi Peng, Xin Yao, Huiwen Yang, Leijie Wu, Yi Ji, Gong Zhang, Renhai Chen, Yangqiu Song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23628
Pdf URL: https://arxiv.org/pdf/2505.23628
Copy Paste: [[2505.23628]] AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora(https://arxiv.org/abs/2505.23628)
Keywords: language model, llm
Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95\% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
摘要：我们提出Autoschemakg，这是一个完全自主知识图构造的框架，可消除对预定义模式的需求。我们的系统利用大型语言模型同时提取知识的三倍，并直接从文本中诱导全面的模式，在实体和事件中对实体和事件进行建模，同时采用概念化将实例组织为语义类别。在处理超过5000万个文档的过程中，我们构建了Atlas（自动化的三重链接和模式归纳），这是一个知识图，具有900多个节点和59亿个边缘。这种方法的表现优于多跳质量检查任务的最先进基线，并增强了LLM的事实。值得注意的是，我们的模式感应实现了95 \％的语义对齐方式，并具有零手动干预的人工制作的模式，表明具有动态诱导的模式的数十亿个尺度的知识图可以有效地补充大型语言模型中的参数知识。

Title: GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns

Authors: Enzo Doyen, Amalia Todirascu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23630
Pdf URL: https://arxiv.org/pdf/2505.23630
Copy Paste: [[2505.23630]] GeNRe: A French Gender-Neutral Rewriting System Using Collective Nouns(https://arxiv.org/abs/2505.23630)
Keywords: language model
Abstract: A significant portion of the textual data used in the field of Natural Language Processing (NLP) exhibits gender biases, particularly due to the use of masculine generics (masculine words that are supposed to refer to mixed groups of men and women), which can perpetuate and amplify stereotypes. Gender rewriting, an NLP task that involves automatically detecting and replacing gendered forms with neutral or opposite forms (e.g., from masculine to feminine), can be employed to mitigate these biases. While such systems have been developed in a number of languages (English, Arabic, Portuguese, German, French), automatic use of gender neutralization techniques (as opposed to inclusive or gender-switching techniques) has only been studied for English. This paper presents GeNRe, the very first French gender-neutral rewriting system using collective nouns, which are gender-fixed in French. We introduce a rule-based system (RBS) tailored for the French language alongside two fine-tuned language models trained on data generated by our RBS. We also explore the use of instruct-based models to enhance the performance of our other systems and find that Claude 3 Opus combined with our dictionary achieves results close to our RBS. Through this contribution, we hope to promote the advancement of gender bias mitigation techniques in NLP for French.
摘要：自然语言处理领域（NLP）中使用的文本数据的很大一部分表现出性别偏见，特别是由于使用男性仿制药（应该指的是男性和女性的男性单词），可以使刻板印象延续并扩大刻板印象。性别重写是一种NLP任务，涉及自动检测和替换具有中性或相反形式的性别形式（例如，从男性到女性），可以用于减轻这些偏见。虽然这种系统是用多种语言（英语，阿拉伯语，葡萄牙语，德语，法语）开发的，但仅研究了英语的自动使用性别中和技术（而不是包容性或性别切换技术）。本文介绍了类型，这是第一个使用集体名词的法国性别中性重写系统，该系统以法语为例。我们介绍了一个针对法语的基于规则的系统（RBS），并在我们的RBS生成的数据中培训了两种微调语言模型。我们还探索了基于指导模型的使用来增强其他系统的性能，并发现Claude 3 Opus与我们的字典相结合的结果接近我们的RB。通过这项贡献，我们希望促进NLP中法语中性别偏见缓解技术的进步。

Title: Are Reasoning Models More Prone to Hallucination?

Authors: Zijun Yao, Yantao Liu, Yanxu Chen, Jianhui Chen, Junfeng Fang, Lei Hou, Juanzi Li, Tat-Seng Chua
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23646
Pdf URL: https://arxiv.org/pdf/2505.23646
Copy Paste: [[2505.23646]] Are Reasoning Models More Prone to Hallucination?(https://arxiv.org/abs/2505.23646)
Keywords: hallucination, chain-of-thought
Abstract: Recently evolved large reasoning models (LRMs) show powerful performance in solving complex tasks with long chain-of-thought (CoT) reasoning capability. As these LRMs are mostly developed by post-training on formal reasoning tasks, whether they generalize the reasoning capability to help reduce hallucination in fact-seeking tasks remains unclear and debated. For instance, DeepSeek-R1 reports increased performance on SimpleQA, a fact-seeking benchmark, while OpenAI-o3 observes even severer hallucination. This discrepancy naturally raises the following research question: Are reasoning models more prone to hallucination? This paper addresses the question from three perspectives. (1) We first conduct a holistic evaluation for the hallucination in LRMs. Our analysis reveals that LRMs undergo a full post-training pipeline with cold start supervised fine-tuning (SFT) and verifiable reward RL generally alleviate their hallucination. In contrast, both distillation alone and RL training without cold start fine-tuning introduce more nuanced hallucinations. (2) To explore why different post-training pipelines alters the impact on hallucination in LRMs, we conduct behavior analysis. We characterize two critical cognitive behaviors that directly affect the factuality of a LRM: Flaw Repetition, where the surface-level reasoning attempts repeatedly follow the same underlying flawed logic, and Think-Answer Mismatch, where the final answer fails to faithfully match the previous CoT process. (3) Further, we investigate the mechanism behind the hallucination of LRMs from the perspective of model uncertainty. We find that increased hallucination of LRMs is usually associated with the misalignment between model uncertainty and factual accuracy. Our work provides an initial understanding of the hallucination in LRMs.
摘要：最近进化的大型推理模型（LRMS）在以长期的经过经过思维（COT）推理能力的方式解决复杂任务方面表现出了强大的表现。由于这些LRM大多是通过对正式推理任务进行培训而开发的，因此它们是否概括了推理能力以帮助减少寻求事实任务的幻觉的能力，尚不清楚和辩论。例如，DeepSeek-R1报告说，SimpleQA的性能提高了，这是一个寻求事实的基准，而OpenAI-O3则观察到更严重的幻觉。这种差异自然提出了以下研究问题：推理模型更容易幻觉？本文从三个角度解决了这个问题。（1）我们首先对LRMS幻觉进行整体评估。我们的分析表明，LRMS经过一条完整的训练后管道，并通过冷启动监督微调（SFT）和可验证的奖励RL通常减轻其幻觉。相比之下，单独蒸馏和不冷训练的RL训练会引起更多细微的幻觉。（2）探索为什么不同的训练后管道会改变对LRMS幻觉的影响，我们进行行为分析。我们表征了两种直接影响LRM：缺陷重复的事实的批判性认知行为，其中反复遵循相同的潜在缺陷逻辑并思考答案不匹配，最终答案未能忠实地匹配先前的COT过程。（3）此外，我们从模型不确定性的角度研究了LRM幻觉的机制。我们发现，LRM的幻觉增加通常与模型不确定性和事实准确性之间的未对准有关。我们的工作提供了对LRM中幻觉的初步理解。

Title: ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs

Authors: Mohamed Elaraby, Diane Litman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23654
Pdf URL: https://arxiv.org/pdf/2505.23654
Copy Paste: [[2505.23654]] ARC: Argument Representation and Coverage Analysis for Zero-Shot Long Document Summarization with Instruction Following LLMs(https://arxiv.org/abs/2505.23654)
Keywords: language model, llm
Abstract: Integrating structured information has long improved the quality of abstractive summarization, particularly in retaining salient content. In this work, we focus on a specific form of structure: argument roles, which are crucial for summarizing documents in high-stakes domains such as law. We investigate whether instruction-tuned large language models (LLMs) adequately preserve this information. To this end, we introduce Argument Representation Coverage (ARC), a framework for measuring how well LLM-generated summaries capture salient arguments. Using ARC, we analyze summaries produced by three open-weight LLMs in two domains where argument roles are central: long legal opinions and scientific articles. Our results show that while LLMs cover salient argument roles to some extent, critical information is often omitted in generated summaries, particularly when arguments are sparsely distributed throughout the input. Further, we use ARC to uncover behavioral patterns -- specifically, how the positional bias of LLM context windows and role-specific preferences impact the coverage of key arguments in generated summaries, emphasizing the need for more argument-aware summarization strategies.
摘要：整合结构化信息长期以来一直提高了抽象性摘要的质量，尤其是在保留显着含量方面。在这项工作中，我们专注于一种特定的结构形式：论证角色，这对于总结高风险领域（例如法律）中的文件至关重要。我们研究指导调节的大语言模型（LLMS）是否可以充分保留此信息。为此，我们介绍了参数表示覆盖范围（ARC），这是一个框架，用于衡量LLM生成的摘要如何捕获显着参数。使用ARC，我们分析了由三个开放量LLMS产生的摘要在两个域名中心的领域中：长期的法律意见和科学文章。我们的结果表明，尽管LLM在某种程度上涵盖了显着的论点角色，但生成的摘要通常会省略关键信息，尤其是当参数在整个输入中稀少时。此外，我们使用ARC来揭示行为模式 - 特别是，LLM上下文Windows的位置偏差和特定于角色的偏好如何影响生成的摘要中关键参数的覆盖范围，从而强调需要更多参数意识到的摘要策略。

Title: Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Authors: Hongxiang Zhang, Hao Chen, Tianyi Zhang, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23657
Pdf URL: https://arxiv.org/pdf/2505.23657
Copy Paste: [[2505.23657]] Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation(https://arxiv.org/abs/2505.23657)
Keywords: language model, llm, hallucination
Abstract: Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.
摘要：最近的解码方法通过完善在发电过程中如何选择下一代币，从而改善了大语言模型〜（llms）的事实。这些方法通常在令牌级别运行，利用内部表示来抑制浅表模式。然而，LLM仍然容易出现幻觉，尤其是在更长的背景下。在本文中，我们提出了主动层对抗性解码（ACTLCD），这是一种新颖的解码策略，积极决定在发电过程中何时应用对比层。通过将解码作为一个顺序决策问题，ACTLCD采用了以奖励意识分类器为指导的强化学习政策，以优化超出令牌级别的事实。我们的实验表明，ACTLCD超过了五个基准的最先进方法，展示了其在减轻各种一代情景中幻觉方面的有效性。

Title: ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions

Authors: Beong-woo Kwak, Minju Kim, Dongha Lim, Hyungjoo Chae, Dongjin Kang, Sunghwan Kim, Dongil Yang, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23662
Pdf URL: https://arxiv.org/pdf/2505.23662
Copy Paste: [[2505.23662]] ToolHaystack: Stress-Testing Tool-Augmented Language Models in Realistic Long-Term Interactions(https://arxiv.org/abs/2505.23662)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong capabilities in using external tools to address user inquiries. However, most existing evaluations assume tool use in short contexts, offering limited insight into model behavior during realistic long-term interactions. To fill this gap, we introduce ToolHaystack, a benchmark for testing the tool use capabilities in long-term interactions. Each test instance in ToolHaystack includes multiple tasks execution contexts and realistic noise within a continuous conversation, enabling assessment of how well models maintain context and handle various disruptions. By applying this benchmark to 14 state-of-the-art LLMs, we find that while current models perform well in standard multi-turn settings, they often significantly struggle in ToolHaystack, highlighting critical gaps in their long-term robustness not revealed by previous tool benchmarks.
摘要：大型语言模型（LLMS）在使用外部工具来解决用户查询方面表现出强大的功能。但是，大多数现有的评估都假设在短上下文中使用工具，从而在现实的长期互动过程中对模型行为的洞察力有限。为了填补这一空白，我们介绍了Toolhaystack，这是用于测试长期交互中使用功能的基准。 ToolHaystack中的每个测试实例都包括多个任务执行上下文和在连续的对话中逼真的噪声，从而评估模型如何保持上下文并处理各种破坏。通过将此基准应用于14个最先进的LLMS，我们发现，尽管当前模型在标准的多圈设置中的表现良好，但它们通常在Toolhaystack中大为挣扎，突出了以前工具基准并未透露其长期鲁棒性的关键差距。

Title: LoLA: Low-Rank Linear Attention With Sparse Caching

Authors: Luke McDermott, Robert W. Heath Jr., Rahul Parhi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23666
Pdf URL: https://arxiv.org/pdf/2505.23666
Copy Paste: [[2505.23666]] LoLA: Low-Rank Linear Attention With Sparse Caching(https://arxiv.org/abs/2505.23666)
Keywords: language model, long context
Abstract: Transformer-based large language models suffer from quadratic complexity at inference on long sequences. Linear attention methods are efficient alternatives, however, they fail to provide an accurate approximation of softmax attention. By additionally incorporating sliding window attention into each linear attention head, this gap can be closed for short context-length tasks. Unfortunately, these approaches cannot recall important information from long contexts due to "memory collisions". In this paper , we propose LoLA: Low-rank Linear Attention with sparse caching. LoLA separately stores additional key-value pairs that would otherwise interfere with past associative memories. Moreover, LoLA further closes the gap between linear attention models and transformers by distributing past key-value pairs into three forms of memory: (i) recent pairs in a local sliding window; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. As an inference-only strategy, LoLA enables pass-key retrieval on up to 8K context lengths on needle-in-a-haystack tasks from RULER. It boosts the accuracy of the base subquadratic model from 0.6% to 97.4% at 4K context lengths, with a 4.6x smaller cache than that of Llama-3.1 8B. LoLA demonstrates strong performance on zero-shot commonsense reasoning tasks among 1B and 8B parameter subquadratic models. Finally, LoLA is an extremely lightweight approach: Nearly all of our results can be reproduced on a single consumer GPU.
摘要：基于变形金刚的大型语言模型在长序列上的推理时具有二次复杂性。线性注意方法是有效的替代方法，但是，它们无法准确地近似SoftMax注意。通过将滑动窗口的注意力纳入每个线性注意力头中，可以封闭此差距，以完成短上下文长度的任务。不幸的是，由于“记忆碰撞”，这些方法无法从长篇小说中回忆起重要信息。在本文中，我们提出了Lola：稀疏缓存的低级线性注意力。 Lola分别存储了其他键值对，否则会干扰过去的关联记忆。此外，Lola通过将过去的键值对分配到三种形式的内存中，进一步缩小了线性注意模型和变形金刚之间的差距：（i）在本地滑动窗口中的最新对；（ii）在稀疏的全球缓存中难以熟悉对；（iii）在线性注意的复发性隐藏状态中的通用对。作为一种仅推理的策略，Lola可以在统治者的针线内完成8K上下文长度上获得通键检索。它在4K上下文长度下将基本子次级模型的准确性从0.6％提高到97.4％，比Llama-3.1 8B的高速缓存小4.6倍。 LOLA在1B和8B参数子限制模型中表现出强烈的性能。最后，Lola是一种非常轻巧的方法：我们几乎所有的结果都可以在单个消费者GPU上复制。

Title: Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models

Authors: Francesca Padovani, Jaap Jumelet, Yevgen Matusevych, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23689
Pdf URL: https://arxiv.org/pdf/2505.23689
Copy Paste: [[2505.23689]] Child-Directed Language Does Not Consistently Boost Syntax Learning in Language Models(https://arxiv.org/abs/2505.23689)
Keywords: language model
Abstract: Seminal work by Huebner et al. (2021) showed that language models (LMs) trained on English Child-Directed Language (CDL) can reach similar syntactic abilities as LMs trained on much larger amounts of adult-directed written text, suggesting that CDL could provide more effective LM training material than the commonly used internet-crawled data. However, the generalizability of these results across languages, model types, and evaluation settings remains unclear. We test this by comparing models trained on CDL vs. Wikipedia across two LM objectives (masked and causal), three languages (English, French, German), and three syntactic minimal-pair benchmarks. Our results on these benchmarks show inconsistent benefits of CDL, which in most cases is outperformed by Wikipedia models. We then identify various shortcomings in previous benchmarks, and introduce a novel testing methodology, FIT-CLAMS, which uses a frequency-controlled design to enable balanced comparisons across training corpora. Through minimal pair evaluations and regression analysis we show that training on CDL does not yield stronger generalizations for acquiring syntax and highlight the importance of controlling for frequency effects when evaluating syntactic ability.
摘要：Huebner等人的开创性工作。（2021）表明，接受英语儿童指导语言培训的语言模型（LMS）可以达到与接受大量成人指导的书面文本培训的LMS相似的句法能力，这表明CDL可以提供比常用的Internet拖延数据更有效的LM培训材料。但是，这些结果跨语言，模型类型和评估设置的普遍性尚不清楚。我们通过比较在两个LM目标（掩盖和因果关系），三种语言（英语，法语，德语）和三个句法最小对基准的模型和Wikipedia训练的模型和Wikipedia训练的模型。我们对这些基准测试的结果表明，CDL的好处不一致，在大多数情况下，Wikipedia模型表现出色。然后，我们在以前的基准测试中确定了各种缺点，并引入了一种新颖的测试方法，即Fit-Clams，该方法使用频率控制的设计来实现培训语料库的平衡比较。通过最小的对评估和回归分析，我们表明，对CDL的培训并不能产生更强的概括来获取语法，并强调评估句法能力时控制频率效应的重要性。

Title: Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Authors: Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, Jackie Chi Kit Cheung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23701
Pdf URL: https://arxiv.org/pdf/2505.23701
Copy Paste: [[2505.23701]] Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation(https://arxiv.org/abs/2505.23701)
Keywords: language model, llm
Abstract: Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.
摘要：最终基于答案的指标通常用于评估数学单词问题的大语言模型（LLM），通常被视为推理能力的代理。但是，这样的指标将两个不同的子技能融合在一起：抽象公式（使用表达式捕获数学关系）和算术计算（执行计算）。通过对GSM8K和SVAMP进行的分散评估，我们发现没有COT的Llama-3和Qwen2.5（1b-32b）的最终回答精度是由算术计算步骤压倒性的，而不是通过抽象的表达步骤瓶颈。与普遍的信念相反，我们表明COT主要有助于计算，对抽象配方的影响有限。从机械上讲，我们表明，这两种技能即使在单个正向通行中也是连接的，而无需通过抽象 - 计算机制进行任何推理步骤：模型首先捕获问题摘要，然后处理计算。因果修补证实了这些抽象存在，可转移，可复合和之前的计算。这些行为和机械性发现突出了对分解评估的需求，以准确评估LLM推理并指导未来的改进。

Title: SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models

Authors: Zixiang Xu, Yanbo Wang, Yue Huang, Jiayi Ye, Haomin Zhuang, Zirui Song, Lang Gao, Chenxi Wang, Zhaorun Chen, Yujun Zhou, Sixian Li, Wang Pan, Yue Zhao, Jieyu Zhao, Xiangliang Zhang, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23713
Pdf URL: https://arxiv.org/pdf/2505.23713
Copy Paste: [[2505.23713]] SocialMaze: A Benchmark for Evaluating Social Reasoning in Large Language Models(https://arxiv.org/abs/2505.23713)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are increasingly applied to socially grounded tasks, such as online community moderation, media content analysis, and social reasoning games. Success in these contexts depends on a model's social reasoning ability - the capacity to interpret social contexts, infer others' mental states, and assess the truthfulness of presented information. However, there is currently no systematic evaluation framework that comprehensively assesses the social reasoning capabilities of LLMs. Existing efforts often oversimplify real-world scenarios and consist of tasks that are too basic to challenge advanced models. To address this gap, we introduce SocialMaze, a new benchmark specifically designed to evaluate social reasoning. SocialMaze systematically incorporates three core challenges: deep reasoning, dynamic interaction, and information uncertainty. It provides six diverse tasks across three key settings: social reasoning games, daily-life interactions, and digital community platforms. Both automated and human validation are used to ensure data quality. Our evaluation reveals several key insights: models vary substantially in their ability to handle dynamic interactions and integrate temporally evolving information; models with strong chain-of-thought reasoning perform better on tasks requiring deeper inference beyond surface-level cues; and model reasoning degrades significantly under uncertainty. Furthermore, we show that targeted fine-tuning on curated reasoning examples can greatly improve model performance in complex social scenarios. The dataset is publicly available at: this https URL
摘要：大型语言模型（LLM）越来越多地应用于社会扎根的任务，例如在线社区审核，媒体内容分析和社会推理游戏。在这些情况下的成功取决于模型的社会推理能力 - 解释社会环境，推断他人的心理状态并评估提出信息的真实性。但是，目前尚无系统性评估框架可以全面评估LLM的社会推理能力。现有的努力通常过于简单化现实世界的情况，而不是基本而无法挑战高级模型的任务。为了解决这一差距，我们介绍了一种专门旨在评估社会推理的新基准社会美味。社会美持体系统地融合了三个核心挑战：深层推理，动态互动和信息不确定性。它在三个关键设置中提供了六个不同的任务：社会推理游戏，日常生活互动和数字社区平台。自动化和人类验证均用于确保数据质量。我们的评估揭示了几个关键见解：模型在处理动态相互作用和整合时间不断发展的信息的能力上有很大差异；具有强大的想法推理的模型在需要更深入的推理以外的表面层面提示的任务上表现更好；在不确定性下，模型推理显着降低。此外，我们表明，在精心策划的推理示例上进行微调可以大大提高复杂社交场景中的模型绩效。该数据集可公开可用：此HTTPS URL

Title: Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models

Authors: Jinzhe Li, Gengxu Li, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23715
Pdf URL: https://arxiv.org/pdf/2505.23715
Copy Paste: [[2505.23715]] Don't Take the Premise for Granted: Evaluating the Premise Critique Ability of Large Language Models(https://arxiv.org/abs/2505.23715)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have witnessed rapid advancements, demonstrating remarkable capabilities. However, a notable vulnerability persists: LLMs often uncritically accept flawed or contradictory premises, leading to inefficient reasoning and unreliable outputs. This emphasizes the significance of possessing the \textbf{Premise Critique Ability} for LLMs, defined as the capacity to proactively identify and articulate errors in input premises. Most existing studies assess LLMs' reasoning ability in ideal settings, largely ignoring their vulnerabilities when faced with flawed premises. Thus, we introduce the \textbf{Premise Critique Bench (PCBench)}, designed by incorporating four error types across three difficulty levels, paired with multi-faceted evaluation metrics. We conducted systematic evaluations of 15 representative LLMs. Our findings reveal: (1) Most models rely heavily on explicit prompts to detect errors, with limited autonomous critique; (2) Premise critique ability depends on question difficulty and error type, with direct contradictions being easier to detect than complex or procedural errors; (3) Reasoning ability does not consistently correlate with the premise critique ability; (4) Flawed premises trigger overthinking in reasoning models, markedly lengthening responses due to repeated attempts at resolving conflicts. These insights underscore the urgent need to enhance LLMs' proactive evaluation of input validity, positioning premise critique as a foundational capability for developing reliable, human-centric systems. The code is available at this https URL.
摘要：大型语言模型（LLM）见证了快速的进步，表明了出色的能力。但是，一个显着的脆弱性仍然存在：LLMS通常在不严格地接受有缺陷或矛盾的前提，导致推理和不可靠的输出。这强调了拥有\ textbf {前提批判能力}的重要性，该llms被定义为主动识别和阐明输入前提中错误的能力。大多数现有研究都评估了LLMS在理想环境中的推理能力，在面对有缺陷的前提时，很大程度上忽略了它们的脆弱性。因此，我们介绍了\ textbf {前提评论基准（PCBENCH）}，该}通过在三个难度级别上合并四种错误类型，并与多方面的评估指标配对。我们对15个代表性LLM进行了系统评估。我们的发现揭示了：（1）大多数模型在很大程度上依赖于明确的提示来检测错误，并且自主批评有限；（2）前提批评能力取决于问题难度和错误类型，直接矛盾比复杂或程序错误更容易检测；（3）推理能力与前提批评能力不一致；（4）有缺陷的场所在推理模型中触发了过度思考，这显着延长了由于解决冲突而反复尝试的响应。这些见解强调了迫切需要增强LLMS对投入有效性的积极评估，将前提批评定位为开发可靠的，以人为中心的系统的基础能力。该代码可在此HTTPS URL上找到。

Title: Label-Guided In-Context Learning for Named Entity Recognition

Authors: Fan Bai, Hamid Hassanzadeh, Ardavan Saeedi, Mark Dredze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.23722
Pdf URL: https://arxiv.org/pdf/2505.23722
Copy Paste: [[2505.23722]] Label-Guided In-Context Learning for Named Entity Recognition(https://arxiv.org/abs/2505.23722)
Keywords: language model, llm, prompt
Abstract: In-context learning (ICL) enables large language models (LLMs) to perform new tasks using only a few demonstrations. In Named Entity Recognition (NER), demonstrations are typically selected based on semantic similarity to the test instance, ignoring training labels and resulting in suboptimal performance. We introduce DEER, a new method that leverages training labels through token-level statistics to improve ICL performance. DEER first enhances example selection with a label-guided, token-based retriever that prioritizes tokens most informative for entity recognition. It then prompts the LLM to revisit error-prone tokens, which are also identified using label statistics, and make targeted corrections. Evaluated on five NER datasets using four different LLMs, DEER consistently outperforms existing ICL methods and approaches the performance of supervised fine-tuning. Further analysis shows its effectiveness on both seen and unseen entities and its robustness in low-resource settings.
摘要：内部文化学习（ICL）使大型语言模型（LLMS）只能使用几个演示执行新任务。在指定的实体识别（NER）中，通常根据与测试实例的语义相似，忽略训练标签并导致次优性能选择示范。我们介绍了一种新方法，该方法通过令牌级统计来利用培训标签以提高ICL性能。鹿首先使用标签引导，基于令牌的检索器增强了示例选择，该检索比令牌优先级最有用的实体识别信息。然后，它提示LLM重新访问容易出错的令牌，该令牌也可以使用标签统计信息确定，并进行针对性的校正。使用四个不同的LLM在五个NER数据集上进行评估，鹿的表现始终优于现有的ICL方法，并接近监督微调的性能。进一步的分析显示了其对可见和看不见的实体及其在低资源环境中的鲁棒性的有效性。

Title: ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Authors: Zexi Liu, Jingyi Chai, Xinyu Zhu, Shuo Tang, Rui Ye, Bo Zhang, Lei Bai, Siheng Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23723
Pdf URL: https://arxiv.org/pdf/2505.23723
Copy Paste: [[2505.23723]] ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering(https://arxiv.org/abs/2505.23723)
Keywords: language model, llm, prompt, agent
Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.
摘要：大型语言模型（LLM）基于代理的出现已经显着推动了自主机器学习（ML）工程的发展。但是，大多数现有的方法在很大程度上取决于手动及时工程，无法根据各种实验经验进行适应和优化。为此，我们首次探讨了基于学习的代理ML的范式，LLM代理通过使用在线增强学习（RL）对ML任务进行交互式实验来学习。为了实现这一目标，我们提出了一个具有三个关键组成部分的新型代理ML培训框架：（1）富含探索的微调，使LLM代理能够产生各种动作，以增强RL探索；（2）逐步的RL，可以在单个动作步骤上进行培训，加速经验收集并提高培训效率；（3）代理ML特异性奖励模块，该模块将不同的ML反馈信号统一为RL优化的一致奖励。利用这个框架，我们训练由7B大小的QWEN-2.5 LLM驱动的ML-Engent进行自主ML。值得注意的是，尽管接受了9毫升任务的培训，但我们的7B大小的ML代理商的表现优于671B大小的DeepSeek-R1代理。此外，它可以取得持续的性能改进，并证明了出色的交叉任务概括功能。

Title: Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Authors: Mohamad Chehade, Soumya Suvra Ghosal, Souradip Chakraborty, Avinash Reddy, Dinesh Manocha, Hao Zhu, Amrit Singh Bedi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23729
Pdf URL: https://arxiv.org/pdf/2505.23729
Copy Paste: [[2505.23729]] Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time(https://arxiv.org/abs/2505.23729)
Keywords: language model, gpt, llm
Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.
摘要：由于偏好反馈的固有多面性质，将大型语言模型与人类保持一致是具有挑战性的。尽管现有方法通常将其视为多目标优化问题，但他们经常忽略人类实际做出决策的方式。对有限理性的研究表明，人类决策遵循满足的策略，使主要目标优化，同时确保其他人达到可接受的阈值。为了弥合这一差距并将满足对齐的概念运行，我们提出了Sitalign：一种推理时间框架，通过最大化主要目标，同时满足次级标准的基于阈值的约束，从而解决了对齐方式的多方面性质。我们通过得出基于满足的推理一致性方法的次数范围来提供理论见解。我们通过对多个基准测试的广泛实验来验证Sitalign的性能。例如，在PKU-SAFERLHF数据集中，其主要目的是最大化帮助的同时确保对无害性的阈值，Sitalign的表现优于最先进的多物镜解码策略，而GPT-4 Wine-winte率则在有助于的奖励方面奖励了对无损害的有益的奖励。

Title: ATLAS: Learning to Optimally Memorize the Context at Test Time

Authors: Ali Behrouz, Zeman Li, Praneeth Kacham, Majid Daliri, Yuan Deng, Peilin Zhong, Meisam Razaviyayn, Vahab Mirrokni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23735
Pdf URL: https://arxiv.org/pdf/2505.23735
Copy Paste: [[2505.23735]] ATLAS: Learning to Optimally Memorize the Context at Test Time(https://arxiv.org/abs/2505.23735)
Keywords: language model, long context
Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80\% accuracy in 10M context length of BABILong benchmark.
摘要：变压器已被确定为序列建模中最受欢迎的骨干，这主要是由于它们在内在检索任务中的有效性和大规模学习能力。然而，它们的二次记忆和时间复杂性将其适用性限制在更长的序列中，因此激发了研究人员探索有效的替代体系结构，例如现代复发性神经网络（又称长期经常性记忆模块）。尽管他们最近在多样化的下游任务中取得了成功，但他们在需要长期的上下文理解和推断向更长的序列的任务中挣扎。我们观察到这些缺点来自其设计中的三个不相交方面：（1）有限的内存能力受内存架构和输入特征映射的限制；（2）更新的在线性质，即仅相对于最后一个输入来优化内存；（3）对其固定尺寸内存的表达不太表现力。为了增强所有这三个方面，我们提出了Atlas，这是一个具有高容量的长期内存模块，可以通过根据当前和过去的代币优化内存来记住上下文，从而克服了长期内存模型的在线性质。在这种见解的基础上，我们介绍了一个新的类似变形金刚的架构，称为DeepTransformers，它们严格概括了原始的变压器体系结构。我们在语言建模，常识性推理，召回密集型和长篇小说理解任务上的实验结果表明，Atlas超过了变形金刚和最近的线性复发模型的性能。 Atlas进一步提高了泰坦的长期表现，在Babilong基准的10m上下文长度上实现了+80 \％的精度。

Title: DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Authors: Ziyin Zhang, Jiahao Xu, Zhiwei He, Tian Liang, Qiuzhi Liu, Yansi Li, Linfeng Song, Zhengwen Liang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, Dong Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.23754
Pdf URL: https://arxiv.org/pdf/2505.23754
Copy Paste: [[2505.23754]] DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning(https://arxiv.org/abs/2505.23754)
Keywords: language model, llm
Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.
摘要：定理证明是评估大语言模型（LLMS）中复杂推理能力的主要测试。但是，传统的自动定理证明（ATP）方法在很大程度上取决于正式的证明系统，这些正式证明系统与LLMS在预训练期间获得的非正式，自然语言知识的强度不符。在这项工作中，我们提出了Deep Theorem，这是一个全面的非正式定理框架，利用自然语言来增强LLM数学推理。深理论包括一个大规模的基准数据集，该数据集由121K高质量的IMO级非正式定理和跨越多种数学领域的证明，严格注释，以实现正确性，困难和主题类别，并由系统地构建的可验证的定理变体。我们设计了一种新颖的强化学习策略（RL-Zero），该策略明确定制了针对非正式定理的量身定理，并利用经过验证的定理变体激励强大的数学推断。此外，我们提出了全面的结果和过程评估指标，以检查证据正确性和推理步骤的质量。广泛的实验分析表明，与现有数据集和监督的微调协议相比，深入理论可显着提高LLM定理性能，从而达到最先进的准确性和推理质量。我们的发现凸显了深理论从根本上推进自动化非正式定理证明和数学探索的潜力。

Title: Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint

Authors: Heekyung Lee, Jiaxin Ge, Tsung-Han Wu, Minwoo Kang, Trevor Darrell, David M. Chan
Subjects: cs.CL, cs.AI, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23759
Pdf URL: https://arxiv.org/pdf/2505.23759
Copy Paste: [[2505.23759]] Puzzled by Puzzles: When Vision-Language Models Can't Take a Hint(https://arxiv.org/abs/2505.23759)
Keywords: language model
Abstract: Rebus puzzles, visual riddles that encode language through imagery, spatial arrangement, and symbolic substitution, pose a unique challenge to current vision-language models (VLMs). Unlike traditional image captioning or question answering tasks, rebus solving requires multi-modal abstraction, symbolic reasoning, and a grasp of cultural, phonetic and linguistic puns. In this paper, we investigate the capacity of contemporary VLMs to interpret and solve rebus puzzles by constructing a hand-generated and annotated benchmark of diverse English-language rebus puzzles, ranging from simple pictographic substitutions to spatially-dependent cues ("head" over "heels"). We analyze how different VLMs perform, and our findings reveal that while VLMs exhibit some surprising capabilities in decoding simple visual clues, they struggle significantly with tasks requiring abstract reasoning, lateral thinking, and understanding visual metaphors.
摘要：通过图像，空间布置和象征性替代编码语言的本色谜语，视觉谜语，对当前视觉模型（VLMS）构成了独特的挑战。与传统的图像字幕或问题回答任务不同，解决rebus解决需要多模式抽象，象征性推理以及对文化，语音和语言双关语的掌握。在本文中，我们研究了当代VLM的能力，可以通过构建各种英语英语rebus拼图的手工生成和带注释的基准来解释和解决重新难题，从简单的象形替代品到空间依赖的提示（“ head toffor”）。我们分析了不同的VLM的性能，我们的发现表明，尽管VLM在解码简单的视觉线索时表现出令人惊讶的功能，但它们在需要抽象推理，横向思考和理解视觉隐喻的任务上遇到了重大斗争。

Title: From Chat Logs to Collective Insights: Aggregative Question Answering

Authors: Wentao Zhang, Woojeong Kim, Yuntian Deng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.23765
Pdf URL: https://arxiv.org/pdf/2505.23765
Copy Paste: [[2505.23765]] From Chat Logs to Collective Insights: Aggregative Question Answering(https://arxiv.org/abs/2505.23765)
Keywords: language model, llm, chat, agent
Abstract: Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.
摘要：由大语言模型（LLM）提供动力的对话代理正在迅速成为我们日常互动的组成部分，从而产生了前所未有的对话数据。这样的数据集为社会利益，趋势主题和集体问题提供了强大的镜头。然而，现有方法通常将这些相互作用视为独立和错过关键见解，这些见解可能会从大规模对话日志中的汇总和推理中出现。在本文中，我们介绍了总体问题回答，这是一项新的任务，要求模型明确推理成千上万多种用户 - chatbot交互以回答聚合查询，例如识别特定人口统计学之间的新出现问题。为了朝这个方向进行研究，我们构建了一个基准，Wildchat-AQA，其中包括6,027个总问题，这些问题来自182,330个真实世界聊天机器人对话。实验表明，现有的方法要么难以有效推理，要么产生过度的计算成本，从而强调了能够从大规模对话数据中提取集体见解的新方法的需求。