2025-10-10

Title: Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation

Authors: Mufei Li, Dongqi Fu, Limei Wang, Si Zhang, Hanqing Zeng, Kaan Sancak, Ruizhong Qiu, Haoyu Wang, Xiaoxin He, Xavier Bresson, Yinglong Xia, Chonglin Sun, Pan Li
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.07414
Pdf URL: https://arxiv.org/pdf/2510.07414
Copy Paste: [[2510.07414]] Haystack Engineering: Context Engineering for Heterogeneous and Agentic Long-Context Evaluation(https://arxiv.org/abs/2510.07414)
Keywords: language model, gpt, llm, long context, agent
Abstract: Modern long-context large language models (LLMs) perform well on synthetic "needle-in-a-haystack" (NIAH) benchmarks, but such tests overlook how noisy contexts arise from biased retrieval and agentic workflows. We argue that haystack engineering is necessary to construct noisy long contexts that faithfully capture key real-world factors -- distraction from heterogeneous biased retrievers and cascading errors in agentic workflows -- to test models' long-context robustness. We instantiate it through HaystackCraft, a new NIAH benchmark built on the full English Wikipedia hyperlink network with multi-hop questions. HaystackCraft evaluates how heterogeneous retrieval strategies (e.g., sparse, dense, hybrid, and graph-based) affect distractor composition, haystack ordering, and downstream LLM performance. HaystackCraft further extends NIAH to dynamic, LLM-dependent settings that simulate agentic operations, where models refine queries, reflect on their past reasonings, and decide when to stop. Experiments with 15 long-context models show that (1) while stronger dense retrievers can introduce more challenging distractors, graph-based reranking simultaneously improves retrieval effectiveness and mitigates more harmful distractors; (2) in agentic tests, even advanced models like Gemini 2.5 Pro and GPT-5 suffer cascading failures from self-generated distractors or struggle to perform early stops. These results highlight persistent challenges in agentic long-context reasoning and establish HaystackCraft as a valuable testbed for future progress.
摘要：现代长上下文大语言模型 (LLM) 在综合“大海捞针”(NIAH) 基准上表现良好，但此类测试忽略了有偏见的检索和代理工作流程如何产生嘈杂的上下文。我们认为，干草堆工程对于构建嘈杂的长上下文是必要的，这些长上下文可以忠实地捕获关键的现实世界因素——对异构偏见检索器的干扰和代理工作流程中的级联错误——以测试模型的长上下文稳健性。我们通过 HaystackCraft 实例化它，HaystackCraft 是一个新的 NIAH 基准测试，构建在完整的英文维基百科超链接网络上，具有多跳问题。 HaystackCraft 评估异构检索策略（例如稀疏、密集、混合和基于图形）如何影响干扰项组成、haystack 排序和下游 LLM 性能。 HaystackCraft 进一步将 NIAH 扩展到动态的、依赖于 LLM 的设置，模拟代理操作，模型在其中细化查询、反思过去的推理并决定何时停止。对 15 个长上下文模型的实验表明，(1) 虽然更强的密集检索器可能会引入更具挑战性的干扰因素，但基于图的重新排名同时提高了检索效率并减少了更多有害的干扰因素； (2) 在代理测试中，即使像 Gemini 2.5 Pro 和 GPT-5 这样的高级模型也会因自身产生的干扰因素而遭受级联故障，或者难以执行早期停止。这些结果凸显了代理长上下文推理中持续存在的挑战，并将 HaystackCraft 确立为未来进步的宝贵测试平台。

Title: Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data

Authors: Olia Toporkov, Alan Akbik, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07434
Pdf URL: https://arxiv.org/pdf/2510.07434
Copy Paste: [[2510.07434]] Lemma Dilemma: On Lemma Generation Without Domain- or Language-Specific Training Data(https://arxiv.org/abs/2510.07434)
Keywords: language model, llm
Abstract: Lemmatization is the task of transforming all words in a given text to their dictionary forms. While large language models (LLMs) have demonstrated their ability to achieve competitive results across a wide range of NLP tasks, there is no prior evidence of how effective they are in the contextual lemmatization task. In this paper, we empirically investigate the capacity of the latest generation of LLMs to perform in-context lemmatization, comparing it to the traditional fully supervised approach. In particular, we consider the setting in which supervised training data is not available for a target domain or language, comparing (i) encoder-only supervised approaches, fine-tuned out-of-domain, and (ii) cross-lingual methods, against direct in-context lemma generation with LLMs. Our experimental investigation across 12 languages of different morphological complexity finds that, while encoders remain competitive in out-of-domain settings when fine-tuned on gold data, current LLMs reach state-of-the-art results for most languages by directly generating lemmas in-context without prior fine-tuning, provided just with a few examples. Data and code available upon publication: this https URL
摘要：词形还原是将给定文本中的所有单词转换为其字典形式的任务。虽然大型语言模型 (LLM) 已经证明了它们在各种 NLP 任务中取得有竞争力的结果的能力，但没有事先证据表明它们在上下文词形还原任务中的有效性。在本文中，我们实证研究了最新一代法学硕士执行上下文词形还原的能力，并将其与传统的完全监督方法进行比较。特别是，我们考虑了监督训练数据不可用于目标域或语言的情况，将（i）仅编码器监督方法、域外微调方法和（ii）跨语言方法与法学硕士直接上下文引理生成进行比较。我们对 12 种不同形态复杂性的语言进行的实验调查发现，虽然编码器在黄金数据上进行微调时在域外设置中仍然具有竞争力，但当前的法学硕士通过直接在上下文中生成引理而无需事先微调，就可以达到大多数语言的最先进结果（仅提供几个示例）。发布后可获得的数据和代码：此 https URL

Title: LASER: An LLM-based ASR Scoring and Evaluation Rubric

Authors: Amruta Parulekar, Preethi Jyothi
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2510.07437
Pdf URL: https://arxiv.org/pdf/2510.07437
Copy Paste: [[2510.07437]] LASER: An LLM-based ASR Scoring and Evaluation Rubric(https://arxiv.org/abs/2510.07437)
Keywords: llm, prompt
Abstract: Standard ASR evaluation metrics like Word Error Rate (WER) tend to unfairly penalize morphological and syntactic nuances that do not significantly alter sentence semantics. We introduce an LLM-based scoring rubric LASER that leverages state-of-the-art LLMs' in-context learning abilities to learn from prompts with detailed examples. Hindi LASER scores using Gemini 2.5 Pro achieved a very high correlation score of 94% with human annotations. Hindi examples in the prompt were also effective in analyzing errors in other Indian languages such as Marathi, Kannada and Malayalam. We also demonstrate how a smaller LLM like Llama 3 can be finetuned on word-pair examples derived from reference and ASR predictions to predict what kind of penalty should be applied with close to 89% accuracy.
摘要：标准 ASR 评估指标（例如单词错误率 (WER)）往往会不公平地惩罚不会显着改变句子语义的形态和句法细微差别。我们引入了基于法学硕士的评分标准 LASER，它利用最先进的法学硕士的情境学习能力，从带有详细示例的提示中学习。使用 Gemini 2.5 Pro 的印地语 LASER 评分与人工注释的相关性高达 94%。提示中的印地语示例对于分析马拉地语、卡纳达语和马拉雅拉姆语等其他印度语言的错误也很有效。我们还演示了如何对像 Llama 3 这样的小型 LLM 进行微调，以从参考和 ASR 预测中导出的词对示例来预测应该应用哪种惩罚，准确率接近 89%。

Title: Populism Meets AI: Advancing Populism Research with LLMs

Authors: Eduardo Ryô Tamaki (German Institute for Global and Area Studies), Yujin J. Jung (Mount St. Mary's University), Julia Chatterley (Princeton University), Grant Mitchell (University of California, Los Angeles), Semir Dzebo (University of Oxford), Cristóbal Sandoval (Diego Portales University), Levente Littvay (ELTE Centre for Social Sciences), Kirk A. Hawkins (Brigham Young University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07458
Pdf URL: https://arxiv.org/pdf/2510.07458
Copy Paste: [[2510.07458]] Populism Meets AI: Advancing Populism Research with LLMs(https://arxiv.org/abs/2510.07458)
Keywords: llm, prompt
Abstract: Measuring the ideational content of populism remains a challenge. Traditional strategies based on textual analysis have been critical for building the field's foundations and providing a valid, objective indicator of populist framing. Yet these approaches are costly, time consuming, and difficult to scale across languages, contexts, and large corpora. Here we present the results from a rubric and anchor guided chain of thought (CoT) prompting approach that mirrors human coder training. By leveraging the Global Populism Database (GPD), a comprehensive dataset of global leaders' speeches annotated for degrees of populism, we replicate the process used to train human coders by prompting the LLM with an adapted version of the same documentation to guide the model's reasoning. We then test multiple proprietary and open weight models by replicating scores in the GPD. Our findings reveal that this domain specific prompting strategy enables the LLM to achieve classification accuracy on par with expert human coders, demonstrating its ability to navigate the nuanced, context sensitive aspects of populism.
摘要：衡量民粹主义的理念内容仍然是一个挑战。基于文本分析的传统策略对于建立该领域的基础和提供民粹主义框架的有效、客观指标至关重要。然而，这些方法成本高昂、耗时，并且难以跨语言、上下文和大型语料库扩展。在这里，我们展示了反映人类编码员训练的标题和锚引导思想链（CoT）提示方法的结果。通过利用全球民粹主义数据库 (GPD)（一个注释了民粹主义程度的全球领导人演讲的综合数据集），我们通过提示法学硕士使用同一文档的改编版本来指导模型的推理，从而复制了用于培训人类编码员的过程。然后，我们通过复制 GPD 中的分数来测试多个专有和开放权重模型。我们的研究结果表明，这种针对特定领域的提示策略使法学硕士能够达到与人类编码专家相当的分类准确性，展示了其驾驭民粹主义的微妙、上下文敏感方面的能力。

Title: MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference

Authors: Zheyuan Zhang, Lin Ge, Hongjiang Li, Weicheng Zhu, Chuxu Zhang, Yanfang Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07475
Pdf URL: https://arxiv.org/pdf/2510.07475
Copy Paste: [[2510.07475]] MAPRO: Recasting Multi-Agent Prompt Optimization as Maximum a Posteriori Inference(https://arxiv.org/abs/2510.07475)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, and LLM-based agents further extend these abilities to various practical workflows. While recent progress shows that multi-agent systems (MAS) can outperform single agents by coordinating specialized roles, designing effective MAS remains difficult due to prompt sensitivity and the compounded instability MAS creates. To cope with the challenge, recent efforts in automated prompt design have reduced manual effort. However, multi-agent prompt optimization remains largely unexplored. Challenges like exponentially expanding search space and ambiguous credit assignment together make systematic design intractable without principled methods. Therefore, we introduce M}ulti-Agent PRompt Optimization (MAPRO), a four-stage framework that first formulates MAS prompt optimization as a Maximum a Posteriori (MAP) inference problem and solves it using a language-guided variant of max-product belief propagation algorithm. To address credit assignment and updates the system iteratively, MAPRO employs a topology-aware refinement mechanism that integrates execution feedback and downstream blames to selectively update agent prompts. Through this process, MAPRO progressively converges to a coordinated set of agent-specific prompt policies. Across benchmarks in various tasks, MAPRO achieves state-of-the-art performance, consistently surpassing manually engineered baselines and recent automated alternatives. Beyond performance, our MAP-based formulation also delivers general guidelines for building more reliable and principled multi-agent systems in the future
摘要：大型语言模型 (LLM) 在不同的任务中表现出了卓越的能力，而基于 LLM 的代理进一步将这些能力扩展到各种实际工作流程。虽然最近的进展表明多智能体系统 (MAS) 可以通过协调专门的角色来超越单一智能体，但由于快速的敏感性和 MAS 产生的复合不稳定性，设计有效的 MAS 仍然很困难。为了应对这一挑战，最近在自动提示设计方面的努力减少了手动工作。然而，多代理提示优化在很大程度上仍未得到探索。指数级扩展的搜索空间和模糊的信用分配等挑战使得系统设计在没有原则性方法的情况下变得棘手。因此，我们引入了多代理提示优化 (MAPRO)，这是一个四阶段框架，首先将 MAS 提示优化表述为最大后验 (MAP) 推理问题，并使用最大乘积置信传播算法的语言引导变体来解决它。为了解决信用分配和迭代更新系统的问题，MAPRO 采用了拓扑感知的细化机制，该机制集成了执行反馈和下游责备，以有选择地更新代理提示。通过这个过程，MAPRO 逐渐收敛到一组协调的特定于代理的提示策略。在各种任务的基准测试中，MAPRO 实现了最先进的性能，始终超越手动设计的基准和最新的自动化替代方案。除了性能之外，我们基于 MAP 的公式还为未来构建更可靠、更有原则的多智能体系统提供了一般准则

Title: AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding

Authors: Shuqing Luo, Yilin Guan, Pingzhi Li, Hanrui Wang, Tianlong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07486
Pdf URL: https://arxiv.org/pdf/2510.07486
Copy Paste: [[2510.07486]] AsyncSpade: Efficient Test-Time Scaling with Asynchronous Sparse Decoding(https://arxiv.org/abs/2510.07486)
Keywords: llm, chain-of-thought
Abstract: Test-time scaling (TTS) boosts LLM reasoning via long chain-of-thought (CoT), but the linear KV-cache growth amplifies the memory-bound bottleneck of LLM decoding. Query-aware page-level sparse decoding can achieve state-of-the-art performance under constrained FLOPs budgets, but is limited by both sequential-dependent page filtering and coarse-grained token selection, hampering serving efficiency and model performance on TTS tasks under high concurrency and long CoT scenarios (consuming even higher runtime than the forward pipeline itself). In this paper, we first find that the current-step query state can be accurately approximated in a unified manner from a short window of recent queries, enabling training-free query-aware sparsity without waiting in the decoding loop. We propose AsyncSpade, an asynchronous framework for efficient TTS built on two core components: (1) a novel light-weight temporal-regressive module that predicts the next-token query state; (2) an asynchronous and disaggregated framework that decouples the KV cache filtering from the auto-regressive decoding loop, overlapping the token-level KV selection with the forward inference computation through asynchronism. To our knowledge, AsyncSpade is the first to eliminate the sequential dependence without sacrificing model performance. We validate the effectiveness of AsyncSpade on common LLM serving setups with an A100 node, where AsyncSpade fully overlaps KV-cache operations with the inference pipeline, achieving theoretical optimal time-per-output-token (TPOT). Specifically, AsyncSpade delivers over 20% reduction on TPOT compared to SoTA baseline (i.e. Quest) and at least 50% TPOT reduction compared to full attention on Qwen3-8B and Qwen3-32B models, while matching or surpassing their accuracy on various TTS benchmarks (AIME-24/25, GPQA-Diamond, MATH-500).
摘要：测试时间扩展 (TTS) 通过长思想链 (CoT) 增强了 LLM 推理，但线性 KV 缓存增长放大了 LLM 解码的内存限制瓶颈。查询感知的页面级稀疏解码可以在有限的 FLOP 预算下实现最先进的性能，但受到顺序相关的页面过滤和粗粒度令牌选择的限制，在高并发和长 CoT 场景下（消耗的运行时间甚至比前向管道本身更高）阻碍了 TTS 任务的服务效率和模型性能。在本文中，我们首先发现可以从最近查询的短窗口中以统一的方式准确地近似当前步查询状态，从而实现免训练的查询感知稀疏性，而无需在解码循环中等待。我们提出了 AsyncSpade，这是一个基于两个核心组件构建的高效 TTS 异步框架：(1) 一个新颖的轻量级时间回归模块，可以预测下一个令牌查询状态；（2）异步分解框架，将KV缓存过滤与自回归解码循环解耦，通过异步将令牌级KV选择与前向推理计算重叠。据我们所知，AsyncSpade 是第一个在不牺牲模型性能的情况下消除顺序依赖的方法。我们通过 A100 节点验证 AsyncSpade 在常见 LLM 服务设置上的有效性，其中 AsyncSpade 将 KV 缓存操作与推理管道完全重叠，从而实现理论上的最佳每个输出令牌时间 (TPOT)。具体而言，与 SoTA 基线（即 Quest）相比，AsyncSpade 的 TPOT 减少了 20% 以上，与 Qwen3-8B 和 Qwen3-32B 模型的完全注意力相比，TPOT 减少了至少 50%，同时在各种 TTS 基准（AIME-24/25、GPQA-Diamond、MATH-500）上匹配或超过了它们的准确性。

Title: Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics

Authors: Rasika Muralidharan, Jaewoon Kwak, Jisun An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07488
Pdf URL: https://arxiv.org/pdf/2510.07488
Copy Paste: [[2510.07488]] Can Lessons From Human Teams Be Applied to Multi-Agent Systems? The Role of Structure, Diversity, and Interaction Dynamics(https://arxiv.org/abs/2510.07488)
Keywords: language model, llm, agent
Abstract: Multi-Agent Systems (MAS) with Large Language Model (LLM)-powered agents are gaining attention, yet fewer studies explore their team dynamics. Inspired by human team science, we propose a multi-agent framework to examine core aspects of team science: structure, diversity, and interaction dynamics. We evaluate team performance across four tasks: CommonsenseQA, StrategyQA, Social IQa, and Latent Implicit Hate, spanning commonsense and social reasoning. Our results show that flat teams tend to perform better than hierarchical ones, while diversity has a nuanced impact. Interviews suggest agents are overconfident about their team performance, yet post-task reflections reveal both appreciation for collaboration and challenges in integration, including limited conversational coordination.
摘要：具有大型语言模型 (LLM) 支持的代理的多代理系统 (MAS) 正在引起人们的关注，但探索其团队动态的研究却很少。受人类团队科学的启发，我们提出了一个多智能体框架来检查团队科学的核心方面：结构、多样性和交互动态。我们通过四项任务评估团队绩效：CommonsenseQA、StrategyQA、Social IQa 和潜在隐性仇恨，涵盖常识和社会推理。我们的结果表明，扁平化团队往往比层级化团队表现更好，而多样性则有着微妙的影响。访谈表明，座席对自己的团队绩效过于自信，但任务后的反思却显示出对协作的赞赏和整合方面的挑战，包括有限的对话协调。

Title: Can Speech LLMs Think while Listening?

Authors: Yi-Jen Shih, Desh Raj, Chunyang Wu, Wei Zhou, SK Bong, Yashesh Gaur, Jay Mahadeokar, Ozlem Kalinli, Mike Seltzer
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2510.07497
Pdf URL: https://arxiv.org/pdf/2510.07497
Copy Paste: [[2510.07497]] Can Speech LLMs Think while Listening?(https://arxiv.org/abs/2510.07497)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Recent advances in speech large language models (speech LLMs) have enabled seamless spoken interactions, but these systems still struggle with complex reasoning tasks. Previously, chain-of-thought (CoT) prompting or fine-tuning has been to shown to significantly improve the reasoning abilities of text-based LLMs. In this work, we investigate the effect of CoT fine-tuning for multi-stream speech LLMs, demonstrating that reasoning in text space improves the accuracy of speech LLMs by 2.4x, on average, over a suite of spoken reasoning tasks. Beyond accuracy, the latency of the spoken response is a crucial factor for interacting with voice-based agents. Inspired by the human behavior of "thinking while listening," we propose methods to reduce the additional latency from reasoning by allowing the model to start reasoning before the user query has ended. To achieve this, we introduce an entropy-based metric, "question completeness," which acts as an indicator to guide the model on the optimal time to start reasoning. This method provides greater control over the accuracy-latency trade-off compared with heuristic-based approaches and, under equivalent latency conditions, yields a 4% accuracy gain on ARC-Easy. Finally, we use Direct Preference Optimization (DPO) on preference data created using rejection sampling to push the accuracy-latency pareto frontier further, resulting in a 70% reduction in latency without loss in accuracy.
摘要：语音大语言模型（语音 LLM）的最新进展已经实现了无缝的口语交互，但这些系统仍然难以应对复杂的推理任务。此前，思想链（CoT）提示或微调已被证明可以显着提高基于文本的法学硕士的推理能力。在这项工作中，我们研究了 CoT 微调对多流语音 LLM 的影响，证明文本空间中的推理比一组口语推理任务平均将语音 LLM 的准确性提高了 2.4 倍。除了准确性之外，语音响应的延迟是与基于语音的代理交互的一个关键因素。受到“边听边思考”的人类行为的启发，我们提出了通过允许模型在用户查询结束之前开始推理来减少推理的额外延迟的方法。为了实现这一目标，我们引入了一种基于熵的指标“问题完整性”，它充当指导模型开始推理的最佳时间的指标。与基于启发式的方法相比，该方法可以更好地控制准确性与延迟的权衡，并且在同等延迟条件下，ARC-Easy 的准确性提高了 4%。最后，我们对使用拒绝采样创建的偏好数据使用直接偏好优化 (DPO)，进一步推动准确性-延迟帕累托前沿，从而在不损失准确性的情况下减少 70% 的延迟。

Title: When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs

Authors: Soyeong Jeong, Taehee Jung, Sung Ju Hwang, Joo-Kyung Kim, Dongyeop Kang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07499
Pdf URL: https://arxiv.org/pdf/2510.07499
Copy Paste: [[2510.07499]] When Thoughts Meet Facts: Reusable Reasoning for Long-Context LMs(https://arxiv.org/abs/2510.07499)
Keywords: language model, prompt
Abstract: Recent Long-Context Language Models (LCLMs) can process hundreds of thousands of tokens in a single prompt, enabling new opportunities for knowledge-intensive multi-hop reasoning by integrating large sets of retrieved documents or, in some cases, directly all necessary information. However, simply feeding more documents into the context window fails to capture how evidence should be connected. We address this gap with thought templates, which recast reasoning as reusable thought caches, derived from prior problem solving traces, structuring how evidence is combined and guiding multi-hop inference with factual documents. To keep these templates effective, we propose an update strategy that iteratively refines templates derived from training data through natural-language feedback. Across diverse benchmarks and LCLM families, our approach delivers consistent gains over strong baselines in both retrieval-based and retrieval-free settings. Furthermore, we show that optimized templates can be distilled into smaller open-source models, demonstrating its broad applicability and transparent reasoning reuse. We refer to our framework as Thought Template Augmented LCLMs (ToTAL).
摘要：最近的长上下文语言模型（LCLM）可以在单个提示中处理数十万个标记，通过集成大量检索到的文档或在某些情况下直接集成所有必要信息，为知识密集型多跳推理提供新的机会。然而，仅仅将更多文档输入上下文窗口无法捕获证据应如何连接。我们通过思维模板来解决这一差距，思维模板将推理重新构建为可重用的思维缓存，源自先前的问题解决轨迹，构建证据的组合方式并指导事实文档的多跳推理。为了保持这些模板的有效性，我们提出了一种更新策略，通过自然语言反馈迭代地细化从训练数据派生的模板。在不同的基准和 LCLM 系列中，我们的方法在基于检索和无检索的设置中都比强大的基线提供了一致的收益。此外，我们表明优化的模板可以被提炼成更小的开源模型，展示其广泛的适用性和透明的推理重用。我们将我们的框架称为思想模板增强 LCLM (ToTAL)。

Title: OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs

Authors: Jaeseong Lee, seung-won hwang, Aurick Qiao, Gabriele Oliaro, Ye Wang, Samyam Rajbhandari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07535
Pdf URL: https://arxiv.org/pdf/2510.07535
Copy Paste: [[2510.07535]] OWL: Overcoming Window Length-Dependence in Speculative Decoding for Long-Context Inputs(https://arxiv.org/abs/2510.07535)
Keywords: language model, llm, long context
Abstract: Speculative decoding promises faster inference for large language models (LLMs), yet existing methods fail to generalize to real-world settings. Benchmarks typically assume short contexts (e.g., 2K tokens), whereas practical workloads involve long contexts. We find current approaches degrade severely with long contexts; for instance, EAGLE3 even slows down the generation speed by 0.81x. We address these limitations by releasing a new long-context benchmark (LongSpecBench) and introducing a novel model (OWL). OWL achieves about 5x higher acceptance length than EAGLE3 on long-context inputs through three innovations: (1) an LSTM-based drafter conditioned only on the last-token state, making it generalize to various lengths, (2) a special token [SPEC] in the verifier that produces richer representation for drafter, and (3) a hybrid algorithm combining both tree and non-tree decoding methods. We release all code and datasets to advance future research.
摘要：推测性解码有望为大型语言模型 (LLM) 提供更快的推理，但现有方法无法推广到现实世界的设置。基准通常假设短上下文（例如 2K 令牌），而实际工作负载涉及长上下文。我们发现当前的方法在长上下文中会严重退化；例如，EAGLE3甚至将生成速度降低了0.81倍。我们通过发布新的长上下文基准测试 (LongSpecBench) 并引入新颖的模型 (OWL) 来解决这些限制。 OWL 通过三项创新，在长上下文输入上实现了比 EAGLE3 高约 5 倍的接受长度：(1) 基于 LSTM 的绘图器仅以最后一个标记状态为条件，使其泛化到各种长度，(2) 验证器中的特殊标记 [SPEC]，可为绘图器生成更丰富的表示，以及 (3) 结合树和非树解码方法的混合算法。我们发布所有代码和数据集以推进未来的研究。

Title: Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices

Authors: Md Tahmid Rahman Laskar, Mohammed Saidul Islam, Ridwan Mahbub, Mizanur Rahman, Amran Bhuiyan, Israt Jahan, Mir Tafseer Nayeem, Shafiq Joty, Enamul Hoque, Jimmy Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07545
Pdf URL: https://arxiv.org/pdf/2510.07545
Copy Paste: [[2510.07545]] Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices(https://arxiv.org/abs/2510.07545)
Keywords: language model, prompt
Abstract: Large Vision-Language Models (LVLMs) with only 7B parameters have shown promise as automated judges in chart comprehension tasks. However, tiny models (<=2B parameters) still perform poorly as judges, limiting their real-world use in resource-constrained settings. To address this, we propose two approaches to ensure cost-efficient evaluation: (i) multi-criteria prompting, which combines separate evaluation criteria into a single query, and (ii) domain-adaptive transfer learning, in which we fine-tune a 2B-parameter LVLM on synthetic judgments in a chart dataset to create the ChartJudge. Experiments show that multi-criteria prompting exposes robustness gaps, which led to a huge drop in performance for 7B models, including specialized LVLM judges like LLaVA-Critic. In addition, we find that our tiny LVLM (ChartJudge) can effectively transfer knowledge from one dataset to another to make it a more specialized model. Our fine-grained analysis across chart types and query complexities offers actionable insights into trade-offs between model size, prompt design, and transferability, enabling scalable, low-cost evaluation for chart reasoning tasks. Our code and the data will be made publicly available.
摘要：仅具有 7B 参数的大型视觉语言模型 (LVLM) 在图表理解任务中显示出作为自动判断的前景。然而，微小模型（<=2B 参数）作为法官的表现仍然很差，限制了它们在资源有限的环境中的实际使用。为了解决这个问题，我们提出了两种方法来确保经济高效的评估：（i）多标准提示，将单独的评估标准组合到单个查询中，以及（ii）领域自适应迁移学习，其中我们根据图表数据集中的综合判断微调 2B 参数 LVLM 以创建 ChartJudge。实验表明，多标准提示暴露了鲁棒性差距，导致 7B 模型的性能大幅下降，包括 LLaVA-Critic 等专门的 LVLM 判断器。此外，我们发现我们的小型 LVLM (ChartJudge) 可以有效地将知识从一个数据集转移到另一个数据集，使其成为更专业的模型。我们对图表类型和查询复杂性进行细粒度分析，为模型大小、提示设计和可转移性之间的权衡提供了可操作的见解，从而实现了图表推理任务的可扩展、低成本评估。我们的代码和数据将公开。

Title: IASC: Interactive Agentic System for ConLangs

Authors: Chihiro Taguchi, Richard Sproat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07591
Pdf URL: https://arxiv.org/pdf/2510.07591
Copy Paste: [[2510.07591]] IASC: Interactive Agentic System for ConLangs(https://arxiv.org/abs/2510.07591)
Keywords: llm, agent
Abstract: We present a system that uses LLMs as a tool in the development of Constructed Languages. The system is modular in that one first creates a target phonology for the language using an agentic approach that refines its output at each step with commentary feedback on its previous attempt. Next, a set of sentences is 'translated' from their English original into a morphosyntactic markup that reflects the word order and morphosyntactic feature specifications of the desired target language, with affixes represented as morphosyntactic feature bundles. From this translated corpus, a lexicon is constructed using the phonological model and the set of morphemes (stems and affixes) extracted from the 'translated' sentences. The system is then instructed to provide an orthography for the language, using an existing script such as Latin or Cyrillic. Finally, the system writes a brief grammatical handbook of the language. The system can also translate further sentences into the target language. Our goal is twofold. First, we hope that these tools will be fun to use for creating artificially constructed languages. Second, we are interested in exploring what LLMs 'know' about language-not what they know about any particular language or linguistic phenomenon, but how much they know about and understand language and linguistic concepts. As we shall see, there is a fairly wide gulf in capabilities both among different LLMs and among different linguistic specifications, with it being notably easier for systems to deal with more common patterns than rarer ones. An additional avenue that we explore is the application of our approach to translating from high-resource into low-resource languages. While the results so far are mostly negative, we provide some evidence that an improved version of the present system could afford some real gains in such tasks. this https URL
摘要：我们提出了一个使用法学硕士作为构建语言开发工具的系统。该系统是模块化的，因为首先使用代理方法为该语言创建目标音系，该方法通过对之前尝试的评论反馈来完善每一步的输出。接下来，将一组句子从其英语原文“翻译”为形态句法标记，该标记反映了所需目标语言的词序和形态句法特征规范，其中词缀表示为形态句法特征束。根据这个翻译语料库，使用语音模型和从“翻译”句子中提取的语素（词干和词缀）集构建词典。然后系统被指示使用现有的脚本（例如拉丁语或西里尔语）提供该语言的正字法。最后，系统编写了该语言的简短语法手册。该系统还可以将更多句子翻译成目标语言。我们的目标是双重的。首先，我们希望这些工具在创建人工构建的语言时会很有趣。其次，我们有兴趣探索法学硕士对语言的“了解”，不是他们对任何特定语言或语言现象了解多少，而是他们对语言和语言概念的了解和理解程度。正如我们将看到的，不同的法学硕士和不同的语言规范之间的能力存在相当大的差距，系统处理更常见的模式比处理罕见的模式要容易得多。我们探索的另一个途径是应用我们的方法将高资源语言翻译成低资源语言。虽然到目前为止的结果大多是负面的，但我们提供了一些证据，表明现有系统的改进版本可以在此类任务中带来一些真正的收益。这个 https 网址

Title: Vocabulary embeddings organize linguistic structure early in language model training

Authors: Isabel Papadimitriou, Jacob Prince
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07613
Pdf URL: https://arxiv.org/pdf/2510.07613
Copy Paste: [[2510.07613]] Vocabulary embeddings organize linguistic structure early in language model training(https://arxiv.org/abs/2510.07613)
Keywords: language model, llm
Abstract: Large language models (LLMs) work by manipulating the geometry of input embedding vectors over multiple layers. Here, we ask: how are the input vocabulary representations of language models structured, and how and when does this structure evolve over training? To answer this question, we use representational similarity analysis, running a suite of experiments that correlate the geometric structure of the input embeddings and output embeddings of two open-source models (Pythia 12B and OLMo 7B) with semantic, syntactic, and frequency-based metrics over the course of training. Our key findings are as follows: 1) During training, the vocabulary embedding geometry quickly converges to high correlations with a suite of semantic and syntactic features; 2) Embeddings of high-frequency and function words (e.g., "the," "of") converge to their final vectors faster than lexical and low-frequency words, which retain some alignment with the bias in their random initializations. These findings help map the dynamic trajectory by which input embeddings organize around linguistic structure, revealing distinct roles for word frequency and function. Our findings motivate a deeper study of how the evolution of vocabulary geometry may facilitate specific capability gains during model training.
摘要：大型语言模型 (LLM) 通过操纵多个层上的输入嵌入向量的几何形状来工作。在这里，我们要问：语言模型的输入词汇表示是如何构造的，以及这种结构如何以及何时在训练中演变？为了回答这个问题，我们使用表征相似性分析，运行一系列实验，将两个开源模型（Pythia 12B 和 OLMo 7B）的输入嵌入和输出嵌入的几何结构与训练过程中的语义、句法和基于频率的指标相关联。我们的主要发现如下：1）在训练过程中，词汇嵌入几何快速收敛到与一组语义和句法特征的高度相关性； 2）高频词和功能词（例如“the”、“of”）的嵌入比词汇词和低频词更快地收敛到最终向量，这些词在随机初始化中保留了一些与偏差的对齐。这些发现有助于绘制输入嵌入围绕语言结构组织的动态轨迹，揭示词频和功能的不同作用。我们的发现激发了对词汇几何的演变如何促进模型训练期间特定能力增益的更深入研究。

Title: Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation

Authors: Zhangdie Yuan, Han-Chin Shing, Mitch Strong, Chaitanya Shivade
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07629
Pdf URL: https://arxiv.org/pdf/2510.07629
Copy Paste: [[2510.07629]] Toward Reliable Clinical Coding with Language Models: Verification and Lightweight Adaptation(https://arxiv.org/abs/2510.07629)
Keywords: language model, llm, prompt
Abstract: Accurate clinical coding is essential for healthcare documentation, billing, and decision-making. While prior work shows that off-the-shelf LLMs struggle with this task, evaluations based on exact match metrics often overlook errors where predicted codes are hierarchically close but incorrect. Our analysis reveals that such hierarchical misalignments account for a substantial portion of LLM failures. We show that lightweight interventions, including prompt engineering and small-scale fine-tuning, can improve accuracy without the computational overhead of search-based methods. To address hierarchically near-miss errors, we introduce clinical code verification as both a standalone task and a pipeline component. To mitigate the limitations in existing datasets, such as incomplete evidence and inpatient bias in MIMIC, we release an expert double-annotated benchmark of outpatient clinical notes with ICD-10 codes. Our results highlight verification as an effective and reliable step toward improving LLM-based medical coding.
摘要：准确的临床编码对于医疗保健文档、计费和决策至关重要。虽然之前的工作表明现成的法学硕士很难完成这项任务，但基于精确匹配指标的评估通常会忽略预测代码层次结构接近但不正确的错误。我们的分析表明，这种层级错位是法学硕士失败的很大一部分原因。我们证明，轻量级干预（包括即时工程和小规模微调）可以提高准确性，而无需基于搜索的方法的计算开销。为了解决分层的未遂错误，我们引入临床代码验证作为独立任务和管道组件。为了缓解现有数据集的局限性，例如 MIMIC 中证据不完整和住院患者偏差，我们发布了带有 ICD-10 代码的专家双注释门诊临床记录基准。我们的结果强调验证是改进基于法学硕士的医学编码的有效且可靠的一步。

Title: Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models

Authors: Đorđe Klisura, Joseph Khoury, Ashish Kundu, Ram Krishnan, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07642
Pdf URL: https://arxiv.org/pdf/2510.07642
Copy Paste: [[2510.07642]] Role-Conditioned Refusals: Evaluating Access Control Reasoning in Large Language Models(https://arxiv.org/abs/2510.07642)
Keywords: language model, llm, prompt
Abstract: Access control is a cornerstone of secure computing, yet large language models often blur role boundaries by producing unrestricted responses. We study role-conditioned refusals, focusing on the LLM's ability to adhere to access control policies by answering when authorized and refusing when not. To evaluate this behavior, we created a novel dataset that extends the Spider and BIRD text-to-SQL datasets, both of which have been modified with realistic PostgreSQL role-based policies at the table and column levels. We compare three designs: (i) zero or few-shot prompting, (ii) a two-step generator-verifier pipeline that checks SQL against policy, and (iii) LoRA fine-tuned models that learn permission awareness directly. Across multiple model families, explicit verification (the two-step framework) improves refusal precision and lowers false permits. At the same time, fine-tuning achieves a stronger balance between safety and utility (i.e., when considering execution accuracy). Longer and more complex policies consistently reduce the reliability of all systems. We release RBAC-augmented datasets and code.
摘要：访问控制是安全计算的基石，但大型语言模型通常会通过产生不受限制的响应来模糊角色边界。我们研究角色条件拒绝，重点关注法学硕士通过在授权时回答和在未授权时拒绝来遵守访问控制策略的能力。为了评估这种行为，我们创建了一个新颖的数据集，该数据集扩展了 Spider 和 BIRD 文本到 SQL 数据集，这两个数据集都已在表和列级别使用基于实际 PostgreSQL 角色的策略进行了修改。我们比较了三种设计：(i) 零次或几次提示，(ii) 根据策略检查 SQL 的两步生成器验证器管道，以及 (iii) 直接学习权限感知的 LoRA 微调模型。在多个模型系列中，显式验证（两步框架）提高了拒绝精度并降低了错误许可。同时，微调在安全性和实用性之间实现了更强的平衡（即，在考虑执行准确性时）。更长、更复杂的策略会不断降低所有系统的可靠性。我们发布了 RBAC 增强数据集和代码。

Title: Banking Done Right: Redefining Retail Banking with Language-Centric AI

Authors: Xin Jie Chua, Jeraelyn Ming Li Tan, Jia Xuan Tan, Soon Chang Poh, Yi Xian Goh, Debbie Hui Tian Choong, Chee Mun Foong, Sze Jue Yang, Chee Seng Chan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07645
Pdf URL: https://arxiv.org/pdf/2510.07645
Copy Paste: [[2510.07645]] Banking Done Right: Redefining Retail Banking with Language-Centric AI(https://arxiv.org/abs/2510.07645)
Keywords: llm, agent
Abstract: This paper presents Ryt AI, an LLM-native agentic framework that powers Ryt Bank to enable customers to execute core financial transactions through natural language conversation. This represents the first global regulator-approved deployment worldwide where conversational AI functions as the primary banking interface, in contrast to prior assistants that have been limited to advisory or support roles. Built entirely in-house, Ryt AI is powered by ILMU, a closed-source LLM developed internally, and replaces rigid multi-screen workflows with a single dialogue orchestrated by four LLM-powered agents (Guardrails, Intent, Payment, and FAQ). Each agent attaches a task-specific LoRA adapter to ILMU, which is hosted within the bank's infrastructure to ensure consistent behavior with minimal overhead. Deterministic guardrails, human-in-the-loop confirmation, and a stateless audit architecture provide defense-in-depth for security and compliance. The result is Banking Done Right: demonstrating that regulator-approved natural-language interfaces can reliably support core financial operations under strict governance.
摘要：本文介绍了 Ryt AI，这是一种 LLM 原生代理框架，为 Ryt Bank 提供支持，使客户能够通过自然语言对话执行核心金融交易。这是全球首个获得全球监管机构批准的部署，其中对话式人工智能作为主要银行界面，而之前的助手仅限于咨询或支持角色。 Ryt AI 完全由内部构建，由内部开发的闭源法学硕士 ILMU 提供支持，并用由四个法学硕士支持的代理（Guardrails、意图、支付和常见问题解答）精心编排的单一对话取代了严格的多屏幕工作流程。每个代理将一个特定于任务的 LoRA 适配器连接到 ILMU，该适配器托管在银行的基础设施内，以确保以最小的开销实现一致的行为。确定性护栏、人机交互确认和无状态审计架构为安全性和合规性提供深度防御。其结果是银行业做得正确：证明监管机构批准的自然语言界面可以在严格治理下可靠地支持核心金融运营。

Title: OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference

Authors: Yuzhe Gu, Xiyu Liang, Jiaojiao Zhao, Enmao Diao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07651
Pdf URL: https://arxiv.org/pdf/2510.07651
Copy Paste: [[2510.07651]] OBCache: Optimal Brain KV Cache Pruning for Efficient Long-Context LLM Inference(https://arxiv.org/abs/2510.07651)
Keywords: language model, llm
Abstract: Large language models (LLMs) with extended context windows enable powerful downstream applications but impose significant memory overhead, as caching all key-value (KV) states scales linearly with sequence length and batch size. Existing cache eviction methods address this by exploiting attention sparsity, yet they typically rank tokens heuristically using accumulated attention weights without considering their true impact on attention outputs. We propose Optimal Brain Cache (OBCache), a principled framework that formulates cache eviction as a layer-wise structured pruning problem. Building upon the Optimal Brain Damage (OBD) theory, OBCache quantifies token saliency by measuring the perturbation in attention outputs induced by pruning tokens, with closed-form scores derived for isolated keys, isolated values, and joint key-value pairs. Our scores account not only for attention weights but also for information from value states and attention outputs, thereby enhancing existing eviction strategies with output-aware signals. Experiments on LLaMA and Qwen models demonstrate that replacing the heuristic scores in existing works, which estimate token saliency across different query positions, with OBCache's output-aware scores consistently improves long-context accuracy.
摘要：具有扩展上下文窗口的大型语言模型 (LLM) 可实现强大的下游应用程序，但会带来巨大的内存开销，因为缓存所有键值 (KV) 状态会随序列长度和批量大小线性扩展。现有的缓存驱逐方法通过利用注意力稀疏性来解决这个问题，但它们通常使用累积的注意力权重启发式地对令牌进行排名，而不考虑它们对注意力输出的真正影响。我们提出了最佳大脑缓存（OBCache），这是一个原则框架，它将缓存驱逐制定为分层结构化修剪问题。基于最佳脑损伤 (OBD) 理论，OBCache 通过测量修剪标记引起的注意力输出扰动来量化标记显着性，并为孤立键、孤立值和联合键值对导出闭合形式分数。我们的分数不仅考虑了注意力权重，还考虑了来自价值状态和注意力输出的信息，从而通过输出感知信号增强了现有的驱逐策略。 LLaMA 和 Qwen 模型的实验表明，用 OBCache 的输出感知分数替换现有作品中的启发式分数（估计不同查询位置的标记显着性）可以持续提高长上下文准确性。

Title: Textual Entailment and Token Probability as Bias Evaluation Metrics

Authors: Virginia K. Felkner, Allison Lim, Jonathan May
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.07662
Pdf URL: https://arxiv.org/pdf/2510.07662
Copy Paste: [[2510.07662]] Textual Entailment and Token Probability as Bias Evaluation Metrics(https://arxiv.org/abs/2510.07662)
Keywords: language model
Abstract: Measurement of social bias in language models is typically by token probability (TP) metrics, which are broadly applicable but have been criticized for their distance from real-world langugage model use cases and harms. In this work, we test natural language inference (NLI) as a more realistic alternative bias metric. We show that, curiously, NLI and TP bias evaluation behave substantially differently, with very low correlation among different NLI metrics and between NLI and TP metrics. We find that NLI metrics are more likely to detect "underdebiased" cases. However, NLI metrics seem to be more brittle and sensitive to wording of counterstereotypical sentences than TP approaches. We conclude that neither token probability nor natural language inference is a "better" bias metric in all cases, and we recommend a combination of TP, NLI, and downstream bias evaluations to ensure comprehensive evaluation of language models. Content Warning: This paper contains examples of anti-LGBTQ+ stereotypes.
摘要：语言模型中社会偏见的衡量通常是通过令牌概率（TP）指标来衡量的，这些指标广泛适用，但因其与现实世界语言模型用例的距离和危害而受到批评。在这项工作中，我们测试自然语言推理（NLI）作为更现实的替代偏差指标。奇怪的是，我们发现，NLI 和 TP 偏差评估的表现截然不同，不同 NLI 指标之间以及 NLI 和 TP 指标之间的相关性非常低。我们发现 NLI 指标更有可能检测到“偏差不足”的案例。然而，与 TP 方法相比，NLI 指标似乎对反刻板句子的措辞更加脆弱和敏感。我们的结论是，在所有情况下，标记概率和自然语言推理都不是“更好”的偏差度量，我们建议结合 TP、NLI 和下游偏差评估来确保对语言模型的综合评估。内容警告：本文包含反 LGBTQ+ 刻板印象的示例。

Title: Stress-Testing Model Specs Reveals Character Differences among Language Models

Authors: Jifan Zhang, Henry Sleight, Andi Peng, John Schulman, Esin Durmus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07686
Pdf URL: https://arxiv.org/pdf/2510.07686
Copy Paste: [[2510.07686]] Stress-Testing Model Specs Reveals Character Differences among Language Models(https://arxiv.org/abs/2510.07686)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.
摘要：大型语言模型 (LLM) 越来越多地根据人工智能章程和模型规范进行训练，这些规范和规范建立了行为准则和道德原则。然而，这些规范面临着严峻的挑战，包括原则之间的内部冲突以及对细微场景的覆盖不足。我们提出了一种对模型特征规范进行压力测试的系统方法，自动识别当前模型规范中存在大量原则矛盾和解释模糊的情况。我们通过生成迫使基于价值的竞争原则之间进行明确权衡的场景来对当前模型规范进行压力测试。使用全面的分类法，我们生成了不同的价值权衡场景，其中模型必须在无法同时满足的合法原则对之间进行选择。我们评估主要提供商（Anthropic、OpenAI、Google、xAI）的 12 名前沿法学硕士的回答，并通过价值分类分数衡量行为分歧。在这些场景中，我们发现了超过 70,000 个表现出显着行为差异的案例。根据经验，我们表明模型行为的这种高度差异强烈地预测了模型规范中的潜在问题。通过定性分析，我们提供了当前模型规范中的大量示例问题，例如几个原则的直接矛盾和解释模糊性。此外，我们生成的数据集还揭示了我们研究的所有前沿模型中明显的错位案例和误报拒绝。最后，我们还提供了价值优先模式以及这些模型的差异。

Title: Large Language Models Meet Virtual Cell: A Survey

Authors: Krinos Li, Xianglu Xiao, Shenglong Deng, Lucas He, Zijun Zhong, Yuanjie Zou, Zhonghao Zhan, Zheng Hui, Weiye Bao, Guang Yang
Subjects: cs.CL, cs.CE, cs.LG, q-bio.CB
Abstract URL: https://arxiv.org/abs/2510.07706
Pdf URL: https://arxiv.org/pdf/2510.07706
Copy Paste: [[2510.07706]] Large Language Models Meet Virtual Cell: A Survey(https://arxiv.org/abs/2510.07706)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are transforming cellular biology by enabling the development of "virtual cells"--computational systems that represent, predict, and reason about cellular states and behaviors. This work provides a comprehensive review of LLMs for virtual cell modeling. We propose a unified taxonomy that organizes existing methods into two paradigms: LLMs as Oracles, for direct cellular modeling, and LLMs as Agents, for orchestrating complex scientific tasks. We identify three core tasks--cellular representation, perturbation prediction, and gene regulation inference--and review their associated models, datasets, evaluation benchmarks, as well as the critical challenges in scalability, generalizability, and interpretability.
摘要：大语言模型 (LLM) 正在通过开发“虚拟细胞”（表示、预测和推理细胞状态和行为的计算系统）来改变细胞生物学。这项工作对虚拟细胞建模的法学硕士进行了全面的回顾。我们提出了一个统一的分类法，将现有方法组织成两个范式：LLM作为Oracle，用于直接细胞建模，LLM作为代理，用于编排复杂的科学任务。我们确定了三个核心任务——细胞表示、扰动预测和基因调控推理——并回顾了它们的相关模型、数据集、评估基准，以及可扩展性、普遍性和可解释性方面的关键挑战。

Title: MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation

Authors: Shuo Yu, Mingyue Cheng, Daoyu Wang, Qi Liu, Zirui Liu, Ze Guo, Xiaoyu Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07713
Pdf URL: https://arxiv.org/pdf/2510.07713
Copy Paste: [[2510.07713]] MemWeaver: A Hierarchical Memory from Textual Interactive Behaviors for Personalized Generation(https://arxiv.org/abs/2510.07713)
Keywords: language model, llm
Abstract: The primary form of user-internet engagement is shifting from leveraging implicit feedback signals, such as browsing and clicks, to harnessing the rich explicit feedback provided by textual interactive behaviors. This shift unlocks a rich source of user textual history, presenting a profound opportunity for a deeper form of personalization. However, prevailing approaches offer only a shallow form of personalization, as they treat user history as a flat list of texts for retrieval and fail to model the rich temporal and semantic structures reflecting dynamic nature of user interests. In this work, we propose \textbf{MemWeaver}, a framework that weaves the user's entire textual history into a hierarchical memory to power deeply personalized generation. The core innovation of our memory lies in its ability to capture both the temporal evolution of interests and the semantic relationships between different activities. To achieve this, MemWeaver builds two complementary memory components that both integrate temporal and semantic information, but at different levels of abstraction: behavioral memory, which captures specific user actions, and cognitive memory, which represents long-term preferences. This dual-component memory serves as a unified representation of the user, allowing large language models (LLMs) to reason over both concrete behaviors and abstracted traits. Experiments on the Language Model Personalization (LaMP) benchmark validate the efficacy of MemWeaver. Our code is available\footnote{this https URL}.
摘要：用户互联网参与的主要形式正在从利用隐式反馈信号（例如浏览和点击）转变为利用文本交互行为提供的丰富的显式反馈。这种转变释放了丰富的用户文本历史来源，为更深层次的个性化提供了巨大的机会。然而，流行的方法仅提供浅层的个性化形式，因为它们将用户历史视为用于检索的平面文本列表，并且无法对反映用户兴趣的动态性质的丰富的时间和语义结构进行建模。在这项工作中，我们提出了 \textbf{MemWeaver}，这是一个将用户的整个文本历史记录编织到分层内存中以支持深度个性化生成的框架。我们记忆的核心创新在于它能够捕捉兴趣的时间演变和不同活动之间的语义关系。为了实现这一目标，MemWeaver 构建了两个互补的记忆组件，它们都集成了时间和语义信息，但处于不同的抽象级别：行为记忆（捕获特定的用户操作）和认知记忆（代表长期偏好）。这种双组件记忆作为用户的统一表示，允许大型语言模型（LLM）对具体行为和抽象特征进行推理。语言模型个性化 (LaMP) 基准测试的实验验证了 MemWeaver 的功效。我们的代码可用\footnote{此 https URL}。

Title: SUBQRAG: sub-question driven dynamic graph rag

Authors: Jiaoyang Li, Junhao Ruan, Shengwei Tang, Saihan Chen, Kaiyan Chang, Yuan Ge, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07718
Pdf URL: https://arxiv.org/pdf/2510.07718
Copy Paste: [[2510.07718]] SUBQRAG: sub-question driven dynamic graph rag(https://arxiv.org/abs/2510.07718)
Keywords: retrieval-augmented generation
Abstract: Graph Retrieval-Augmented Generation (Graph RAG) effectively builds a knowledge graph (KG) to connect disparate facts across a large document corpus. However, this broad-view approach often lacks the deep structured reasoning needed for complex multi-hop question answering (QA), leading to incomplete evidence and error accumulation. To address these limitations, we propose SubQRAG, a sub-question-driven framework that enhances reasoning depth. SubQRAG decomposes a complex question into an ordered chain of verifiable sub-questions. For each sub-question, it retrieves relevant triples from the graph. When the existing graph is insufficient, the system dynamically expands it by extracting new triples from source documents in real time. All triples used in the reasoning process are aggregated into a "graph memory," forming a structured and traceable evidence path for final answer generation. Experiments on three multi-hop QA benchmarks demonstrate that SubQRAG achieves consistent and significant improvements, especially in Exact Match scores.
摘要：图检索增强生成（Graph RAG）有效地构建知识图（KG）来连接大型文档语料库中的不同事实。然而，这种宽视角方法往往缺乏复杂多跳问答（QA）所需的深层结构化推理，导致证据不完整和错误积累。为了解决这些限制，我们提出了 SubQRAG，一个子问题驱动的框架，可以增强推理深度。 SubQRAG 将复杂问题分解为可验证子问题的有序链。对于每个子问题，它从图中检索相关的三元组。当现有图不够时，系统通过实时从源文档中提取新的三元组来动态扩展它。推理过程中使用的所有三元组都聚合到“图形存储器”中，形成最终答案生成的结构化且可追踪的证据路径。对三个多跳 QA 基准的实验表明，SubQRAG 实现了一致且显着的改进，尤其是在精确匹配分数方面。

Title: Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing

Authors: Cunli Mao, Xiaofei Gao, Ran Song, Shizhu He, Shengxiang Gao, Kang Liu, Zhengtao Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07736
Pdf URL: https://arxiv.org/pdf/2510.07736
Copy Paste: [[2510.07736]] Multilingual Knowledge Graph Completion via Efficient Multilingual Knowledge Sharing(https://arxiv.org/abs/2510.07736)
Keywords: language model, llm
Abstract: Large language models (LLMs) based Multilingual Knowledge Graph Completion (MKGC) aim to predict missing facts by leveraging LLMs' multilingual understanding capabilities, improving the completeness of multilingual knowledge graphs (KGs). However, existing MKGC research underutilizes the multilingual capabilities of LLMs and ignores the shareability of cross-lingual knowledge. In this paper, we propose a novel MKGC framework that leverages multilingual shared knowledge to significantly enhance performance through two components: Knowledge-level Grouped Mixture of Experts (KL-GMoE) and Iterative Entity Reranking (IER). KL-GMoE efficiently models shared knowledge, while IER significantly enhances its utilization. To evaluate our framework, we constructed a mKG dataset containing 5 languages and conducted comprehensive comparative experiments with existing state-of-the-art (SOTA) MKGC method. The experimental results demonstrate that our framework achieves improvements of 5.47%, 3.27%, and 1.01% in the Hits@1, Hits@3, and Hits@10 metrics, respectively, compared with SOTA MKGC method. Further experimental analysis revealed the properties of knowledge sharing in settings of unseen and unbalanced languages. We have released the dataset and code for our work on this https URL.
摘要：基于大语言模型（LLM）的多语言知识图谱补全（MKGC）旨在利用LLM的多语言理解能力来预测缺失的事实，提高多语言知识图谱（KG）的完整性。然而，现有的MKGC研究并未充分利用法学硕士的多语言能力，并忽视了跨语言知识的共享性。在本文中，我们提出了一种新颖的 MKGC 框架，该框架利用多语言共享知识，通过两个组件显着提高性能：知识级专家分组混合 (KL-GMoE) 和迭代实体重排序 (IER)。 KL-GMoE 有效地对共享知识进行建模，而 IER 则显着提高了其利用率。为了评估我们的框架，我们构建了一个包含 5 种语言的 mKG 数据集，并与现有最先进的 (SOTA) MKGC 方法进行了全面的比较实验。实验结果表明，与 SOTA MKGC 方法相比，我们的框架在 Hits@1、Hits@3 和 Hits@10 指标上分别实现了 5.47%、3.27% 和 1.01% 的改进。进一步的实验分析揭示了在看不见的和不平衡的语言环境中知识共享的特性。我们已经在此 https URL 上发布了我们工作的数据集和代码。

Title: ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs

Authors: Fu Chen, Peng Wang, Xiyin Li, Wen Li, Shichi Lei, Dongdong Xiang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07737
Pdf URL: https://arxiv.org/pdf/2510.07737
Copy Paste: [[2510.07737]] ToolExpander: Extending the Frontiers of Tool-Using Reinforcement Learning to Weak LLMs(https://arxiv.org/abs/2510.07737)
Keywords: language model, llm
Abstract: Training Large Language Models (LLMs) with Group Relative Policy Optimization (GRPO) encounters a significant challenge: models often fail to produce accurate responses, particularly in small-scale architectures. This limitation not only diminishes performance improvements and undermines the potential of GRPO but also frequently leads to mid-training collapse, adversely affecting stability and final efficacy. To address these issues, we propose ToolExpander, a novel framework that advances tool-oriented reinforcement learning for resource-constrained LLMs through two key innovations:(1) Dynamic Multi-Round Hard Sampling, which dynamically substitutes challenging samples(those without correct outputs over 10 rollouts) with high-quality few-shot demonstrations during training, coupled with an exponential learning rate decay strategy to mitigate oscillations;(2) Self-Exemplifying Thinking, an enhanced GRPO framework that eliminates KL divergence and incorporates adjusted clipping coefficients, encouraging models to autonomously generate and analyze few-shot examples via a minimal additional reward (0.01).Experimental results demonstrate that ToolExpander significantly enhances tool-using capabilities in LLMs, especially in weaker small-scale models, improving both training stability and overall performance.
摘要：使用组相对策略优化 (GRPO) 训练大型语言模型 (LLM) 遇到了重大挑战：模型通常无法产生准确的响应，特别是在小规模架构中。这种限制不仅削弱了性能的提高并削弱了 GRPO 的潜力，而且还经常导致训练中期崩溃，对稳定性和最终功效产生不利影响。为了解决这些问题，我们提出了 ToolExpander，这是一种新颖的框架，通过两项关键创新，为资源有限的法学硕士推进面向工具的强化学习：(1) 动态多轮硬采样，在训练期间用高质量的几次演示动态替换具有挑战性的样本（那些在 10 次展示中没有正确输出的样本），并结合指数学习率衰减策略来缓解（2）Self-Exemplifying Thinking，一种增强的GRPO框架，消除了KL散度并纳入了调整后的裁剪系数，通过最小的额外奖励（0.01）鼓励模型自主生成和分析少数样本。实验结果表明，ToolExpander显着增强了LLM中的工具使用能力，特别是在较弱的小规模模型中，改善了训练稳定性和整体性能。

Title: OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment

Authors: Tianci Liu, Ran Xu, Tony Yu, Ilgee Hong, Carl Yang, Tuo Zhao, Haoyu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07743
Pdf URL: https://arxiv.org/pdf/2510.07743
Copy Paste: [[2510.07743]] OpenRubrics: Towards Scalable Synthetic Rubric Generation for Reward Modeling and LLM Alignment(https://arxiv.org/abs/2510.07743)
Keywords: llm, prompt
Abstract: Reward modeling lies at the core of reinforcement learning from human feedback (RLHF), yet most existing reward models rely on scalar or pairwise judgments that fail to capture the multifaceted nature of human preferences. Recent studies have explored rubrics-as-rewards (RaR) that uses structured natural language criteria that capture multiple dimensions of response quality. However, producing rubrics that are both reliable and scalable remains a key challenge. In this work, we introduce OpenRubrics, a diverse, large-scale collection of (prompt, rubric) pairs for training rubric-generation and rubric-based reward models. To elicit discriminative and comprehensive evaluation signals, we introduce Contrastive Rubric Generation (CRG), which derives both hard rules (explicit constraints) and principles (implicit qualities) by contrasting preferred and rejected responses. We further improve reliability by enforcing preference-label consistency via rejection sampling to remove noisy rubrics. Across multiple reward-modeling benchmarks, our rubric-based reward model, Rubric-RM, surpasses strong size-matched baselines by 6.8%. These gains transfer to policy models on instruction-following and biomedical benchmarks. Our results show that rubrics provide scalable alignment signals that narrow the gap between costly human evaluation and automated reward modeling, enabling a new principle-driven paradigm for LLM alignment.
摘要：奖励建模是人类反馈强化学习 (RLHF) 的核心，但大多数现有奖励模型依赖于标量或成对判断，无法捕捉人类偏好的多方面性质。最近的研究探索了奖励规则（RaR），它使用结构化的自然语言标准来捕获响应质量的多个维度。然而，生成既可靠又可扩展的评估标准仍然是一个关键挑战。在这项工作中，我们介绍了 OpenRubrics，这是一个多样化的大规模（提示、评分标准）对集合，用于训练评分标准生成和基于评分标准的奖励模型。为了得出区分性和综合性的评估信号，我们引入了对比量规生成（CRG），它通过对比首选和拒绝的响应来导出硬规则（显性约束）和原则（隐性质量）。我们通过拒绝抽样来消除嘈杂的规则来强制偏好标签的一致性，从而进一步提高可靠性。在多个奖励建模基准中，我们基于 rubric 的奖励模型 Rubric-RM 超出了强大的规模匹配基线 6.8%。这些成果转移到了关于遵循指令和生物医学基准的政策模型。我们的结果表明，评分细则提供了可扩展的对齐信号，缩小了昂贵的人工评估和自动奖励建模之间的差距，为法学硕士对齐提供了一种新的原则驱动范例。

Title: Parallel Test-Time Scaling for Latent Reasoning Models

Authors: Runyang You, Yongqi Li, Meng Liu, Wenjie Wang, Liqiang Nie, Wenjie Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07745
Pdf URL: https://arxiv.org/pdf/2510.07745
Copy Paste: [[2510.07745]] Parallel Test-Time Scaling for Latent Reasoning Models(https://arxiv.org/abs/2510.07745)
Keywords: language model, llm, chain-of-thought
Abstract: Parallel test-time scaling (TTS) is a pivotal approach for enhancing large language models (LLMs), typically by sampling multiple token-based chains-of-thought in parallel and aggregating outcomes through voting or search. Recent advances in latent reasoning, where intermediate reasoning unfolds in continuous vector spaces, offer a more efficient alternative to explicit Chain-of-Thought, yet whether such latent models can similarly benefit from parallel TTS remains open, mainly due to the absence of sampling mechanisms in continuous space, and the lack of probabilistic signals for advanced trajectory aggregation. \ This work enables parallel TTS for latent reasoning models by addressing the above issues. For sampling, we introduce two uncertainty-inspired stochastic strategies: Monte Carlo Dropout and Additive Gaussian Noise. For aggregation, we design a Latent Reward Model (LatentRM) trained with step-wise contrastive objective to score and guide latent reasoning. Extensive experiments and visualization analyses show that both sampling strategies scale effectively with compute and exhibit distinct exploration dynamics, while LatentRM enables effective trajectory selection. Together, our explorations open a new direction for scalable inference in continuous spaces. Code released at this https URL.
摘要：并行测试时间扩展 (TTS) 是增强大型语言模型 (LLM) 的关键方法，通常通过并行采样多个基于令牌的思想链并通过投票或搜索聚合结果。潜在推理的最新进展（其中中间推理在连续向量空间中展开）为显式思维链提供了更有效的替代方案，但此类潜在模型是否能够同样从并行 TTS 中受益仍然悬而未决，这主要是由于连续空间中缺乏采样机制，并且缺乏用于高级轨迹聚合的概率信号。这项工作通过解决上述问题，为潜在推理模型实现了并行 TTS。对于采样，我们引入了两种不确定性随机策略：蒙特卡罗辍学和加性高斯噪声。对于聚合，我们设计了一个潜在奖励模型（LatentRM），通过逐步对比目标进行训练，以评分和指导潜在推理。大量的实验和可视化分析表明，两种采样策略都可以通过计算有效扩展，并表现出独特的探索动态，而 LatentRM 可以实现有效的轨迹选择。我们的探索共同为连续空间中的可扩展推理开辟了新方向。在此 https URL 发布代码。

Title: Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers

Authors: Nishant Balepur, Atrey Desai, Rachel Rudinger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07761
Pdf URL: https://arxiv.org/pdf/2510.07761
Copy Paste: [[2510.07761]] Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers(https://arxiv.org/abs/2510.07761)
Keywords: language model, llm
Abstract: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.
摘要：大型语言模型 (LLM) 现在可以在回答之前进行推理，在多项选择题回答 (MCQA) 等任务中表现出色。然而，令人担忧的是，法学硕士并没有按预期解决 MCQ，因为研究发现，没有推理的法学硕士在不使用问题（即仅选择）的 MCQA 中取得了成功。这种部分输入的成功通常被认为是有问题的，但推理痕迹可以揭示这些策略在仅选择的环境中是否真的很肤浅。为了研究这些策略，推理法学硕士需要以完整且仅选择的输入来解决 MCQ；测试时推理通常可以提高全部的准确性，但只有一半的情况下可以提高选择的准确性。虽然可能是由于浅薄的捷径，但仅选择的成功几乎不受推理痕迹长度的影响，并且在发现痕迹通过忠实性测试后，我们表明他们使用了问题较少的策略，例如推断缺失的问题。总而言之，我们对“部分输入成功始终是一个缺陷”的说法提出质疑，因此我们讨论推理痕迹如何将有问题的数据与问题较少的推理分开。

Title: ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

Authors: Murong Yue, Zhiwei Liu, Liangwei Yang, Jianguo Zhang, Zuxin Liu, Haolin Chen, Ziyu Yao, Silvio Savarese, Caiming Xiong, Shelby Heinecke, Huan Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07768
Pdf URL: https://arxiv.org/pdf/2510.07768
Copy Paste: [[2510.07768]] ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning(https://arxiv.org/abs/2510.07768)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large Language Models (LLMs) equipped with external tools have demonstrated enhanced performance on complex reasoning tasks. The widespread adoption of this tool-augmented reasoning is hindered by the scarcity of domain-specific tools. For instance, in domains such as physics question answering, suitable and specialized tools are often missing. Recent work has explored automating tool creation by extracting reusable functions from Chain-of-Thought (CoT) reasoning traces; however, these approaches face a critical scalability bottleneck. As the number of generated tools grows, storing them in an unstructured collection leads to significant retrieval challenges, including an expanding search space and ambiguity between function-related tools. To address this, we propose a systematic approach to automatically refactor an unstructured collection of tools into a structured tool library. Our system first generates discrete, task-specific tools and clusters them into semantically coherent topics. Within each cluster, we introduce a multi-agent framework to consolidate scattered functionalities: a code agent refactors code to extract shared logic and creates versatile, aggregated tools, while a reviewing agent ensures that these aggregated tools maintain the complete functional capabilities of the original set. This process transforms numerous question-specific tools into a smaller set of powerful, aggregated tools without loss of functionality. Experimental results demonstrate that our approach significantly improves tool retrieval accuracy and overall reasoning performance across multiple reasoning tasks. Furthermore, our method shows enhanced scalability compared with baselines as the number of question-specific increases.
摘要：配备外部工具的大型语言模型（LLM）已在复杂推理任务中表现出增强的性能。由于特定领域工具的缺乏，阻碍了这种工具增强推理的广泛采用。例如，在物理问答等领域，经常缺少合适且专门的工具。最近的工作通过从思想链（CoT）推理轨迹中提取可重用的函数来探索自动化工具创建；然而，这些方法面临着关键的可扩展性瓶颈。随着生成的工具数量的增长，将它们存储在非结构化集合中会带来重大的检索挑战，包括搜索空间的扩大以及功能相关工具之间的模糊性。为了解决这个问题，我们提出了一种系统方法，可以将非结构化工具集合自动重构为结构化工具库。我们的系统首先生成离散的、特定于任务的工具，并将它们聚集成语义一致的主题。在每个集群中，我们引入了一个多代理框架来整合分散的功能：代码代理重构代码以提取共享逻辑并创建通用的聚合工具，而审查代理则确保这些聚合工具保持原始集的完整功能。此过程将众多针对特定问题的工具转换为较小的一组强大的聚合工具，而不会损失功能。实验结果表明，我们的方法显着提高了工具检索的准确性和跨多个推理任务的整体推理性能。此外，随着特定问题数量的增加，我们的方法显示出与基线相比增强的可扩展性。

Title: Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards

Authors: Youliang Yuan, Qiuyang Mang, Jingbang Chen, Hong Wan, Xiaoyuan Liu, Junjielong Xu, Jen-tse Huang, Wenxuan Wang, Wenxiang Jiao, Pinjia He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07774
Pdf URL: https://arxiv.org/pdf/2510.07774
Copy Paste: [[2510.07774]] Curing Miracle Steps in LLM Mathematical Reasoning with Rubric Rewards(https://arxiv.org/abs/2510.07774)
Keywords: language model, llm
Abstract: Large language models for mathematical reasoning are typically trained with outcome-based rewards, which credit only the final answer. In our experiments, we observe that this paradigm is highly susceptible to reward hacking, leading to a substantial overestimation of a model's reasoning ability. This is evidenced by a high incidence of false positives - solutions that reach the correct final answer through an unsound reasoning process. Through a systematic analysis with human verification, we establish a taxonomy of these failure modes, identifying patterns like Miracle Steps - abrupt jumps to a correct output without a valid preceding derivation. Probing experiments suggest a strong association between these Miracle Steps and memorization, where the model appears to recall the answer directly rather than deriving it. To mitigate this systemic issue, we introduce the Rubric Reward Model (RRM), a process-oriented reward function that evaluates the entire reasoning trajectory against problem-specific rubrics. The generative RRM provides fine-grained, calibrated rewards (0-1) that explicitly penalize logical flaws and encourage rigorous deduction. When integrated into a reinforcement learning pipeline, RRM-based training consistently outperforms outcome-only supervision across four math benchmarks. Notably, it boosts Verified Pass@1024 on AIME2024 from 26.7% to 62.6% and reduces the incidence of Miracle Steps by 71%. Our work demonstrates that rewarding the solution process is crucial for building models that are not only more accurate but also more reliable.
摘要：用于数学推理的大型语言模型通常是通过基于结果的奖励进行训练的，该奖励仅归功于最终答案。在我们的实验中，我们观察到这种范式非常容易受到奖励黑客攻击，从而导致模型推理能力的大幅高估。假阳性的高发生率证明了这一点——通过不合理的推理过程得出正确的最终答案的解决方案。通过系统分析和人工验证，我们建立了这些故障模式的分类，识别奇迹步骤之类的模式 - 在没有有效的先前推导的情况下突然跳转到正确的输出。探索实验表明这些奇迹步骤与记忆之间存在很强的关联，模型似乎直接回忆答案而不是推导出答案。为了缓解这个系统性问题，我们引入了Rubric奖励模型（RRM），这是一种面向过程的奖励函数，可以根据特定问题的规则评估整个推理轨迹。生成式 RRM 提供细粒度、校准的奖励 (0-1)，明确惩罚逻辑缺陷并鼓励严格的推论。当集成到强化学习管道中时，基于 RRM 的训练在四个数学基准上始终优于仅结果监督。值得注意的是，它将 AIME2024 上的 Verified Pass@1024 从 26.7% 提高到 62.6%，并将 Miracle Steps 的发生率降低了 71%。我们的工作表明，奖励解决过程对于构建不仅更准确而且更可靠的模型至关重要。

Title: The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs

Authors: Omar Mahmoud, Ali Khalil, Buddhika Laknath Semage, Thommen George Karimpanal, Santu Rana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07775
Pdf URL: https://arxiv.org/pdf/2510.07775
Copy Paste: [[2510.07775]] The Unintended Trade-off of AI Alignment:Balancing Hallucination Mitigation and Safety in LLMs(https://arxiv.org/abs/2510.07775)
Keywords: language model, llm, hallucination
Abstract: Hallucination in large language models (LLMs) has been widely studied in recent years, with progress in both detection and mitigation aimed at improving truthfulness. Yet, a critical side effect remains largely overlooked: enhancing truthfulness can negatively impact safety alignment. In this paper, we investigate this trade-off and show that increasing factual accuracy often comes at the cost of weakened refusal behavior. Our analysis reveals that this arises from overlapping components in the model that simultaneously encode hallucination and refusal information, leading alignment methods to suppress factual knowledge unintentionally. We further examine how fine-tuning on benign datasets, even when curated for safety, can degrade alignment for the same reason. To address this, we propose a method that disentangles refusal-related features from hallucination features using sparse autoencoders, and preserves refusal behavior during fine-tuning through subspace orthogonalization. This approach prevents hallucinations from increasing while maintaining safety this http URL evaluate our method on commonsense reasoning tasks and harmful benchmarks (AdvBench and StrongReject). Results demonstrate that our approach preserves refusal behavior and task utility, mitigating the trade-off between truthfulness and safety.
摘要：近年来，大语言模型（LLM）中的幻觉得到了广泛的研究，在旨在提高真实性的检测和缓解方面取得了进展。然而，一个关键的副作用仍然在很大程度上被忽视：增强真实性可能会对安全一致性产生负面影响。在本文中，我们研究了这种权衡，并表明提高事实准确性往往是以削弱拒绝行为为代价的。我们的分析表明，这是由于模型中同时编码幻觉和拒绝信息的重叠组件造成的，导致对齐方法无意中抑制了事实知识。我们进一步研究了对良性数据集的微调，即使是出于安全考虑，也会因同样的原因而降低对齐效果。为了解决这个问题，我们提出了一种方法，使用稀疏自动编码器将拒绝相关特征与幻觉特征分开，并在通过子空间正交化进行微调期间保留拒绝行为。这种方法可以防止幻觉增加，同时保持安全性。该 http URL 评估我们在常识推理任务和有害基准（AdvBench 和 StrongReject）上的方法。结果表明，我们的方法保留了拒绝行为和任务效用，减轻了真实性和安全性之间的权衡。

Title: Drift No More? Context Equilibria in Multi-Turn LLM Interactions

Authors: Vardhan Dongre, Ryan A. Rossi, Viet Dac Lai, David Seunghyun Yoon, Dilek Hakkani-Tür, Trung Bui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07777
Pdf URL: https://arxiv.org/pdf/2510.07777
Copy Paste: [[2510.07777]] Drift No More? Context Equilibria in Multi-Turn LLM Interactions(https://arxiv.org/abs/2510.07777)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) excel at single-turn tasks such as instruction following and summarization, yet real-world deployments require sustained multi-turn interactions where user goals and conversational context persist and evolve. A recurring challenge in this setting is context drift: the gradual divergence of a model's outputs from goal-consistent behavior across turns. Unlike single-turn errors, drift unfolds temporally and is poorly captured by static evaluation metrics. In this work, we present a study of context drift in multi-turn interactions and propose a simple dynamical framework to interpret its behavior. We formalize drift as the turn-wise KL divergence between the token-level predictive distributions of the test model and a goal-consistent reference model, and propose a recurrence model that interprets its evolution as a bounded stochastic process with restoring forces and controllable interventions. We instantiate this framework in both synthetic long-horizon rewriting tasks and realistic user-agent simulations such as in $\tau$-Bench, measuring drift for several open-weight LLMs that are used as user simulators. Our experiments consistently reveal stable, noise-limited equilibria rather than runaway degradation, and demonstrate that simple reminder interventions reliably reduce divergence in line with theoretical predictions. Together, these results suggest that multi-turn drift can be understood as a controllable equilibrium phenomenon rather than as inevitable decay, providing a foundation for studying and mitigating context drift in extended interactions.
摘要：大型语言模型 (LLM) 擅长单轮任务，例如指令跟踪和总结，但现实世界的部署需要持续的多轮交互，其中用户目标和对话上下文持续存在并不断发展。在这种情况下，一个反复出现的挑战是上下文漂移：模型的输出逐渐偏离目标一致的行为。与单圈误差不同，漂移是暂时展开的，静态评估指标很难捕捉到。在这项工作中，我们提出了多轮交互中的上下文漂移的研究，并提出了一个简单的动态框架来解释其行为。我们将漂移形式化为测试模型的标记级预测分布与目标一致的参考模型之间的循环 KL 散度，并提出了一种递归模型，将其演化解释为具有恢复力和可控干预的有界随机过程。我们在合成的长范围重写任务和现实的用户代理模拟（例如$\tau$-Bench）中实例化了这个框架，测量了用作用户模拟器的几个开放权重LLM的漂移。我们的实验始终揭示了稳定的、噪声限制的平衡，而不是失控的退化，并证明简单的提醒干预措施可以可靠地减少分歧，与理论预测一致。总之，这些结果表明，多轮漂移可以被理解为一种可控的平衡现象，而不是不可避免的衰减，为研究和减轻扩展交互中的上下文漂移提供了基础。

Title: RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model

Authors: Shuichiro Haruta, Kazunori Matsumoto, Zhi Li, Yanan Wang, Mori Kurokawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07782
Pdf URL: https://arxiv.org/pdf/2510.07782
Copy Paste: [[2510.07782]] RCPU: Rotation-Constrained Error Compensation for Structured Pruning of a Large Language Model(https://arxiv.org/abs/2510.07782)
Keywords: language model, llm
Abstract: In this paper, we propose a rotation-constrained compensation method to address the errors introduced by structured pruning of large language models (LLMs). LLMs are trained on massive datasets and accumulate rich semantic knowledge in their representation space. In contrast, pruning is typically carried out with only a small amount of calibration data, which makes output mismatches unavoidable. Although direct least-squares fitting can reduce such errors, it tends to overfit to the limited calibration set, destructively modifying pretrained weights. To overcome this difficulty, we update the pruned parameters under a rotation constraint. This constrained update preserves the geometry of output representations (i.e., norms and inner products) and simultaneously re-aligns the pruned subspace with the original outputs. Furthermore, in rotation-constrained compensation, removing components that strongly contribute to the principal directions of the output makes error recovery difficult. Since input dimensions with large variance strongly affect these principal directions, we design a variance-aware importance score that ensures such dimensions are preferentially kept in the pruned model. By combining this scoring rule with rotation-constrained updates, the proposed method effectively compensates errors while retaining the components likely to be more important in a geometry-preserving manner. In the experiments, we apply the proposed method to LLaMA-7B and evaluate it on WikiText-2 and multiple language understanding benchmarks. The results demonstrate consistently better perplexity and task accuracy compared with existing baselines.
摘要：在本文中，我们提出了一种旋转约束补偿方法来解决大型语言模型（LLM）的结构化剪枝引入的错误。法学硕士接受海量数据集的训练，并在其表示空间中积累丰富的语义知识。相反，修剪通常仅使用少量校准数据进行，这使得输出不匹配不可避免。尽管直接最小二乘拟合可以减少此类误差，但它往往会过度拟合有限的校准集，从而破坏性地修改预训练的权重。为了克服这个困难，我们在旋转约束下更新修剪后的参数。这种约束更新保留了输出表示的几何形状（即范数和内积），并同时将修剪后的子空间与原始输出重新对齐。此外，在旋转约束补偿中，去除对输出主方向有很大贡献的分量使得错误恢复变得困难。由于具有大方差的输入维度强烈影响这些主方向，因此我们设计了一个方差感知重要性得分，以确保这些维度优先保留在剪枝模型中。通过将此评分规则与旋转约束更新相结合，所提出的方法有效地补偿了误差，同时以几何保留的方式保留了可能更重要的组件。在实验中，我们将所提出的方法应用于LLaMA-7B，并在WikiText-2和多种语言理解基准上对其进行评估。结果表明，与现有基线相比，困惑度和任务准确性始终更高。

Title: LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology

Authors: Sajib Acharjee Dip, Adrika Zafor, Bikash Kumar Paul, Uddip Acharjee Shuvo, Muhit Islam Emon, Xuan Wang, Liqing Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07793
Pdf URL: https://arxiv.org/pdf/2510.07793
Copy Paste: [[2510.07793]] LLM4Cell: A Survey of Large Language and Agentic Models for Single-Cell Biology(https://arxiv.org/abs/2510.07793)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) and emerging agentic frameworks are beginning to transform single-cell biology by enabling natural-language reasoning, generative annotation, and multimodal data integration. However, progress remains fragmented across data modalities, architectures, and evaluation standards. LLM4Cell presents the first unified survey of 58 foundation and agentic models developed for single-cell research, spanning RNA, ATAC, multi-omic, and spatial modalities. We categorize these methods into five families-foundation, text-bridge, spatial, multimodal, epigenomic, and agentic-and map them to eight key analytical tasks including annotation, trajectory and perturbation modeling, and drug-response prediction. Drawing on over 40 public datasets, we analyze benchmark suitability, data diversity, and ethical or scalability constraints, and evaluate models across 10 domain dimensions covering biological grounding, multi-omics alignment, fairness, privacy, and explainability. By linking datasets, models, and evaluation domains, LLM4Cell provides the first integrated view of language-driven single-cell intelligence and outlines open challenges in interpretability, standardization, and trustworthy model development.
摘要：大语言模型 (LLM) 和新兴的代理框架开始通过支持自然语言推理、生成注释和多模式数据集成来改变单细胞生物学。然而，数据模式、架构和评估标准方面的进展仍然分散。 LLM4Cell 首次对为单细胞研究开发的 58 个基础和代理模型进行了统一调查，涵盖 RNA、ATAC、多组学和空间模式。我们将这些方法分为五个系列——基础、文本桥、空间、多模态、表观基因组和代理——并将它们映射到八个关键分析任务，包括注释、轨迹和扰动建模以及药物反应预测。我们利用 40 多个公共数据集，分析基准的适用性、数据多样性以及道德或可扩展性约束，并评估涵盖生物基础、多组学对齐、公平性、隐私和可解释性等 10 个领域维度的模型。通过链接数据集、模型和评估领域，LLM4Cell 提供了语言驱动的单细胞智能的第一个集成视图，并概述了可解释性、标准化和可信模型开发方面的开放挑战。

Title: HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation

Authors: Peilin Wu, Mian Zhang, Kun Wan, Wentian Zhao, Kaiyu He, Xinya Du, Zhiyu Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.07794
Pdf URL: https://arxiv.org/pdf/2510.07794
Copy Paste: [[2510.07794]] HiPRAG: Hierarchical Process Rewards for Efficient Agentic Retrieval Augmented Generation(https://arxiv.org/abs/2510.07794)
Keywords: llm, retrieval augmented generation, agent
Abstract: Agentic RAG is a powerful technique for incorporating external information that LLMs lack, enabling better problem solving and question answering. However, suboptimal search behaviors exist widely, such as over-search (retrieving information already known) and under-search (failing to search when necessary), which leads to unnecessary overhead and unreliable outputs. Current training methods, which typically rely on outcome-based rewards in a RL framework, lack the fine-grained control needed to address these inefficiencies. To overcome this, we introduce Hierarchical Process Rewards for Efficient agentic RAG (HiPRAG), a training methodology that incorporates a fine-grained, knowledge-grounded process reward into the RL training. Our approach evaluates the necessity of each search decision on-the-fly by decomposing the agent's reasoning trajectory into discrete, parsable steps. We then apply a hierarchical reward function that provides an additional bonus based on the proportion of optimal search and non-search steps, on top of commonly used outcome and format rewards. Experiments on the Qwen2.5 and Llama-3.2 models across seven diverse QA benchmarks show that our method achieves average accuracies of 65.4% (3B) and 67.2% (7B). This is accomplished while improving search efficiency, reducing the over-search rate to just 2.3% and concurrently lowering the under-search rate. These results demonstrate the efficacy of optimizing the reasoning process itself, not just the final outcome. Further experiments and analysis demonstrate that HiPRAG shows good generalizability across a wide range of RL algorithms, model families, sizes, and types. This work demonstrates the importance and potential of fine-grained control through RL, for improving the efficiency and optimality of reasoning for search agents.
摘要：Agentic RAG 是一种强大的技术，可以整合法学硕士缺乏的外部信息，从而更好地解决问题和回答问题。然而，次优搜索行为广泛存在，例如过度搜索（检索已知信息）和搜索不足（必要时未能搜索），这会导致不必要的开销和不可靠的输出。当前的训练方法通常依赖于强化学习框架中基于结果的奖励，缺乏解决这些低效率问题所需的细粒度控制。为了克服这个问题，我们引入了高效代理 RAG 的分层过程奖励（HiPRAG），这是一种将细粒度、以知识为基础的过程奖励融入到 RL 训练中的训练方法。我们的方法通过将代理的推理轨迹分解为离散的、可解析的步骤来动态评估每个搜索决策的必要性。然后，我们应用分层奖励函数，除了常用的结果和格式奖励之外，该函数还根据最佳搜索和非搜索步骤的比例提供额外的奖励。在七个不同 QA 基准的 Qwen2.5 和 Llama-3.2 模型上进行的实验表明，我们的方法实现了 65.4% (3B) 和 67.2% (7B) 的平均准确率。实现这一目标的同时提高了搜索效率，将过度搜索率降低至仅 2.3%，同时降低了搜索不足率。这些结果证明了优化推理过程本身的有效性，而不仅仅是最终结果。进一步的实验和分析表明，HiPRAG 在各种 RL 算法、模型系列、大小和类型中表现出良好的通用性。这项工作展示了通过强化学习进行细粒度控制的重要性和潜力，以提高搜索代理推理的效率和优化性。

Title: Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

Authors: Eric Hanchen Jiang, Guancheng Wan, Sophia Yin, Mengting Li, Yuchen Wu, Xiao Liang, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, Ying Nian Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07799
Pdf URL: https://arxiv.org/pdf/2510.07799
Copy Paste: [[2510.07799]] Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models(https://arxiv.org/abs/2510.07799)
Keywords: language model, llm, agent
Abstract: The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.
摘要：由大型语言模型（LLM）驱动的多智能体系统的效率很大程度上取决于它们的通信拓扑。然而，设计最佳拓扑是一项不小的挑战，因为它需要平衡任务性能、通信成本和鲁棒性等相互竞争的目标。现有的框架通常依赖于静态或手工制作的拓扑，这本质上无法适应不同的任务要求，导致简单问题的过度令牌消耗或复杂问题的性能瓶颈。为了应对这一挑战，我们引入了一种新颖的生成框架，称为 \textit{引导拓扑扩散（GTD）}。受条件离散图扩散模型的启发，GTD 将拓扑综合表述为迭代构建过程。在每一步中，生成过程都由轻量级代理模型引导，该模型预测多目标奖励（例如，准确性、效用、成本），从而实现针对任务自适应拓扑的实时、无梯度优化。这种迭代的引导综合过程将 GTD 与单步生成框架区分开来，使其能够更好地进行复杂的设计权衡。我们在多个基准测试中验证了 GTD，实验表明该框架可以生成高度任务自适应、稀疏且高效的通信拓扑，显着优于 LLM 代理协作中的现有方法。

Title: AdaSwitch: Adaptive Switching Generation for Knowledge Distillation

Authors: Jingyu Peng, Maolin Wang, Hengyi Cai, Yuchen Li, Kai Zhang, Shuaiqiang Wang, Dawei Yin, Xiangyu Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07842
Pdf URL: https://arxiv.org/pdf/2510.07842
Copy Paste: [[2510.07842]] AdaSwitch: Adaptive Switching Generation for Knowledge Distillation(https://arxiv.org/abs/2510.07842)
Keywords: language model, llm
Abstract: Small language models (SLMs) are crucial for applications with strict latency and computational constraints, yet achieving high performance remains challenging. Knowledge distillation (KD) can transfer capabilities from large teacher models, but existing methods involve trade-offs: off-policy distillation provides high-quality supervision but introduces a training-inference mismatch, while on-policy approaches maintain consistency but rely on low-quality student outputs. To address these issues, we propose AdaSwitch, a novel approach that dynamically combines on-policy and off-policy generation at the token level. AdaSwitch allows the student to first explore its own predictions and then selectively integrate teacher guidance based on real-time quality assessment. This approach simultaneously preserves consistency and maintains supervision quality. Experiments on three datasets with two teacher-student LLM pairs demonstrate that AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
摘要：小语言模型 (SLM) 对于具有严格延迟和计算限制的应用程序至关重要，但实现高性能仍然具有挑战性。知识蒸馏（KD）可以从大型教师模型中转移能力，但现有方法涉及权衡：离策略蒸馏提供高质量的监督，但引入了训练-推理不匹配，而在策略方法保持一致性，但依赖于低质量的学生输出。为了解决这些问题，我们提出了 AdaSwitch，这是一种在代币级别动态结合在策略和离策略生成的新颖方法。 AdaSwitch 允许学生首先探索自己的预测，然后根据实时质量评估有选择地整合教师指导。这种方法同时保持一致性并保持监督质量。对两个师生 LLM 对的三个数据集进行的实验表明，AdaSwitch 持续提高了准确性，提供了一种实用且有效的方法，以可接受的额外开销来提炼 SLM。

Title: Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains

Authors: Md. Faiyaz Abdullah Sayeedi, Md. Mahbub Alam, Subhey Sadi Rahman, Md. Adnanul Islam, Jannatul Ferdous Deepti, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07877
Pdf URL: https://arxiv.org/pdf/2510.07877
Copy Paste: [[2510.07877]] Ready to Translate, Not to Represent? Bias and Performance Gaps in Multilingual LLMs Across Language Families and Domains(https://arxiv.org/abs/2510.07877)
Keywords: language model, llm
Abstract: The rise of Large Language Models (LLMs) has redefined Machine Translation (MT), enabling context-aware and fluent translations across hundreds of languages and textual domains. Despite their remarkable capabilities, LLMs often exhibit uneven performance across language families and specialized domains. Moreover, recent evidence reveals that these models can encode and amplify different biases present in their training data, posing serious concerns for fairness, especially in low-resource languages. To address these gaps, we introduce Translation Tangles, a unified framework and dataset for evaluating the translation quality and fairness of open-source LLMs. Our approach benchmarks 24 bidirectional language pairs across multiple domains using different metrics. We further propose a hybrid bias detection pipeline that integrates rule-based heuristics, semantic similarity filtering, and LLM-based validation. We also introduce a high-quality, bias-annotated dataset based on human evaluations of 1,439 translation-reference pairs. The code and dataset are accessible on GitHub: this https URL
摘要：大型语言模型 (LLM) 的兴起重新定义了机器翻译 (MT)，实现了跨数百种语言和文本领域的上下文感知和流畅翻译。尽管法学硕士拥有卓越的能力，但它们在不同语言家族和专业领域的表现往往参差不齐。此外，最近的证据表明，这些模型可以编码和放大训练数据中存在的不同偏差，从而对公平性造成严重担忧，尤其是在资源匮乏的语言中。为了解决这些差距，我们引入了 Translation Tangles，这是一个统一的框架和数据集，用于评估开源法学硕士的翻译质量和公平性。我们的方法使用不同的指标对跨多个领域的 24 个双向语言对进行基准测试。我们进一步提出了一种混合偏差检测管道，集成了基于规则的启发式、语义相似性过滤和基于 LLM 的验证。我们还引入了基于 1,439 个翻译参考对的人工评估的高质量、带有偏差注释的数据集。代码和数据集可在 GitHub 上访问：此 https URL

Title: Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking

Authors: Xinliang Frederick Zhang, Anhad Mohananey, Alexandra Chronopoulou, Pinelopi Papalampidi, Somit Gupta, Tsendsuren Munkhdalai, Lu Wang, Shyam Upadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07880
Pdf URL: https://arxiv.org/pdf/2510.07880
Copy Paste: [[2510.07880]] Do LLMs Really Need 10+ Thoughts for "Find the Time 1000 Days Later"? Towards Structural Understanding of LLM Overthinking(https://arxiv.org/abs/2510.07880)
Keywords: llm, chain-of-thought
Abstract: Models employing long chain-of-thought (CoT) reasoning have shown superior performance on complex reasoning tasks. Yet, this capability introduces a critical and often overlooked inefficiency -- overthinking -- models often engage in unnecessarily extensive reasoning even for simple queries, incurring significant computations without accuracy improvements. While prior work has explored solutions to mitigate overthinking, a fundamental gap remains in our understanding of its underlying causes. Most existing analyses are limited to superficial, profiling-based observations, failing to delve into LLMs' inner workings. This study introduces a systematic, fine-grained analyzer of LLMs' thought process to bridge the gap, TRACE. We first benchmark the overthinking issue, confirming that long-thinking models are five to twenty times slower on simple tasks with no substantial gains. We then use TRACE to first decompose the thought process into minimally complete sub-thoughts. Next, by inferring discourse relationships among sub-thoughts, we construct granular thought progression graphs and subsequently identify common thinking patterns for topically similar queries. Our analysis reveals two major patterns for open-weight thinking models -- Explorer and Late Landing. This finding provides evidence that over-verification and over-exploration are the primary drivers of overthinking in LLMs. Grounded in thought structures, we propose a utility-based definition of overthinking, which moves beyond length-based metrics. This revised definition offers a more insightful understanding of LLMs' thought progression, as well as practical guidelines for principled overthinking management.
摘要：采用长思维链 (CoT) 推理的模型在复杂推理任务上表现出了卓越的性能。然而，这种能力带来了一个严重且经常被忽视的低效率问题——过度思考——即使对于简单的查询，模型也经常进行不必要的广泛推理，从而导致大量计算，而精度却没有提高。虽然之前的工作已经探索了减轻过度思考的解决方案，但我们对其根本原因的理解仍然存在根本差距。大多数现有分析仅限于表面的、基于分析的观察，未能深入研究法学硕士的内部运作。本研究引入了一种系统的、细粒度的法学硕士思维过程分析仪来弥补差距，即 TRACE。我们首先对过度思考问题进行基准测试，确认长期思考模型在简单任务上的速度要慢五到二十倍，而且没有实质性的收益。然后，我们使用 TRACE 首先将思维过程分解为最小完整的子思维。接下来，通过推断子思想之间的话语关系，我们构建了细粒度的思想进展图，并随后识别了主题相似的查询的常见思维模式。我们的分析揭示了开放权重思维模型的两种主要模式——探索者和迟到。这一发现提供了证据，证明过度验证和过度探索是法学硕士过度思考的主要驱动因素。基于思想结构，我们提出了一个基于效用的过度思考定义，它超越了基于长度的指标。这一修订后的定义提供了对法学硕士思想进程的更深入的理解，以及有原则的过度思考管理的实用指南。

Title: CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching

Authors: Heyang Liu, Yuhao Wang, Ziyang Cheng, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07881
Pdf URL: https://arxiv.org/pdf/2510.07881
Copy Paste: [[2510.07881]] CS3-Bench: Evaluating and Enhancing Speech-to-Speech LLMs for Mandarin-English Code-Switching(https://arxiv.org/abs/2510.07881)
Keywords: language model, llm
Abstract: The advancement of multimodal large language models has accelerated the development of speech-to-speech interaction systems. While natural monolingual interaction has been achieved, we find existing models exhibit deficiencies in language alignment. In our proposed Code-Switching Speech-to-Speech Benchmark (CS3-Bench), experiments on 7 mainstream models demonstrate a relative performance drop of up to 66% in knowledge-intensive question answering and varying degrees of misunderstanding in open-ended conversations. Starting from a model with severe performance deterioration, we propose both data constructions and training approaches to improve the language alignment capabilities, specifically employing Chain of Recognition (CoR) to enhance understanding and Keyword Highlighting (KH) to guide generation. Our approach improves the knowledge accuracy from 25.14% to 46.13%, with open-ended understanding rate from 64.5% to 86.5%, and significantly reduces pronunciation errors in the secondary language. CS3-Bench is available at this https URL.
摘要：多模态大语言模型的进步加速了语音交互系统的发展。虽然已经实现了自然的单语交互，但我们发现现有模型在语言对齐方面存在缺陷。在我们提出的 Code-Switching Speech-to-Speech Benchmark (CS3-Bench) 中，对 7 个主流模型的实验表明，在知识密集型问答中相对性能下降高达 66%，并且在开放式对话中存在不同程度的误解。从性能严重恶化的模型开始，我们提出了数据构造和训练方法来提高语言对齐能力，特别是使用识别链（CoR）来增强理解，并使用关键字突出显示（KH）来指导生成。我们的方法将知识准确率从 25.14% 提高到 46.13%，开放式理解率从 64.5% 提高到 86.5%，并显着减少第二语言的发音错误。 CS3-Bench 可通过此 https URL 获取。

Title: Contrastive Weak-to-strong Generalization

Authors: Houcheng Jiang, Junfeng Fang, Jiaxin Wu, Tianyu Zhang, Chen Gao, Yong Li, Xiang Wang, Xiangnan He, Yang Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07884
Pdf URL: https://arxiv.org/pdf/2510.07884
Copy Paste: [[2510.07884]] Contrastive Weak-to-strong Generalization(https://arxiv.org/abs/2510.07884)
Keywords: language model, llm
Abstract: Weak-to-strong generalization provides a promising paradigm for scaling large language models (LLMs) by training stronger models on samples from aligned weaker ones, without requiring human feedback or explicit reward modeling. However, its robustness and generalization are hindered by the noise and biases in weak-model outputs, which limit its applicability in practice. To address this challenge, we leverage implicit rewards, which approximate explicit rewards through log-likelihood ratios, and reveal their structural equivalence with Contrastive Decoding (CD), a decoding strategy shown to reduce noise in LLM generation. Building on this connection, we propose Contrastive Weak-to-Strong Generalization (ConG), a framework that employs contrastive decoding between pre- and post-alignment weak models to generate higher-quality samples. This approach enables more reliable capability transfer, denoising, and improved robustness, substantially mitigating the limitations of traditional weak-to-strong methods. Empirical results across different model families confirm consistent improvements, demonstrating the generality and effectiveness of ConG. Taken together, our findings highlight the potential of ConG to advance weak-to-strong generalization and provide a promising pathway toward AGI.
摘要：弱到强泛化为扩展大型语言模型 (LLM) 提供了一种有前途的范例，通过在对齐的较弱模型上训练更强的模型，而不需要人类反馈或显式奖励建模。然而，其鲁棒性和泛化性受到弱模型输出中的噪声和偏差的阻碍，从而限制了其在实践中的适用性。为了应对这一挑战，我们利用隐式奖励，它通过对数似然比近似显式奖励，并揭示它们与对比解码（CD）的结构等价性，对比解码是一种被证明可以减少 LLM 生成中噪声的解码策略。基于这种联系，我们提出了对比弱到强泛化（ConG），该框架采用对齐前和对齐后弱模型之间的对比解码来生成更高质量的样本。这种方法可以实现更可靠的能力传输、去噪并提高鲁棒性，从而大大减轻传统弱到强方法的局限性。不同模型系列的实证结果证实了一致的改进，证明了 ConG 的通用性和有效性。总而言之，我们的研究结果凸显了 ConG 推进弱到强泛化的潜力，并为 AGI 提供了一条有希望的途径。

Title: Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models

Authors: Hyeonseok Moon, Seongtae Hong, Jaehyung Seo, Heuiseok Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07892
Pdf URL: https://arxiv.org/pdf/2510.07892
Copy Paste: [[2510.07892]] Metric Calculating Benchmark: Code-Verifiable Complicate Instruction Following Benchmark for Large Language Models(https://arxiv.org/abs/2510.07892)
Keywords: language model, llm
Abstract: Recent frontier-level LLMs have saturated many previously difficult benchmarks, leaving little room for further differentiation. This progress highlights the need for challenging benchmarks that provide objective verification. In this paper, we introduce MCBench, a benchmark designed to evaluate whether LLMs can execute string-matching NLP metrics by strictly following step-by-step instructions. Unlike prior benchmarks that depend on subjective judgments or general reasoning, MCBench offers an objective, deterministic and codeverifiable evaluation. This setup allows us to systematically test whether LLMs can maintain accurate step-by-step execution, including instruction adherence, numerical computation, and long-range consistency in handling intermediate results. To ensure objective evaluation of these abilities, we provide a parallel reference code that can evaluate the accuracy of LLM output. We provide three evaluative metrics and three benchmark variants designed to measure the detailed instruction understanding capability of LLMs. Our analyses show that MCBench serves as an effective and objective tool for evaluating the capabilities of cutting-edge LLMs.
摘要：最近的前沿水平法学硕士已经饱和了许多以前困难的基准，几乎没有留下进一步分化的空间。这一进展凸显了需要具有挑战性的基准来提供客观验证。在本文中，我们介绍了 MCBench，这是一个基准测试，旨在评估法学硕士是否可以通过严格遵循分步指令来执行字符串匹配的 NLP 指标。与之前依赖于主观判断或一般推理的基准不同，MCBench 提供客观、确定性和可代码验证的评估。这种设置使我们能够系统地测试法学硕士是否能够保持准确的逐步执行，包括指令遵守、数值计算以及处理中间结果的长期一致性。为了确保对这些能力进行客观评估，我们提供了一个并行参考代码，可以评估LLM输出的准确性。我们提供了三个评估指标和三个基准变体，旨在衡量法学硕士的详细指令理解能力。我们的分析表明，MCBench 是评估尖端法学硕士能力的有效且客观的工具。

Title: ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall

Authors: Jiayu Yang, Yuxuan Fan, Songning Lai, Shengen Wu, Jiaqi Tang, Chun Kang, Zhijiang Guo, Yutao Yue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07896
Pdf URL: https://arxiv.org/pdf/2510.07896
Copy Paste: [[2510.07896]] ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall(https://arxiv.org/abs/2510.07896)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) require efficient knowledge editing (KE) to update factual information, yet existing methods exhibit significant performance decay in multi-hop factual recall. This failure is particularly acute when edits involve intermediate implicit subjects within reasoning chains. Through causal analysis, we reveal that this limitation stems from an oversight of how chained knowledge is dynamically represented and utilized at the neuron level. We discover that during multi hop reasoning, implicit subjects function as query neurons, which sequentially activate corresponding value neurons across transformer layers to accumulate information toward the final answer, a dynamic prior KE work has overlooked. Guided by this insight, we propose ACE: Attribution-Controlled Knowledge Editing for Multi-hop Factual Recall, a framework that leverages neuron-level attribution to identify and edit these critical query-value (Q-V) pathways. ACE provides a mechanistically grounded solution for multi-hop KE, empirically outperforming state-of-the-art methods by 9.44% on GPT-J and 37.46% on Qwen3-8B. Our analysis further reveals more fine-grained activation patterns in Qwen3 and demonstrates that the semantic interpretability of value neurons is orchestrated by query-driven accumulation. These findings establish a new pathway for advancing KE capabilities based on the principled understanding of internal reasoning mechanisms.
摘要：大型语言模型（LLM）需要高效的知识编辑（KE）来更新事实信息，但现有方法在多跳事实回忆中表现出显着的性能下降。当编辑涉及推理链中的中间隐含主题时，这种失败尤其严重。通过因果分析，我们揭示了这种限制源于对链式知识如何在神经元级别动态表示和利用的监督。我们发现，在多跳推理过程中，隐式主体充当查询神经元，它顺序激活跨变压器层的相应值神经元，以积累最终答案的信息，这是动态先前的 KE 工作所忽略的。在这一见解的指导下，我们提出了 ACE：用于多跳事实回忆的归因控制知识编辑，这是一个利用神经元级归因来识别和编辑这些关键查询值 (Q-V) 路径的框架。 ACE 为多跳 KE 提供了一种基于机械原理的解决方案，在 GPT-J 上比最先进的方法高出 9.44%，在 Qwen3-8B 上比最先进的方法高出 37.46%。我们的分析进一步揭示了 Qwen3 中更细粒度的激活模式，并证明了值神经元的语义可解释性是通过查询驱动的累积来协调的。这些发现基于对内部推理机制的原则性理解，为提升 KE 能力建立了一条新途径。

Title: Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation

Authors: Fanwei Zhua, Jiaxuan He, Xiaoxiao Chen, Zulong Chen, Quan Lu, Chenrui Mei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07912
Pdf URL: https://arxiv.org/pdf/2510.07912
Copy Paste: [[2510.07912]] Towards Human-Like Grading: A Unified LLM-Enhanced Framework for Subjective Question Evaluation(https://arxiv.org/abs/2510.07912)
Keywords: language model, llm
Abstract: Automatic grading of subjective questions remains a significant challenge in examination assessment due to the diversity in question formats and the open-ended nature of student responses. Existing works primarily focus on a specific type of subjective question and lack the generality to support comprehensive exams that contain diverse question types. In this paper, we propose a unified Large Language Model (LLM)-enhanced auto-grading framework that provides human-like evaluation for all types of subjective questions across various domains. Our framework integrates four complementary modules to holistically evaluate student answers. In addition to a basic text matching module that provides a foundational assessment of content similarity, we leverage the powerful reasoning and generative capabilities of LLMs to: (1) compare key knowledge points extracted from both student and reference answers, (2) generate a pseudo-question from the student answer to assess its relevance to the original question, and (3) simulate human evaluation by identifying content-related and non-content strengths and weaknesses. Extensive experiments on both general-purpose and domain-specific datasets show that our framework consistently outperforms traditional and LLM-based baselines across multiple grading metrics. Moreover, the proposed system has been successfully deployed in real-world training and certification exams at a major e-commerce enterprise.
摘要：由于问题格式的多样性和学生回答的开放性，主观问题的自动评分仍然是考试评估中的重大挑战。现有的著作主要关注特定类型的主观问题，缺乏支持包含多种问题类型的综合考试的通用性。在本文中，我们提出了一个统一的大语言模型（LLM）增强的自动评分框架，为各个领域的所有类型的主观问题提供类似人类的评估。我们的框架集成了四个补充模块来全面评估学生的答案。除了提供内容相似性基础评估的基本文本匹配模块之外，我们还利用法学硕士强大的推理和生成能力来：（1）比较从学生和参考答案中提取的关键知识点，（2）从学生答案中生成伪问题，以评估其与原始问题的相关性，以及（3）通过识别内容相关和非内容的优势和劣势来模拟人类评估。对通用和特定领域数据集的大量实验表明，我们的框架在多个评分指标上始终优于传统和基于 LLM 的基线。此外，所提出的系统已成功部署在一家大型电子商务企业的实际培训和认证考试中。

Title: STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models

Authors: Kyumin Lee, Minjin Jeon, Sanghwan Jang, Hwanjo Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07923
Pdf URL: https://arxiv.org/pdf/2510.07923
Copy Paste: [[2510.07923]] STEPER: Step-wise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models(https://arxiv.org/abs/2510.07923)
Keywords: language model
Abstract: Answering complex real-world questions requires step-by-step retrieval and integration of relevant information to generate well-grounded responses. However, existing knowledge distillation methods overlook the need for different reasoning abilities at different steps, hindering transfer in multi-step retrieval-augmented frameworks. To address this, we propose Stepwise Knowledge Distillation for Enhancing Reasoning Ability in Multi-Step Retrieval-Augmented Language Models (StepER). StepER employs step-wise supervision to align with evolving information and reasoning demands across stages. Additionally, it incorporates difficulty-aware training to progressively optimize learning by prioritizing suitable steps. Our method is adaptable to various multi-step retrieval-augmented language models, including those that use retrieval queries for reasoning paths or decomposed questions. Extensive experiments show that StepER outperforms prior methods on multi-hop QA benchmarks, with an 8B model achieving performance comparable to a 70B teacher model.
摘要：回答复杂的现实问题需要逐步检索和整合相关信息，以生成有根据的答案。然而，现有的知识蒸馏方法忽视了不同步骤对不同推理能力的需求，阻碍了多步骤检索增强框架中的迁移。为了解决这个问题，我们提出了逐步知识蒸馏，以增强多步检索增强语言模型（StepER）中的推理能力。 StepER 采用逐步监督来适应跨阶段不断变化的信息和推理需求。此外，它还结合了难度意识培训，通过优先考虑适当的步骤来逐步优化学习。我们的方法适用于各种多步骤检索增强语言模型，包括那些使用检索查询来推理路径或分解问题的模型。大量实验表明，StepER 在多跳 QA 基准上的性能优于现有方法，8B 模型的性能可与 70B 教师模型相媲美。

Title: Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation

Authors: Adam Dejl, James Barry, Alessandra Pascale, Javier Carnerero Cano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07926
Pdf URL: https://arxiv.org/pdf/2510.07926
Copy Paste: [[2510.07926]] Comprehensiveness Metrics for Automatic Evaluation of Factual Recall in Text Generation(https://arxiv.org/abs/2510.07926)
Keywords: language model, llm, hallucination
Abstract: Despite demonstrating remarkable performance across a wide range of tasks, large language models (LLMs) have also been found to frequently produce outputs that are incomplete or selectively omit key information. In sensitive domains, such omissions can result in significant harm comparable to that posed by factual inaccuracies, including hallucinations. In this study, we address the challenge of evaluating the comprehensiveness of LLM-generated texts, focusing on the detection of missing information or underrepresented viewpoints. We investigate three automated evaluation strategies: (1) an NLI-based method that decomposes texts into atomic statements and uses natural language inference (NLI) to identify missing links, (2) a Q&A-based approach that extracts question-answer pairs and compares responses across sources, and (3) an end-to-end method that directly identifies missing content using LLMs. Our experiments demonstrate the surprising effectiveness of the simple end-to-end approach compared to more complex methods, though at the cost of reduced robustness, interpretability and result granularity. We further assess the comprehensiveness of responses from several popular open-weight LLMs when answering user queries based on multiple sources.
摘要：尽管在广泛的任务中表现出了卓越的性能，但大型语言模型 (LLM) 也被发现经常产生不完整或选择性省略关键信息的输出。在敏感领域，这种遗漏可能会导致与事实不准确（包括幻觉）所造成的严重伤害相当的伤害。在这项研究中，我们解决了评估法学硕士生成文本的全面性的挑战，重点是检测缺失的信息或代表性不足的观点。我们研究了三种自动评估策略：(1) 基于 NLI 的方法，将文本分解为原子语句，并使用自然语言推理 (NLI) 来识别缺失的链接；(2) 基于问答的方法，提取问答对并比较跨来源的响应；(3) 一种使用 LLM 直接识别缺失内容的端到端方法。我们的实验证明，与更复杂的方法相比，简单的端到端方法具有令人惊讶的有效性，尽管其代价是稳健性、可解释性和结果粒度降低。在回答基于多个来源的用户查询时，我们进一步评估了几个流行的开放权重法学硕士的答复的全面性。

Title: Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries

Authors: Madis Jürviste, Joonatan Jakobson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.07931
Pdf URL: https://arxiv.org/pdf/2510.07931
Copy Paste: [[2510.07931]] Vision-Enabled LLMs in Historical Lexicography: Digitising and Enriching Estonian-German Dictionaries from the 17th and 18th Centuries(https://arxiv.org/abs/2510.07931)
Keywords: language model, llm
Abstract: This article presents research conducted at the Institute of the Estonian Language between 2022 and 2025 on the application of large language models (LLMs) to the study of 17th and 18th century Estonian dictionaries. The authors address three main areas: enriching historical dictionaries with modern word forms and meanings; using vision-enabled LLMs to perform text recognition on sources printed in Gothic script (Fraktur); and preparing for the creation of a unified, cross-source dataset. Initial experiments with J. Gutslaff's 1648 dictionary indicate that LLMs have significant potential for semi-automatic enrichment of dictionary information. When provided with sufficient context, Claude 3.7 Sonnet accurately provided meanings and modern equivalents for 81% of headword entries. In a text recognition experiment with A. T. Helle's 1732 dictionary, a zero-shot method successfully identified and structured 41% of headword entries into error-free JSON-formatted output. For digitising the Estonian-German dictionary section of A. W. Hupel's 1780 grammar, overlapping tiling of scanned image files is employed, with one LLM being used for text recognition and a second for merging the structured output. These findings demonstrate that even for minor languages LLMs have a significant potential for saving time and financial resources.
摘要：本文介绍了爱沙尼亚语言研究所在 2022 年至 2025 年间进行的关于将大型语言模型 (LLM) 应用到 17 世纪和 18 世纪爱沙尼亚词典研究中的研究。作者讨论了三个主要领域：用现代单词形式和含义丰富历史词典；使用支持视觉的法学硕士对哥特体 (Fraktur) 打印的来源进行文本识别；并准备创建统一的跨源数据集。对 J. Gutslaff 的 1648 词典的初步实验表明，法学硕士在半自动丰富词典信息方面具有巨大潜力。当提供足够的上下文时，Claude 3.7 Sonnet 准确地提供了 81% 的词条条目的含义和现代对应词。在使用 A. T. Helle 的 1732 词典进行的文本识别实验中，零样本方法成功识别了 41% 的词条条目并将其结构化为无错误的 JSON 格式输出。为了数字化 A. W. Hupel 1780 年语法中的爱沙尼亚语-德语词典部分，采用了扫描图像文件的重叠平铺，其中一个法学硕士用于文本识别，另一个用于合并结构化输出。这些发现表明，即使对于小语种，法学硕士也具有节省时间和财务资源的巨大潜力。

Title: A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning

Authors: Fengji Zhang, Xinyao Niu, Chengyang Ying, Guancheng Lin, Zhongkai Hao, Zhou Fan, Chengen Huang, Jacky Keung, Bei Chen, Junyang Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07958
Pdf URL: https://arxiv.org/pdf/2510.07958
Copy Paste: [[2510.07958]] A$^2$Search: Ambiguity-Aware Question Answering with Reinforcement Learning(https://arxiv.org/abs/2510.07958)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) and Reinforcement Learning (RL) have led to strong performance in open-domain question answering (QA). However, existing models still struggle with questions that admit multiple valid answers. Standard QA benchmarks, which typically assume a single gold answer, overlook this reality and thus produce inappropriate training signals. Existing attempts to handle ambiguity often rely on costly manual annotation, which is difficult to scale to multi-hop datasets such as HotpotQA and MuSiQue. In this paper, we present A$^2$Search, an annotation-free, end-to-end training framework to recognize and handle ambiguity. At its core is an automated pipeline that detects ambiguous questions and gathers alternative answers via trajectory sampling and evidence verification. The model is then optimized with RL using a carefully designed $\mathrm{AnsF1}$ reward, which naturally accommodates multiple answers. Experiments on eight open-domain QA benchmarks demonstrate that A$^2$Search achieves new state-of-the-art performance. With only a single rollout, A$^2$Search-7B yields an average $\mathrm{AnsF1}@1$ score of $48.4\%$ across four multi-hop benchmarks, outperforming all strong baselines, including the substantially larger ReSearch-32B ($46.2\%$). Extensive analyses further show that A$^2$Search resolves ambiguity and generalizes across benchmarks, highlighting that embracing ambiguity is essential for building more reliable QA systems. Our code, data, and model weights can be found at this https URL
摘要：大型语言模型 (LLM) 和强化学习 (RL) 的最新进展带来了开放域问答 (QA) 方面的强劲表现。然而，现有模型仍然难以解决允许多个有效答案的问题。标准 QA 基准通常假设一个黄金答案，但忽视了这一现实，从而产生了不适当的训练信号。现有处理歧义的尝试通常依赖于昂贵的手动注释，这很难扩展到 HotpotQA 和 MuSiQue 等多跳数据集。在本文中，我们提出了 A$^2$Search，这是一种无注释的端到端训练框架，用于识别和处理歧义。其核心是一个自动化管道，可以检测模棱两可的问题并通过轨迹采样和证据验证收集替代答案。然后使用精心设计的 $\mathrm{AnsF1}$ 奖励通过 RL 对该模型进行优化，该奖励自然可以容纳多个答案。对八个开放域 QA 基准的实验表明，A$^2$Search 实现了新的最先进的性能。仅一次推出，A$^2$Search-7B 在四个多跳基准测试中的平均 $\mathrm{AnsF1}@1$ 得分为 $48.4\%$，优于所有强基线，包括大得多的 ReSearch-32B ($46.2\%$)。广泛的分析进一步表明，A$^2$Search 解决了歧义性并在基准之间进行了概括，强调接受歧义性对于构建更可靠的 QA 系统至关重要。我们的代码、数据和模型权重可以在此 https URL 找到

Title: LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

Authors: Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07962
Pdf URL: https://arxiv.org/pdf/2510.07962
Copy Paste: [[2510.07962]] LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?(https://arxiv.org/abs/2510.07962)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: this https URL
摘要：大型语言模型（LLM）通常通过监督微调（SFT）在推理方面取得了显着的进步。然而，SFT 是资源密集型的，依赖于大型精选数据集、拒绝采样演示以及所有代币的统一优化，尽管只有一小部分具有有意义的学习价值。在这项工作中，我们探索了一个违反直觉的想法：较小的语言模型（SLM）能否通过揭示反映后者独特优势的高价值推理时刻来教授较大的语言模型（LLM）？我们提出了 LightReasoner，这是一种新颖的框架，它利用了更强的专家模型 (LLM) 和较弱的业余模型 (SLM) 之间的行为差异。 LightReasoner 分两个阶段运行：(1) 采样阶段，精确定位关键推理时刻并构建监督示例，通过专家与业余爱好者的对比来捕捉专家的优势；(2) 微调阶段，将专家模型与这些精炼示例结合起来，放大其推理优势。在七个数学基准中，LightReasoner 将准确性提高了 28.1%，同时减少了 90% 的时间消耗，将问题采样率提高了 80%，并将令牌使用率调整了 99%，所有这些都无需依赖真实标签。通过将较弱的 SLM 转化为有效的教学信号，LightReasoner 提供了一种可扩展且资源高效的方法来推进 LLM 推理。代码位于：此 https URL

Title: Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Authors: Jialu Du, Guiyang Hou, Yihui Fu, Chen Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07974
Pdf URL: https://arxiv.org/pdf/2510.07974
Copy Paste: [[2510.07974]] Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning(https://arxiv.org/abs/2510.07974)
Keywords: language model, llm, prompt, agent
Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.
摘要：虽然大型语言模型 (LLM) 在数学和代码推理方面表现出色，但我们观察到它们在社会推理任务中表现不佳，表现出认知混乱、逻辑不一致以及客观世界状态和主观信念状态之间的混淆。通过对DeepSeek-R1推理轨迹的详细分析，我们发现LLM在处理多参与者、多时间线的场景时，经常会遇到推理僵局，容易输出“tricky”、“confused”等相互矛盾的词语，导致推理错误或无限循环。核心问题是他们无法将客观现实与代理人的主观信念分开。为了解决这个问题，我们提出了一种自适应世界模型增强推理机制，该机制构建动态文本世界模型来跟踪实体状态和时间序列。它动态监控推理轨迹中的混乱指标，并通过提供清晰的世界状态描述及时进行干预，帮助模型克服认知困境。该机制模仿人类如何使用隐式世界模型来区分外部事件和内部信念。对三个社交基准的评估表明，准确性显着提高（例如，Hi-ToM 中 +10%），同时降低了计算成本（代币减少高达 33.8%），为在社交环境中部署法学硕士提供了简单而有效的解决方案。

Title: Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge

Authors: Watcharapong Timklaypachara, Monrada Chiewhawan, Nopporn Lekuthai, Titipat Achakulvisut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.07993
Pdf URL: https://arxiv.org/pdf/2510.07993
Copy Paste: [[2510.07993]] Leveraging Author-Specific Context for Scientific Figure Caption Generation: 3rd SciCap Challenge(https://arxiv.org/abs/2510.07993)
Keywords: prompt
Abstract: Scientific figure captions require both accuracy and stylistic consistency to convey visual information. Here, we present a domain-specific caption generation system for the 3rd SciCap Challenge that integrates figure-related textual context with author-specific writing styles using the LaMP-Cap dataset. Our approach uses a two-stage pipeline: Stage 1 combines context filtering, category-specific prompt optimization via DSPy's MIPROv2 and SIMBA, and caption candidate selection; Stage 2 applies few-shot prompting with profile figures for stylistic refinement. Our experiments demonstrate that category-specific prompts outperform both zero-shot and general optimized approaches, improving ROUGE-1 recall by +8.3\% while limiting precision loss to -2.8\% and BLEU-4 reduction to -10.9\%. Profile-informed stylistic refinement yields 40--48\% gains in BLEU scores and 25--27\% in ROUGE. Overall, our system demonstrates that combining contextual understanding with author-specific stylistic adaptation can generate captions that are both scientifically accurate and stylistically faithful to the source paper.
摘要：科学图形标题需要准确性和风格一致性来传达视觉信息。在这里，我们为第三届 SciCap 挑战赛提出了一个特定领域的标题生成系统，该系统使用 LaMP-Cap 数据集将图形相关的文本上下文与作者特定的写作风格集成在一起。我们的方法使用两阶段管道：第一阶段结合了上下文过滤、通过 DSPy 的 MIPROv2 和 SIMBA 进行的特定于类别的提示优化以及字幕候选选择；第二阶段应用带有个人资料的小镜头提示来完善风格。我们的实验表明，特定类别的提示优于零样本和一般优化方法，将 ROUGE-1 召回率提高了 +8.3\%，同时将精度损失限制为 -2.8\%，将 BLEU-4 降低到 -10.9\%。基于配置文件的风格细化使 BLEU 分数提高了 40--48\%，ROUGE 分数提高了 25--27\%。总体而言，我们的系统表明，将上下文理解与作者特定的文体适应相结合可以生成既科学准确又在文体上忠实于源论文的标题。

Title: Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks

Authors: Cheng Yang, Xuemeng Yang, Licheng Wen, Daocheng Fu, Jianbiao Mei, Rong Wu, Pinlong Cai, Yufan Shen, Nianchen Deng, Botian Shi, Yu Qiao, Haifeng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08002
Pdf URL: https://arxiv.org/pdf/2510.08002
Copy Paste: [[2510.08002]] Learning on the Job: An Experience-Driven Self-Evolving Agent for Long-Horizon Tasks(https://arxiv.org/abs/2510.08002)
Keywords: language model, llm, agent
Abstract: Large Language Models have demonstrated remarkable capabilities across diverse domains, yet significant challenges persist when deploying them as AI agents for real-world long-horizon tasks. Existing LLM agents suffer from a critical limitation: they are test-time static and cannot learn from experience, lacking the ability to accumulate knowledge and continuously improve on the job. To address this challenge, we propose MUSE, a novel agent framework that introduces an experience-driven, self-evolving system centered around a hierarchical Memory Module. MUSE organizes diverse levels of experience and leverages them to plan and execute long-horizon tasks across multiple applications. After each sub-task execution, the agent autonomously reflects on its trajectory, converting the raw trajectory into structured experience and integrating it back into the Memory Module. This mechanism enables the agent to evolve beyond its static pretrained parameters, fostering continuous learning and self-evolution. We evaluate MUSE on the long-horizon productivity benchmark TAC. It achieves new SOTA performance by a significant margin using only a lightweight Gemini-2.5 Flash model. Sufficient Experiments demonstrate that as the agent autonomously accumulates experience, it exhibits increasingly superior task completion capabilities, as well as robust continuous learning and self-evolution capabilities. Moreover, the accumulated experience from MUSE exhibits strong generalization properties, enabling zero-shot improvement on new tasks. MUSE establishes a new paradigm for AI agents capable of real-world productivity task automation.
摘要：大型语言模型已经在不同领域展示了卓越的能力，但将它们部署为人工智能代理来执行现实世界的长期任务时仍然存在重大挑战。现有的法学硕士代理人面临着一个严重的限制：他们在测试时是静态的，无法从经验中学习，缺乏积累知识和在工作中持续改进的能力。为了应对这一挑战，我们提出了 MUSE，这是一种新颖的代理框架，它引入了一个以分层内存模块为中心的经验驱动、自我进化的系统。 MUSE 组织不同级别的经验，并利用它们跨多个应用程序规划和执行长期任务。每个子任务执行后，代理会自主反思其轨迹，将原始轨迹转换为结构化经验，并将其集成回内存模块。这种机制使代理能够超越其静态预训练参数，促进持续学习和自我进化。我们根据长期生产力基准 TAC 评估 MUSE。它仅使用轻量级 Gemini-2.5 Flash 模型就显着实现了新的 SOTA 性能。足够的实验表明，随着智能体自主积累经验，它表现出越来越优越的任务完成能力，以及强大的持续学习和自我进化能力。此外，MUSE 积累的经验具有很强的泛化特性，能够对新任务进行零样本改进。 MUSE 为人工智能代理建立了一个新的范式，能够实现现实世界生产力任务的自动化。

Title: ChatGPT as a Translation Engine: A Case Study on Japanese-English

Authors: Vincent Michael Sutanto, Giovanni Gatti De Giacomo, Toshiaki Nakazawa, Masaru Yamada
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08042
Pdf URL: https://arxiv.org/pdf/2510.08042
Copy Paste: [[2510.08042]] ChatGPT as a Translation Engine: A Case Study on Japanese-English(https://arxiv.org/abs/2510.08042)
Keywords: gpt, prompt, chat
Abstract: This study investigates ChatGPT for Japanese-English translation, exploring simple and enhanced prompts and comparing against commercially available translation engines. Performing both automatic and MQM-based human evaluations, we found that document-level translation outperforms sentence-level translation for ChatGPT. On the other hand, we were not able to determine if enhanced prompts performed better than simple prompts in our experiments. We also discovered that ChatGPT-3.5 was preferred by automatic evaluation, but a tradeoff exists between accuracy (ChatGPT-3.5) and fluency (ChatGPT-4). Lastly, ChatGPT yields competitive results against two widely-known translation systems.
摘要：本研究调查了 ChatGPT 的日语-英语翻译，探索简单和增强的提示，并与商用翻译引擎进行比较。通过执行自动评估和基于 MQM 的人工评估，我们发现 ChatGPT 的文档级翻译优于句子级翻译。另一方面，在我们的实验中，我们无法确定增强提示是否比简单提示表现更好。我们还发现自动评估更喜欢 ChatGPT-3.5，但准确性 (ChatGPT-3.5) 和流畅性 (ChatGPT-4) 之间存在权衡。最后，ChatGPT 与两个众所周知的翻译系统相比，产生了具有竞争力的结果。

Title: Climate Knowledge in Large Language Models

Authors: Ivan Kuznetsov (1), Jacopo Grassi (2), Dmitrii Pantiukhin (1), Boris Shapkin (1), Thomas Jung (1 and 3), Nikolay Koldunov (1) ((1) Alfred Wegener Institute, Helmholtz Centre for Polar and Marine Research, Bremerhaven, Germany., (2) Department of Environment, Land, and Infrastructure Engineering, Politecnico di Torino, Turin, Italy., (3) Institute of Environmental Physics, University of Bremen, Bremen, Germany.)
Subjects: cs.CL, cs.LG, physics.ao-ph
Abstract URL: https://arxiv.org/abs/2510.08043
Pdf URL: https://arxiv.org/pdf/2510.08043
Copy Paste: [[2510.08043]] Climate Knowledge in Large Language Models(https://arxiv.org/abs/2510.08043)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed for climate-related applications, where understanding internal climatological knowledge is crucial for reliability and misinformation risk assessment. Despite growing adoption, the capacity of LLMs to recall climate normals from parametric knowledge remains largely uncharacterized. We investigate the capacity of contemporary LLMs to recall climate normals without external retrieval, focusing on a prototypical query: mean July 2-m air temperature 1991-2020 at specified locations. We construct a global grid of queries at 1° resolution land points, providing coordinates and location descriptors, and validate responses against ERA5 reanalysis. Results show that LLMs encode non-trivial climate structure, capturing latitudinal and topographic patterns, with root-mean-square errors of 3-6 °C and biases of $\pm$1 °C. However, spatially coherent errors remain, particularly in mountains and high latitudes. Performance degrades sharply above 1500 m, where RMSE reaches 5-13 °C compared to 2-4 °C at lower elevations. We find that including geographic context (country, city, region) reduces errors by 27% on average, with larger models being most sensitive to location descriptors. While models capture the global mean magnitude of observed warming between 1950-1974 and 2000-2024, they fail to reproduce spatial patterns of temperature change, which directly relate to assessing climate change. This limitation highlights that while LLMs may capture present-day climate distributions, they struggle to represent the regional and local expression of long-term shifts in temperature essential for understanding climate dynamics. Our evaluation framework provides a reproducible benchmark for quantifying parametric climate knowledge in LLMs and complements existing climate communication assessments.
摘要：大型语言模型 (LLM) 越来越多地应用于气候相关应用，其中了解内部气候知识对于可靠性和错误信息风险评估至关重要。尽管采用率越来越高，法学硕士从参数知识中回忆气候常态的能力在很大程度上仍然没有得到表征。我们研究了当代法学硕士无需外部检索即可回忆气候常态的能力，重点关注一个典型查询：1991-2020 年指定地点 7 月 2 米平均气温。我们在 1° 分辨率着陆点构建了一个全局查询网格，提供坐标和位置描述符，并根据 ERA5 重新分析验证响应。结果表明，LLM 编码了非平凡的气候结构，捕获了纬度和地形模式，均方根误差为 3-6 °C，偏差为 $\pm$1 °C。然而，空间相干误差仍然存在，特别是在山区和高纬度地区。海拔超过 1500 m 时，性能急剧下降，RMSE 达到 5-13 °C，而海拔较低时为 2-4 °C。我们发现，纳入地理背景（国家、城市、地区）平均可以减少 27% 的错误，而较大的模型对位置描述符最为敏感。虽然模型捕捉了 1950-1974 年和 2000-2024 年期间观测到的全球变暖平均幅度，但它们无法重现与评估气候变化直接相关的温度变化的空间模式。这一局限性凸显出，虽然法学硕士可以捕捉当今的气候分布，但他们很难代表理解气候动态所必需的长期温度变化的区域和局部表达。我们的评估框架为量化法学硕士中的参数气候知识提供了可重复的基准，并补充了现有的气候传播评估。

Title: A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models

Authors: Congming Zheng, Jiachen Zhu, Zhuoying Ou, Yuxiang Chen, Kangning Zhang, Rong Shan, Zeyu Zheng, Mengyue Yang, Jianghao Lin, Yong Yu, Weinan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08049
Pdf URL: https://arxiv.org/pdf/2510.08049
Copy Paste: [[2510.08049]] A Survey of Process Reward Models: From Outcome Signals to Process Supervisions for Large Language Models(https://arxiv.org/abs/2510.08049)
Keywords: language model, llm, agent
Abstract: Although Large Language Models (LLMs) exhibit advanced reasoning ability, conventional alignment remains largely dominated by outcome reward models (ORMs) that judge only final answers. Process Reward Models(PRMs) address this gap by evaluating and guiding reasoning at the step or trajectory level. This survey provides a systematic overview of PRMs through the full loop: how to generate process data, build PRMs, and use PRMs for test-time scaling and reinforcement learning. We summarize applications across math, code, text, multimodal reasoning, robotics, and agents, and review emerging benchmarks. Our goal is to clarify design spaces, reveal open challenges, and guide future research toward fine-grained, robust reasoning alignment.
摘要：尽管大型语言模型 (LLM) 表现出先进的推理能力，但传统的对齐方式仍然主要由仅判断最终答案的结果奖励模型 (ORM) 主导。过程奖励模型（PRM）通过在步骤或轨迹级别评估和指导推理来解决这一差距。这项调查通过整个循环系统地概述了 PRM：如何生成过程数据、构建 PRM 以及使用 PRM 进行测试时间扩展和强化学习。我们总结了数学、代码、文本、多模式推理、机器人和代理的应用，并回顾了新兴的基准。我们的目标是阐明设计空间，揭示开放的挑战，并指导未来的研究朝着细粒度、稳健的推理方向发展。

Title: Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility

Authors: Shramay Palta, Peter Rankel, Sarah Wiegreffe, Rachel Rudinger
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.08091
Pdf URL: https://arxiv.org/pdf/2510.08091
Copy Paste: [[2510.08091]] Everything is Plausible: Investigating the Impact of LLM Rationales on Human Notions of Plausibility(https://arxiv.org/abs/2510.08091)
Keywords: llm
Abstract: We investigate the degree to which human plausibility judgments of multiple-choice commonsense benchmark answers are subject to influence by (im)plausibility arguments for or against an answer, in particular, using rationales generated by LLMs. We collect 3,000 plausibility judgments from humans and another 13,600 judgments from LLMs. Overall, we observe increases and decreases in mean human plausibility ratings in the presence of LLM-generated PRO and CON rationales, respectively, suggesting that, on the whole, human judges find these rationales convincing. Experiments with LLMs reveal similar patterns of influence. Our findings demonstrate a novel use of LLMs for studying aspects of human cognition, while also raising practical concerns that, even in domains where humans are ``experts'' (i.e., common sense), LLMs have the potential to exert considerable influence on people's beliefs.
摘要：我们研究了人类对多项选择常识基准答案的合理性判断受到支持或反对答案的（不）合理论据的影响的程度，特别是使用法学硕士产生的基本原理。我们收集了来自人类的 3,000 个合理性判断和来自法学硕士的另外 13,600 个判断。总体而言，我们观察到在法学硕士生成的 PRO 和 CON 理由存在的情况下，人类平均合理性评级分别增加和减少，这表明，总体而言，人类法官认为这些理由令人信服。法学硕士的实验揭示了类似的影响模式。我们的研究结果展示了法学硕士在研究人类认知方面的新颖用途，同时也引起了实际关注，即即使在人类是“专家”（即常识）的领域，法学硕士也有可能对人们的信仰产生相当大的影响。

Title: The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models

Authors: Sherzod Hakimov, Roland Bernard, Tim Leiber, Karl Osswald, Kristina Richert, Ruilin Yang, Raffaella Bernardi, David Schlangen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08098
Pdf URL: https://arxiv.org/pdf/2510.08098
Copy Paste: [[2510.08098]] The Price of Thought: A Multilingual Analysis of Reasoning, Performance, and Cost of Negotiation in Large Language Models(https://arxiv.org/abs/2510.08098)
Keywords: language model, gpt, llm, agent
Abstract: Negotiation is a fundamental challenge for AI agents, as it requires an ability to reason strategically, model opponents, and balance cooperation with competition. We conduct the first comprehensive study systematically evaluating the effect of (LLM-)reasoning on the negotiation abilities of both commercial and open-weight LLMs, and do this across three languages. Using a self-play setup across three diverse dialogue games, we analyse trade-offs between performance and cost, the language consistency of reasoning processes, and the nature of strategic adaptation exhibited by models. Our findings show that enabling reasoning-that is, scaling test time compute-significantly improves negotiation outcomes by enhancing collaboration and helping models overcome task complexities, but comes at a substantial computational cost: reasoning improves GPT-5's performance by 31.4 % while increasing its cost by nearly 400 %. Most critically, we uncover a significant multilingual reasoning distinction: open-weight models consistently switch to English for their internal reasoning steps, even when negotiating in German or Italian (and thus possibly impacting potential explainability gains through the disclosure of reasoning traces), while leading commercial models maintain language consistency between their reasoning and final output.
摘要：谈判是人工智能代理面临的一项基本挑战，因为它需要具有战略推理、建模对手以及平衡合作与竞争的能力。我们进行了第一项全面的研究，系统地评估了（法学硕士）推理对商业法学硕士和开放权重法学硕士谈判能力的影响，并跨三种语言进行了这项研究。我们使用三种不同对话游戏的自我对弈设置，分析性能和成本之间的权衡、推理过程的语言一致性以及模型所表现出的策略适应的本质。我们的研究结果表明，启用推理（即扩展测试时间计算）可以通过增强协作并帮助模型克服任务复杂性来显着改善谈判结果，但会带来巨大的计算成本：推理将 GPT-5 的性能提高了 31.4%，同时其成本增加了近 400%。最关键的是，我们发现了一个显着的多语言推理区别：开放权重模型在其内部推理步骤中始终切换到英语，即使在用德语或意大利语进行谈判时也是如此（因此可能通过推理痕迹的披露影响潜在的可解释性收益），而领先的商业模型则保持其推理和最终输出之间的语言一致性。

Title: Lossless Vocabulary Reduction for Auto-Regressive Language Models

Authors: Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Shin'ya Yamaguchi, Tomoya Ohba, Tamao Sakao, Susumu Takeuchi
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2510.08102
Pdf URL: https://arxiv.org/pdf/2510.08102
Copy Paste: [[2510.08102]] Lossless Vocabulary Reduction for Auto-Regressive Language Models(https://arxiv.org/abs/2510.08102)
Keywords: language model
Abstract: Tokenization -- the process of decomposing a given text into a sequence of subwords called tokens -- is one of the key components in the development of language models. Particularly, auto-regressive language models generate texts token by token, i.e., by predicting the next-token distribution given the previous ones, and thus tokenization directly affects their efficiency in text generation. Since each language model has their own vocabulary as a set of possible tokens, they struggle to cooperate with each other at the level of next-token distributions such as model ensemble. In this paper, we establish a theoretical framework of lossless vocabulary reduction, which efficiently converts a given auto-regressive language model into the one with an arbitrarily small vocabulary without any loss in accuracy. As an application, we demonstrate that language models with different tokenization can cooperate with each other efficiently through their maximal common vocabulary.
摘要：标记化（将给定文本分解为一系列称为标记的子词的过程）是语言模型开发的关键组成部分之一。特别是，自回归语言模型逐个标记地生成文本，即通过给定先前的标记来预测下一个标记的分布，因此标记化直接影响其文本生成的效率。由于每个语言模型都有自己的词汇表作为一组可能的标记，因此它们很难在下一个标记分布（例如模型集成）的级别上相互合作。在本文中，我们建立了一种无损词汇缩减的理论框架，该框架可以有效地将给定的自回归语言模型转换为具有任意小词汇量的语言模型，而不会损失任何准确性。作为一个应用程序，我们证明了具有不同标记化的语言模型可以通过其最大通用词汇量有效地相互协作。

Title: Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing

Authors: Haoyang Gui, Thales Bertaglia, Taylor Annabell, Catalina Goanta, Tjomme Dooper, Gerasimos Spanakis
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.08111
Pdf URL: https://arxiv.org/pdf/2510.08111
Copy Paste: [[2510.08111]] Evaluating LLM-Generated Legal Explanations for Regulatory Compliance in Social Media Influencer Marketing(https://arxiv.org/abs/2510.08111)
Keywords: gpt, llm, prompt
Abstract: The rise of influencer marketing has blurred boundaries between organic content and sponsored content, making the enforcement of legal rules relating to transparency challenging. Effective regulation requires applying legal knowledge with a clear purpose and reason, yet current detection methods of undisclosed sponsored content generally lack legal grounding or operate as opaque "black boxes". Using 1,143 Instagram posts, we compare gpt-5-nano and gemini-2.5-flash-lite under three prompting strategies with controlled levels of legal knowledge provided. Both models perform strongly in classifying content as sponsored or not (F1 up to 0.93), though performance drops by over 10 points on ambiguous cases. We further develop a taxonomy of reasoning errors, showing frequent citation omissions (28.57%), unclear references (20.71%), and hidden ads exhibiting the highest miscue rate (28.57%). While adding regulatory text to the prompt improves explanation quality, it does not consistently improve detection accuracy. The contribution of this paper is threefold. First, it makes a novel addition to regulatory compliance technology by providing a taxonomy of common errors in LLM-generated legal reasoning to evaluate whether automated moderation is not only accurate but also legally robust, thereby advancing the transparent detection of influencer marketing content. Second, it features an original dataset of LLM explanations annotated by two students who were trained in influencer marketing law. Third, it combines quantitative and qualitative evaluation strategies for LLM explanations and critically reflects on how these findings can support advertising regulatory bodies in automating moderation processes on a solid legal foundation.
摘要：影响力营销的兴起模糊了有机内容和赞助内容之间的界限，使得与透明度相关的法律规则的执行面临挑战。有效的监管需要以明确的目的和理由运用法律知识，但目前对未公开赞助内容的检测方法普遍缺乏法律依据或作为不透明的“黑匣子”运作。我们使用 1,143 个 Instagram 帖子，在三种提示策略下比较 gpt-5-nano 和 gemini-2.5-flash-lite，并提供受控水平的法律知识。两种模型在将内容分类为赞助或非赞助方面都表现出色（F1 高达 0.93），尽管在不明确的情况下性能下降了 10 多点。我们进一步开发了推理错误的分类法，显示频繁的引用遗漏（28.57％）、不明确的参考文献（20.71％）以及表现出最高错误率的隐藏广告（28.57％）。虽然在提示中添加监管文本可以提高解释质量，但并不能持续提高检测准确性。本文的贡献有三个方面。首先，它通过提供法学硕士生成的法律推理中常见错误的分类来评估自动审核是否不仅准确而且在法律上稳健，从而对有影响力的营销内容进行透明检测，从而对监管合规技术进行了新颖的补充。其次，它具有法学硕士解释的原始数据集，由两名接受过影响者营销法培训的学生注释。第三，它结合了法学硕士解释的定量和定性评估策略，并批判性地反思了这些发现如何支持广告监管机构在坚实的法律基础上实现自动化审核流程。

Title: Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations

Authors: Jasmina Gajcin, Erik Miehling, Rahul Nair, Elizabeth Daly, Radu Marinescu, Seshu Tirupathi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08120
Pdf URL: https://arxiv.org/pdf/2510.08120
Copy Paste: [[2510.08120]] Interpreting LLM-as-a-Judge Policies via Verifiable Global Explanations(https://arxiv.org/abs/2510.08120)
Keywords: llm
Abstract: Using LLMs to evaluate text, that is, LLM-as-a-judge, is increasingly being used at scale to augment or even replace human annotations. As such, it is imperative that we understand the potential biases and risks of doing so. In this work, we propose an approach for extracting high-level concept-based global policies from LLM-as-a-Judge. Our approach consists of two algorithms: 1) CLoVE (Contrastive Local Verifiable Explanations), which generates verifiable, concept-based, contrastive local explanations and 2) GloVE (Global Verifiable Explanations), which uses iterative clustering, summarization and verification to condense local rules into a global policy. We evaluate GloVE on seven standard benchmarking datasets for content harm detection. We find that the extracted global policies are highly faithful to decisions of the LLM-as-a-Judge. Additionally, we evaluated the robustness of global policies to text perturbations and adversarial attacks. Finally, we conducted a user study to evaluate user understanding and satisfaction with global policies.
摘要：使用法学硕士来评估文本，即法学硕士作为法官，越来越多地被大规模使用，以增强甚至取代人类注释。因此，我们必须了解这样做的潜在偏见和风险。在这项工作中，我们提出了一种从法学硕士法官中提取基于概念的高级全球政策的方法。我们的方法由两种算法组成：1）CLoVE（对比局部可验证解释），它生成可验证的、基于概念的对比局部解释；2）GloVE（全局可验证解释），它使用迭代聚类、总结和验证将局部规则压缩为全局策略。我们在七个用于内容危害检测的标准基准数据集上评估 GloVE。我们发现提取的全球政策高度忠实于法学硕士作为法官的决定。此外，我们还评估了全球政策对文本扰动和对抗性攻击的稳健性。最后，我们进行了一项用户研究，以评估用户对全球政策的理解和满意度。

Title: Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling

Authors: Shuliang Liu, Zhipeng Xu, Zhenghao Liu, Yukun Yan, Minghe Yu, Yu Gu, Chong Chen, Huiyuan Xie, Ge Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08145
Pdf URL: https://arxiv.org/pdf/2510.08145
Copy Paste: [[2510.08145]] Mitigating Judgment Preference Bias in Large Language Models through Group-Based Polling(https://arxiv.org/abs/2510.08145)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) as automatic evaluators, commonly referred to as LLM-as-a-Judge, have also attracted growing attention. This approach plays a vital role in aligning LLMs with human judgments, providing accurate and reliable assessments. However, LLM-based judgment models often exhibit judgment preference bias during the evaluation phase, tending to favor responses generated by themselves, undermining the reliability of their judgments. This paper introduces the Group-Based Polling Optimization (Genii), an unsupervised multi-agent collaborative optimization framework that mitigates the inherent judgment preference bias of judgment models. Specifically, Genii integrates various LLM-based judgment models into a multi-agent system and simulates the interactive client-server polling mechanism to optimize each client agent unsupervisedly. Our experiments demonstrate that Genii outperforms supervised models trained on annotated judgment data, while requiring no human-labeled annotations. Genii consistently improves performance across different client agents during the polling, even when weaker models act as server agents. Further analysis reveals that Genii effectively mitigates judgment preference bias of LLM-based judgment models, demonstrating its effectiveness. All codes are available at this https URL.
摘要：作为自动评估器的大型语言模型（LLM），通常称为 LLM-as-a-Judge，也引起了越来越多的关注。这种方法在使法学硕士与人类判断保持一致、提供准确可靠的评估方面发挥着至关重要的作用。然而，基于LLM的判断模型在评估阶段往往表现出判断偏好偏差，倾向于偏向于自身产生的反应，从而损害了其判断的可靠性。本文介绍了基于组的轮询优化（Genii），这是一种无监督的多智能体协作优化框架，可以减轻判断模型固有的判断偏好偏差。具体来说，Genii将各种基于LLM的判断模型集成到多代理系统中，并模拟交互式客户端-服务器轮询机制，以无监督地优化每个客户端代理。我们的实验表明，Genii 的性能优于基于带注释的判断数据训练的监督模型，同时不需要人工标记的注释。 Genii 在轮询期间持续改进不同客户端代理的性能，即使较弱的模型充当服务器代理也是如此。进一步的分析表明，Genii有效地缓解了基于LLM的判断模型的判断偏好偏差，证明了其有效性。所有代码均可在此 https URL 中获取。

Title: AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents

Authors: Md Tahmid Rahman Laskar, Julien Bouvier Tremblay, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08149
Pdf URL: https://arxiv.org/pdf/2510.08149
Copy Paste: [[2510.08149]] AI Knowledge Assist: An Automated Approach for the Creation of Knowledge Bases for Conversational AI Agents(https://arxiv.org/abs/2510.08149)
Keywords: language model, llm, chat, retrieval augmented generation, agent
Abstract: The utilization of conversational AI systems by leveraging Retrieval Augmented Generation (RAG) techniques to solve customer problems has been on the rise with the rapid progress of Large Language Models (LLMs). However, the absence of a company-specific dedicated knowledge base is a major barrier to the integration of conversational AI systems in contact centers. To this end, we introduce AI Knowledge Assist, a system that extracts knowledge in the form of question-answer (QA) pairs from historical customer-agent conversations to automatically build a knowledge base. Fine-tuning a lightweight LLM on internal data demonstrates state-of-the-art performance, outperforming larger closed-source LLMs. More specifically, empirical evaluation on 20 companies demonstrates that the proposed AI Knowledge Assist system that leverages the LLaMA-3.1-8B model eliminates the cold-start gap in contact centers by achieving above 90% accuracy in answering information-seeking questions. This enables immediate deployment of RAG-powered chatbots.
摘要：随着大型语言模型（LLM）的快速发展，通过利用检索增强生成（RAG）技术来解决客户问题的对话式人工智能系统的使用不断增加。然而，缺乏公司特定的专用知识库是对话式人工智能系统在联络中心集成的主要障碍。为此，我们引入了 AI Knowledge Assist，该系统可以从历史客户与代理对话中以问答 (QA) 对的形式提取知识，以自动构建知识库。对内部数据进行微调的轻量级法学硕士展示了最先进的性能，优于大型闭源法学硕士。更具体地说，对 20 家公司的实证评估表明，所提出的利用 LLaMA-3.1-8B 模型的 AI 知识辅助系统通过在回答信息搜索问题方面实现 90% 以上的准确率，消除了联络中心的冷启动差距。这使得可以立即部署 RAG 支持的聊天机器人。

Title: DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations

Authors: Elena Khasanova, Harsh Saini, Md Tahmid Rahman Laskar, Xue-Yong Fu, Cheng Chen, Shashi Bhushan TN
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08152
Pdf URL: https://arxiv.org/pdf/2510.08152
Copy Paste: [[2510.08152]] DACIP-RC: Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension on Business Conversations(https://arxiv.org/abs/2510.08152)
Keywords: language model, llm
Abstract: The rapid advancements in Large Language Models (LLMs) have enabled their adoption in real-world industrial scenarios for various natural language processing tasks. However, the high inference cost of large-scale LLMs makes their deployment impractical, necessitating the use of smaller models. Despite their efficiency, smaller LLMs lack robust zero-shot instruction-following capabilities across diverse domains, limiting their adaptability to dynamic user requirements. Traditional fine-tuning approaches exacerbate this issue by inducing catastrophic forgetting, reducing the model's generalization ability for unseen tasks. In this paper, we propose Domain Adaptive Continual Instruction Pre-Training via Reading Comprehension (DACIP-RC), a continual pre-training technique that enhances smaller LLMs' domain adaptability for business conversational tasks. Unlike conventional pre-training approaches that rely on next-token prediction, DACIP-RC generates diverse task instructions and responses via reading comprehension on conversation transcripts, enabling better instruction generalization. Our empirical evaluations demonstrate that DACIP-RC significantly improves zero-shot generalization across a wide range of business conversational tasks, including meeting summarization, action item generation, and call purpose identification. To the best of our knowledge, this is the first work to apply instruction pre-training on business conversational data, providing insights into how industries can leverage proprietary datasets for domain adaptation.
摘要：大型语言模型 (LLM) 的快速发展使其能够在现实工业场景中用于各种自然语言处理任务。然而，大规模法学硕士的高推理成本使其部署不切实际，需要使用较小的模型。尽管效率较高，但较小的法学硕士缺乏跨不同领域的强大的零样本指令跟踪能力，限制了它们对动态用户需求的适应性。传统的微调方法会引发灾难性遗忘，从而降低模型对未见过的任务的泛化能力，从而加剧了这个问题。在本文中，我们提出了通过阅读理解进行领域自适应持续指令预训练（DACIP-RC），这是一种持续预训练技术，可增强小型法学硕士对业务会话任务的领域适应性。与依赖下一个标记预测的传统预训练方法不同，DACIP-RC 通过对对话记录的阅读理解生成不同的任务指令和响应，从而实现更好的指令泛化。我们的实证评估表明，DACIP-RC 显着提高了各种业务对话任务的零样本泛化能力，包括会议总结、行动项目生成和呼叫目的识别。据我们所知，这是第一项将指令预训练应用于业务对话数据的工作，提供了有关行业如何利用专有数据集进行领域适应的见解。

Title: Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs

Authors: Shuzhou Yuan, Ercong Nie, Yinuo Sun, Chenxuan Zhao, William LaCroix, Michael Färber
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08158
Pdf URL: https://arxiv.org/pdf/2510.08158
Copy Paste: [[2510.08158]] Beyond Over-Refusal: Scenario-Based Diagnostics and Post-Hoc Mitigation for Exaggerated Refusals in LLMs(https://arxiv.org/abs/2510.08158)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) frequently produce false refusals, declining benign requests that contain terms resembling unsafe queries. We address this challenge by introducing two comprehensive benchmarks: the Exaggerated Safety Benchmark (XSB) for single-turn prompts, annotated with "Focus" keywords that identify refusal-inducing triggers, and the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB), which systematically evaluates refusal calibration in realistic, context-rich dialog settings. Our benchmarks reveal that exaggerated refusals persist across diverse recent LLMs and are especially pronounced in complex, multi-turn scenarios. To mitigate these failures, we leverage post-hoc explanation methods to identify refusal triggers and deploy three lightweight, model-agnostic approaches, ignore-word instructions, prompt rephrasing, and attention steering, at inference time, all without retraining or parameter access. Experiments on four instruction-tuned Llama models demonstrate that these strategies substantially improve compliance on safe prompts while maintaining robust safety protections. Our findings establish a reproducible framework for diagnosing and mitigating exaggerated refusals, highlighting practical pathways to safer and more helpful LLM deployments.
摘要：大型语言模型 (LLM) 经常产生错误拒绝，拒绝包含类似于不安全查询的术语的良性请求。我们通过引入两个全面的基准来应对这一挑战：用于单轮提示的夸大安全基准（XSB），用识别拒绝诱导触发器的“焦点”关键字进行注释，以及基于多轮场景的夸大安全基准（MS-XSB），它系统地评估现实的、上下文丰富的对话设置中的拒绝校准。我们的基准显示，最近的各种法学硕士都存在夸大的拒绝情况，并且在复杂的多轮场景中尤其明显。为了减轻这些失败，我们利用事后解释方法来识别拒绝触发器，并在推理时部署三种轻量级、模型无关的方法、忽略单词指令、提示改写和注意力引导，所有这些都无需重新训练或参数访问。对四个指令调整的 Llama 模型的实验表明，这些策略大大提高了对安全提示的遵从性，同时保持了强大的安全保护。我们的研究结果建立了一个可重复的框架，用于诊断和减轻夸大的拒绝，强调了更安全、更有用的法学硕士部署的实用途径。

Title: METRICALARGS: A Taxonomy for Studying Metrical Poetry with LLMs

Authors: Chalamalasetti Kranti, Sowmya Vajjala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08188
Pdf URL: https://arxiv.org/pdf/2510.08188
Copy Paste: [[2510.08188]] METRICALARGS: A Taxonomy for Studying Metrical Poetry with LLMs(https://arxiv.org/abs/2510.08188)
Keywords: language model, llm
Abstract: Prior NLP work studying poetry has focused primarily on automatic poem generation and summarization. Many languages have well-studied traditions of poetic meter which enforce constraints on a poem in terms of syllable and phoneme patterns. Such advanced literary forms offer opportunities for probing deeper reasoning and language understanding in Large Language Models (LLMs) and their ability to follow strict pre-requisites and rules. In this paper, we introduce MetricalARGS, the first taxonomy of poetry-related NLP tasks designed to evaluate LLMs on metrical poetry across four dimensions: Analysis, Retrieval, Generation, and Support. We discuss how these tasks relate to existing NLP tasks, addressing questions around datasets and evaluation metrics. Taking Telugu as our example language, we illustrate how the taxonomy can be used in practice. MetricalARGS highlights the broader possibilities for understanding the capabilities and limitations of today's LLMs through the lens of metrical poetry.
摘要：之前研究诗歌的 NLP 工作主要集中在自动诗歌生成和摘要上。许多语言都有经过深入研究的诗歌韵律传统，这些传统在音节和音素模式方面对诗歌施加了限制。这种先进的文学形式为探索大型语言模型（LLM）中更深层的推理和语言理解及其遵循严格的先决条件和规则的能力提供了机会。在本文中，我们介绍了 MetricalARGS，这是第一个与诗歌相关的 NLP 任务分类法，旨在跨四个维度评估韵律诗歌的法学硕士：分析、检索、生成和支持。我们讨论这些任务如何与现有的 NLP 任务相关，解决有关数据集和评估指标的问题。以泰卢固语作为示例语言，我们说明了如何在实践中使用分类法。 MetricalARGS 强调了通过格律诗歌的视角理解当今法学硕士的能力和局限性的更广泛的可能性。

Title: Training-Free Group Relative Policy Optimization

Authors: Yuzheng Cai, Siqi Cai, Yuchen Shi, Zihan Xu, Lichao Chen, Yulei Qin, Xiaoyu Tan, Gang Li, Zongyi Li, Haojia Lin, Yong Mao, Ke Li, Xing Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08191
Pdf URL: https://arxiv.org/pdf/2510.08191
Copy Paste: [[2510.08191]] Training-Free Group Relative Policy Optimization(https://arxiv.org/abs/2510.08191)
Keywords: language model, llm, prompt, agent
Abstract: Recent advances in Large Language Model (LLM) agents have demonstrated their promising general capabilities. However, their performance in specialized real-world domains often degrades due to challenges in effectively integrating external tools and specific prompting strategies. While methods like agentic reinforcement learning have been proposed to address this, they typically rely on costly parameter updates, for example, through a process that uses Supervised Fine-Tuning (SFT) followed by a Reinforcement Learning (RL) phase with Group Relative Policy Optimization (GRPO) to alter the output distribution. However, we argue that LLMs can achieve a similar effect on the output distribution by learning experiential knowledge as a token prior, which is a far more lightweight approach that not only addresses practical data scarcity but also avoids the common issue of overfitting. To this end, we propose Training-Free Group Relative Policy Optimization (Training-Free GRPO), a cost-effective solution that enhances LLM agent performance without any parameter updates. Our method leverages the group relative semantic advantage instead of numerical ones within each group of rollouts, iteratively distilling high-quality experiential knowledge during multi-epoch learning on a minimal ground-truth data. Such knowledge serves as the learned token prior, which is seamlessly integrated during LLM API calls to guide model behavior. Experiments on mathematical reasoning and web searching tasks demonstrate that Training-Free GRPO, when applied to DeepSeek-V3.1-Terminus, significantly improves out-of-domain performance. With just a few dozen training samples, Training-Free GRPO outperforms fine-tuned small LLMs with marginal training data and cost.
摘要：大型语言模型（LLM）代理的最新进展已经证明了它们有前途的通用能力。然而，由于有效集成外部工具和特定提示策略方面的挑战，它们在专门的现实世界领域中的表现常常会下降。虽然已经提出了代理强化学习等方法来解决这个问题，但它们通常依赖于昂贵的参数更新，例如，通过使用监督微调 (SFT) 的过程，然后使用组相对策略优化 (GRPO) 的强化学习 (RL) 阶段来改变输出分布。然而，我们认为法学硕士可以通过学习经验知识作为令牌先验来实现对输出分布的类似效果，这是一种更轻量级的方法，不仅可以解决实际数据稀缺问题，还可以避免过度拟合的常见问题。为此，我们提出了免训练组相对策略优化（免训练GRPO），这是一种经济高效的解决方案，无需任何参数更新即可增强LLM代理性能。我们的方法利用组相对语义优势，而不是每组推出中的数值优势，在最小真实数据的多时代学习过程中迭代地提炼出高质量的经验知识。这些知识作为学习到的 token 先验，在 LLM API 调用期间无缝集成以指导模型行为。数学推理和网络搜索任务的实验表明，将 Training-Free GRPO 应用于 DeepSeek-V3.1-Terminus 时，可以显着提高域外性能。只需几十个训练样本，Training-Free GRPO 的性能就优于具有边际训练数据和成本的微调小型 LLM。

Title: Memory Retrieval and Consolidation in Large Language Models through Function Tokens

Authors: Shaohua Zhang, Yuan Lin, Hang Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08203
Pdf URL: https://arxiv.org/pdf/2510.08203
Copy Paste: [[2510.08203]] Memory Retrieval and Consolidation in Large Language Models through Function Tokens(https://arxiv.org/abs/2510.08203)
Keywords: language model, llm
Abstract: The remarkable success of large language models (LLMs) stems from their ability to consolidate vast amounts of knowledge into the memory during pre-training and to retrieve it from the memory during inference, enabling advanced capabilities such as knowledge memorization, instruction-following and reasoning. However, the mechanisms of memory retrieval and consolidation in LLMs remain poorly understood. In this paper, we propose the function token hypothesis to explain the workings of LLMs: During inference, function tokens activate the most predictive features from context and govern next token prediction (memory retrieval). During pre-training, predicting the next tokens (usually content tokens) that follow function tokens increases the number of learned features of LLMs and updates the model parameters (memory consolidation). Function tokens here roughly correspond to function words in linguistics, including punctuation marks, articles, prepositions, and conjunctions, in contrast to content tokens. We provide extensive experimental evidence supporting this hypothesis. Using bipartite graph analysis, we show that a small number of function tokens activate the majority of features. Case studies further reveal how function tokens activate the most predictive features from context to direct next token prediction. We also find that during pre-training, the training loss is dominated by predicting the next content tokens following function tokens, which forces the function tokens to select the most predictive features from context.
摘要：大型语言模型（LLM）的显着成功源于它们能够在预训练期间将大量知识巩固到内存中，并在推理过程中从内存中检索知识，从而实现知识记忆、指令跟踪和推理等高级功能。然而，人们对法学硕士的记忆检索和巩固机制仍然知之甚少。在本文中，我们提出函数令牌假设来解释 LLM 的工作原理：在推理过程中，函数令牌激活上下文中最具预测性的特征并控制下一个令牌预测（记忆检索）。在预训练期间，预测函数标记之后的下一个标记（通常是内容标记）会增加 LLM 学习到的特征数量并更新模型参数（内存整合）。这里的功能标记大致对应于语言学中的功能词，包括标点符号、冠词、介词和连词，与内容标记相反。我们提供了大量的实验证据来支持这一假设。使用二部图分析，我们表明少量功能标记激活了大多数功能。案例研究进一步揭示了功能标记如何激活从上下文到直接下一个标记预测的最具预测性的特征。我们还发现，在预训练期间，训练损失主要是预测功能标记之后的下一个内容标记，这迫使功能标记从上下文中选择最具预测性的特征。

Title: LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions

Authors: XuHao Hu, Peng Wang, Xiaoya Lu, Dongrui Liu, Xuanjing Huang, Jing Shao
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.08211
Pdf URL: https://arxiv.org/pdf/2510.08211
Copy Paste: [[2510.08211]] LLMs Learn to Deceive Unintentionally: Emergent Misalignment in Dishonesty from Misaligned Samples to Biased Human-AI Interactions(https://arxiv.org/abs/2510.08211)
Keywords: llm
Abstract: Previous research has shown that LLMs finetuned on malicious or incorrect completions within narrow domains (e.g., insecure code or incorrect medical advice) can become broadly misaligned to exhibit harmful behaviors, which is called emergent misalignment. In this work, we investigate whether this phenomenon can extend beyond safety behaviors to a broader spectrum of dishonesty and deception under high-stakes scenarios (e.g., lying under pressure and deceptive behavior). To explore this, we finetune open-sourced LLMs on misaligned completions across diverse domains. Experimental results demonstrate that LLMs show broadly misaligned behavior in dishonesty. Additionally, we further explore this phenomenon in a downstream combined finetuning setting, and find that introducing as little as 1% of misalignment data into a standard downstream task is sufficient to decrease honest behavior over 20%. Furthermore, we consider a more practical human-AI interaction environment where we simulate both benign and biased users to interact with the assistant LLM. Notably, we find that the assistant can be misaligned unintentionally to exacerbate its dishonesty with only 10% biased user population. In summary, we extend the study of emergent misalignment to the domain of dishonesty and deception under high-stakes scenarios, and demonstrate that this risk arises not only through direct finetuning, but also in downstream mixture tasks and practical human-AI interactions.
摘要：先前的研究表明，法学硕士对狭窄领域内的恶意或不正确的完成（例如不安全的代码或不正确的医疗建议）进行微调可能会出现广泛的错位，从而表现出有害行为，这被称为紧急错位。在这项工作中，我们研究了这种现象是否可以超越安全行为，扩展到高风险场景下更广泛的不诚实和欺骗行为（例如，在压力下和欺骗行为）。为了探索这一点，我们针对不同领域的不一致完成情况对开源法学硕士进行了微调。实验结果表明，法学硕士在不诚实方面表现出广泛的不一致行为。此外，我们在下游组合微调设置中进一步探索了这种现象，发现在标准下游任务中引入少至 1% 的错位数据就足以减少超过 20% 的诚实行为。此外，我们考虑了一个更实用的人机交互环境，在其中模拟良性和有偏见的用户与助理法学硕士进行交互。值得注意的是，我们发现，只有 10% 的有偏见的用户群体，助手可能会无意中出现偏差，从而加剧其不诚实行为。总之，我们将紧急失调的研究扩展到高风险场景下的不诚实和欺骗领域，并证明这种风险不仅是通过直接微调产生的，而且还存在于下游混合任务和实际的人机交互中。

Title: SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets

Authors: Qiang Yang, Xiuying Chen, Changsheng Ma, Rui Yin, Xin Gao, Xiangliang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08214
Pdf URL: https://arxiv.org/pdf/2510.08214
Copy Paste: [[2510.08214]] SenWave: A Fine-Grained Multi-Language Sentiment Analysis Dataset Sourced from COVID-19 Tweets(https://arxiv.org/abs/2510.08214)
Keywords: language model, gpt, chat
Abstract: The global impact of the COVID-19 pandemic has highlighted the need for a comprehensive understanding of public sentiment and reactions. Despite the availability of numerous public datasets on COVID-19, some reaching volumes of up to 100 billion data points, challenges persist regarding the availability of labeled data and the presence of coarse-grained or inappropriate sentiment labels. In this paper, we introduce SenWave, a novel fine-grained multi-language sentiment analysis dataset specifically designed for analyzing COVID-19 tweets, featuring ten sentiment categories across five languages. The dataset comprises 10,000 annotated tweets each in English and Arabic, along with 30,000 translated tweets in Spanish, French, and Italian, derived from English tweets. Additionally, it includes over 105 million unlabeled tweets collected during various COVID-19 waves. To enable accurate fine-grained sentiment classification, we fine-tuned pre-trained transformer-based language models using the labeled tweets. Our study provides an in-depth analysis of the evolving emotional landscape across languages, countries, and topics, revealing significant insights over time. Furthermore, we assess the compatibility of our dataset with ChatGPT, demonstrating its robustness and versatility in various applications. Our dataset and accompanying code are publicly accessible on the repository\footnote{this https URL}. We anticipate that this work will foster further exploration into fine-grained sentiment analysis for complex events within the NLP community, promoting more nuanced understanding and research innovations.
摘要：COVID-19 大流行的全球影响凸显了全面了解公众情绪和反应的必要性。尽管有大量关于 COVID-19 的公共数据集，其中一些数据点的数量高达 1000 亿个数据点，但在标记数据的可用性以及粗粒度或不适当的情绪标签的存在方面仍然存在挑战。在本文中，我们介绍了 SenWave，这是一种新颖的细粒度多语言情感分析数据集，专门用于分析 COVID-19 推文，具有跨五种语言的十个情感类别。该数据集包含 10,000 条带注释的英语和阿拉伯语推文，以及 30,000 条源自英语推文的西班牙语、法语和意大利语翻译推文。此外，它还包括在各种 COVID-19 浪潮中收集的超过 1.05 亿条未标记的推文。为了实现准确的细粒度情感分类，我们使用带标签的推文对基于变压器的预训练语言模型进行了微调。我们的研究对跨语言、国家和主题不断变化的情感景观进行了深入分析，随着时间的推移揭示了重要的见解。此外，我们评估了我们的数据集与 ChatGPT 的兼容性，证明了其在各种应用程序中的稳健性和多功能性。我们的数据集和随附的代码可以在存储库\脚注{此 https URL} 上公开访问。我们预计这项工作将促进对 NLP 社区内复杂事件的细粒度情感分析的进一步探索，促进更细致的理解和研究创新。

Title: The Alignment Waltz: Jointly Training Agents to Collaborate for Safety

Authors: Jingyu Zhang, Haozhu Wang, Eric Michael Smith, Sid Wang, Amr Sharaf, Mahesh Pasupuleti, Benjamin Van Durme, Daniel Khashabi, Jason Weston, Hongyuan Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08240
Pdf URL: https://arxiv.org/pdf/2510.08240
Copy Paste: [[2510.08240]] The Alignment Waltz: Jointly Training Agents to Collaborate for Safety(https://arxiv.org/abs/2510.08240)
Keywords: llm, prompt, agent
Abstract: Harnessing the power of LLMs requires a delicate dance between being helpful and harmless. This creates a fundamental tension between two competing challenges: vulnerability to adversarial attacks that elicit unsafe content, and a tendency for overrefusal on benign but sensitive prompts. Current approaches often navigate this dance with safeguard models that completely reject any content that contains unsafe portions. This approach cuts the music entirely-it may exacerbate overrefusals and fails to provide nuanced guidance for queries it refuses. To teach models a more coordinated choreography, we propose WaltzRL, a novel multi-agent reinforcement learning framework that formulates safety alignment as a collaborative, positive-sum game. WaltzRL jointly trains a conversation agent and a feedback agent, where the latter is incentivized to provide useful suggestions that improve the safety and helpfulness of the conversation agent's responses. At the core of WaltzRL is a Dynamic Improvement Reward (DIR) that evolves over time based on how well the conversation agent incorporates the feedback. At inference time, unsafe or overrefusing responses from the conversation agent are improved rather than discarded. The feedback agent is deployed together with the conversation agent and only engages adaptively when needed, preserving helpfulness and low latency on safe queries. Our experiments, conducted across five diverse datasets, demonstrate that WaltzRL significantly reduces both unsafe responses (e.g., from 39.0% to 4.6% on WildJailbreak) and overrefusals (from 45.3% to 9.9% on OR-Bench) compared to various baselines. By enabling the conversation and feedback agents to co-evolve and adaptively apply feedback, WaltzRL enhances LLM safety without degrading general capabilities, thereby advancing the Pareto front between helpfulness and harmlessness.
摘要：利用法学硕士的力量需要在有益和无害之间进行微妙的平衡。这在两个相互竞争的挑战之间造成了根本的紧张关系：容易受到引发不安全内容的对抗性攻击，以及过度拒绝良性但敏感提示的倾向。当前的方法通常使用安全模型来引导这种舞蹈，完全拒绝任何包含不安全部分的内容。这种方法完全切断了音乐——它可能会加剧过度拒绝，并且无法为其拒绝的查询提供细致入微的指导。为了教授模型更加协调的编排，我们提出了 WaltzRL，这是一种新颖的多智能体强化学习框架，它将安全协调制定为协作的正和游戏。 WaltzRL 联合训练对话代理和反馈代理，后者被激励提供有用的建议，以提高对话代理响应的安全性和有用性。 WaltzRL 的核心是动态改进奖励 (DIR)，它根据对话代理整合反馈的程度随时间而变化。在推理时，来自对话代理的不安全或过度拒绝的响应会得到改进，而不是被丢弃。反馈代理与对话代理一起部署，仅在需要时自适应地参与，从而保留安全查询的帮助性和低延迟。我们在五个不同数据集上进行的实验表明，与各种基线相比，WaltzRL 显着减少了不安全反应（例如，在 WildJailbreak 上从 39.0% 减少到 4.6%）和过度拒绝（在 OR-Bench 上从 45.3% 减少到 9.9%）。通过使对话和反馈代理能够共同进化并自适应地应用反馈，WaltzRL 在不降低一般能力的情况下增强了 LLM 的安全性，从而在有帮助和无害之间推进了帕累托前沿。

Title: Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling

Authors: Jannek Ulm, Kevin Du, Vésteinn Snæbjarnarson
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08245
Pdf URL: https://arxiv.org/pdf/2510.08245
Copy Paste: [[2510.08245]] Contrastive Decoding for Synthetic Data Generation in Low-Resource Language Modeling(https://arxiv.org/abs/2510.08245)
Keywords: language model, llm
Abstract: Large language models (LLMs) are trained on huge amounts of textual data, and concerns have been raised that the limits of such data may soon be reached. A potential solution is to train on synthetic data sampled from LLMs. In this work, we build on this idea and investigate the benefits of contrastive decoding for generating synthetic corpora. In a controlled setting, we experiment with sampling corpora using the relative difference between a good and bad model trained on the same original corpus of 100 million words. By amplifying the signal from a model that has better performance, we create a synthetic corpus and mix it with the original training data. Our findings show that training on a mixture of synthesized and real data improves performance on the language modeling objective and a range of downstream tasks. In particular, we see that training with a mix of synthetic data from contrastive decoding benefits tasks that require more reasoning skills, while synthetic data from traditional sampling helps more on tasks dependent on surface level linguistic capabilities.
摘要：大型语言模型（LLM）是根据大量文本数据进行训练的，人们担心这些数据可能很快就会达到极限。一个潜在的解决方案是对从法学硕士采样的合成数据进行训练。在这项工作中，我们以这个想法为基础，研究对比解码对于生成合成语料库的好处。在受控环境中，我们利用在同一原始语料库（包含 1 亿个单词）上训练的好模型和坏模型之间的相对差异来对语料库进行抽样实验。通过放大具有更好性能的模型的信号，我们创建一个合成语料库并将其与原始训练数据混合。我们的研究结果表明，对合成数据和真实数据的混合进行训练可以提高语言建模目标和一系列下游任务的性能。特别是，我们发现使用来自对比解码的合成数据的混合进行训练有利于需要更多推理技能的任务，而来自传统采样的合成数据更有利于依赖于表面语言能力的任务。

Title: Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window

Authors: Qiaoyu Tang, Hao Xiang, Le Yu, Bowen Yu, Yaojie Lu, Xianpei Han, Le Sun, WenJuan Zhang, Pengbo Wang, Shixuan Liu, Zhenru Zhang, Jianhong Tu, Hongyu Lin, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08276
Pdf URL: https://arxiv.org/pdf/2510.08276
Copy Paste: [[2510.08276]] Beyond Turn Limits: Training Deep Search Agents with Dynamic Context Window(https://arxiv.org/abs/2510.08276)
Keywords: agent
Abstract: While recent advances in reasoning models have demonstrated cognitive behaviors through reinforcement learning, existing approaches struggle to invoke deep reasoning capabilities in multi-turn agents with long-horizon interactions. We propose DeepMiner, a novel framework that elicits such abilities by introducing high-difficulty training tasks and dynamic context window. DeepMiner presents a reverse construction method to generate complex but verifiable question-answer pairs from authentic web sources, which ensures the challenge and reliability of training data while injecting cognitive capabilities into multi-turn reasoning scenarios. We further design an elegant yet effective dynamic context management strategy for both training and inference, utilizing sliding window mechanisms while eliminating the dependency on external summarization models, thereby efficiently empowering the model to handle continuously expanding long-horizon contexts. Through reinforcement learning on Qwen3-32B, we develop DeepMiner-32B, which achieves substantial performance improvements across multiple search agent benchmarks. DeepMiner attains 33.5% accuracy on BrowseComp-en, surpassing the previous best open-source agent by almost 20 percentage points, and demonstrates consistent improvements on BrowseComp-zh, XBench-DeepSearch, and GAIA. Notably, our dynamic context management enables sustained interactions of nearly 100 turns within standard 32k context length, effectively addressing the context limitations that constrain existing multi-turn interaction systems.
摘要：虽然推理模型的最新进展已经通过强化学习证明了认知行为，但现有方法很难调用具有长视野交互的多轮智能体的深度推理能力。我们提出了 DeepMiner，这是一种新颖的框架，通过引入高难度的训练任务和动态上下文窗口来激发这种能力。 DeepMiner提出了一种逆向构造方法，从真实的网络来源生成复杂但可验证的问答对，保证了训练数据的挑战性和可靠性，同时为多轮推理场景注入认知能力。我们进一步为训练和推理设计了一种优雅而有效的动态上下文管理策略，利用滑动窗口机制，同时消除对外部摘要模型的依赖，从而有效地使模型能够处理不断扩展的长视野上下文。通过 Qwen3-32B 上的强化学习，我们开发了 DeepMiner-32B，它在多个搜索代理基准测试中实现了显着的性能改进。 DeepMiner 在 BrowseComp-en 上达到了 33.5% 的准确率，比之前最好的开源代理高出近 20 个百分点，并且在 BrowseComp-zh、XBench-DeepSearch 和 GAIA 上表现出一致的改进。值得注意的是，我们的动态上下文管理可以在标准 32k 上下文长度内实现近 100 轮的持续交互，有效解决限制现有多轮交互系统的上下文限制。

Title: Neuron-Level Analysis of Cultural Understanding in Large Language Models

Authors: Taisei Yamamoto, Ryoma Kumon, Danushka Bollegala, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08284
Pdf URL: https://arxiv.org/pdf/2510.08284
Copy Paste: [[2510.08284]] Neuron-Level Analysis of Cultural Understanding in Large Language Models(https://arxiv.org/abs/2510.08284)
Keywords: language model, llm
Abstract: As large language models (LLMs) are increasingly deployed worldwide, ensuring their fair and comprehensive cultural understanding is important. However, LLMs exhibit cultural bias and limited awareness of underrepresented cultures, while the mechanisms underlying their cultural understanding remain underexplored. To fill this gap, we conduct a neuron-level analysis to identify neurons that drive cultural behavior, introducing a gradient-based scoring method with additional filtering for precise refinement. We identify both culture-general neurons contributing to cultural understanding regardless of cultures, and culture-specific neurons tied to an individual culture. These neurons account for less than 1% of all neurons and are concentrated in shallow to middle MLP layers. We validate their role by showing that suppressing them substantially degrades performance on cultural benchmarks (by up to 30%), while performance on general natural language understanding (NLU) benchmarks remains largely unaffected. Moreover, we show that culture-specific neurons support knowledge of not only the target culture, but also related cultures. Finally, we demonstrate that training on NLU benchmarks can diminish models' cultural understanding when we update modules containing many culture-general neurons. These findings provide insights into the internal mechanisms of LLMs and offer practical guidance for model training and engineering. Our code is available at this https URL
摘要：随着大型语言模型（LLM）在全球范围内越来越多地部署，确保其公平和全面的文化理解非常重要。然而，法学硕士表现出文化偏见，对代表性不足的文化认识有限，而其文化理解背后的机制仍未得到充分探索。为了填补这一空白，我们进行了神经元级分析，以识别驱动文化行为的神经元，引入基于梯度的评分方法和额外的过滤以进行精确细化。我们识别出有助于文化理解的文化通用神经元（无论文化如何）和与个体文化相关的文化特异性神经元。这些神经元占所有神经元的比例不到 1%，并且集中在浅层到中层 MLP 层。我们通过证明抑制它们会大大降低文化基准的性能（高达 30%）来验证它们的作用，而一般自然语言理解 (NLU) 基准的性能基本上不受影响。此外，我们表明文化特异性神经元不仅支持目标文化的知识，而且支持相关文化的知识。最后，我们证明，当我们更新包含许多文化通用神经元的模块时，NLU 基准训练会削弱模型的文化理解。这些发现提供了对法学硕士内部机制的见解，并为模型训练和工程提供了实用指导。我们的代码可在此 https URL 获取

Title: AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming

Authors: Muxi Diao, Yutao Mou, Keqing He, Hanbo Song, Lulu Zhao, Shikun Zhang, Wei Ye, Kongming Liang, Zhanyu Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08329
Pdf URL: https://arxiv.org/pdf/2510.08329
Copy Paste: [[2510.08329]] AutoRed: A Free-form Adversarial Prompt Generation Framework for Automated Red Teaming(https://arxiv.org/abs/2510.08329)
Keywords: language model, llm, prompt
Abstract: The safety of Large Language Models (LLMs) is crucial for the development of trustworthy AI applications. Existing red teaming methods often rely on seed instructions, which limits the semantic diversity of the synthesized adversarial prompts. We propose AutoRed, a free-form adversarial prompt generation framework that removes the need for seed instructions. AutoRed operates in two stages: (1) persona-guided adversarial instruction generation, and (2) a reflection loop to iteratively refine low-quality prompts. To improve efficiency, we introduce a verifier to assess prompt harmfulness without querying the target models. Using AutoRed, we build two red teaming datasets -- AutoRed-Medium and AutoRed-Hard -- and evaluate eight state-of-the-art LLMs. AutoRed achieves higher attack success rates and better generalization than existing baselines. Our results highlight the limitations of seed-based approaches and demonstrate the potential of free-form red teaming for LLM safety evaluation. We will open source our datasets in the near future.
摘要：大型语言模型 (LLM) 的安全性对于开发值得信赖的人工智能应用程序至关重要。现有的红队方法通常依赖于种子指令，这限制了合成的对抗性提示的语义多样性。我们提出了 AutoRed，一种自由形式的对抗性提示生成框架，无需种子指令。 AutoRed 分两个阶段运行：(1) 角色引导的对抗性指令生成，(2) 反射循环以迭代地完善低质量提示。为了提高效率，我们引入了一个验证器来评估即时危害性，而无需查询目标模型。使用 AutoRed，我们构建了两个红队数据集——AutoRed-Medium 和 AutoRed-Hard——并评估了八个最先进的法学硕士。与现有基线相比，AutoRed 实现了更高的攻击成功率和更好的泛化能力。我们的结果强调了基于种子的方法的局限性，并证明了自由形式红队在法学硕士安全评估中的潜力。我们将在不久的将来开源我们的数据集。

Title: Two-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media

Authors: Yukai Song, Pengfei Zhou, César Escobar-Viera, Candice Biernesser, Wei Huang, Jingtong Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08365
Pdf URL: https://arxiv.org/pdf/2510.08365
Copy Paste: [[2510.08365]] Two-Stage Voting for Robust and Efficient Suicide Risk Detection on Social Media(https://arxiv.org/abs/2510.08365)
Keywords: language model, llm, prompt
Abstract: Suicide rates have risen worldwide in recent years, underscoring the urgent need for proactive prevention strategies. Social media provides valuable signals, as many at-risk individuals - who often avoid formal help due to stigma - choose instead to share their distress online. Yet detecting implicit suicidal ideation, conveyed indirectly through metaphor, sarcasm, or subtle emotional cues, remains highly challenging. Lightweight models like BERT handle explicit signals but fail on subtle implicit ones, while large language models (LLMs) capture nuance at prohibitive computational cost. To address this gap, we propose a two-stage voting architecture that balances efficiency and robustness. In Stage 1, a lightweight BERT classifier rapidly resolves high-confidence explicit cases. In Stage 2, ambiguous inputs are escalated to either (i) a multi-perspective LLM voting framework to maximize recall on implicit ideation, or (ii) a feature-based ML ensemble guided by psychologically grounded indicators extracted via prompt-engineered LLMs for efficiency and interpretability. To the best of our knowledge, this is among the first works to operationalize LLM-extracted psychological features as structured vectors for suicide risk detection. On two complementary datasets - explicit-dominant Reddit and implicit-only DeepSuiMind - our framework outperforms single-model baselines, achieving 98.0% F1 on explicit cases, 99.7% on implicit ones, and reducing the cross-domain gap below 2%, while significantly lowering LLM cost.
摘要：近年来，全球自杀率有所上升，凸显了采取积极预防策略的迫切需要。社交媒体提供了宝贵的信号，因为许多高危人群（他们经常因耻辱而避免获得正式帮助）选择在网上分享他们的痛苦。然而，检测通过隐喻、讽刺或微妙的情感暗示间接传达的隐含自杀意念仍然极具挑战性。像 BERT 这样的轻量级模型可以处理显式信号，但无法处理微妙的隐式信号，而大型语言模型 (LLM) 则以高昂的计算成本捕获细微差别。为了解决这一差距，我们提出了一种平衡效率和鲁棒性的两阶段投票架构。在第一阶段，轻量级 BERT 分类器快速解决高置信度的显式案例。在第 2 阶段，模糊的输入升级为 (i) 多视角 LLM 投票框架，以最大程度地回忆隐含想法，或 (ii) 基于特征的 ML 集成，以通过即时设计的 LLM 提取的心理基础指标为指导，以提高效率和可解释性。据我们所知，这是最早将法学硕士提取的心理特征作为自杀风险检测的结构化向量的作品之一。在两个互补的数据集上——显式主导的 Reddit 和仅隐式的 DeepSuiMind——我们的框架优于单模型基线，在显式案例上实现了 98.0% 的 F1，在隐式案例上实现了 99.7% 的 F1，并将跨域差距降低到 2% 以下，同时显着降低了 LLM 成本。

Title: On the Relationship Between the Choice of Representation and In-Context Learning

Authors: Ioana Marinescu, Kyunghyun Cho, Eric Karl Oermann
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08372
Pdf URL: https://arxiv.org/pdf/2510.08372
Copy Paste: [[2510.08372]] On the Relationship Between the Choice of Representation and In-Context Learning(https://arxiv.org/abs/2510.08372)
Keywords: language model, llm
Abstract: In-context learning (ICL) is the ability of a large language model (LLM) to learn a new task from a few demonstrations presented as part of the context. Past studies have attributed a large portion of the success of ICL to the way these in-context demonstrations are represented, particularly to how labels are represented in classification tasks. On the other hand, observations of the learning capacity of ICL (i.e., the extent to which more in-context demonstrations can lead to higher performance) have been mixed, and ICL is often thought to occur only under specific conditions. The interaction between these two aspects in ICL, representation and learning, has not been studied in depth until now. We hypothesize that they are largely independent of one another, such that the representation of demonstrations determines the baseline accuracy of ICL, while learning from additional demonstrations improves only on top of this baseline. We validate this hypothesis by developing an optimization algorithm that can enumerate a spectrum of possible label sets (representations) varying in semantic relevance. We then perform ICL with varying numbers of in-context demonstrations for each of these label sets. We observed that learning happens regardless of the quality of the label set itself, although its efficiency, measured by the slope of improvement over in-context demonstrations, is conditioned on both the label set quality and the parameter count of the underlying language model. Despite the emergence of learning, the relative quality (accuracy) of the choice of a label set (representation) is largely maintained throughout learning, confirming our hypothesis and implying their orthogonality. Our work reveals a previously underexplored aspect of ICL: the independent effects of learning from demonstrations and their representations on ICL performance.
摘要：上下文学习 (ICL) 是大型语言模型 (LLM) 从作为上下文一部分呈现的一些演示中学习新任务的能力。过去的研究将 ICL 的成功很大程度上归因于这些上下文演示的表示方式，特别是标签在分类任务中的表示方式。另一方面，对 ICL 学习能力（即更多的情境演示可以带来更高表现的程度）的观察结果褒贬不一，并且 ICL 通常被认为只在特定条件下发生。 ICL 中的表征和学习这两个方面之间的相互作用迄今为止尚未得到深入研究。我们假设它们在很大程度上彼此独立，因此演示的表示决定了 ICL 的基线准确性，而从其他演示中学习仅在此基线之上有所提高。我们通过开发一种优化算法来验证这一假设，该算法可以枚举一系列语义相关性不同的可能标签集（表示）。然后，我们对每个标签集进行不同数量的上下文演示来执行 ICL。我们观察到，无论标签集本身的质量如何，学习都会发生，尽管其效率（通过上下文演示的改进斜率来衡量）取决于标签集质量和底层语言模型的参数计数。尽管出现了学习，但标签集（表示）选择的相对质量（准确性）在整个学习过程中基本上保持不变，证实了我们的假设并暗示了它们的正交性。我们的工作揭示了 ICL 之前未被充分探索的一个方面：从演示及其表征中学习对 ICL 性能的独立影响。

Title: If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models

Authors: Jasmin Orth, Philipp Mondorf, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08388
Pdf URL: https://arxiv.org/pdf/2510.08388
Copy Paste: [[2510.08388]] If Probable, Then Acceptable? Understanding Conditional Acceptability Judgments in Large Language Models(https://arxiv.org/abs/2510.08388)
Keywords: language model, llm, prompt
Abstract: Conditional acceptability refers to how plausible a conditional statement is perceived to be. It plays an important role in communication and reasoning, as it influences how individuals interpret implications, assess arguments, and make decisions based on hypothetical scenarios. When humans evaluate how acceptable a conditional "If A, then B" is, their judgments are influenced by two main factors: the $\textit{conditional probability}$ of $B$ given $A$, and the $\textit{semantic relevance}$ of the antecedent $A$ given the consequent $B$ (i.e., whether $A$ meaningfully supports $B$). While prior work has examined how large language models (LLMs) draw inferences about conditional statements, it remains unclear how these models judge the $\textit{acceptability}$ of such statements. To address this gap, we present a comprehensive study of LLMs' conditional acceptability judgments across different model families, sizes, and prompting strategies. Using linear mixed-effects models and ANOVA tests, we find that models are sensitive to both conditional probability and semantic relevance-though to varying degrees depending on architecture and prompting style. A comparison with human data reveals that while LLMs incorporate probabilistic and semantic cues, they do so less consistently than humans. Notably, larger models do not necessarily align more closely with human judgments.
摘要：条件可接受性是指条件陈述被认为有多合理。它在沟通和推理中发挥着重要作用，因为它影响个人如何解释含义、评估论点以及根据假设场景做出决策。当人类评估条件“如果 A，则 B”的可接受程度时，他们的判断受到两个主要因素的影响：给定 $A$ 时 $B$ 的 $\textit{条件概率}$，以及给定后件 $B$ 的先行词 $A$ 的 $\textit{语义相关性}$（即，$A$ 是否有意义地支持 $B$）。虽然之前的工作已经研究了大型语言模型（LLM）如何对条件语句进行推论，但仍不清楚这些模型如何判断此类语句的 $\textit{acceptability}$。为了解决这一差距，我们对法学硕士在不同模型系列、规模和提示策略中的条件可接受性判断进行了全面研究。使用线性混合效应模型和方差分析测试，我们发现模型对条件概率和语义相关性都很敏感——尽管程度不同，具体取决于架构和提示风格。与人类数据的比较表明，虽然法学硕士结合了概率和语义线索，但它们的一致性不如人类。值得注意的是，更大的模型不一定更符合人类的判断。

Title: Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT

Authors: Noor Ul Zain, Mohsin Raza, Ahsan Adeel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08404
Pdf URL: https://arxiv.org/pdf/2510.08404
Copy Paste: [[2510.08404]] Single layer tiny Co$^4$ outpaces GPT-2 and GPT-BERT(https://arxiv.org/abs/2510.08404)
Keywords: gpt
Abstract: We show that a tiny Co$^4$ machine(Adeel,2025) with a single layer, two heads, and 8M parameters, operating at an approximate cost of $O(N)$ (where $N$ is the number of input tokens), outpaces the BabyLM Challenge baselines GPT-2 (124M, 12 layers, $O(N^2))$ and GPT-BERT (30M, 12 layers, $O(N^2))$ in just two epochs, while both are trained for ten. Co$^4$ achieves orders-of-magnitude greater training efficiency on 10M tokens, demonstrating highly sample efficient pretraining. Using the BabyLM challenge evaluation pipeline across complex benchmarks, Co$^4$ exhibits strong zero-shot and fine-tuning performance on SuperGLUE tasks. Specifically, Co$^4$ outperforms GPT-2 on 5 out of 7 zero-shot metrics and 6 out of 7 fine-tuning tasks, and GPT-BERT on 4 out of 7 metrics in both cases. These results suggest the need to rethink prevailing deep learning paradigms and associated scaling laws.
摘要：我们展示了一个小型 Co$^4$ 机器（Adeel，2025），具有单层、两个头和 8M 参数，运行成本约为 $O(N)$（其中 $N$ 是输入令牌的数量），超过了 BabyLM 挑战基线 GPT-2（124M，12 层，$O(N^2)）$ 和 GPT-BERT（30M，12 层， $O(N^2))$ 只需两个 epoch，而两者都训练了 10 个。 Co$^4$ 在 10M 个令牌上实现了多个数量级的训练效率，展示了高度样本高效的预训练。使用跨复杂基准的 BabyLM 挑战评估管道，Co$^4$ 在 SuperGLUE 任务上表现出强大的零样本和微调性能。具体来说，Co$^4$ 在 7 个零样本指标中的 5 个和 7 个微调任务中的 6 个上优于 GPT-2，在这两种情况下，Co$^4$ 在 7 个指标中的 4 个上优于 GPT-BERT。这些结果表明需要重新思考流行的深度学习范式和相关的缩放法则。

Title: DeepPrune: Parallel Scaling without Inter-trace Redundancy

Authors: Shangqing Tu, Yaxuan Li, Yushi Bai, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08483
Pdf URL: https://arxiv.org/pdf/2510.08483
Copy Paste: [[2510.08483]] DeepPrune: Parallel Scaling without Inter-trace Redundancy(https://arxiv.org/abs/2510.08483)
Keywords: language model, llm, chain-of-thought
Abstract: Parallel scaling has emerged as a powerful paradigm to enhance reasoning capabilities in large language models (LLMs) by generating multiple Chain-of-Thought (CoT) traces simultaneously. However, this approach introduces significant computational inefficiency due to inter-trace redundancy -- our analysis reveals that over 80% of parallel reasoning traces yield identical final answers, representing substantial wasted computation. To address this critical efficiency bottleneck, we propose DeepPrune, a novel framework that enables efficient parallel scaling through dynamic pruning. Our method features a specialized judge model trained with focal loss and oversampling techniques to accurately predict answer equivalence from partial reasoning traces which realizes 0.87 AUROC on equivalence prediction, combined with an online greedy clustering algorithm that dynamically prunes redundant paths while preserving answer diversity. Comprehensive evaluations across three challenging benchmarks (AIME 2024, AIME 2025, and GPQA) and multiple reasoning models demonstrate that DeepPrune achieves remarkable token reduction by over 80% compared to conventional consensus sampling on most cases, while maintaining competitive accuracy within 3 percentage points. Our work establishes a new standard for efficient parallel reasoning, making high-performance reasoning more efficient. Our code and data are here: this https URL
摘要：并行扩展已成为一种强大的范式，可通过同时生成多个思想链 (CoT) 轨迹来增强大型语言模型 (LLM) 的推理能力。然而，这种方法由于迹线间冗余而导致计算效率显着降低——我们的分析表明，超过 80% 的并行推理迹线会产生相同的最终答案，这意味着大量的计算浪费。为了解决这个关键的效率瓶颈，我们提出了 DeepPrune，这是一种新颖的框架，可以通过动态修剪实现高效的并行扩展。我们的方法采用专门的判断模型，通过焦点损失和过采样技术进行训练，以根据部分推理轨迹准确预测答案等价性，实现等价性预测的 0.87 AUROC，并结合在线贪婪聚类算法，动态修剪冗余路径，同时保留答案多样性。对三个具有挑战性的基准（AIME 2024、AIME 2025 和 GPQA）和多个推理模型的综合评估表明，在大多数情况下，DeepPrune 与传统共识采样相比，实现了超过 80% 的代币显着减少，同时保持了 3 个百分点以内的竞争准确性。我们的工作为高效并行推理建立了新标准，使高性能推理更加高效。我们的代码和数据在这里：这个 https URL

Title: Neologism Learning for Controllability and Self-Verbalization

Authors: John Hewitt, Oyvind Tafjord, Robert Geirhos, Been Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08506
Pdf URL: https://arxiv.org/pdf/2510.08506
Copy Paste: [[2510.08506]] Neologism Learning for Controllability and Self-Verbalization(https://arxiv.org/abs/2510.08506)
Keywords: llm
Abstract: Humans invent new words when there is a rising demand for a new useful concept (e.g., doomscrolling). We explore and validate a similar idea in our communication with LLMs: introducing new words to better understand and control the models, expanding on the recently introduced neologism learning. This method introduces a new word by adding a new word embedding and training with examples that exhibit the concept with no other changes in model parameters. We show that adding a new word allows for control of concepts such as flattery, incorrect answers, text length, as well as more complex concepts in AxBench. We discover that neologisms can also further our understanding of the model via self-verbalization: models can describe what each new word means to them in natural language, like explaining that a word that represents a concept of incorrect answers means ``a lack of complete, coherent, or meaningful answers...'' To validate self-verbalizations, we introduce plug-in evaluation: we insert the verbalization into the context of a model and measure whether it controls the target concept. In some self-verbalizations, we find machine-only synonyms: words that seem unrelated to humans but cause similar behavior in machines. Finally, we show how neologism learning can jointly learn multiple concepts in multiple words.
摘要：当对新的有用概念（例如，末日滚动）的需求不断增长时，人类就会发明新词。我们在与法学硕士的交流中探索并验证了类似的想法：引入新词以更好地理解和控制模型，扩展最近引入的新词学习。该方法通过添加新词嵌入和使用展示概念的示例进行训练来引入新词，而模型参数没有其他变化。我们证明，添加新单词可以控制诸如奉承、错误答案、文本长度以及 AxBench 中更复杂的概念等概念。我们发现新词还可以通过自我语言化进一步加深我们对模型的理解：模型可以用自然语言描述每个新单词对它们意味着什么，比如解释一个代表错误答案概念的单词意味着“缺乏完整、连贯或有意义的答案……”为了验证自我语言化，我们引入了插件评估：我们将语言化插入到模型的上下文中并测量它是否控制目标概念。在一些自我语言表达中，我们发现了机器专用的同义词：看似与人类无关但在机器中引起类似行为的单词。最后，我们展示了新词学习如何共同学习多个单词中的多个概念。

Title: Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator

Authors: Hyunji Lee, Kevin Chenhao Li, Matthias Grabmair, Shanshan Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08524
Pdf URL: https://arxiv.org/pdf/2510.08524
Copy Paste: [[2510.08524]] Efficient Prompt Optimisation for Legal Text Classification with Proxy Prompt Evaluator(https://arxiv.org/abs/2510.08524)
Keywords: language model, prompt
Abstract: Prompt optimization aims to systematically refine prompts to enhance a language model's performance on specific tasks. Fairness detection in Terms of Service (ToS) clauses is a challenging legal NLP task that demands carefully crafted prompts to ensure reliable results. However, existing prompt optimization methods are often computationally expensive due to inefficient search strategies and costly prompt candidate scoring. In this paper, we propose a framework that combines Monte Carlo Tree Search (MCTS) with a proxy prompt evaluator to more effectively explore the prompt space while reducing evaluation costs. Experiments demonstrate that our approach achieves higher classification accuracy and efficiency than baseline methods under a constrained computation budget.
摘要：提示优化旨在系统地完善提示，以增强语言模型在特定任务上的性能。服务条款 (ToS) 条款中的公平性检测是一项具有挑战性的法律 NLP 任务，需要精心设计的提示以确保结果可靠。然而，由于低效的搜索策略和昂贵的提示候选评分，现有的提示优化方法通常在计算上昂贵。在本文中，我们提出了一个将蒙特卡罗树搜索（MCTS）与代理提示评估器相结合的框架，以更有效地探索提示空间，同时降低评估成本。实验表明，在计算预算有限的情况下，我们的方法比基线方法实现了更高的分类精度和效率。

Title: Which Heads Matter for Reasoning? RL-Guided KV Cache Compression

Authors: Wenjie Du, Li Jiang, Keda Tao, Xue Liu, Huan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.08525
Pdf URL: https://arxiv.org/pdf/2510.08525
Copy Paste: [[2510.08525]] Which Heads Matter for Reasoning? RL-Guided KV Cache Compression(https://arxiv.org/abs/2510.08525)
Keywords: language model, chain-of-thought
Abstract: Reasoning large language models exhibit complex reasoning behaviors through the extended chain-of-thought generation, creating unprecedented Key-Value (KV) cache overhead during the decoding phase. Existing KV cache compression methods underperform on reasoning models: token-dropping methods break reasoning integrity by discarding critical information, while head-reallocating methods mistakenly compress reasoning-critical heads since they are designed for retrieval tasks, resulting in significant performance degradation as compression rates increase. We hypothesize that KV heads exhibit functional heterogeneity in reasoning models-some heads are critical for chain-of-thought consistency while others are compressible. To validate and exploit this insight, we propose RLKV, a novel reasoning-critical head identification framework, which uses reinforcement learning to directly optimize the relationship between each head's cache usage and reasoning quality. As RLKV produces rewards from actual generated samples during training, it naturally identifies heads relevant to reasoning behaviors. We then allocate full KV cache to these heads while applying compressed constant KV cache to others for efficient inference. Our experiments reveal that only a small fraction of attention heads is essential for reasoning, enabling our KV compression approach to outperform baseline methods while achieving 20-50% cache reduction with near lossless performance compared to uncompressed results.
摘要：推理大型语言模型通过扩展的思想链生成表现出复杂的推理行为，在解码阶段产生前所未有的键值（KV）缓存开销。现有的 KV 缓存压缩方法在推理模型上表现不佳：令牌丢弃方法通过丢弃关键信息来破坏推理完整性，而头重新分配方法错误地压缩推理关键头，因为它们是为检索任务设计的，导致随着压缩率的增加而导致性能显着下降。我们假设 KV 头在推理模型中表现出功能异质性——一些头对于思想链的一致性至关重要，而另一些头则可压缩。为了验证和利用这一见解，我们提出了 RLKV，一种新颖的推理关键头识别框架，它使用强化学习来直接优化每个头的缓存使用和推理质量之间的关系。由于 RLKV 在训练期间从实际生成的样本中产生奖励，因此它自然会识别与推理行为相关的头。然后，我们将完整的 KV 缓存分配给这些头，同时将压缩的常量 KV 缓存应用于其他头以进行有效的推理。我们的实验表明，只有一小部分注意力头对于推理至关重要，这使得我们的 KV 压缩方法能够超越基线方法，同时与未压缩结果相比，实现 20-50% 的缓存减少，且性能接近无损。

Title: CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards

Authors: Xiangyuan Xue, Yifan Zhou, Guibin Zhang, Zaibin Zhang, Yijiang Li, Chen Zhang, Zhenfei Yin, Philip Torr, Wanli Ouyang, Lei Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.08529
Pdf URL: https://arxiv.org/pdf/2510.08529
Copy Paste: [[2510.08529]] CoMAS: Co-Evolving Multi-Agent Systems via Interaction Rewards(https://arxiv.org/abs/2510.08529)
Keywords: language model, llm, agent
Abstract: Self-evolution is a central research topic in enabling large language model (LLM)-based agents to continually improve their capabilities after pretraining. Recent research has witnessed a transition from reinforcement learning (RL)-free to RL-based methods. Current RL-based methods either rely on dense external reward signals or extract intrinsic reward signals from LLMs themselves. However, these approaches diverge from the self-evolution mechanisms observed in human intelligence, where individuals learn and improve through mutual discussion and collaboration. In this work, we introduce Co-Evolving Multi-Agent Systems (CoMAS), a novel framework that enables agents to improve autonomously by learning from inter-agent interactions without external supervision. CoMAS generates intrinsic rewards from rich discussion dynamics, employs an LLM-as-a-judge mechanism to formulate these rewards, and optimizes each agent's policy through RL, thereby enabling decentralized and scalable co-evolution. Experimental results demonstrate that CoMAS consistently outperforms untrained agents and achieves state-of-the-art performance across most evaluation settings. Ablation studies confirm the necessity of interaction-based reward signals and reveal promising scalability as the number and diversity of agents increase. These findings establish CoMAS as a novel and effective paradigm for self-evolution in LLM-based agents.
摘要：自我进化是一个中心研究主题，它使基于大语言模型（LLM）的代理能够在预训练后不断提高其能力。最近的研究见证了从无强化学习 (RL) 的方法到基于 RL 的方法的转变。当前基于强化学习的方法要么依赖密集的外部奖励信号，要么从法学硕士本身提取内在奖励信号。然而，这些方法与人类智能中观察到的自我进化机制不同，即个体通过相互讨论和协作来学习和改进。在这项工作中，我们引入了共同进化多代理系统（CoMAS），这是一种新颖的框架，使代理能够通过从代理间交互中学习而无需外部监督来自主改进。 CoMAS 从丰富的讨论动态中产生内在奖励，采用 LLM 作为法官的机制来制定这些奖励，并通过 RL 优化每个代理的策略，从而实现去中心化和可扩展的共同进化。实验结果表明，CoMAS 始终优于未经训练的智能体，并在大多数评估设置中实现了最先进的性能。消融研究证实了基于交互的奖励信号的必要性，并随着代理数量和多样性的增加而揭示了有希望的可扩展性。这些发现将 CoMAS 确立为基于 LLM 的智能体自我进化的新颖且有效的范例。

Title: ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation

Authors: Qin Liu, Jacob Dineen, Yuxi Huang, Sheng Zhang, Hoifung Poon, Ben Zhou, Muhao Chen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.08569
Pdf URL: https://arxiv.org/pdf/2510.08569
Copy Paste: [[2510.08569]] ArenaBencher: Automatic Benchmark Evolution via Multi-Model Competitive Evaluation(https://arxiv.org/abs/2510.08569)
Keywords: language model, llm
Abstract: Benchmarks are central to measuring the capabilities of large language models and guiding model development, yet widespread data leakage from pretraining corpora undermines their validity. Models can match memorized content rather than demonstrate true generalization, which inflates scores, distorts cross-model comparisons, and misrepresents progress. We introduce ArenaBencher, a model-agnostic framework for automatic benchmark evolution that updates test cases while preserving comparability. Given an existing benchmark and a diverse pool of models to be evaluated, ArenaBencher infers the core ability of each test case, generates candidate question-answer pairs that preserve the original objective, verifies correctness and intent with an LLM as a judge, and aggregates feedback from multiple models to select candidates that expose shared weaknesses. The process runs iteratively with in-context demonstrations that steer generation toward more challenging and diagnostic cases. We apply ArenaBencher to math problem solving, commonsense reasoning, and safety domains and show that it produces verified, diverse, and fair updates that uncover new failure modes, increase difficulty while preserving test objective alignment, and improve model separability. The framework provides a scalable path to continuously evolve benchmarks in step with the rapid progress of foundation models.
摘要：基准对于衡量大型语言模型的能力和指导模型开发至关重要，但预训练语料库的广泛数据泄漏破坏了其有效性。模型可以匹配记忆的内容，而不是展示真正的概括，这会夸大分数，扭曲跨模型比较，并歪曲进展。我们引入了 ArenaBencher，这是一个与模型无关的框架，用于自动基准演化，可以更新测试用例，同时保持可比性。给定现有的基准和要评估的不同模型池，ArenaBencher 推断每个测试用例的核心能力，生成保留原始目标的候选问答对，以法学硕士作为判断者验证正确性和意图，并汇总多个模型的反馈以选择暴露共同弱点的候选者。该过程通过上下文演示迭代运行，引导一代人走向更具挑战性和诊断性的案例。我们将 ArenaBencher 应用于数学问题解决、常识推理和安全领域，并表明它可以产生经过验证的、多样化的和公平的更新，从而发现新的故障模式，在保持测试目标一致性的同时增加难度，并提高模型的可分离性。该框架提供了一条可扩展的路径，可以随着基础模型的快速进展而不断发展基准。