2026-02-06

Title: BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations

Authors: Deepak Gupta, Davis Bartels, Dina Demner-Fuhsman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04982
Pdf URL: https://arxiv.org/pdf/2602.04982
Copy Paste: [[2602.04982]] BioACE: An Automated Framework for Biomedical Answer and Citation Evaluations(https://arxiv.org/abs/2602.04982)
Keywords: language model, llm, retrieval-augmented generation
Abstract: With the increasing use of large language models (LLMs) for generating answers to biomedical questions, it is crucial to evaluate the quality of the generated answers and the references provided to support the facts in the generated answers. Evaluation of text generated by LLMs remains a challenge for question answering, retrieval-augmented generation (RAG), summarization, and many other natural language processing tasks in the biomedical domain, due to the requirements of expert assessment to verify consistency with the scientific literature and complex medical terminology. In this work, we propose BioACE, an automated framework for evaluating biomedical answers and citations against the facts stated in the answers. The proposed BioACE framework considers multiple aspects, including completeness, correctness, precision, and recall, in relation to the ground-truth nuggets for answer evaluation. We developed automated approaches to evaluate each of the aforementioned aspects and performed extensive experiments to assess and analyze their correlation with human evaluations. In addition, we considered multiple existing approaches, such as natural language inference (NLI) and pre-trained language models and LLMs, to evaluate the quality of evidence provided to support the generated answers in the form of citations into biomedical literature. With the detailed experiments and analysis, we provide the best approaches for biomedical answer and citation evaluation as a part of BioACE (this https URL) evaluation package.
摘要：随着越来越多地使用大型语言模型 (LLM) 来生成生物医学问题的答案，评估生成的答案的质量以及为支持生成的答案中的事实而提供的参考文献至关重要。由于需要专家评估来验证与科学文献和复杂医学术语的一致性，因此对法学硕士生成的文本的评估对于生物医学领域的问答、检索增强生成（RAG）、摘要和许多其他自然语言处理任务仍然是一个挑战。在这项工作中，我们提出了 BioACE，这是一个自动化框架，用于根据答案中陈述的事实评估生物医学答案和引用。拟议的 BioACE 框架考虑了与答案评估的真实数据相关的多个方面，包括完整性、正确性、精确性和召回率。我们开发了自动化方法来评估上述每个方面，并进行了广泛的实验来评估和分析它们与人类评估的相关性。此外，我们考虑了多种现有方法，例如自然语言推理（NLI）和预训练语言模型和法学硕士，以评估所提供的证据的质量，以支持以生物医学文献引用的形式生成的答案。通过详细的实验和分析，我们提供了生物医学答案和引文评估的最佳方法，作为 BioACE（此 https URL）评估包的一部分。

Title: CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System

Authors: Zexin Lin, Jiachen Yu, Haoyang Zhang, Yuzhao Li, Zhonghang Li, Yujiu Yang, Junjie Wang, Xiaoqiang Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05004
Pdf URL: https://arxiv.org/pdf/2602.05004
Copy Paste: [[2602.05004]] CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System(https://arxiv.org/abs/2602.05004)
Keywords: language model, agent
Abstract: Large language models are enabling language-conditioned agents in interactive environments, but highly cooperative tasks often impose two simultaneous constraints: sub-second real-time coordination and sustained multi-episode adaptation under a strict online token budget. Existing approaches either rely on frequent in-episode reasoning that induces latency and timing jitter, or deliver post-episode improvements through unstructured text that is difficult to compile into reliable low-cost execution. We propose CoWork-X, an active co-evolution framework that casts peer collaboration as a closed-loop optimization problem across episodes, inspired by fast--slow memory separation. CoWork-X instantiates a Skill-Agent that executes via HTN (hierarchical task network)-based skill retrieval from a structured, interpretable, and compositional skill library, and a post-episode Co-Optimizer that performs patch-style skill consolidation with explicit budget constraints and drift regularization. Experiments in challenging Overcooked-AI-like realtime collaboration benchmarks demonstrate that CoWork-X achieves stable, cumulative performance gains while steadily reducing online latency and token usage.
摘要：大型语言模型正在交互式环境中启用语言条件代理，但高度协作的任务通常会同时施加两个约束：亚秒级实时协调和严格在线令牌预算下的持续多集适应。现有的方法要么依赖于频繁的剧集内推理，从而导致延迟和定时抖动，要么通过难以编译成可靠的低成本执行的非结构化文本来提供剧集后改进。我们提出了 CoWork-X，这是一种主动的共同进化框架，受快慢内存分离的启发，它将同伴协作视为跨情节的闭环优化问题。 CoWork-X 实例化了一个技能代理，该代理通过基于 HTN（分层任务网络）的技能从结构化、可解释和组合的技能库中检索来执行，以及一个后集协同优化器，该优化器通过明确的预算约束和偏差正则化来执行补丁式技能整合。在具有挑战性的类似 Overcooked-AI 的实时协作基准测试中进行的实验表明，CoWork-X 实现了稳定的累积性能增益，同时稳步降低了在线延迟和令牌使用量。

Title: Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation

Authors: Sean Trott, Pamela D. Rivière
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05035
Pdf URL: https://arxiv.org/pdf/2602.05035
Copy Paste: [[2602.05035]] Capacity Constraints and the Multilingual Penalty for Lexical Disambiguation(https://arxiv.org/abs/2602.05035)
Keywords: language model
Abstract: Multilingual language models (LMs) sometimes under-perform their monolingual counterparts, possibly due to capacity limitations. We quantify this ``multilingual penalty'' for lexical disambiguation--a task requiring precise semantic representations and contextualization mechanisms--using controlled datasets of human relatedness judgments for ambiguous words in both English and Spanish. Comparing monolingual and multilingual LMs from the same families, we find consistently reduced performance in multilingual LMs. We then explore three potential capacity constraints: representational (reduced embedding isotropy), attentional (reduced attention to disambiguating cues), and vocabulary-related (increased multi-token segmentation). Multilingual LMs show some evidence of all three limitations; moreover, these factors statistically account for the variance formerly attributed to a model's multilingual status. These findings suggest both that multilingual LMs do suffer from multiple capacity constraints, and that these constraints correlate with reduced disambiguation performance.
摘要：多语言语言模型 (LM) 有时表现不佳，这可能是由于容量限制。我们使用人类相关性判断的受控数据集来量化词汇消歧的“多语言惩罚”——这是一项需要精确语义表示和语境化机制的任务——对英语和西班牙语中的歧义单词进行人类相关性判断。比较来自同一家族的单语言和多语言 LM，我们发现多语言 LM 的性能持续下降。然后，我们探索三个潜在的容量限制：表征（减少嵌入各向同性）、注意力（减少对消歧线索的关注）和词汇相关（增加多标记分割）。多语言 LM 显示了所有这三个局限性的一些证据；此外，这些因素在统计上解释了以前归因于模型的多语言状态的差异。这些发现表明，多语言语言模型确实受到多种能力限制，并且这些限制与消歧性能下降相关。

Title: Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories

Authors: Sidi Lu, Zhenwen Liang, Dongyang Ma, Yan Wang, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05085
Pdf URL: https://arxiv.org/pdf/2602.05085
Copy Paste: [[2602.05085]] Locas: Your Models are Principled Initializers of Locally-Supported Parametric Memories(https://arxiv.org/abs/2602.05085)
Keywords: language model, llm
Abstract: In this paper, we aim to bridge test-time-training with a new type of parametric memory that can be flexibly offloaded from or merged into model parameters. We present Locas, a Locally-Supported parametric memory that shares the design of FFN blocks in modern transformers, allowing it to be flexibly permanentized into the model parameters while supporting efficient continual learning. We discuss two major variants of Locas: one with a conventional two-layer MLP design that has a clearer theoretical guarantee; the other one shares the same GLU-FFN structure with SOTA LLMs, and can be easily attached to existing models for both parameter-efficient and computation-efficient continual learning. Crucially, we show that proper initialization of such low-rank sideway-FFN-style memories -- performed in a principled way by reusing model parameters, activations and/or gradients -- is essential for fast convergence, improved generalization, and catastrophic forgetting prevention. We validate the proposed memory mechanism on the PG-19 whole-book language modeling and LoCoMo long-context dialogue question answering tasks. With only 0.02\% additional parameters in the lowest case, Locas-GLU is capable of storing the information from past context while maintaining a much smaller context window. In addition, we also test the model's general capability loss after memorizing the whole book with Locas, through comparative MMLU evaluation. Results show the promising ability of Locas to permanentize past context into parametric knowledge with minimized catastrophic forgetting of the model's existing internal knowledge.
摘要：在本文中，我们的目标是通过一种新型参数存储器来连接测试时训练，该参数存储器可以灵活地从模型参数中卸载或合并到模型参数中。我们推出了 Locas，一种本地支持的参数存储器，它共享现代变压器中 FFN 块的设计，使其能够灵活地永久化到模型参数中，同时支持高效的持续学习。我们讨论 Locas 的两种主要变体：一种采用传统的两层 MLP 设计，具有更清晰的理论保证；另一种采用传统的两层 MLP 设计，具有更清晰的理论保证；另一种与 SOTA LLM 共享相同的 GLU-FFN 结构，并且可以轻松附加到现有模型，以实现参数高效和计算高效的持续学习。至关重要的是，我们表明，这种低阶横向 FFN 式记忆的正确初始化（通过重用模型参数、激活和/或梯度以原则性方式执行）对于快速收敛、改进泛化和预防灾难性遗忘至关重要。我们在 PG-19 全书语言建模和 LoCoMo 长上下文对话问答任务上验证了所提出的记忆机制。在最低情况下仅需要 0.02% 的附加参数，Locas-GLU 就能够存储来自过去上下文的信息，同时保持更小的上下文窗口。此外，我们还通过对比MMLU评估，测试了用Locas背诵整本书后模型的综合能力损失。结果表明，Locas 具有将过去的上下文永久化为参数化知识的能力，同时最大限度地减少模型现有内部知识的灾难性遗忘。

Title: Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models

Authors: Michael Browder, Kevin Duh, J. David Harris, Vince Lyzinski, Paul McNamee, Youngser Park, Carey E. Priebe, Peter Viechnicki
Subjects: cs.CL, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2602.05106
Pdf URL: https://arxiv.org/pdf/2602.05106
Copy Paste: [[2602.05106]] Data Kernel Perspective Space Performance Guarantees for Synthetic Data from Transformer Models(https://arxiv.org/abs/2602.05106)
Keywords: llm
Abstract: Scarcity of labeled training data remains the long pole in the tent for building performant language technology and generative AI models. Transformer models -- particularly LLMs -- are increasingly being used to mitigate the data scarcity problem via synthetic data generation. However, because the models are black boxes, the properties of the synthetic data are difficult to predict. In practice it is common for language technology engineers to 'fiddle' with the LLM temperature setting and hope that what comes out the other end improves the downstream model. Faced with this uncertainty, here we propose Data Kernel Perspective Space (DKPS) to provide the foundation for mathematical analysis yielding concrete statistical guarantees for the quality of the outputs of transformer models. We first show the mathematical derivation of DKPS and how it provides performance guarantees. Next we show how DKPS performance guarantees can elucidate performance of a downstream task, such as neural machine translation models or LLMs trained using Contrastive Preference Optimization (CPO). Limitations of the current work and future research are also discussed.
摘要：标记训练数据的稀缺仍然是构建高性能语言技术和生成人工智能模型的长杆。 Transformer 模型（尤其是法学硕士）越来越多地用于通过合成数据生成来缓解数据稀缺问题。然而，由于模型是黑匣子，因此合成数据的属性很难预测。在实践中，语言技术工程师通常会“摆弄”LLM 温度设置，并希望另一端的输出能够改善下游模型。面对这种不确定性，我们在这里提出数据内核透视空间（DKPS），为数学分析提供基础，为变压器模型的输出质量提供具体的统计保证。我们首先展示 DKPS 的数学推导以及它如何提供性能保证。接下来，我们将展示 DKPS 性能保证如何阐明下游任务的性能，例如使用对比偏好优化 (CPO) 训练的神经机器翻译模型或 LLM。还讨论了当前工作和未来研究的局限性。

Title: GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek

Authors: Yang Zhang, Mersin Konomi, Christos Xypolopoulos, Konstantinos Divriotis, Konstantinos Skianis, Giannis Nikolentzos, Giorgos Stamou, Guokan Shang, Michalis Vazirgiannis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05150
Pdf URL: https://arxiv.org/pdf/2602.05150
Copy Paste: [[2602.05150]] GreekMMLU: A Native-Sourced Multitask Benchmark for Evaluating Language Models in Greek(https://arxiv.org/abs/2602.05150)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are commonly trained on multilingual corpora that include Greek, yet reliable evaluation benchmarks for Greek-particularly those based on authentic, native-sourced content-remain limited. Existing datasets are often machine-translated from English, failing to capture Greek linguistic and cultural characteristics. We introduce GreekMMLU, a native-sourced benchmark for massive multitask language understanding in Greek, comprising 21,805 multiple-choice questions across 45 subject areas, organized under a newly defined subject taxonomy and annotated with educational difficulty levels spanning primary to professional examinations. All questions are sourced or authored in Greek from academic, professional, and governmental exams. We publicly release 16,857 samples and reserve 4,948 samples for a private leaderboard to enable robust and contamination-resistant evaluation. Evaluations of over 80 open- and closed-source LLMs reveal substantial performance gaps between frontier and open-weight models, as well as between Greek-adapted models and general multilingual ones. Finally, we provide a systematic analysis of factors influencing performance-including model scale, adaptation, and prompting-and derive insights for improving LLM capabilities in Greek.
摘要：大型语言模型 (LLM) 通常在包括希腊语在内的多语言语料库上进行训练，但希腊语的可靠评估基准（尤其是基于真实的本地内容的评估基准）仍然有限。现有的数据集通常是从英语机器翻译的，无法捕捉希腊的语言和文化特征。我们引入了 GreekMMLU，这是一个用于希腊语大规模多任务语言理解的原生基准，包含 45 个学科领域的 21,805 个多项选择题，按照新定义的学科分类法进行组织，并注释了从初级到专业考试的教育难度级别。所有问题均来自学术、专业和政府考试的希腊语或以希腊语撰写。我们公开发布了 16,857 个样本，并为私人排行榜保留了 4,948 个样本，以实现稳健且抗污染的评估。对 80 多个开源和闭源法学硕士的评估揭示了前沿模型和开放权重模型之间以及希腊语模型和一般多语言模型之间存在巨大的性能差距。最后，我们对影响绩效的因素（包括模型规模、适应性和提示）进行了系统分析，并得出了提高希腊语法学硕士能力的见解。

Title: Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems

Authors: Ziyuan Yang, Wenxuan Ding, Shangbin Feng, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05176
Pdf URL: https://arxiv.org/pdf/2602.05176
Copy Paste: [[2602.05176]] Among Us: Measuring and Mitigating Malicious Contributions in Model Collaboration Systems(https://arxiv.org/abs/2602.05176)
Keywords: language model, llm, agent
Abstract: Language models (LMs) are increasingly used in collaboration: multiple LMs trained by different parties collaborate through routing systems, multi-agent debate, model merging, and more. Critical safety risks remain in this decentralized paradigm: what if some of the models in multi-LLM systems are compromised or malicious? We first quantify the impact of malicious models by engineering four categories of malicious LMs, plug them into four types of popular model collaboration systems, and evaluate the compromised system across 10 datasets. We find that malicious models have a severe impact on the multi-LLM systems, especially for reasoning and safety domains where performance is lowered by 7.12% and 7.94% on average. We then propose mitigation strategies to alleviate the impact of malicious components, by employing external supervisors that oversee model collaboration to disable/mask them out to reduce their influence. On average, these strategies recover 95.31% of the initial performance, while making model collaboration systems fully resistant to malicious models remains an open research question.
摘要：语言模型 (LM) 越来越多地用于协作：由不同方训练的多个 LM 通过路由系统、多代理辩论、模型合并等进行协作。这种去中心化范式中仍然存在关键的安全风险：如果多法学硕士系统中的某些模型受到损害或恶意怎么办？我们首先通过设计四类恶意 LM，将它们插入四种类型的流行模型协作系统中，并跨 10 个数据集评估受损系统，从而量化恶意模型的影响。我们发现恶意模型对多 LLM 系统产生严重影响，特别是推理和安全领域，性能平均降低 7.12% 和 7.94%。然后，我们提出缓解策略，通过聘请外部监督者来监督模型协作来禁用/屏蔽恶意组件以减少其影响，从而减轻恶意组件的影响。平均而言，这些策略恢复了 95.31% 的初始性能，而使模型协作系统完全抵抗恶意模型仍然是一个悬而未决的研究问题。

Title: The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems

Authors: Shangbin Feng, Kishan Panaganti, Yulia Tsvetkov, Wenhao Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05182
Pdf URL: https://arxiv.org/pdf/2602.05182
Copy Paste: [[2602.05182]] The Single-Multi Evolution Loop for Self-Improving Model Collaboration Systems(https://arxiv.org/abs/2602.05182)
Keywords: language model
Abstract: Model collaboration -- systems where multiple language models (LMs) collaborate -- combines the strengths of diverse models with cost in loading multiple LMs. We improve efficiency while preserving the strengths of collaboration by distilling collaborative patterns into a single model, where the model is trained on the outputs of the model collaboration system. At inference time, only the distilled model is employed: it imitates the collaboration while only incurring the cost of a single model. Furthermore, we propose the single-multi evolution loop: multiple LMs collaborate, each distills from the collaborative outputs, and these post-distillation improved LMs collaborate again, forming a collective evolution ecosystem where models evolve and self-improve by interacting with an environment of other models. Extensive experiments with 7 collaboration strategies and 15 tasks (QA, reasoning, factuality, etc.) demonstrate that: 1) individual models improve by 8.0% on average, absorbing the strengths of collaboration while reducing the cost to a single model; 2) the collaboration also benefits from the stronger and more synergistic LMs after distillation, improving over initial systems without evolution by 14.9% on average. Analysis reveals that the single-multi evolution loop outperforms various existing evolutionary AI methods, is compatible with diverse model/collaboration/distillation settings, and helps solve problems where the initial model/system struggles to.
摘要：模型协作——多个语言模型 (LM) 协作的系统——结合了不同模型的优势和加载多个 LM 的成本。我们通过将协作模式提炼成单个模型来提高效率，同时保留协作的优势，其中该模型根据模型协作系统的输出进行训练。在推理时，仅采用蒸馏模型：它模仿协作，同时仅产生单个模型的成本。此外，我们提出了单多进化循环：多个 LM 协作，每个 LM 都从协作输出中提取，这些精炼后改进的 LM 再次协作，形成一个集体进化生态系统，模型通过与其他模型的环境交互来进化和自我改进。对 7 种协作策略和 15 项任务（QA、推理、事实性等）进行的大量实验表明：1）单个模型平均提高了 8.0%，吸收了协作的优势，同时降低了单个模型的成本； 2) 合作还受益于蒸馏后更强大、更具协同性的语言模型，比没有进化的初始系统平均提高了 14.9%。分析表明，单多进化循环优于现有的各种进化人工智能方法，兼容多种模型/协作/蒸馏设置，有助于解决初始模型/系统难以解决的问题。

Title: Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky

Authors: Hsuan-Yu Chou, Wajiha Naveed, Shuyan Zhou, Xiaowei Yang
Subjects: cs.CL, cs.HC, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2602.05189
Pdf URL: https://arxiv.org/pdf/2602.05189
Copy Paste: [[2602.05189]] Are Open-Weight LLMs Ready for Social Media Moderation? A Comparative Study on Bluesky(https://arxiv.org/abs/2602.05189)
Keywords: language model, llm
Abstract: As internet access expands, so does exposure to harmful content, increasing the need for effective moderation. Research has demonstrated that large language models (LLMs) can be effectively utilized for social media moderation tasks, including harmful content detection. While proprietary LLMs have been shown to zero-shot outperform traditional machine learning models, the out-of-the-box capability of open-weight LLMs remains an open question. Motivated by recent developments of reasoning LLMs, we evaluate seven state-of-the-art models: four proprietary and three open-weight. Testing with real-world posts on Bluesky, moderation decisions by Bluesky Moderation Service, and annotations by two authors, we find a considerable degree of overlap between the sensitivity (81%--97%) and specificity (91%--100%) of the open-weight LLMs and those (72%--98%, and 93%--99%) of the proprietary ones. Additionally, our analysis reveals that specificity exceeds sensitivity for rudeness detection, but the opposite holds for intolerance and threats. Lastly, we identify inter-rater agreement across human moderators and the LLMs, highlighting considerations for deploying LLMs in both platform-scale and personalized moderation contexts. These findings show open-weight LLMs can support privacy-preserving moderation on consumer-grade hardware and suggest new directions for designing moderation systems that balance community values with individual user preferences.
摘要：随着互联网接入的扩大，接触有害内容的机会也在增加，从而增加了有效审核的必要性。研究表明，大型语言模型（LLM）可以有效地用于社交媒体审核任务，包括有害内容检测。虽然专有法学硕士已被证明零样本优于传统机器学习模型，但开放权重法学硕士的开箱即用能力仍然是一个悬而未决的问题。受推理法学硕士最新发展的推动，我们评估了七个最先进的模型：四个专有模型和三个开放权重模型。对 Bluesky 上的真实帖子、Bluesky 审核服务的审核决策以及两位作者的注释进行测试，我们发现开放权重法学硕士的敏感性 (81%--97%) 和特异性 (91%--100%) 与专有法学硕士的敏感性 (81%--97%) 和特异性 (91%--100%) 和专有法学硕士 (72%--98% 和 93%--99%) 之间有相当程度的重叠。此外，我们的分析表明，特异性超过了粗鲁行为检测的灵敏度，但对于不容忍和威胁则相反。最后，我们确定了人类审核员和法学硕士之间的评估者间协议，强调了在平台规模和个性化审核环境中部署法学硕士的考虑因素。这些发现表明，开放权重法学硕士可以支持消费级硬件上的隐私保护审核，并为设计平衡社区价值观与个人用户偏好的审核系统提出了新方向。

Title: Aligning Large Language Model Behavior with Human Citation Preferences

Authors: Kenichiro Ando, Tatsuya Harada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05205
Pdf URL: https://arxiv.org/pdf/2602.05205
Copy Paste: [[2602.05205]] Aligning Large Language Model Behavior with Human Citation Preferences(https://arxiv.org/abs/2602.05205)
Keywords: language model, llm
Abstract: Most services built on powerful large-scale language models (LLMs) add citations to their output to enhance credibility. Recent research has paid increasing attention to the question of what reference documents to link to outputs. However, how LLMs recognize cite-worthiness and how this process should be controlled remains underexplored. In this study, we focus on what kinds of content LLMs currently tend to cite and how well that behavior aligns with human preferences. We construct a dataset to characterize the relationship between human citation preferences and LLM behavior. Web-derived texts are categorized into eight citation-motivation types, and pairwise citation preferences are exhaustively evaluated across all type combinations to capture fine-grained contrasts. Our results show that humans most frequently seek citations for medical text, and stronger models display a similar tendency. We also find that current models are as much as $27\%$ more likely than humans to add citations to text that is explicitly marked as needing citations on sources such as Wikipedia, and this overemphasis reduces alignment accuracy. Conversely, models systematically underselect numeric sentences (by $-22.6\%$ relative to humans) and sentences containing personal names (by $-20.1\%$), categories for which humans typically demand citations. Furthermore, experiments with Direct Preference Optimization demonstrate that model behavior can be calibrated to better match human citation preferences. We expect this study to provide a foundation for more fine-grained investigations into LLM citation preferences.
摘要：大多数基于强大的大规模语言模型 (LLM) 构建的服务都会在其输出中添加引用以提高可信度。最近的研究越来越关注将哪些参考文档链接到输出的问题。然而，法学硕士如何识别引用价值以及如何控制这一过程仍有待探索。在这项研究中，我们关注法学硕士目前倾向于引用哪些类型的内容，以及这种行为与人类偏好的契合程度。我们构建了一个数据集来描述人类引用偏好和法学硕士行为之间的关系。来自网络的文本被分为八种引用动机类型，并且对所有类型组合的成对引用偏好进行详尽评估，以捕获细粒度的对比。我们的结果表明，人类最常寻求对医学文本的引用，并且更强的模型也表现出类似的趋势。我们还发现，当前模型比人类更有可能在维基百科等来源明确标记为需要引用的文本中添加引用，而这种过分强调会降低对齐准确性。相反，模型系统地低估了数字句子（相对于人类而言 $-22.6\%$）和包含人名的句子（$-20.1\%$），而人类通常要求引用这些类别。此外，直接偏好优化的实验表明，可以校准模型行为以更好地匹配人类的引用偏好。我们希望这项研究能为对法学硕士引用偏好进行更细粒度的调查奠定基础。

Title: FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters

Authors: Zhilin Liang, Yuxiang Wang, Zimu Zhou, Hainan Zhang, Boyi Liu, Yongxin Tong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05235
Pdf URL: https://arxiv.org/pdf/2602.05235
Copy Paste: [[2602.05235]] FedMosaic: Federated Retrieval-Augmented Generation via Parametric Adapters(https://arxiv.org/abs/2602.05235)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by grounding generation in external knowledge to improve factuality and reduce hallucinations. Yet most deployments assume a centralized corpus, which is infeasible in privacy aware domains where knowledge remains siloed. This motivates federated RAG (FedRAG), where a central LLM server collaborates with distributed silos without sharing raw documents. In context RAG violates this requirement by transmitting verbatim documents, whereas parametric RAG encodes documents into lightweight adapters that merge with a frozen LLM at inference, avoiding raw-text exchange. We adopt the parametric approach but face two unique challenges induced by FedRAG: high storage and communication from per-document adapters, and destructive aggregation caused by indiscriminately merging multiple adapters. We present FedMosaic, the first federated RAG framework built on parametric adapters. FedMosaic clusters semantically related documents into multi-document adapters with document-specific masks to reduce overhead while preserving specificity, and performs selective adapter aggregation to combine only relevance-aligned, nonconflicting adapters. Experiments show that FedMosaic achieves an average 10.9% higher accuracy than state-of-the-art methods in four categories, while lowering storage costs by 78.8% to 86.3% and communication costs by 91.4%, and never sharing raw documents.
摘要：检索增强生成（RAG）通过将生成建立在外部知识的基础上来增强大型语言模型（LLM），以提高事实性并减少幻觉。然而，大多数部署都采用集中式语料库，这在知识仍然孤立的隐私意识领域是不可行的。这激发了联合 RAG (FedRAG)，其中中央 LLM 服务器与分布式孤岛协作，而不共享原始文档。在上下文中，RAG 通过传输逐字文档来违反此要求，而参数化 RAG 将文档编码到轻量级适配器中，在推理时与冻结的 LLM 合并，从而避免原始文本交换。我们采用参数化方法，但面临 FedRAG 带来的两个独特挑战：每个文档适配器的高存储和通信，以及不加区别地合并多个适配器造成的破坏性聚合。我们推出了 FedMosaic，这是第一个基于参数适配器构建的联合 RAG 框架。 FedMosaic 将语义相关的文档聚类为具有文档特定掩码的多文档适配器，以减少开销，同时保留特异性，并执行选择性适配器聚合以仅组合相关性对齐、不冲突的适配器。实验表明，FedMosaic 在四个类别中的准确率平均比最先进的方法高 10.9%，同时将存储成本降低 78.8% 至 86.3%，将通信成本降低 91.4%，并且从不共享原始文档。

Title: Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks

Authors: Guangwei Zhang, Jianing Zhu, Cheng Qian, Neil Gong, Rada Mihalcea, Zhaozhuo Xu, Jingrui He, Jiaqi Ma, Yun Huang, Chaowei Xiao, Bo Li, Ahmed Abbasi, Dongwon Lee, Heng Ji, Denghui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05252
Pdf URL: https://arxiv.org/pdf/2602.05252
Copy Paste: [[2602.05252]] Copyright Detective: A Forensic System to Evidence LLMs Flickering Copyright Leakage Risks(https://arxiv.org/abs/2602.05252)
Keywords: llm, prompt
Abstract: We present Copyright Detective, the first interactive forensic system for detecting, analyzing, and visualizing potential copyright risks in LLM outputs. The system treats copyright infringement versus compliance as an evidence discovery process rather than a static classification task due to the complex nature of copyright law. It integrates multiple detection paradigms, including content recall testing, paraphrase-level similarity analysis, persuasive jailbreak probing, and unlearning verification, within a unified and extensible framework. Through interactive prompting, response collection, and iterative workflows, our system enables systematic auditing of verbatim memorization and paraphrase-level leakage, supporting responsible deployment and transparent evaluation of LLM copyright risks even with black-box access.
摘要：我们推出了版权侦探，这是第一个交互式取证系统，用于检测、分析和可视化法学硕士输出中的潜在版权风险。由于版权法的复杂性，该系统将版权侵权与合规视为证据发现过程，而不是静态分类任务。它在统一且可扩展的框架内集成了多种检测范例，包括内容回忆测试、释义级相似性分析、说服性越狱探测和遗忘验证。通过交互式提示、回复收集和迭代工作流程，我们的系统能够对逐字记忆和释义级别的泄漏进行系统审核，支持负责任的部署和透明评估法学硕士版权风险，即使在黑盒访问的情况下也是如此。

Title: CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs

Authors: Haoran Li, Sucheng Ren, Alan Yuille, Feng Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.05258
Pdf URL: https://arxiv.org/pdf/2602.05258
Copy Paste: [[2602.05258]] CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs(https://arxiv.org/abs/2602.05258)
Keywords: language model, llm, long context
Abstract: Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs). While various methods have been proposed to adapt RoPE to longer contexts, their guiding principles generally fall into two categories: (1) out-of-distribution (OOD) mitigation, which scales RoPE frequencies to accommodate unseen positions, and (2) Semantic Modeling, which posits that the attention scores computed with RoPE should always prioritize semantically similar tokens. In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely CoPE: soft clipping lowfrequency components of RoPE. CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping. Extensive experiments demonstrate that simply applying our soft clipping strategy to RoPE yields significant performance gains that scale up to 256k context length, validating our theoretical analysis and establishing CoPE as a new state-of-the-art for length generalization. Our code, data, and models are available at this https URL.
摘要：旋转位置嵌入 (RoPE) 是大型语言模型 (LLM) 中上下文扩展的关键组成部分。虽然已经提出了各种方法来使 RoPE 适应更长的上下文，但它们的指导原则通常分为两类：(1) 分布外 (OOD) 缓解，它可以缩放 RoPE 频率以适应看不见的位置；(2) 语义建模，它假设使用 RoPE 计算的注意力分数应始终优先考虑语义相似的标记。在这项工作中，我们通过极简主义干预来统一这些看似不同的目标，即 CoPE：软削波 RoPE 的低频分量。 CoPE不仅可以消除OOD异常值并细化语义信号，还可以防止硬削波导致的频谱泄漏。大量实验表明，简单地将我们的软裁剪策略应用于 RoPE 即可产生显着的性能提升，可扩展至 256k 上下文长度，验证了我们的理论分析并将 CoPE 确立为长度泛化的新的最先进技术。我们的代码、数据和模型可在此 https URL 中获取。

Title: Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Authors: Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05261
Pdf URL: https://arxiv.org/pdf/2602.05261
Copy Paste: [[2602.05261]] Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR(https://arxiv.org/abs/2602.05261)
Keywords: language model, llm
Abstract: Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
摘要：最近，具有可验证奖励的强化学习 (RLVR) 在大型语言模型 (LLM) 和视觉语言模型 (VLM) 中的应用在增强复杂任务的推理能力方面取得了巨大成功。在RLVR训练过程中，反应长度的增加通常被认为是促进推理能力增长的关键因素。然而，在训练过程中，不同 RLVR 算法的响应长度变化模式存在显着差异。为了从根本上解释这些变化，本文对主流 RLVR 算法的组成部分进行了深入分析。我们对影响响应长度的因素进行了理论分析，并通过广泛的实验验证了我们的理论。基于这些理论发现，我们提出了长度无偏序列策略优化（LUSPO）算法。具体来说，我们纠正了组序列策略优化（GSPO）中固有的长度偏差，使其损失函数相对于响应长度无偏差，从而解决了响应长度崩溃的问题。我们在数学推理基准和多模态推理场景中进行了广泛的实验，其中 LUSPO 始终实现卓越的性能。实证结果表明，与 GRPO 和 GSPO 等现有方法相比，LUSPO 代表了一种新颖、最先进的优化策略。

Title: Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science

Authors: Jingru Fan, Dewen Liu, Yufan Dang, Huatao Li, Yuheng Wang, Wei Liu, Feiyu Duan, Xuanwen Ding, Shu Yao, Lin Wu, Ruijie Shi, Wai-Shing Leung, Yuan Cheng, Zhongyu Wei, Cheng Yang, Chen Qian, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2602.05289
Pdf URL: https://arxiv.org/pdf/2602.05289
Copy Paste: [[2602.05289]] Towards a Science of Collective AI: LLM-based Multi-Agent Systems Need a Transition from Blind Trial-and-Error to Rigorous Science(https://arxiv.org/abs/2602.05289)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have greatly extended the capabilities of Multi-Agent Systems (MAS), demonstrating significant effectiveness across a wide range of complex and open-ended domains. However, despite this rapid progress, the field still relies heavily on empirical trial-and-error. It lacks a unified and principled scientific framework necessary for systematic optimization and improvement. This bottleneck stems from the ambiguity of attribution: first, the absence of a structured taxonomy of factors leaves researchers restricted to unguided adjustments; second, the lack of a unified metric fails to distinguish genuine collaboration gain from mere resource accumulation. In this paper, we advocate for a transition to design science through an integrated framework. We advocate to establish the collaboration gain metric ($\Gamma$) as the scientific standard to isolate intrinsic gains from increased budgets. Leveraging $\Gamma$, we propose a factor attribution paradigm to systematically identify collaboration-driving factors. To support this, we construct a systematic MAS factor library, structuring the design space into control-level presets and information-level dynamics. Ultimately, this framework facilitates the transition from blind experimentation to rigorous science, paving the way towards a true science of Collective AI.
摘要：大型语言模型 (LLM) 的最新进展极大地扩展了多代理系统 (MAS) 的功能，在广泛的复杂和开放领域中展示了显着的有效性。然而，尽管取得了如此迅速的进展，该领域仍然严重依赖于经验试错。缺乏系统优化和改进所必需的统一、原则性的科学框架。这一瓶颈源于归因的模糊性：首先，缺乏结构化的因素分类导致研究人员只能进行无指导的调整；其次，缺乏统一的衡量标准无法区分真正的协作收益和单纯的资源积累。在本文中，我们主张通过综合框架向设计科学过渡。我们主张建立协作收益指标（$\Gamma$）作为科学标准，以将内在收益与增加的预算隔离开来。利用$\Gamma$，我们提出了一种因素归因范式来系统地识别协作驱动因素。为了支持这一点，我们构建了一个系统的 MAS 因子库，将设计空间构建为控制级预设和信息级动态。最终，这个框架促进了从盲目实验到严谨科学的转变，为真正的集体人工智能科学铺平了道路。

Title: MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning

Authors: Haojin Wang, Yike Wang, Shangbin Feng, Hannaneh Hajishirzi, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05307
Pdf URL: https://arxiv.org/pdf/2602.05307
Copy Paste: [[2602.05307]] MentorCollab: Selective Large-to-Small Inference-Time Guidance for Efficient Reasoning(https://arxiv.org/abs/2602.05307)
Keywords: language model
Abstract: Large reasoning models (LRMs) achieve strong performance by producing long chains of thought, but their inference costs are high and often generate redundant reasoning. Small language models (SLMs) are far more efficient, yet struggle on multi-step reasoning tasks. A natural idea is to let a large model guide a small one at inference time as a mentor, yet existing collaboration methods often promote imitation, resulting in verbose reasoning without consistent error correction. We propose MentorCollab, an inference-time collaboration method in which an LRM selectively and sparsely guides an SLM, rather than taking over generation. At randomly sampled token positions, we probe for divergences between the two models and use a lightweight verifier to decide whether the SLM should follow a short lookahead segment from its mentor or continue on its own. Across 15 SLM--LRM pairs and 3 domains (math reasoning, general knowledge, and commonsense reasoning), our method improves performance in 12 settings, with average gains of 3.0% and up to 8.0%, while adopting only having 18.4% tokens generated by the expensive mentor model on average. We find that short segments and selective probing are sufficient for effective collaboration. Our results show that selective inference-time guidance restores large-model reasoning ability without substantial inference overhead.
摘要：大型推理模型（LRM）通过产生长的思维链来实现强大的性能，但其推理成本很高，并且经常产生冗余推理。小语言模型 (SLM) 的效率要高得多，但在多步骤推理任务上却很困难。一个自然的想法是让一个大模型作为导师在推理时指导一个小模型，但现有的协作方法经常促进模仿，导致冗长的推理而没有一致的错误纠正。我们提出了 MentorCollab，一种推理时协作方法，其中 LRM 有选择地、稀疏地指导 SLM，而不是接管生成。在随机采样的令牌位置上，我们探究两个模型之间的差异，并使用轻量级验证器来决定 SLM 是否应该遵循其导师的简短前瞻片段，或者继续自己继续。在 15 个 SLM-LRM 对和 3 个领域（数学推理、一般知识和常识推理）中，我们的方法在 12 种设置中提高了性能，平均增益为 3.0% 和高达 8.0%，而平均只采用由昂贵的导师模型生成的 18.4% 的令牌。我们发现短片段和选择性探索足以实现有效的合作。我们的结果表明，选择性推理时间指导可以恢复大模型推理能力，而无需大量推理开销。

Title: How Do Language Models Acquire Character-Level Information?

Authors: Soma Sato, Ryohei Sasano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05347
Pdf URL: https://arxiv.org/pdf/2602.05347
Copy Paste: [[2602.05347]] How Do Language Models Acquire Character-Level Information?(https://arxiv.org/abs/2602.05347)
Keywords: language model
Abstract: Language models (LMs) have been reported to implicitly encode character-level information, despite not being explicitly provided during training. However, the mechanisms underlying this phenomenon remain largely unexplored. To reveal the mechanisms, we analyze how models acquire character-level knowledge by comparing LMs trained under controlled settings, such as specifying the pre-training dataset or tokenizer, with those trained under standard settings. We categorize the contributing factors into those independent of tokenization. Our analysis reveals that merge rules and orthographic constraints constitute primary factors arising from tokenization, whereas semantic associations of substrings and syntactic information function as key factors independent of tokenization.
摘要：据报道，语言模型（LM）隐式编码字符级信息，尽管在训练期间没有明确提供。然而，这种现象背后的机制在很大程度上仍未被探索。为了揭示这些机制，我们通过比较在受控设置（例如指定预训练数据集或分词器）下训练的 LM 与在标准设置下训练的 LM 来分析模型如何获取字符级知识。我们将影响因素分类为与标记化无关的因素。我们的分析表明，合并规则和拼写约束构成了标记化产生的主要因素，而子字符串和句法信息的语义关联是独立于标记化的关键因素。

Title: PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning

Authors: Jun Rao, Zixiong Yu, Xuebo Liu, Guhan Chen, Jing Li, Jiansheng Wei, Xiaojun Meng, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05370
Pdf URL: https://arxiv.org/pdf/2602.05370
Copy Paste: [[2602.05370]] PACE: Defying the Scaling Hypothesis of Exploration in Iterative Alignment for Mathematical Reasoning(https://arxiv.org/abs/2602.05370)
Keywords: language model
Abstract: Iterative Direct Preference Optimization has emerged as the state-of-the-art paradigm for aligning Large Language Models on reasoning tasks. Standard implementations (DPO-R1) rely on Best-of-N sampling (e.g., $N \ge 8$) to mine golden trajectories from the distribution tail. In this paper, we challenge this scaling hypothesis and reveal a counter-intuitive phenomenon: in mathematical reasoning, aggressive exploration yields diminishing returns and even catastrophic policy collapse. We theoretically demonstrate that scaling $N$ amplifies verifier noise and induces detrimental distribution shifts. To resolve this, we introduce \textbf{PACE} (Proximal Alignment via Corrective Exploration), which replaces brute-force mining with a generation-based corrective strategy. Operating with a minimal budget ($2
摘要：迭代直接偏好优化已成为在推理任务上调整大型语言模型的最先进范例。标准实现 (DPO-R1) 依靠 Best-of-N 采样（例如 $N \ge 8$）从分布尾部挖掘黄金轨迹。在本文中，我们挑战了这种尺度假设，并揭示了一个反直觉的现象：在数学推理中，激进的探索会导致收益递减，甚至导致灾难性的政策崩溃。我们从理论上证明，缩放 $N$ 会放大验证者的噪音并导致有害的分布变化。为了解决这个问题，我们引入了 \textbf{PACE} （通过校正探索进行近端对齐），它用基于生成的校正策略取代了暴力挖掘。 PACE 以最低预算（$2

Title: Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks

Authors: Chaimae Abouzahir, Congbo Ma, Nizar Habash, Farah E. Shamout
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.05374
Pdf URL: https://arxiv.org/pdf/2602.05374
Copy Paste: [[2602.05374]] Cross-Lingual Empirical Evaluation of Large Language Models for Arabic Medical Tasks(https://arxiv.org/abs/2602.05374)
Keywords: language model, llm
Abstract: In recent years, Large Language Models (LLMs) have become widely used in medical applications, such as clinical decision support, medical education, and medical question answering. Yet, these models are often English-centric, limiting their robustness and reliability for linguistically diverse communities. Recent work has highlighted discrepancies in performance in low-resource languages for various medical tasks, but the underlying causes remain poorly understood. In this study, we conduct a cross-lingual empirical analysis of LLM performance on Arabic and English medical question and answering. Our findings reveal a persistent language-driven performance gap that intensifies with increasing task complexity. Tokenization analysis exposes structural fragmentation in Arabic medical text, while reliability analysis suggests that model-reported confidence and explanations exhibit limited correlation with correctness. Together, these findings underscore the need for language-aware design and evaluation strategies in LLMs for medical tasks.
摘要：近年来，大型语言模型（LLM）已广泛应用于医学应用，例如临床决策支持、医学教育和医学问答。然而，这些模型通常以英语为中心，限制了它们对于语言多样化社区的稳健性和可靠性。最近的工作强调了各种医疗任务中低资源语言的性能差异，但其根本原因仍然知之甚少。在本研究中，我们对法学硕士在阿拉伯语和英语医学问答方面的表现进行了跨语言实证分析。我们的研究结果表明，语言驱动的绩效差距持续存在，并且随着任务复杂性的增加而加剧。标记化分析揭示了阿拉伯医学文本中的结构碎片，而可靠性分析表明模型报告的置信度和解释与正确性的相关性有限。总之，这些发现强调了法学硕士在医学任务中需要语言感知的设计和评估策略。

Title: IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models

Authors: Tao Liu, Jiafan Lu, Bohan Yu, Pengcheng Wu, Liu Haixin, Guoyu Xu, Li Xiangheng, Lixiao Li, Jiaming Hou, Zhao Shijun, Xinglin Lyu, Kunli Zhang, Yuxiang Jia, Hongyin Zan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05385
Pdf URL: https://arxiv.org/pdf/2602.05385
Copy Paste: [[2602.05385]] IESR:Efficient MCTS-Based Modular Reasoning for Text-to-SQL with Large Language Models(https://arxiv.org/abs/2602.05385)
Keywords: language model, llm
Abstract: Text-to-SQL is a key natural language processing task that maps natural language questions to SQL queries, enabling intuitive interaction with web-based databases. Although current methods perform well on benchmarks like BIRD and Spider, they struggle with complex reasoning, domain knowledge, and hypothetical queries, and remain costly in enterprise deployment. To address these issues, we propose a framework named IESR(Information Enhanced Structured Reasoning) for lightweight large language models: (i) leverages LLMs for key information understanding and schema linking, and decoupling mathematical computation and SQL generation, (ii) integrates a multi-path reasoning mechanism based on Monte Carlo Tree Search (MCTS) with majority voting, and (iii) introduces a trajectory consistency verification module with a discriminator model to ensure accuracy and consistency. Experimental results demonstrate that IESR achieves state-of-the-art performance on the complex reasoning benchmark LogicCat (24.28 EX) and the Archer dataset (37.28 EX) using only compact lightweight models without fine-tuning. Furthermore, our analysis reveals that current coder models exhibit notable biases and deficiencies in physical knowledge, mathematical computation, and common-sense reasoning, highlighting important directions for future research. We released code at this https URL.
摘要：文本到 SQL 是一项关键的自然语言处理任务，它将自然语言问题映射到 SQL 查询，从而实现与基于 Web 的数据库的直观交互。尽管当前的方法在 BIRD 和 Spider 等基准测试中表现良好，但它们在复杂的推理、领域知识和假设查询方面表现不佳，并且在企业部署中仍然成本高昂。为了解决这些问题，我们提出了一个名为IESR（信息增强结构化推理）的轻量级大型语言模型框架：（i）利用LLM进行关键信息理解和模式链接，并将数学计算和SQL生成解耦，（ii）集成基于蒙特卡罗树搜索（MCTS）和多数投票的多路径推理机制，（iii）引入带有判别器模型的轨迹一致性验证模块以确保准确性和一致性。实验结果表明，IESR 仅使用紧凑的轻量级模型，无需进行微调，即可在复杂推理基准 LogicCat (24.28 EX) 和 Archer 数据集 (37.28 EX) 上实现最先进的性能。此外，我们的分析表明，当前的编码器模型在物理知识、数学计算和常识推理方面表现出明显的偏差和缺陷，突出了未来研究的重要方向。我们在此 https URL 发布了代码。

Title: Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances

Authors: Jiyun Chun, Eric Fosler-Lussier, Michael White, Andrew Perrault
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05392
Pdf URL: https://arxiv.org/pdf/2602.05392
Copy Paste: [[2602.05392]] Beyond Length: Context-Aware Expansion and Independence as Developmentally Sensitive Evaluation in Child Utterances(https://arxiv.org/abs/2602.05392)
Keywords: llm
Abstract: Evaluating the quality of children's utterances in adult-child dialogue remains challenging due to insufficient context-sensitive metrics. Common proxies such as Mean Length of Utterance (MLU), lexical diversity (vocd-D), and readability indices (Flesch-Kincaid Grade Level, Gunning Fog Index) are dominated by length and ignore conversational context, missing aspects of response quality such as reasoning depth, topic maintenance, and discourse planning. We introduce an LLM-as-a-judge framework that first classifies the Previous Adult Utterance Type and then scores the child's response along two axes: Expansion (contextual elaboration and inferential depth) and Independence (the child's contribution to advancing the discourse). These axes reflect fundamental dimensions in child language development, where Expansion captures elaboration, clause combining, and causal and contrastive connectives. Independence captures initiative, topic control, decreasing reliance on adult scaffolding through growing self-regulation, and audience design. We establish developmental validity by showing age-related patterns and demonstrate predictive value by improving age estimation over common baselines. We further confirm semantic sensitivity by detecting differences tied to discourse relations. Our metrics align with human judgments, enabling large-scale evaluation. This shifts child utterance assessment from simply measuring length to evaluating how meaningfully the child's speech contributes to and advances the conversation within its context.
摘要：由于上下文相关指标不足，评估成人与儿童对话中儿童话语的质量仍然具有挑战性。平均话语长度（MLU）、词汇多样性（vocd-D）和可读性指数（Flesch-Kincaid Grade Level、Gunning Fog Index）等常见指标以长度为主，忽略对话上下文，缺少响应质量的方面，例如推理深度、主题维护和话语规划。我们引入了一个法学硕士作为法官的框架，该框架首先对以前的成人话语类型进行分类，然后沿着两个轴对孩子的反应进行评分：扩展（上下文阐述和推理深度）和独立性（孩子对推进话语的贡献）。这些轴反映了儿童语言发展的基本维度，其中扩展捕获了阐述、子句组合以及因果和对比连接词。独立性体现了主动性、主题控制、通过增强自我调节和受众设计来减少对成人脚手架的依赖。我们通过显示与年龄相关的模式来建立发育有效性，并通过改进共同基线的年龄估计来证明预测价值。我们通过检测与话语关系相关的差异来进一步确认语义敏感性。我们的指标与人类判断一致，从而实现大规模评估。这将儿童言语评估从简单地测量长度转变为评估儿童的言语在其上下文中对对话的贡献和推进的意义。

Title: Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better

Authors: Ji Zhao, Yufei Gu, Shitong Shao, Xun Zhou, Liang Xiang, Zeke Xie
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.05393
Pdf URL: https://arxiv.org/pdf/2602.05393
Copy Paste: [[2602.05393]] Late-to-Early Training: LET LLMs Learn Earlier, So Faster and Better(https://arxiv.org/abs/2602.05393)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) achieve remarkable empirical success through scaling model and data size, pretraining has become increasingly critical yet computationally prohibitive, hindering rapid development. Despite the availability of numerous pretrained LLMs developed at significant computational expense, a fundamental real-world question remains underexplored: \textit{Can we leverage existing small pretrained models to accelerate the training of larger models?} In this paper, we propose a Late-to-Early Training (LET) paradigm that enables LLMs to explicitly learn later knowledge in earlier steps and earlier layers. The core idea is to guide the early layers of an LLM during early training using representations from the late layers of a pretrained (i.e. late training phase) model. We identify two key mechanisms that drive LET's effectiveness: late-to-early-step learning and late-to-early-layer learning. These mechanisms significantly accelerate training convergence while robustly enhancing both language modeling capabilities and downstream task performance, enabling faster training with superior performance. Extensive experiments on 1.4B and 7B parameter models demonstrate LET's efficiency and effectiveness. Notably, when training a 1.4B LLM on the Pile dataset, our method achieves up to 1.6$\times$ speedup with nearly 5\% improvement in downstream task accuracy compared to standard training, even when using a pretrained model with 10$\times$ fewer parameters than the target model.
摘要：随着大型语言模型 (LLM) 通过扩展模型和数据大小取得了显着的经验成功，预训练变得越来越重要，但计算量却令人望而却步，阻碍了快速发展。尽管有大量以大量计算成本开发的预训练法学硕士，但一个基本的现实问题仍未得到充分探索：\textit{我们能否利用现有的小型预训练模型来加速大型模型的训练？}在本文中，我们提出了一种后期到早期训练（LET）范式，使法学硕士能够在早期步骤和早期层中明确学习后期知识。核心思想是在早期训练期间使用预训练（即后期训练阶段）模型后期层的表示来指导法学硕士的早期层。我们确定了推动 LET 有效性的两个关键机制：晚到早步骤学习和晚到早层学习。这些机制显着加速了训练收敛，同时有力地增强了语言建模能力和下游任务性能，从而实现了更快的训练和卓越的性能。对 1.4B 和 7B 参数模型的大量实验证明了 LET 的效率和有效性。值得注意的是，当在 Pile 数据集上训练 1.4B LLM 时，与标准训练相比，我们的方法实现了高达 1.6$\times$ 的加速，下游任务准确性提高了近 5\%，即使使用参数比目标模型少 10$\times$ 的预训练模型也是如此。

Title: OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Authors: Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05400
Pdf URL: https://arxiv.org/pdf/2602.05400
Copy Paste: [[2602.05400]] OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration(https://arxiv.org/abs/2602.05400)
Keywords: language model, gpt
Abstract: As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
摘要：随着高质量的公共文本接近耗尽（一种称为数据墙的现象），预训练正在从更多令牌转向更好的令牌。然而，现有方法要么依赖于忽略训练动态的启发式静态过滤器，要么使用基于原始梯度的动态但与优化器无关的标准。我们提出了 OPUS（优化器诱导的预计效用选择），这是一种动态数据选择框架，它定义了优化器诱导的更新空间中的效用。 OPUS 通过将现代优化器形成的有效更新投影到源自稳定的分布代理的目标方向来对候选人进行评分。为了确保可扩展性，我们采用 Ghost 技术和 CountSketch 来提高计算效率，并采用玻尔兹曼采样来实现数据多样性，仅产生 4.7% 的额外计算开销。 OPUS 在不同的语料库、质量等级、优化器和模型规模上取得了显着的成果。在使用 30B 代币对 FineWeb 和 FineWeb-Edu 上的 GPT-2 Large/XL 进行预训练时，OPUS 的性能优于工业级基线，甚至优于完整的 200B 代币训练。此外，当与工业级静态滤波器结合使用时，OPUS 进一步提高了预训练效率，即使数据质量较低。此外，在 SciencePedia 上对 Qwen3-8B-Base 的持续预训练中，与使用 3B 令牌的完整训练相比，OPUS 仅使用 0.5B 令牌就实现了卓越的性能，证明了专业领域的数据效率显着提高。

Title: Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models

Authors: Basel Mousi, Fahim Dalvi, Shammur Chowdhury, Firoj Alam, Nadir Durrani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05437
Pdf URL: https://arxiv.org/pdf/2602.05437
Copy Paste: [[2602.05437]] Once Correct, Still Wrong: Counterfactual Hallucination in Multilingual Vision-Language Models(https://arxiv.org/abs/2602.05437)
Keywords: language model, hallucination, prompt
Abstract: Vision-language models (VLMs) can achieve high accuracy while still accepting culturally plausible but visually incorrect interpretations. Existing hallucination benchmarks rarely test this failure mode, particularly outside Western contexts and English. We introduce M2CQA, a culturally grounded multimodal benchmark built from images spanning 17 MENA countries, paired with contrastive true and counterfactual statements in English, Arabic, and its dialects. To isolate hallucination beyond raw accuracy, we propose the CounterFactual Hallucination Rate (CFHR), which measures counterfactual acceptance conditioned on correctly answering the true statement. Evaluating state-of-the-art VLMs under multiple prompting strategies, we find that CFHR rises sharply in Arabic, especially in dialects, even when true-statement accuracy remains high. Moreover, reasoning-first prompting consistently increases counterfactual hallucination, while answering before justifying improves robustness. We will make the experimental resources and dataset publicly available for the community.
摘要：视觉语言模型（VLM）可以实现高精度，同时仍然接受文化上合理但视觉上不正确的解释。现有的幻觉基准很少测试这种失败模式，特别是在西方环境和英语之外。我们介绍 M2CQA，这是一个基于文化的多模式基准，根据跨越 17 个 MENA 国家的图像构建，并搭配英语、阿拉伯语及其方言的真实和反事实陈述的对比。为了隔离超出原始准确性的幻觉，我们提出了反事实幻觉率（CFHR），它衡量以正确回答真实陈述为条件的反事实接受度。在多种提示策略下评估最先进的 VLM，我们发现阿拉伯语中的 CFHR 急剧上升，尤其是在方言中，即使真实陈述的准确性仍然很高。此外，推理优先的提示始终会增加反事实幻觉，而在证明合理性之前回答会提高稳健性。我们将向社区公开提供实验资源和数据集。

Title: Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs

Authors: Yao Zhou, Zeen Song, Wenwen Qiang, Fengge Wu, Shuyi Zhou, Changwen Zheng, Hui Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05444
Pdf URL: https://arxiv.org/pdf/2602.05444
Copy Paste: [[2602.05444]] Causal Front-Door Adjustment for Robust Jailbreak Attacks on LLMs(https://arxiv.org/abs/2602.05444)
Keywords: language model, llm
Abstract: Safety alignment mechanisms in Large Language Models (LLMs) often operate as latent internal states, obscuring the model's inherent capabilities. Building on this observation, we model the safety mechanism as an unobserved confounder from a causal perspective. Then, we propose the \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}ttack ({\textbf{CFA}}$^2$) to jailbreak LLM, which is a framework that leverages Pearl's Front-Door Criterion to sever the confounding associations for robust jailbreaking. Specifically, we employ Sparse Autoencoders (SAEs) to physically strip defense-related features, isolating the core task intent. We further reduce computationally expensive marginalization to a deterministic intervention with low inference complexity. Experiments demonstrate that {CFA}$^2$ achieves state-of-the-art attack success rates while offering a mechanistic interpretation of the jailbreaking process.
摘要：大型语言模型 (LLM) 中的安全对齐机制通常作为潜在内部状态运行，从而掩盖了模型的固有功能。基于这一观察，我们从因果角度将安全机制建模为未观察到的混杂因素。然后，我们提出 \textbf{C}ausal \textbf{F}ront-Door \textbf{A}djustment \textbf{A}tack ({\textbf{CFA}}$^2$) 来越狱 LLM，这是一个利用 Pearl 的前门准则来切断混杂关联以实现稳健越狱的框架。具体来说，我们采用稀疏自动编码器（SAE）来物理剥离防御相关的功能，隔离核心任务意图。我们进一步将计算成本高昂的边缘化减少为具有低推理复杂性的确定性干预。实验表明，{CFA}$^2$ 实现了最先进的攻击成功率，同时提供了越狱过程的机械解释。

Title: Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale

Authors: Damon McMillan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05447
Pdf URL: https://arxiv.org/pdf/2602.05447
Copy Paste: [[2602.05447]] Structured Context Engineering for File-Native Agentic Systems: Evaluating Schema Accuracy, Format Effectiveness, and Multi-File Navigation at Scale(https://arxiv.org/abs/2602.05447)
Keywords: language model, gpt, llm, agent
Abstract: Large Language Model agents increasingly operate external systems through programmatic interfaces, yet practitioners lack empirical guidance on how to structure the context these agents consume. Using SQL generation as a proxy for programmatic agent operations, we present a systematic study of context engineering for structured data, comprising 9,649 experiments across 11 models, 4 formats (YAML, Markdown, JSON, Token-Oriented Object Notation [TOON]), and schemas ranging from 10 to 10,000 tables. Our findings challenge common assumptions. First, architecture choice is model-dependent: file-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini; +2.7%, p=0.029) but shows mixed results for open source models (aggregate -7.7%, p<0.001), with deficits varying substantially by model. Second, format does not significantly affect aggregate accuracy (chi-squared=2.45, p=0.484), though individual models, particularly open source, exhibit format-specific sensitivities. Third, model capability is the dominant factor, with a 21 percentage point accuracy gap between frontier and open source tiers that dwarfs any format or architecture effect. Fourth, file-native agents scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Fifth, file size does not predict runtime efficiency: compact formats can consume significantly more tokens at scale due to format-unfamiliar search patterns. These findings provide practitioners with evidence-based guidance for deploying LLM agents on structured systems, demonstrating that architectural decisions should be tailored to model capability rather than assuming universal best practices.
摘要：大型语言模型代理越来越多地通过编程接口操作外部系统，但实践者缺乏关于如何构建这些代理所使用的上下文的经验指导。使用 SQL 生成作为编程代理操作的代理，我们对结构化数据的上下文工程进行了系统研究，包括跨 11 个模型、4 种格式（YAML、Markdown、JSON、面向令牌的对象表示法 [TOON]）的 9,649 个实验以及从 10 到 10,000 个表的模式。我们的发现挑战了常见的假设。首先，架构选择取决于模型：基于文件的上下文检索提高了前沿模型的准确性（Claude、GPT、Gemini；+2.7%，p=0.029），但开源模型的结果好坏参半（总计 -7.7%，p<0.001），不同模型的缺陷差异很大。其次，格式不会显着影响总体准确性（卡方 = 2.45，p = 0.484），尽管个别模型（尤其是开源模型）表现出特定于格式的敏感性。第三，模型能力是主导因素，前沿层和开源层之间的准确度差距达 21 个百分点，这让任何格式或架构效果都相形见绌。第四，文件本机代理通过域分区模式扩展到 10,000 个表，同时保持高导航准确性。第五，文件大小并不能预测运行时效率：由于格式不熟悉的搜索模式，紧凑格式可能会消耗更多的标记。这些发现为从业者提供了在结构化系统上部署 LLM 代理的基于证据的指导，表明架构决策应根据模型功能进行定制，而不是假设通用的最佳实践。

Title: LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation

Authors: Bingru Li
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2602.05493
Pdf URL: https://arxiv.org/pdf/2602.05493
Copy Paste: [[2602.05493]] LinguistAgent: A Reflective Multi-Model Platform for Automated Linguistic Annotation(https://arxiv.org/abs/2602.05493)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Data annotation remains a significant bottleneck in the Humanities and Social Sciences, particularly for complex semantic tasks such as metaphor identification. While Large Language Models (LLMs) show promise, a significant gap remains between the theoretical capability of LLMs and their practical utility for researchers. This paper introduces LinguistAgent, an integrated, user-friendly platform that leverages a reflective multi-model architecture to automate linguistic annotation. The system implements a dual-agent workflow, comprising an Annotator and a Reviewer, to simulate a professional peer-review process. LinguistAgent supports comparative experiments across three paradigms: Prompt Engineering (Zero/Few-shot), Retrieval-Augmented Generation, and Fine-tuning. We demonstrate LinguistAgent's efficacy using the task of metaphor identification as an example, providing real-time token-level evaluation (Precision, Recall, and $F_1$ score) against human gold standards. The application and codes are released on this https URL.
摘要：数据注释仍然是人文和社会科学的一个重要瓶颈，特别是对于隐喻识别等复杂的语义任务。虽然大型语言模型 (LLM) 展现出良好的前景，但 LLM 的理论能力与其对研究人员的实际用途之间仍然存在巨大差距。本文介绍了 LinguistAgent，这是一个集成的、用户友好的平台，它利用反射多模型架构来自动进行语言注释。该系统实现了双代理工作流程，包括注释者和审阅者，以模拟专业的同行评审过程。 LinguistAgent 支持跨三种范式的比较实验：即时工程（零/少样本）、检索增强生成和微调。我们以隐喻识别任务为例展示了 LinguistAgent 的功效，根据人类黄金标准提供实时标记级别评估（精确度、召回率和 $F_1$ 分数）。应用程序和代码在此 https URL 上发布。

Title: Transport and Merge: Cross-Architecture Merging for Large Language Models

Authors: Chenhang Cui, Binyun Yang, Fei Shen, Yuxin Chen, Jingnan Zheng, Xiang Wang, An Zhang, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05495
Pdf URL: https://arxiv.org/pdf/2602.05495
Copy Paste: [[2602.05495]] Transport and Merge: Cross-Architecture Merging for Large Language Models(https://arxiv.org/abs/2602.05495)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve strong capabilities by scaling model capacity and training data, yet many real-world deployments rely on smaller models trained or adapted from low-resource data. This gap motivates the need for mechanisms to transfer knowledge from large, high-resource models to smaller, low-resource targets. While model merging provides an effective transfer mechanism, most existing approaches assume architecture-compatible models and therefore cannot directly transfer knowledge from large high-resource LLMs to heterogeneous low-resource targets. In this work, we propose a cross-architecture merging framework based on optimal transport (OT) that aligns activations to infer cross-neuron correspondences between heterogeneous models. The resulting transport plans are then used to guide direct weight-space fusion, enabling effective high-resource to low-resource transfer using only a small set of inputs. Extensive experiments across low-resource languages and specialized domains demonstrate consistent improvements over target models.
摘要：大型语言模型 (LLM) 通过扩展模型容量和训练数据来实现强大的功能，但许多现实世界的部署依赖于由低资源数据训练或改编的较小模型。这种差距激发了对将知识从大型、高资源模型转移到较小、低资源目标的机制的需求。虽然模型合并提供了有效的转移机制，但大多数现有方法都假设架构兼容的模型，因此无法直接将知识从大型高资源 LLM 转移到异构低资源目标。在这项工作中，我们提出了一种基于最佳传输（OT）的跨架构合并框架，该框架可以对齐激活以推断异构模型之间的跨神经元对应关系。由此产生的运输计划随后用于指导直接权重空间融合，从而仅使用少量输入即可实现有效的高资源到低资源转移。跨低资源语言和专业领域的广泛实验证明了目标模型的持续改进。

Title: A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering

Authors: Larissa Pusch, Alexandre Courtiol, Tim Conrad
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.05512
Pdf URL: https://arxiv.org/pdf/2602.05512
Copy Paste: [[2602.05512]] A Human-in-the-Loop, LLM-Centered Architecture for Knowledge-Graph Question Answering(https://arxiv.org/abs/2602.05512)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) excel at language understanding but remain limited in knowledge-intensive domains due to hallucinations, outdated information, and limited explainability. Text-based retrieval-augmented generation (RAG) helps ground model outputs in external sources but struggles with multi-hop reasoning. Knowledge Graphs (KGs), in contrast, support precise, explainable querying, yet require a knowledge of query languages. This work introduces an interactive framework in which LLMs generate and explain Cypher graph queries and users iteratively refine them through natural language. Applied to real-world KGs, the framework improves accessibility to complex datasets while preserving factual accuracy and semantic rigor and provides insight into how model performance varies across domains. Our core quantitative evaluation is a 90-query benchmark on a synthetic movie KG that measures query explanation quality and fault detection across multiple LLMs, complemented by two smaller real-life query-generation experiments on a Hyena KG and the MaRDI (Mathematical Research Data Initiative) KG.
摘要：大型语言模型 (LLM) 擅长语言理解，但由于幻觉、过时的信息和有限的可解释性，在知识密集型领域仍然受到限制。基于文本的检索增强生成（RAG）有助于在外部源中基础模型输出，但在多跳推理方面遇到困难。相比之下，知识图 (KG) 支持精确、可解释的查询，但需要查询语言的知识。这项工作引入了一个交互式框架，其中法学硕士生成并解释密码图查询，用户通过自然语言迭代地完善它们。该框架应用于现实世界的知识图谱，提高了对复杂数据集的可访问性，同时保持事实准确性和语义严谨性，并提供了对模型性能如何跨领域变化的洞察。我们的核心定量评估是基于合成电影 KG 的 90 个查询基准，该基准衡量跨多个 LLM 的查询解释质量和故障检测，并辅以 Hyena KG 和 MaRDI（数学研究数据计划）KG 上的两个较小的现实生活查询生成实验。

Title: Multi-Task GRPO: Reliable LLM Reasoning Across Tasks

Authors: Shyam Sundhar Ramesh, Xiaotong Ji, Matthieu Zimmer, Sangwoong Yoon, Zhiyong Wang, Haitham Bou Ammar, Aurelien Lucchi, Ilija Bogunovic
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.05547
Pdf URL: https://arxiv.org/pdf/2602.05547
Copy Paste: [[2602.05547]] Multi-Task GRPO: Reliable LLM Reasoning Across Tasks(https://arxiv.org/abs/2602.05547)
Keywords: language model, llm, prompt
Abstract: RL-based post-training with GRPO is widely used to improve large language models on individual reasoning tasks. However, real-world deployment requires reliable performance across diverse tasks. A straightforward multi-task adaptation of GRPO often leads to imbalanced outcomes, with some tasks dominating optimization while others stagnate. Moreover, tasks can vary widely in how frequently prompts yield zero advantages (and thus zero gradients), which further distorts their effective contribution to the optimization signal. To address these issues, we propose a novel Multi-Task GRPO (MT-GRPO) algorithm that (i) dynamically adapts task weights to explicitly optimize worst-task performance and promote balanced progress across tasks, and (ii) introduces a ratio-preserving sampler to ensure task-wise policy gradients reflect the adapted weights. Experiments on both 3-task and 9-task settings show that MT-GRPO consistently outperforms baselines in worst-task accuracy. In particular, MT-GRPO achieves 16-28% and 6% absolute improvement on worst-task performance over standard GRPO and DAPO, respectively, while maintaining competitive average accuracy. Moreover, MT-GRPO requires 50% fewer training steps to reach 50% worst-task accuracy in the 3-task setting, demonstrating substantially improved efficiency in achieving reliable performance across tasks.
摘要：基于 RL 的 GRPO 后训练被广泛用于改进单个推理任务的大型语言模型。然而，现实世界的部署需要跨不同任务的可靠性能。 GRPO 的直接多任务适应通常会导致不平衡的结果，某些任务主导优化，而另一些任务则停滞不前。此外，任务在提示产生零优势（从而零梯度）的频率方面可能存在很大差异，这进一步扭曲了它们对优化信号的有效贡献。为了解决这些问题，我们提出了一种新颖的多任务GRPO（MT-GRPO）算法，该算法（i）动态调整任务权重以显式优化最差任务性能并促进跨任务的平衡进展，（ii）引入比例保持采样器以确保任务方面的策略梯度反映调整后的权重。 3 任务和 9 任务设置的实验表明，MT-GRPO 在最差任务精度方面始终优于基线。特别是，与标准 GRPO 和 DAPO 相比，MT-GRPO 在最差任务性能方面分别实现了 16-28% 和 6% 的绝对改进，同时保持了有竞争力的平均精度。此外，MT-GRPO 需要减少 50% 的训练步骤才能在 3 任务设置中达到 50% 最差任务的准确性，这表明在跨任务实现可靠性能方面的效率得到了显着提高。

Title: CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models

Authors: Rui Jia, Ruiyi Lan, Fengrui Liu, Zhongxiang Dai, Bo Jiang, Jing Shao, Jingyuan Chen, Guandong Xu, Fei Wu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05633
Pdf URL: https://arxiv.org/pdf/2602.05633
Copy Paste: [[2602.05633]] CASTLE: A Comprehensive Benchmark for Evaluating Student-Tailored Personalized Safety in Large Language Models(https://arxiv.org/abs/2602.05633)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have advanced the development of personalized learning in education. However, their inherent generation mechanisms often produce homogeneous responses to identical prompts. This one-size-fits-all mechanism overlooks the substantial heterogeneity in students cognitive and psychological, thereby posing potential safety risks to vulnerable groups. Existing safety evaluations primarily rely on context-independent metrics such as factual accuracy, bias, or toxicity, which fail to capture the divergent harms that the same response might cause across different student attributes. To address this gap, we propose the concept of Student-Tailored Personalized Safety and construct CASTLE based on educational theories. This benchmark covers 15 educational safety risks and 14 student attributes, comprising 92,908 bilingual scenarios. We further design three evaluation metrics: Risk Sensitivity, measuring the model ability to detect risks; Emotional Empathy, evaluating the model capacity to recognize student states; and Student Alignment, assessing the match between model responses and student attributes. Experiments on 18 SOTA LLMs demonstrate that CASTLE poses a significant challenge: all models scored below an average safety rating of 2.3 out of 5, indicating substantial deficiencies in personalized safety assurance.
摘要：大型语言模型（LLM）推动了教育领域个性化学习的发展。然而，它们固有的生成机制通常会对相同的提示产生同质的响应。这种一刀切的机制忽视了学生认知和心理的巨大异质性，从而给弱势群体带来了潜在的安全风险。现有的安全评估主要依赖于与背景无关的指标，例如事实准确性、偏见或毒性，这些指标无法捕捉到相同的反应可能对不同学生属性造成的不同伤害。为了解决这一差距，我们提出了“学生定制的个性化安全”的概念，并基于教育理论构建了CASTLE。该基准涵盖15个教育安全风险和14个学生属性，包含92,908个双语场景。我们进一步设计了三个评价指标：风险敏感性，衡量模型检测风险的能力；情感同理心，评估模型识别学生状态的能力；学生一致性，评估模型响应与学生属性之间的匹配度。对 18 个 SOTA 法学硕士进行的实验表明，CASTLE 提出了重大挑战：所有模型的平均安全评级均低于 2.3 分（满分 5 分），表明在个性化安全保证方面存在重大缺陷。

Title: MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

Authors: Congbo Ma, Yichun Zhang, Yousef Al-Jazzazi, Ahamed Foisal, Laasya Sharma, Yousra Sadqi, Khaled Saleh, Jihad Mallat, Farah E. Shamout
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05692
Pdf URL: https://arxiv.org/pdf/2602.05692
Copy Paste: [[2602.05692]] MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations(https://arxiv.org/abs/2602.05692)
Keywords: language model, llm
Abstract: Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: this https URL.
摘要：现有或生成的临床文本中的不准确可能会导致严重的不良后果，尤其是误诊或不正确的治疗建议。随着大型语言模型 (LLM) 越来越多地在各种医疗保健应用中使用，通过专用基准进行综合评估至关重要。然而，这样的数据集仍然稀缺，尤其是在不同的语言和上下文中。在本文中，我们介绍了 MedErrBench，这是第一个用于错误检测、定位和纠正的多语言基准，是在经验丰富的临床医生的指导下开发的。 MedErrBench 基于十种常见错误类型的扩展分类，涵盖英语、阿拉伯语和中文，并由领域专家注释和审查自然临床案例。我们评估了一系列通用、特定语言和医学领域语言模型在所有三项任务中的性能。我们的结果揭示了显着的性能差距，特别是在非英语环境中，凸显了对基于临床的语言感知系统的需求。通过公开 MedErrBench 和我们的评估协议，我们的目标是推进多语言临床 NLP，以在全球范围内促进更安全、更公平的基于人工智能的医疗保健。该数据集可在补充材料中找到。数据集的匿名版本可在以下位置获得：此 https URL。

Title: Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation

Authors: Shuting Jiang, Ran Song, Yuxin Huang, Yan Xiang, Yantuan Xian, Shengxiang Gao, Zhengtao Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05694
Pdf URL: https://arxiv.org/pdf/2602.05694
Copy Paste: [[2602.05694]] Consensus-Aligned Neuron Efficient Fine-Tuning Large Language Models for Multi-Domain Machine Translation(https://arxiv.org/abs/2602.05694)
Keywords: language model, llm
Abstract: Multi-domain machine translation (MDMT) aims to build a unified model capable of translating content across diverse domains. Despite the impressive machine translation capabilities demonstrated by large language models (LLMs), domain adaptation still remains a challenge for LLMs. Existing MDMT methods such as in-context learning and parameter-efficient fine-tuning often suffer from domain shift, parameter interference and limited generalization. In this work, we propose a neuron-efficient fine-tuning framework for MDMT that identifies and updates consensus-aligned neurons within LLMs. These neurons are selected by maximizing the mutual information between neuron behavior and domain features, enabling LLMs to capture both generalizable translation patterns and domain-specific nuances. Our method then fine-tunes LLMs guided by these neurons, effectively mitigating parameter interference and domain-specific overfitting. Comprehensive experiments on three LLMs across ten German-English and Chinese-English translation domains evidence that our method consistently outperforms strong PEFT baselines on both seen and unseen domains, achieving state-of-the-art performance.
摘要：多领域机器翻译（MDMT）旨在构建一个能够跨不同领域翻译内容的统一模型。尽管大型语言模型 (LLM) 展示了令人印象深刻的机器翻译能力，但领域适应仍然是 LLM 面临的挑战。现有的 MDMT 方法（例如上下文学习和参数高效微调）经常遭受域转移、参数干扰和泛化有限的问题。在这项工作中，我们提出了一种 MDMT 的神经元高效微调框架，用于识别和更新 LLM 内一致对齐的神经元。这些神经元是通过最大化神经元行为和领域特征之间的互信息来选择的，使法学硕士能够捕获可概括的翻译模式和特定领域的细微差别。然后，我们的方法对这些神经元引导的 LLM 进行微调，有效减轻参数干扰和特定领域的过度拟合。对十个德英和汉英翻译领域的三名法学硕士进行的综合实验证明，我们的方法在可见领域和未见领域始终优于强大的 PEFT 基线，实现了最先进的性能。

Title: CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering

Authors: Hao Yang, Zhiyu Yang, Xupeng Zhang, Wei Wei, Yunjie Zhang, Lin Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05728
Pdf URL: https://arxiv.org/pdf/2602.05728
Copy Paste: [[2602.05728]] CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering(https://arxiv.org/abs/2602.05728)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has become a key paradigm for knowledge-intensive question answering. However, existing multi-hop RAG systems remain inefficient, as they alternate between retrieval and reasoning at each step, resulting in repeated LLM calls, high token consumption, and unstable entity grounding across hops. We propose CompactRAG, a simple yet effective framework that decouples offline corpus restructuring from online reasoning. In the offline stage, an LLM reads the corpus once and converts it into an atomic QA knowledge base, which represents knowledge as minimal, fine-grained question-answer pairs. In the online stage, complex queries are decomposed and carefully rewritten to preserve entity consistency, and are resolved through dense retrieval followed by RoBERTa-based answer extraction. Notably, during inference, the LLM is invoked only twice in total - once for sub-question decomposition and once for final answer synthesis - regardless of the number of reasoning hops. Experiments on HotpotQA, 2WikiMultiHopQA, and MuSiQue demonstrate that CompactRAG achieves competitive accuracy while substantially reducing token consumption compared to iterative RAG baselines, highlighting a cost-efficient and practical approach to multi-hop reasoning over large knowledge corpora. The implementation is available at GitHub.
摘要：检索增强生成（RAG）已成为知识密集型问答的关键范例。然而，现有的多跳 RAG 系统仍然效率低下，因为它们在每个步骤中交替进行检索和推理，导致重复的 LLM 调用、高令牌消耗以及跨跳实体接地不稳定。我们提出了 CompactRAG，这是一个简单而有效的框架，可以将离线语料库重组与在线推理分离。在离线阶段，法学硕士读取语料库一次并将其转换为原子 QA 知识库，它将知识表示为最小的、细粒度的问答对。在在线阶段，复杂的查询被分解并仔细重写以保持实体一致性，并通过密集检索和基于 RoBERTa 的答案提取来解决。值得注意的是，在推理过程中，LLM 总共只被调用两次 - 一次用于子问题分解，一次用于最终答案合成 - 无论推理跳数是多少。 HotpotQA、2WikiMultiHopQA 和 MuSiQue 上的实验表明，与迭代 RAG 基线相比，CompactRAG 实现了有竞争力的准确性，同时大幅减少了令牌消耗，突出了在大型知识库上进行多跳推理的经济高效且实用的方法。该实现可在 GitHub 上找到。

Title: LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards

Authors: Bowen Ping, Zijun Chen, Yiyao Yu, Tingfeng Hui, Junchi Yan, Baobao Chang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05758
Pdf URL: https://arxiv.org/pdf/2602.05758
Copy Paste: [[2602.05758]] LongR: Unleashing Long-Context Reasoning via Reinforcement Learning with Dense Utility Rewards(https://arxiv.org/abs/2602.05758)
Keywords: llm
Abstract: Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
摘要：强化学习已成为法学硕士推理的关键驱动力。这种能力在长上下文场景中同样至关重要——例如长对话理解和结构化数据分析，其中的挑战不仅限于消耗代币，还包括执行严格的推论。虽然现有的工作重点是数据合成或架构变化，但最近的工作指出，仅仅依靠稀疏的、仅结果的奖励产生的收益有限，因为这种粗略的信号通常不足以有效地指导复杂的长上下文推理。为了解决这个问题，我们提出了LongR，一个统一的框架，通过集成动态的“思考和阅读”机制来增强长上下文性能，该机制将推理与文档咨询交织在一起，并根据相对信息增益来量化相关文档的效用的上下文密度奖励。根据经验，LongR 在 LongBench v2 上实现了 9% 的提升，并且在 RULER 和 InfiniteBench 上实现了持续改进，展示了在广泛环境中导航的强大效率。此外，LongR 持续增强各种 RL 算法（例如 DAPO、GSPO）的性能。最后，我们进行深入分析，研究推理链长度对效率的影响以及模型对干扰因素的鲁棒性。

Title: Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors

Authors: Adnan Al Ali, Jindřich Helcl, Jindřich Libovický
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05769
Pdf URL: https://arxiv.org/pdf/2602.05769
Copy Paste: [[2602.05769]] Different Time, Different Language: Revisiting the Bias Against Non-Native Speakers in GPT Detectors(https://arxiv.org/abs/2602.05769)
Keywords: gpt, llm, chat
Abstract: LLM-based assistants have been widely popularised after the release of ChatGPT. Concerns have been raised about their misuse in academia, given the difficulty of distinguishing between human-written and generated text. To combat this, automated techniques have been developed and shown to be effective, to some extent. However, prior work suggests that these methods often falsely flag essays from non-native speakers as generated, due to their low perplexity extracted from an LLM, which is supposedly a key feature of the detectors. We revisit these statements two years later, specifically in the Czech language setting. We show that the perplexity of texts from non-native speakers of Czech is not lower than that of native speakers. We further examine detectors from three separate families and find no systematic bias against non-native speakers. Finally, we demonstrate that contemporary detectors operate effectively without relying on perplexity.
摘要：ChatGPT发布后，基于LLM的助手得到了广泛普及。鉴于很难区分人类书写的文本和生成的文本，人们对它们在学术界的滥用表示担忧。为了解决这个问题，自动化技术已经被开发出来，并且在某种程度上被证明是有效的。然而，之前的工作表明，这些方法经常错误地将非母语人士的论文标记为生成的，因为它们从法学硕士中提取的复杂性较低，而这被认为是检测器的一个关键特征。两年后，我们重新审视这些声明，特别是在捷克语环境中。我们表明，非捷克语母语者的文本的困惑度并不低于母语者的。我们进一步检查了来自三个不同家族的检测器，发现没有针对非母语人士的系统偏见。最后，我们证明当代探测器可以在不依赖困惑的情况下有效运行。

Title: Reinforcement World Model Learning for LLM-based Agents

Authors: Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05842
Pdf URL: https://arxiv.org/pdf/2602.05842
Copy Paste: [[2602.05842]] Reinforcement World Model Learning for LLM-based Agents(https://arxiv.org/abs/2602.05842)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $\tau^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $\tau^2$ Bench respectively, while matching the performance of expert-data training.
摘要：大型语言模型（LLM）在以语言为中心的任务中取得了强劲的性能。然而，在代理环境中，法学硕士常常难以预测行动后果并适应环境动态，这凸显了基于法学硕士的代理对世界建模能力的需求。我们提出了强化世界模型学习（RWML），这是一种自我监督方法，使用模拟到真实的差距奖励来学习基于 LLM 的代理在文本状态上的动作条件世界模型。我们的方法将模型产生的模拟下一个状态与从环境中观察到的实现的下一个状态对齐，从而鼓励预先训练的嵌入空间中内部世界模拟和实际环境动态之间的一致性。与下一个状态令牌预测相比，下一个状态令牌预测优先考虑令牌级别的保真度（即，再现准确的措辞）而不是语义等价，并可能导致模型崩溃，我们的方法提供了更强大的训练信号，并且从经验上看，与 LLM 作为法官相比，更不易受到奖励黑客的影响。我们在 ALFWorld 和 $\tau^2$ Bench 上评估我们的方法，并观察到相对于基本模型的显着收益，尽管完全是自我监督的。当与任务成功奖励相结合时，我们的方法在 ALFWorld 和 $\tau^2$ Bench 上分别优于直接任务成功奖励 RL 6.9 和 5.7 个点，同时与专家数据训练的性能相匹配。

Title: OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

Authors: Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05843
Pdf URL: https://arxiv.org/pdf/2602.05843
Copy Paste: [[2602.05843]] OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions(https://arxiv.org/abs/2602.05843)
Keywords: language model, llm, agent
Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at this https URL
摘要：大型语言模型（LLM）的快速发展促进了能够驾驭复杂环境的自主代理的发展。然而，现有的评估主要采用演绎范式，其中代理通常在有限的规划范围内根据明确提供的规则和静态目标执行任务。至关重要的是，这忽略了主体从经验中自主发现潜在过渡规律的归纳必要性，这是实现主体远见和维持战略一致性的基石。为了弥补这一差距，我们引入了 OdysseyArena，它将代理评估重新集中在长视野、主动和归纳交互上。我们形式化并实例化四个原语，将抽象的过渡动态转化为具体的交互环境。在此基础上，我们建立了 OdysseyArena-Lite 来进行标准化基准测试，提供了一组 120 项任务来衡量智能体的归纳效率和长期发现。更进一步，我们引入了 OdysseyArena-Challenge 来对代理在极端交互范围（例如，> 200 步）下的稳定性进行压力测试。对 15 多个领先的法学硕士进行的大量实验表明，即使是前沿模型也表现出归纳场景的缺陷，从而确定了在复杂环境中追求自主发现的关键瓶颈。我们的代码和数据可在此 https URL 获取

Title: RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Authors: Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05853
Pdf URL: https://arxiv.org/pdf/2602.05853
Copy Paste: [[2602.05853]] RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference(https://arxiv.org/abs/2602.05853)
Keywords: language model, long context
Abstract: The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$\tau$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
摘要：注意机制的二次复杂度给处理长上下文的大型语言模型带来了关键瓶颈。虽然动态稀疏注意力方法提供了输入自适应效率，但它们面临着基本的权衡：需要预处理、缺乏全局评估、违反查询独立性或产生高计算开销。我们提出了 RRAttention，一种新颖的动态稀疏注意力方法，它通过 head \underline{r}ound-\underline{r}obin (RR) 采样策略同时实现所有所需的属性。通过在每个步幅内跨注意力头旋转查询采样位置，RRAttention 保持查询独立性，同时通过步幅级聚合实现高效的全局模式发现。我们的方法将复杂度从 $O(L^2)$ 降低到 $O(L^2/S^2)$，并采用自适应 Top-$\tau$ 选择来实现最佳稀疏性。对自然语言理解（HELMET）和多模态视频理解（Video-MME）的大量实验表明，RRAttention 恢复了超过 99% 的全注意力性能，同时仅计算一半的注意力块，在 128K 上下文长度下实现了 2.4$\times$ 加速，并且优于现有的动态稀疏注意力方法。

Title: xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection

Authors: Adrián Girón, Pablo Miralles, Javier Huertas-Tato, Sergio D'Antonio, David Camacho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.05874
Pdf URL: https://arxiv.org/pdf/2602.05874
Copy Paste: [[2602.05874]] xList-Hate: A Checklist-Based Framework for Interpretable and Generalizable Hate Speech Detection(https://arxiv.org/abs/2602.05874)
Keywords: language model, llm
Abstract: Hate speech detection is commonly framed as a direct binary classification problem despite being a composite concept defined through multiple interacting factors that vary across legal frameworks, platform policies, and annotation guidelines. As a result, supervised models often overfit dataset-specific definitions and exhibit limited robustness under domain shift and annotation noise. We introduce xList-Hate, a diagnostic framework that decomposes hate speech detection into a checklist of explicit, concept-level questions grounded in widely shared normative criteria. Each question is independently answered by a large language model (LLM), producing a binary diagnostic representation that captures hateful content features without directly predicting the final label. These diagnostic signals are then aggregated by a lightweight, fully interpretable decision tree, yielding transparent and auditable predictions. We evaluate it across multiple hate speech benchmarks and model families, comparing it against zero-shot LLM classification and in-domain supervised fine-tuning. While supervised methods typically maximize in-domain performance, we consistently improves cross-dataset robustness and relative performance under domain shift. In addition, qualitative analysis of disagreement cases provides evidence that the framework can be less sensitive to certain forms of annotation inconsistency and contextual ambiguity. Crucially, the approach enables fine-grained interpretability through explicit decision paths and factor-level analysis. Our results suggest that reframing hate speech detection as a diagnostic reasoning task, rather than a monolithic classification problem, provides a robust, explainable, and extensible alternative for content moderation.
摘要：仇恨言论检测通常被视为直接的二元分类问题，尽管它是一个通过多个相互作用因素定义的复合概念，这些因素因法律框架、平台政策和注释指南而异。因此，监督模型通常会过度拟合数据集特定的定义，并且在域转移和注释噪声下表现出有限的鲁棒性。我们引入了 xList-Hate，这是一个诊断框架，可将仇恨言论检测分解为基于广泛共享的规范标准的明确的概念级问题清单。每个问题都由大型语言模型 (LLM) 独立回答，生成二进制诊断表示，捕获仇恨内容特征，而不直接预测最终标签。然后，这些诊断信号通过轻量级、完全可解释的决策树进行聚合，从而产生透明且可审计的预测。我们通过多个仇恨言论基准和模型系列对其进行评估，并将其与零样本 LLM 分类和域内监督微调进行比较。虽然监督方法通常会最大化域内性能，但我们始终如一地提高跨数据集的鲁棒性和域转移下的相对性能。此外，对分歧案例的定性分析提供了证据，表明该框架对某些形式的注释不一致和上下文模糊性不太敏感。至关重要的是，该方法通过明确的决策路径和因素级分析实现了细粒度的可解释性。我们的结果表明，将仇恨言论检测重新定义为诊断推理任务，而不是单一的分类问题，为内容审核提供了一种强大的、可解释的和可扩展的替代方案。

Title: EuroLLM-22B: Technical Report

Authors: Miguel Moura Ramos, Duarte M. Alves, Hippolyte Gisserot-Boukhlef, João Alves, Pedro Henrique Martins, Patrick Fernandes, José Pombal, Nuno M. Guerreiro, Ricardo Rei, Nicolas Boizard, Amin Farajian, Mateusz Klimaszewski, José G. C. de Souza, Barry Haddow, François Yvon, Pierre Colombo, Alexandra Birch, André F. T. Martins
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.05879
Pdf URL: https://arxiv.org/pdf/2602.05879
Copy Paste: [[2602.05879]] EuroLLM-22B: Technical Report(https://arxiv.org/abs/2602.05879)
Keywords: language model, llm
Abstract: This report presents EuroLLM-22B, a large language model trained from scratch to support the needs of European citizens by covering all 24 official European Union languages and 11 additional languages. EuroLLM addresses the issue of European languages being underrepresented and underserved in existing open large language models. We provide a comprehensive overview of EuroLLM-22B's development, including tokenizer design, architectural specifications, data filtering, and training procedures. Across a broad set of multilingual benchmarks, EuroLLM-22B demonstrates strong performance in reasoning, instruction following, and translation, achieving results competitive with models of comparable size. To support future research, we release our base and instruction-tuned models, our multilingual web pretraining data and updated EuroBlocks instruction datasets, as well as our pre-training and evaluation codebases.
摘要：本报告介绍了 EuroLLM-22B，这是一个从头开始训练的大型语言模型，涵盖所有 24 种欧盟官方语言和 11 种其他语言，以满足欧洲公民的需求。 EuroLLM 解决了现有开放大型语言模型中欧洲语言代表性不足和服务不足的问题。我们提供 EuroLLM-22B 开发的全面概述，包括分词器设计、架构规范、数据过滤和培训程序。在一系列广泛的多语言基准测试中，EuroLLM-22B 在推理、指令跟踪和翻译方面表现出强大的性能，取得了与同等规模的模型相媲美的结果。为了支持未来的研究，我们发布了基础模型和指令调整模型、多语言网络预训练数据和更新的 EuroBlocks 指令数据集，以及预训练和评估代码库。

Title: Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models

Authors: Shuo Nie, Hexuan Deng, Chao Wang, Ruiyu Fang, Xuebo Liu, Shuangyong Song, Yu Li, Min Zhang, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05897
Pdf URL: https://arxiv.org/pdf/2602.05897
Copy Paste: [[2602.05897]] Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models(https://arxiv.org/abs/2602.05897)
Keywords: language model, hallucination, chain-of-thought
Abstract: As large language models become smaller and more efficient, small reasoning models (SRMs) are crucial for enabling chain-of-thought (CoT) reasoning in resource-constrained settings. However, they are prone to faithfulness hallucinations, especially in intermediate reasoning steps. Existing mitigation methods based on online reinforcement learning rely on outcome-based rewards or coarse-grained CoT evaluation, which can inadvertently reinforce unfaithful reasoning when the final answer is correct. To address these limitations, we propose Faithfulness-Aware Step-Level Reinforcement Learning (FaithRL), introducing step-level supervision via explicit faithfulness rewards from a process reward model, together with an implicit truncated resampling strategy that generates contrastive signals from faithful prefixes. Experiments across multiple SRMs and Open-Book QA benchmarks demonstrate that FaithRL consistently reduces hallucinations in both the CoT and final answers, leading to more faithful and reliable reasoning. Code is available at this https URL.
摘要：随着大型语言模型变得更小、更高效，小型推理模型 (SRM) 对于在资源受限的环境中实现思想链 (CoT) 推理至关重要。然而，他们很容易出现忠实幻觉，尤其是在中间推理步骤中。现有的基于在线强化学习的缓解方法依赖于基于结果的奖励或粗粒度的 CoT 评估，当最终答案正确时，这可能会无意中强化不忠实的推理。为了解决这些限制，我们提出了忠诚意识步进级强化学习（FaithRL），通过过程奖励模型中的显式忠诚度奖励引入步进级监督，以及从忠实前缀生成对比信号的隐式截断重采样策略。跨多个 SRM 和 Open-Book QA 基准的实验表明，FaithRL 始终如一地减少了 CoT 和最终答案中的幻觉，从而导致更加忠实和可靠的推理。代码可从此 https URL 获取。

Title: Codified Finite-state Machines for Role-playing

Authors: Letian Peng, Yupeng Hou, Kun Zhou, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05905
Pdf URL: https://arxiv.org/pdf/2602.05905
Copy Paste: [[2602.05905]] Codified Finite-state Machines for Role-playing(https://arxiv.org/abs/2602.05905)
Keywords: language model, llm, prompt
Abstract: Modeling latent character states is crucial for consistent and engaging role-playing (RP) with large language models (LLMs). Yet, existing prompting-based approaches mainly capture surface actions, often failing to track the latent states that drive interaction. We revisit finite-state machines (FSMs), long used in game design to model state transitions. While effective in small, well-specified state spaces, traditional hand-crafted, rule-based FSMs struggle to adapt to the open-ended semantic space of RP. To address this, we introduce Codified Finite-State Machines (CFSMs), a framework that automatically codifies textual character profiles into FSMs using LLM-based coding. CFSMs extract key states and transitions directly from the profile, producing interpretable structures that enforce character consistency. To further capture uncertainty and variability, we extend CFSMs into Codified Probabilistic Finite-State Machines (CPFSMs), where transitions are modeled as probability distributions over states. Through both synthetic evaluations and real-world RP scenarios in established artifacts, we demonstrate that CFSM and CPFSM outperform generally applied baselines, verifying effectiveness not only in structured tasks but also in open-ended stochastic state exploration.
摘要：对潜在角色状态进行建模对于使用大型语言模型 (LLM) 进行一致且引人入胜的角色扮演 (RP) 至关重要。然而，现有的基于提示的方法主要捕获表面动作，通常无法跟踪驱动交互的潜在状态。我们重新审视有限状态机 (FSM)，它长期以来在游戏设计中用于模拟状态转换。虽然传统的手工制作、基于规则的 FSM 在小型、明确的状态空间中有效，但它很难适应 RP 的开放式语义空间。为了解决这个问题，我们引入了编码有限状态机 (CFSM)，这是一个使用基于 LLM 的编码自动将文本字符配置文件编码为 FSM 的框架。 CFSM 直接从配置文件中提取关键状态和转换，生成可强制字符一致性的可解释结构。为了进一步捕获不确定性和可变性，我们将 CFSM 扩展到编码概率有限状态机 (CPFSM)，其中转换被建模为状态上的概率分布。通过综合评估和已建立工件中的真实 RP 场景，我们证明 CFSM 和 CPFSM 优于普遍应用的基线，不仅在结构化任务中而且在开放式随机状态探索中验证了有效性。

Title: KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs

Authors: Jian Chen, Zhuoran Wang, Jiayu Qin, Ming Li, Meng Wang, Changyou Chen, Yin Chen, Qizhen Weng, Yirui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05929
Pdf URL: https://arxiv.org/pdf/2602.05929
Copy Paste: [[2602.05929]] KV-CoRE: Benchmarking Data-Dependent Low-Rank Compressibility of KV-Caches in LLMs(https://arxiv.org/abs/2602.05929)
Keywords: language model, llm
Abstract: Large language models rely on kv-caches to avoid redundant computation during autoregressive decoding, but as context length grows, reading and writing the cache can quickly saturate GPU memory bandwidth. Recent work has explored KV-cache compression, yet most approaches neglect the data-dependent nature of kv-caches and their variation across layers. We introduce KV-CoRE KV-cache Compressibility by Rank Evaluation), an SVD-based method for quantifying the data-dependent low-rank compressibility of kv-caches. KV-CoRE computes the optimal low-rank approximation under the Frobenius norm and, being gradient-free and incremental, enables efficient dataset-level, layer-wise evaluation. Using this method, we analyze multiple models and datasets spanning five English domains and sixteen languages, uncovering systematic patterns that link compressibility to model architecture, training data, and language coverage. As part of this analysis, we employ the Normalized Effective Rank as a metric of compressibility and show that it correlates strongly with performance degradation under compression. Our study establishes a principled evaluation framework and the first large-scale benchmark of kv-cache compressibility in LLMs, offering insights for dynamic, data-aware compression and data-centric model development.
摘要：大型语言模型依靠 kv 缓存来避免自回归解码期间的冗余计算，但随着上下文长度的增长，读写缓存会很快使 GPU 内存带宽饱和。最近的工作探索了 KV 缓存压缩，但大多数方法忽略了 KV 缓存的数据依赖性质及其跨层的变化。我们引入了 KV-CoRE KV-cache 可压缩性（按等级评估），这是一种基于 SVD 的方法，用于量化 kv-cache 的数据相关的低等级可压缩性。 KV-CoRE 在 Frobenius 范数下计算最佳低秩近似，并且无梯度和增量，可实现高效的数据集级、分层评估。使用这种方法，我们分析了跨越五个英语领域和十六种语言的多个模型和数据集，揭示了将可压缩性与模型架构、训练数据和语言覆盖范围联系起来的系统模式。作为此分析的一部分，我们采用归一化有效等级作为可压缩性的指标，并表明它与压缩下的性能下降密切相关。我们的研究建立了一个原则性的评估框架和法学硕士中第一个大规模的 kv-cache 压缩性基准，为动态、数据感知压缩和以数据为中心的模型开发提供了见解。

Title: Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions

Authors: Léo Labat, Etienne Ollion, François Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05932
Pdf URL: https://arxiv.org/pdf/2602.05932
Copy Paste: [[2602.05932]] Polyglots or Multitudes? Multilingual LLM Answers to Value-laden Multiple-Choice Questions(https://arxiv.org/abs/2602.05932)
Keywords: language model, llm, prompt
Abstract: Multiple-Choice Questions (MCQs) are often used to assess knowledge, reasoning abilities, and even values encoded in large language models (LLMs). While the effect of multilingualism has been studied on LLM factual recall, this paper seeks to investigate the less explored question of language-induced variation in value-laden MCQ responses. Are multilingual LLMs consistent in their responses across languages, i.e. behave like theoretical polyglots, or do they answer value-laden MCQs depending on the language of the question, like a multitude of monolingual models expressing different values through a single model? We release a new corpus, the Multilingual European Value Survey (MEVS), which, unlike prior work relying on machine translation or ad hoc prompts, solely comprises human-translated survey questions aligned in 8 European languages. We administer a subset of those questions to over thirty multilingual LLMs of various sizes, manufacturers and alignment-fine-tuning status under comprehensive, controlled prompt variations including answer order, symbol type, and tail character. Our results show that while larger, instruction-tuned models display higher overall consistency, the robustness of their responses varies greatly across questions, with certain MCQs eliciting total agreement within and across models while others leave LLM answers split. Language-specific behavior seems to arise in all consistent, instruction-fine-tuned models, but only on certain questions, warranting a further study of the selective effect of preference fine-tuning.
摘要：多项选择题 (MCQ) 通常用于评估知识、推理能力，甚至是大型语言模型 (LLM) 中编码的价值观。虽然多语言对法学硕士事实回忆的影响已经被研究，但本文试图研究较少被探索的问题，即语言引起的价值负载 MCQ 反应的变化。多语言法学硕士在不同语言的回答中是否一致，即表现得像理论上的多语言者，或者他们是否根据问题的语言回答充满价值的MCQ，就像通过单个模型表达不同价值的多个单语言模型一样？我们发布了一个新的语料库，即多语言欧洲价值调查 (MEVS)，与之前依赖机器翻译或临时提示的工作不同，该语料库仅包含以 8 种欧洲语言对齐的人工翻译的调查问题。我们在全面、受控的提示变化（包括答案顺序、符号类型和尾部字符）下，向三十多个不同规模、制造商和对齐微调状态的多语言法学硕士管理这些问题的子集。我们的结果表明，虽然较大的、经过指令调整的模型表现出更高的整体一致性，但它们的回答的稳健性在不同问题上差异很大，某些 MCQ 引起模型内部和模型之间的完全一致，而其他 MCQ 则使 LLM 答案分裂。语言特定行为似乎出现在所有一致的、指令微调的模型中，但仅在某些问题上出现，需要进一步研究偏好微调的选择性效果。

Title: DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

Authors: Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.05992
Pdf URL: https://arxiv.org/pdf/2602.05992
Copy Paste: [[2602.05992]] DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs(https://arxiv.org/abs/2602.05992)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at this https URL.
摘要：扩散大语言模型 (dLLM) 已成为文本生成的一种有前景的替代方案，其特点是其对并行解码的原生支持。在实践中，块推理对于避免全局双向解码中的顺序错位和提高输出质量至关重要。然而，广泛使用的固定、预定义的块（朴素）调度与语义难度无关，使其成为质量和效率的次优策略：它可以迫使过早承诺不确定的位置，同时延迟块边界附近的简单位置。在这项工作中，我们分析了朴素块调度的局限性，并揭示了动态调整调度以适应语义难度以实现可靠和高效推理的重要性。受此启发，我们提出了动态滑动块（DSB），这是一种免训练的块调度方法，它使用具有动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率，我们引入了 DSB Cache，这是一种专为 DSB 量身定制的免训练 KV 缓存机制。跨多个模型和基准的大量实验表明，DSB 与 DSB Cache 一起可以持续提高 dLLM 的生成质量和推理效率。代码在此 https URL 发布。

Title: A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies

Authors: Panagiotis Kaliosis, Adithya V Ganesan, Oscar N.E. Kjell, Whitney Ringwald, Scott Feltman, Melissa A. Carr, Dimitris Samaras, Camilo Ruggero, Benjamin J. Luft, Roman Kotov, Andrew H. Schwartz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.06015
Pdf URL: https://arxiv.org/pdf/2602.06015
Copy Paste: [[2602.06015]] A Systematic Evaluation of Large Language Models for PTSD Severity Estimation: The Role of Contextual Knowledge and Modeling Strategies(https://arxiv.org/abs/2602.06015)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly being used in a zero-shot fashion to assess mental health conditions, yet we have limited knowledge on what factors affect their accuracy. In this study, we utilize a clinical dataset of natural language narratives and self-reported PTSD severity scores from 1,437 individuals to comprehensively evaluate the performance of 11 state-of-the-art LLMs. To understand the factors affecting accuracy, we systematically varied (i) contextual knowledge like subscale definitions, distribution summary, and interview questions, and (ii) modeling strategies including zero-shot vs few shot, amount of reasoning effort, model sizes, structured subscales vs direct scalar prediction, output rescaling and nine ensemble methods. Our findings indicate that (a) LLMs are most accurate when provided with detailed construct definitions and context of the narrative; (b) increased reasoning effort leads to better estimation accuracy; (c) performance of open-weight models (Llama, Deepseek), plateau beyond 70B parameters while closed-weight (o3-mini, gpt-5) models improve with newer generations; and (d) best performance is achieved when ensembling a supervised model with the zero-shot LLMs. Taken together, the results suggest choice of contextual knowledge and modeling strategies is important for deploying LLMs to accurately assess mental health.
摘要：大型语言模型 (LLM) 越来越多地以零样本方式用于评估心理健康状况，但我们对影响其准确性的因素知之甚少。在这项研究中，我们利用 1,437 名个人的自然语言叙述和自我报告的 PTSD 严重程度评分的临床数据集来综合评估 11 名最先进的法学硕士的表现。为了了解影响准确性的因素，我们系统地改变了（i）上下文知识，如子尺度定义、分布摘要和访谈问题，以及（ii）建模策略，包括零样本与少样本、推理工作量、模型大小、结构化子尺度与直接标量预测、输出重新缩放和九种集成方法。我们的研究结果表明，(a) 法学硕士在提供详细的结构定义和叙述背景时最为准确； (b) 增加推理工作可以提高估计准确性； (c) 开放权重模型（Llama、Deepseek）的性能，超过 70B 参数达到稳定状态，而封闭权重（o3-mini、gpt-5）模型随着新一代的改进而有所改善； (d) 当将监督模型与零样本 LLM 集成时，可以获得最佳性能。总的来说，结果表明背景知识和建模策略的选择对于部署法学硕士来准确评估心理健康非常重要。

Title: Multi-Token Prediction via Self-Distillation

Authors: John Kirchenbauer, Abhimanyu Hans, Brian Bartoldson, Micah Goldblum, Ashwinee Panda, Tom Goldstein
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.06019
Pdf URL: https://arxiv.org/pdf/2602.06019
Copy Paste: [[2602.06019]] Multi-Token Prediction via Self-Distillation(https://arxiv.org/abs/2602.06019)
Keywords: language model
Abstract: Existing techniques for accelerating language model inference, such as speculative decoding, require training auxiliary speculator models and building and deploying complex inference pipelines. We consider a new approach for converting a pretrained autoregressive language model from a slow single next token prediction model into a fast standalone multi-token prediction model using a simple online distillation objective. The final model retains the exact same implementation as the pretrained initial checkpoint and is deployable without the addition of any auxiliary verifier or other specialized inference code. On GSM8K, our method produces models that can decode more than $3\times$ faster on average at $<5\%$ drop in accuracy relative to single token decoding performance.
摘要：加速语言模型推理的现有技术（例如推测解码）需要训练辅助推测器模型以及构建和部署复杂的推理管道。我们考虑一种新方法，使用简单的在线蒸馏目标将预训练的自回归语言模型从缓慢的单个下一个标记预测模型转换为快速独立的多标记预测模型。最终模型保留与预训练初始检查点完全相同的实现，并且无需添加任何辅助验证器或其他专门的推理代码即可部署。在 GSM8K 上，我们的方法生成的模型可以平均解码速度超过 $3\time$，但相对于单个令牌解码性能，准确度下降 $<5\%$。

Title: Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory

Authors: Haozhen Zhang, Haodong Yue, Tao Feng, Quanyu Long, Jianzhu Bao, Bowen Jin, Weizhi Zhang, Xiao Li, Jiaxuan You, Chengwei Qin, Wenya Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.06025
Pdf URL: https://arxiv.org/pdf/2602.06025
Copy Paste: [[2602.06025]] Learning Query-Aware Budget-Tier Routing for Runtime Agent Memory(https://arxiv.org/abs/2602.06025)
Keywords: language model, llm, agent
Abstract: Memory is increasingly central to Large Language Model (LLM) agents operating beyond a single context window, yet most existing systems rely on offline, query-agnostic memory construction that can be inefficient and may discard query-critical information. Although runtime memory utilization is a natural alternative, prior work often incurs substantial overhead and offers limited explicit control over the performance-cost trade-off. In this work, we present \textbf{BudgetMem}, a runtime agent memory framework for explicit, query-aware performance-cost control. BudgetMem structures memory processing as a set of memory modules, each offered in three budget tiers (i.e., \textsc{Low}/\textsc{Mid}/\textsc{High}). A lightweight router performs budget-tier routing across modules to balance task performance and memory construction cost, which is implemented as a compact neural policy trained with reinforcement learning. Using BudgetMem as a unified testbed, we study three complementary strategies for realizing budget tiers: implementation (method complexity), reasoning (inference behavior), and capacity (module model size). Across LoCoMo, LongMemEval, and HotpotQA, BudgetMem surpasses strong baselines when performance is prioritized (i.e., high-budget setting), and delivers better accuracy-cost frontiers under tighter budgets. Moreover, our analysis disentangles the strengths and weaknesses of different tiering strategies, clarifying when each axis delivers the most favorable trade-offs under varying budget regimes.
摘要：内存对于在单个上下文窗口之外运行的大型语言模型 (LLM) 代理越来越重要，但大多数现有系统依赖于离线、与查询无关的内存构造，这种构造可能效率低下，并且可能会丢弃查询关键信息。尽管运行时内存利用率是一种自然的替代方案，但先前的工作通常会产生大量开销，并且对性能成本权衡的显式控制有限。在这项工作中，我们提出了 \textbf{BudgetMem}，一个运行时代理内存框架，用于显式的、查询感知的性能成本控制。 BudgetMem 将内存处理构建为一组内存模块，每个模块提供三个预算层（即 \textsc{Low}/\textsc{Mid}/\textsc{High}）。轻量级路由器跨模块执行预算层路由，以平衡任务性能和内存构建成本，这是通过强化学习训练的紧凑神经策略实现的。使用 BudgetMem 作为统一测试平台，我们研究了实现预算层级的三种互补策略：实现（方法复杂性）、推理（推理行为）和容量（模块模型大小）。在 LoCoMo、LongMemEval 和 HotpotQA 中，BudgetMem 在优先考虑性能（即高预算设置）时超越了强大的基线，并在预算紧张的情况下提供了更好的准确性-成本边界。此外，我们的分析理清了不同分层策略的优缺点，阐明了每个轴何时在不同的预算制度下提供最有利的权衡。

Title: DFlash: Block Diffusion for Flash Speculative Decoding

Authors: Jian Chen, Yesheng Liang, Zhijian Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.06036
Pdf URL: https://arxiv.org/pdf/2602.06036
Copy Paste: [[2602.06036]] DFlash: Block Diffusion for Flash Speculative Decoding(https://arxiv.org/abs/2602.06036)
Keywords: language model, llm
Abstract: Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
摘要：自回归大型语言模型 (LLM) 可提供强大的性能，但需要固有的顺序解码，从而导致推理延迟高和 GPU 利用率低。推测性解码通过使用快速草稿模型来缓解这一瓶颈，该模型的输出由目标 LLM 并行验证；然而，现有的方法仍然依赖于自回归绘图，这仍然是连续的并限制了实际的加速。扩散法学硕士通过实现并行生成提供了一种有前途的替代方案，但当前的扩散模型通常比自回归模型表现不佳。在本文中，我们介绍了 DFlash，这是一种推测性解码框架，它采用轻量级块扩散模型进行并行绘图。通过在单次前向传递中生成草稿令牌，并根据从目标模型中提取的上下文特征来调节草稿模型，DFlash 能够实现高效的草稿，并提供高质量的输出和更高的接受率。实验表明，DFlash 在一系列模型和任务中实现了超过 6 倍的无损加速，比最先进的推测解码方法 EAGLE-3 提供高达 2.5 倍的加速。