2026-02-11

Title: Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection

Authors: Janek Bevendorff, Maik Fröbe, André Greiner-Petter, Andreas Jakoby, Maximilian Mayerl, Preslav Nakov, Henry Plutz, Martin Potthast, Benno Stein, Minh Ngoc Ta, Yuxia Wang, Eva Zangerle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09147
Pdf URL: https://arxiv.org/pdf/2602.09147
Copy Paste: [[2602.09147]] Overview of PAN 2026: Voight-Kampff Generative AI Detection, Text Watermarking, Multi-Author Writing Style Analysis, Generative Plagiarism Detection, and Reasoning Trajectory Detection(https://arxiv.org/abs/2602.09147)
Keywords: llm
Abstract: The goal of the PAN workshop is to advance computational stylometry and text forensics via objective and reproducible evaluation. In 2026, we run the following five tasks: (1) Voight-Kampff Generative AI Detection, particularly in mixed and obfuscated authorship scenarios, (2) Text Watermarking, a new task that aims to find new and benchmark the robustness of existing text watermarking schemes, (3) Multi-author Writing Style Analysis, a continued task that aims to find positions of authorship change, (4) Generative Plagiarism Detection, a continued task that targets source retrieval and text alignment between generated text and source documents, and (5) Reasoning Trajectory Detection, a new task that deals with source detection and safety detection of LLM-generated or human-written reasoning trajectories. As in previous years, PAN invites software submissions as easy-to-reproduce Docker containers for most of the tasks. Since PAN 2012, more than 1,100 submissions have been made this way via the TIRA experimentation platform.
摘要：PAN 研讨会的目标是通过客观和可重复的评估来推进计算文体测量和文本取证。到 2026 年，我们将运行以下五项任务：(1) Voight-Kampff 生成式 AI 检测，特别是在混合和混淆的作者身份场景中；(2) 文本水印，一项新任务，旨在寻找新的文本水印方案并对现有文本水印方案的鲁棒性进行基准测试；(3) 多作者写作风格分析，一项旨在查找作者身份变化位置的持续任务；(4) 生成抄袭检测，一项针对源检索和文本对齐的持续任务生成的文本和源文档，以及（5）推理轨迹检测，这是一项新任务，处理法学硕士生成或人工编写的推理轨迹的源检测和安全检测。与往年一样，PAN 邀请软件提交作为易于复制的 Docker 容器来完成大多数任务。自 PAN 2012 以来，已通过 TIRA 实验平台以这种方式提交了 1,100 多份申请。

Title: Effective Reasoning Chains Reduce Intrinsic Dimensionality

Authors: Archiki Prasad, Mandar Joshi, Kenton Lee, Mohit Bansal, Peter Shaw
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09276
Pdf URL: https://arxiv.org/pdf/2602.09276
Copy Paste: [[2602.09276]] Effective Reasoning Chains Reduce Intrinsic Dimensionality(https://arxiv.org/abs/2602.09276)
Keywords: language model, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning and its variants have substantially improved the performance of language models on complex reasoning tasks, yet the precise mechanisms by which different strategies facilitate generalization remain poorly understood. While current explanations often point to increased test-time computation or structural guidance, establishing a consistent, quantifiable link between these factors and generalization remains challenging. In this work, we identify intrinsic dimensionality as a quantitative measure for characterizing the effectiveness of reasoning chains. Intrinsic dimensionality quantifies the minimum number of model dimensions needed to reach a given accuracy threshold on a given task. By keeping the model architecture fixed and varying the task formulation through different reasoning strategies, we demonstrate that effective reasoning strategies consistently reduce the intrinsic dimensionality of the task. Validating this on GSM8K with Gemma-3 1B and 4B, we observe a strong inverse correlation between the intrinsic dimensionality of a reasoning strategy and its generalization performance on both in-distribution and out-of-distribution data. Our findings suggest that effective reasoning chains facilitate learning by better compressing the task using fewer parameters, offering a new quantitative metric for analyzing reasoning processes.
摘要：思想链（CoT）推理及其变体极大地提高了语言模型在复杂推理任务上的性能，但不同策略促进泛化的精确机制仍然知之甚少。虽然当前的解释通常指向增加测试时间计算或结构指导，但在这些因素和泛化之间建立一致的、可量化的联系仍然具有挑战性。在这项工作中，我们将内在维度确定为表征推理链有效性的定量度量。内在维度量化了给定任务达到给定精度阈值所需的最小模型维度数。通过保持模型架构固定并通过不同的推理策略改变任务表述，我们证明了有效的推理策略始终如一地降低了任务的内在维度。使用 Gemma-3 1B 和 4B 在 GSM8K 上验证这一点，我们观察到推理策略的内在维度与其在分布内和分布外数据上的泛化性能之间存在很强的逆相关性。我们的研究结果表明，有效的推理链通过使用更少的参数更好地压缩任务来促进学习，为分析推理过程提供新的定量指标。

Title: Don't Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention

Authors: Shu-Ting Pi, Pradeep Bagavan, Yejia Li, Disha, Qun Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09312
Pdf URL: https://arxiv.org/pdf/2602.09312
Copy Paste: [[2602.09312]] Don't Shoot The Breeze: Topic Continuity Model Using Nonlinear Naive Bayes With Attention(https://arxiv.org/abs/2602.09312)
Keywords: language model, llm, chat
Abstract: Utilizing Large Language Models (LLM) as chatbots in diverse business scenarios often presents the challenge of maintaining topic continuity. Abrupt shifts in topics can lead to poor user experiences and inefficient utilization of computational resources. In this paper, we present a topic continuity model aimed at assessing whether a response aligns with the initial conversation topic. Our model is built upon the expansion of the corresponding natural language understanding (NLU) model into quantifiable terms using a Naive Bayes approach. Subsequently, we have introduced an attention mechanism and logarithmic nonlinearity to enhance its capability to capture topic continuity. This approach allows us to convert the NLU model into an interpretable analytical formula. In contrast to many NLU models constrained by token limits, our proposed model can seamlessly handle conversations of any length with linear time complexity. Furthermore, the attention mechanism significantly improves the model's ability to identify topic continuity in complex conversations. According to our experiments, our model consistently outperforms traditional methods, particularly in handling lengthy and intricate conversations. This unique capability offers us an opportunity to ensure the responsible and interpretable use of LLMs.
摘要：在不同的业务场景中使用大型语言模型 (LLM) 作为聊天机器人通常会带来保持主题连续性的挑战。主题的突然转变可能会导致糟糕的用户体验和计算资源的低效利用。在本文中，我们提出了一个主题连续性模型，旨在评估响应是否与初始对话主题一致。我们的模型是建立在使用朴素贝叶斯方法将相应的自然语言理解（NLU）模型扩展为可量化术语的基础上的。随后，我们引入了注意力机制和对数非线性来增强其捕获主题连续性的能力。这种方法允许我们将 NLU 模型转换为可解释的分析公式。与许多受 token 限制的 NLU 模型相比，我们提出的模型可以以线性时间复杂度无缝处理任意长度的对话。此外，注意力机制显着提高了模型识别复杂对话中主题连续性的能力。根据我们的实验，我们的模型始终优于传统方法，特别是在处理冗长而复杂的对话方面。这种独特的能力为我们提供了确保负责任且可解释地使用法学硕士的机会。

Title: Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization

Authors: Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09331
Pdf URL: https://arxiv.org/pdf/2602.09331
Copy Paste: [[2602.09331]] Beyond Uniform Credit: Causal Credit Assignment for Policy Optimization(https://arxiv.org/abs/2602.09331)
Keywords: language model
Abstract: Policy gradient methods for language model reasoning, such as GRPO and DAPO, assign uniform credit to all generated tokens - the filler phrase "Let me think" receives the same gradient update as the critical calculation "23 + 45 = 68." We propose counterfactual importance weighting: mask reasoning spans, measure the drop in answer probability, and upweight tokens accordingly during policy gradient updates. Our method requires no auxiliary models or external annotation, instead importance is estimated directly from the policy model's own probability shifts. Experiments on GSM8K across three models spanning the Qwen and Llama families demonstrate consistent improvements over uniform baselines and faster convergence to equivalent accuracy. Inverting the importance signal hurts performance, confirming we capture genuine causal structure rather than noise. Analysis shows the method correctly prioritizes calculation steps over scaffolding text. We view these findings as establishing counterfactual importance weighting as a foundation for further research rather than a complete solution.
摘要：用于语言模型推理的策略梯度方法（例如 GRPO 和 DAPO）为所有生成的令牌分配统一的信用 - 填充短语“让我想想”接收与关键计算“23 + 45 = 68”相同的梯度更新。我们提出反事实重要性加权：掩盖推理跨度，测量答案概率的下降，并在策略梯度更新期间相应地增加令牌的权重。我们的方法不需要辅助模型或外部注释，而是直接根据策略模型自身的概率变化来估计重要性。在涵盖 Qwen 和 Llama 系列的三个模型上进行的 GSM8K 实验表明，在统一基线上取得了一致的改进，并且更快地收敛到同等精度。反转重要性信号会损害性能，从而确认我们捕获了真正的因果结构而不是噪音。分析表明，该方法正确地将计算步骤优先于脚手架文本。我们将这些发现视为建立反事实重要性权重作为进一步研究的基础，而不是完整的解决方案。

Title: FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding

Authors: Siyuan Huang, Ziyu Wang, Chao Pan, Han Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09336
Pdf URL: https://arxiv.org/pdf/2602.09336
Copy Paste: [[2602.09336]] FM SO.P: A Progressive Task Mixture Framework with Automatic Evaluation for Cross-Domain SOP Understanding(https://arxiv.org/abs/2602.09336)
Keywords: language model, agent
Abstract: Standard Operating Procedures (SOPs) are critical for enterprise operations, yet existing language models struggle with SOP understanding and cross-domain generalization. Current methods fail because joint training cannot differentiate between reasoning capabilities that SOP requires: terminology precision, sequential ordering, and constraint reasoning. We propose FM SO.P, solving these challenges through two novelties. First, we introduce progressive task mixtures that build capabilities by stages across three task types with cumulative data: concept disambiguation for terminology precision, action sequence understanding for procedural correctness, and scenario-aware graph reasoning for conditional logic. Second, we propose an automatic multi-agent evaluation system consisting of three agents that adaptively generate rubrics, stratified test sets, and rubric scoring, adapting to domains (e.g., temporal constraints for DMV, regulatory compliance for banking). Evaluated on SOPBench across seven domains (Bank, DMV, Healthcare, Market, University, Library, Hotel), FM SO.P achieves 48.3\% pass rate with our 32B model and 34.3\% with our opensource 7B model, matching Qwen-2.5-72B-Instruct baseline (34.4\%) with 10x fewer parameters.
摘要：标准操作程序 (SOP) 对于企业运营至关重要，但现有的语言模型难以理解 SOP 和跨领域泛化。当前的方法失败是因为联合训练无法区分 SOP 所需的推理能力：术语精度、顺序排序和约束推理。我们提出 FM SO.P，通过两项创新来解决这些挑战。首先，我们引入渐进式任务混合，通过累积数据分阶段构建三种任务类型的能力：术语精确性的概念消歧、程序正确性的动作序列理解以及条件逻辑的场景感知图形推理。其次，我们提出了一个自动多智能体评估系统，由三个智能体组成，它们自适应地生成评分细则、分层测试集和评分细则，适应领域（例如，DMV 的时间约束、银行业的监管合规性）。在 SOPBench 上跨七个领域（银行、DMV、医疗保健、市场、大学、图书馆、酒店）进行评估时，FM SO.P 使用我们的 32B 模型实现了 48.3\% 的通过率，使用我们的开源 7B 模型实现了 34.3\% 的通过率，与 Qwen-2.5-72B-Instruct 基线 (34.4\%) 匹配，参数减少了 10 倍。

Title: Understanding Risk and Dependency in AI Chatbot Use from User Discourse

Authors: Jianfeng Zhu, Karin G. Coifman, Ruoming Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09339
Pdf URL: https://arxiv.org/pdf/2602.09339
Copy Paste: [[2602.09339]] Understanding Risk and Dependency in AI Chatbot Use from User Discourse(https://arxiv.org/abs/2602.09339)
Keywords: llm, chat, agent
Abstract: Generative AI systems are increasingly embedded in everyday life, yet empirical understanding of how psychological risk associated with AI use emerges, is experienced, and is regulated by users remains limited. We present a large-scale computational thematic analysis of posts collected between 2023 and 2025 from two Reddit communities, r/AIDangers and r/ChatbotAddiction, explicitly focused on AI-related harm and distress. Using a multi-agent, LLM-assisted thematic analysis grounded in Braun and Clarke's reflexive framework, we identify 14 recurring thematic categories and synthesize them into five higher-order experiential dimensions. To further characterize affective patterns, we apply emotion labeling using a BERT-based classifier and visualize emotional profiles across dimensions. Our findings reveal five empirically derived experiential dimensions of AI-related psychological risk grounded in real-world user discourse, with self-regulation difficulties emerging as the most prevalent and fear concentrated in concerns related to autonomy, control, and technical risk. These results provide early empirical evidence from lived user experience of how AI safety is perceived and emotionally experienced outside laboratory or speculative contexts, offering a foundation for future AI safety research, evaluation, and responsible governance.
摘要：生成式人工智能系统越来越多地融入日常生活中，但对与人工智能使用相关的心理风险如何出现、体验以及用户监管的经验理解仍然有限。我们对 2023 年至 2025 年间从两个 Reddit 社区 r/AIDangers 和 r/ChatbotAddiction 收集的帖子进行了大规模计算主题分析，明确关注与人工智能相关的伤害和困扰。使用基于 Braun 和 Clarke 的反身框架的多智能体、法学硕士辅助的主题分析，我们确定了 14 个重复出现的主题类别，并将它们合成为 5 个更高阶的体验维度。为了进一步表征情感模式，我们使用基于 BERT 的分类器应用情感标签，并跨维度可视化情感概况。我们的研究结果揭示了基于现实世界用户话语的人工智能相关心理风险的五个经验维度，其中自我调节困难成为最普遍的问题，恐惧集中在与自主、控制和技术风险相关的担忧上。这些结果提供了来自生活用户体验的早期经验证据，说明人工智能安全在实验室或推测环境之外如何被感知和情感体验，为未来的人工智能安全研究、评估和负责任的治理奠定了基础。

Title: Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs

Authors: Yoshifumi Kawasaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09346
Pdf URL: https://arxiv.org/pdf/2602.09346
Copy Paste: [[2602.09346]] Digital Linguistic Bias in Spanish: Evidence from Lexical Variation in LLMs(https://arxiv.org/abs/2602.09346)
Keywords: language model, llm
Abstract: This study examines the extent to which Large Language Models (LLMs) capture geographic lexical variation in Spanish, a language that exhibits substantial regional variation. Treating LLMs as virtual informants, we probe their dialectal knowledge using two survey-style question formats: Yes-No questions and multiple-choice questions. To this end, we exploited a large-scale, expert-curated database of Spanish lexical variation. Our evaluation covers more than 900 lexical items across 21 Spanish-speaking countries and is conducted at both the country and dialectal area levels. Across both evaluation formats, the results reveal systematic differences in how LLMs represent Spanish language varieties. Lexical variation associated with Spain, Equatorial Guinea, Mexico & Central America, and the La Plata River is recognized more accurately by the models, while the Chilean variety proves particularly difficult for the models to distinguish. Importantly, differences in the volume of country-level digital resources do not account for these performance patterns, suggesting that factors beyond data quantity shape dialectal representation in LLMs. By providing a fine-grained, large-scale evaluation of geographic lexical variation, this work advances empirical understanding of dialectal knowledge in LLMs and contributes new evidence to discussions of Digital Linguistic Bias in Spanish.
摘要：这项研究探讨了大型语言模型 (LLM) 在多大程度上捕获了西班牙语中的地理词汇差异，而西班牙语是一种表现出巨大区域差异的语言。将法学硕士视为虚拟信息提供者，我们使用两种调查式的问题格式来探究他们的方言知识：是-否问题和多项选择问题。为此，我们利用了一个由专家策划的大型西班牙语词汇变异数据库。我们的评估涵盖 21 个西班牙语国家的 900 多个词汇项目，并在国家和方言地区层面进行。在两种评估形式中，结果揭示了法学硕士在代表西班牙语语言变体方面的系统差异。模型可以更准确地识别与西班牙、赤道几内亚、墨西哥和中美洲以及拉普拉塔河相关的词汇变异，而智利的变异对于模型来说特别难以区分。重要的是，国家级数字资源数量的差异并不能解释这些表现模式，这表明数据量以外的因素影响了法学硕士的方言表现。通过对地理词汇变异进行细粒度、大规模的评估，这项工作促进了法学硕士方言知识的实证理解，并为西班牙语数字语言偏见的讨论提供了新的证据。

Title: AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis

Authors: Zexu Sun, Bokai Ji, Hengyi Cai, Shuaiqiang Wang, Lei Wang, Guangxia Li, Xu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09372
Pdf URL: https://arxiv.org/pdf/2602.09372
Copy Paste: [[2602.09372]] AgentSkiller: Scaling Generalist Agent Intelligence through Semantically Integrated Cross-Domain Data Synthesis(https://arxiv.org/abs/2602.09372)
Keywords: language model, agent
Abstract: Large Language Model agents demonstrate potential in solving real-world problems via tools, yet generalist intelligence is bottlenecked by scarce high-quality, long-horizon data. Existing methods collect privacy-constrained API logs or generate scripted interactions lacking diversity, which struggle to produce data requisite for scaling capabilities. We propose AgentSkiller, a fully automated framework synthesizing multi-turn interaction data across realistic, semantically linked domains. It employs a DAG-based architecture with explicit state transitions to ensure determinism and recoverability. The pipeline builds a domain ontology and Person-Centric Entity Graph, defines tool interfaces via Service Blueprints for Model Context Protocol servers, and populates environments with consistent databases and strict Domain Policies. A cross-domain fusion mechanism links services to simulate complex tasks. Finally, the pipeline creates user tasks by verifying solution paths, filtering via execution-based validation, and generating queries using a Persona-based Simulator for automated rollout. This produces reliable environments with clear state changes. To demonstrate effectiveness, we synthesized $\approx$ 11K interaction samples; experimental results indicate that models trained on this dataset achieve significant improvements on function calling over baselines, particularly in larger parameter regimes.
摘要：大型语言模型代理展示了通过工具解决现实世界问题的潜力，但通才智能却受到稀缺的高质量、长期数据的瓶颈。现有方法收集隐私受限的 API 日志或生成缺乏多样性的脚本化交互，这很难生成扩展功能所需的数据。我们提出了 AgentSkiller，这是一个完全自动化的框架，可以跨现实的语义链接域合成多轮交互数据。它采用基于 DAG 的架构，具有显式状态转换，以确保确定性和可恢复性。该管道构建域本体和以人为中心的实体图，通过模型上下文协议服务器的服务蓝图定义工具接口，并使用一致的数据库和严格的域策略填充环境。跨域融合机制链接服务以模拟复杂任务。最后，管道通过验证解决方案路径、通过基于执行的验证进行过滤以及使用基于角色的模拟器生成查询来创建用户任务以进行自动部署。这会产生具有清晰状态变化的可靠环境。为了证明有效性，我们合成了约 11K 个交互样本；实验结果表明，在此数据集上训练的模型在函数调用方面比基线取得了显着的改进，特别是在较大的参数范围内。

Title: BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation

Authors: Peng Lai, Zhihao Ou, Yong Wang, Longyue Wang, Jian Yang, Yun Chen, Guanhua Chen
Subjects: cs.CL, cs.AI, cs.SE
Abstract URL: https://arxiv.org/abs/2602.09383
Pdf URL: https://arxiv.org/pdf/2602.09383
Copy Paste: [[2602.09383]] BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation(https://arxiv.org/abs/2602.09383)
Keywords: llm
Abstract: LLM-as-a-Judge has been widely adopted across various research and practical applications, yet the robustness and reliability of its evaluation remain a critical issue. A core challenge it faces is bias, which has primarily been studied in terms of known biases and their impact on evaluation outcomes, while automated and systematic exploration of potential unknown biases is still lacking. Nevertheless, such exploration is crucial for enhancing the robustness and reliability of evaluations. To bridge this gap, we propose BiasScope, a LLM-driven framework for automatically and at scale discovering potential biases that may arise during model evaluation. BiasScope can uncover potential biases across different model families and scales, with its generality and effectiveness validated on the JudgeBench dataset. It overcomes the limitations of existing approaches, transforming bias discovery from a passive process relying on manual effort and predefined bias lists into an active and comprehensive automated exploration. Moreover, based on BiasScope, we propose JudgeBench-Pro, an extended version of JudgeBench and a more challenging benchmark for evaluating the robustness of LLM-as-a-judge. Strikingly, even powerful LLMs as evaluators show error rates above 50\% on JudgeBench-Pro, underscoring the urgent need to strengthen evaluation robustness and to mitigate potential biases further.
摘要：法学硕士法官已被广泛应用于各种研究和实际应用中，但其评估的稳健性和可靠性仍然是一个关键问题。它面临的一个核心挑战是偏差，主要是从已知偏差及其对评估结果的影响方面进行研究，而对潜在未知偏差的自动化和系统性探索仍然缺乏。尽管如此，这种探索对于提高评估的稳健性和可靠性至关重要。为了弥补这一差距，我们提出了 BiasScope，这是一个法学硕士驱动的框架，用于自动大规模地发现模型评估过程中可能出现的潜在偏差。 BiasScope 可以发现不同模型系列和规模之间的潜在偏差，其通用性和有效性在 JudgeBench 数据集上得到验证。它克服了现有方法的局限性，将偏差发现从依赖手动工作和预定义偏差列表的被动过程转变为主动且全面的自动化探索。此外，基于BiasScope，我们提出了JudgeBench-Pro，这是JudgeBench的扩展版本，也是一个更具挑战性的基准，用于评估LLM作为法官的稳健性。引人注目的是，即使是实力雄厚的法学硕士作为评估者，在 JudgeBench-Pro 上的错误率也超过 50%，这凸显了加强评估稳健性并进一步减少潜在偏差的迫切需要。

Title: Contractual Deepfakes: Can Large Language Models Generate Contracts?

Authors: Eliza Mik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09384
Pdf URL: https://arxiv.org/pdf/2602.09384
Copy Paste: [[2602.09384]] Contractual Deepfakes: Can Large Language Models Generate Contracts?(https://arxiv.org/abs/2602.09384)
Keywords: language model, llm
Abstract: Notwithstanding their unprecedented ability to generate text, LLMs do not understand the meaning of words, have no sense of context and cannot reason. Their output constitutes an approximation of statistically dominant word patterns. And yet, the drafting of contracts is often presented as a typical legal task that could be facilitated by this technology. This paper seeks to put an end to such unreasonable ideas. Predicting words differs from using language in the circumstances of specific transactions and reconstituting common contractual phrases differs from reasoning about the law. LLMs seem to be able to generate generic and superficially plausible contractual documents. In the cold light of day, such documents may turn out to be useless assemblages of inconsistent provisions or contracts that are enforceable but unsuitable for a given transaction. This paper casts a shadow on the simplistic assumption that LLMs threaten the continued viability of the legal industry.
摘要：尽管法学硕士具有前所未有的生成文本的能力，但他们不理解单词的含义，没有上下文意识，也无法推理。他们的输出构成了统计上占主导地位的单词模式的近似值。然而，合同的起草通常被视为一项典型的法律任务，可以通过这项技术来促进。本文旨在杜绝这种不合理的想法。预测词语不同于在特定交易的情况下使用语言，重构常见合同短语不同于对法律的推理。法学硕士似乎能够生成通用且表面上合理的合同文件。在寒冷的日子里，这些文件可能会变成无用的不一致条款或合同的组合，这些条款或合同是可执行的，但不适合特定的交易。本文为法学硕士威胁法律行业持续生存的简单化假设蒙上了阴影。

Title: Effective vocabulary expanding of multilingual language models for extremely low-resource languages

Authors: Jianyu Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09388
Pdf URL: https://arxiv.org/pdf/2602.09388
Copy Paste: [[2602.09388]] Effective vocabulary expanding of multilingual language models for extremely low-resource languages(https://arxiv.org/abs/2602.09388)
Keywords: language model
Abstract: Multilingual pre-trained language models(mPLMs) offer significant benefits for many low-resource languages. To further expand the range of languages these models can support, many works focus on continued pre-training of these models. However, few works address how to extend mPLMs to low-resource languages that were previously unsupported. To tackle this issue, we expand the model's vocabulary using a target language corpus. We then screen out a subset from the model's original vocabulary, which is biased towards representing the source language(e.g. English), and utilize bilingual dictionaries to initialize the representations of the expanded vocabulary. Subsequently, we continue to pre-train the mPLMs using the target language corpus, based on the representations of these expanded vocabulary. Experimental results show that our proposed method outperforms the baseline, which uses randomly initialized expanded vocabulary for continued pre-training, in POS tagging and NER tasks, achieving improvements by 0.54% and 2.60%, respectively. Furthermore, our method demonstrates high robustness in selecting the training corpora, and the models' performance on the source language does not degrade after continued pre-training.
摘要：多语言预训练语言模型 (mPLM) 为许多资源匮乏的语言提供了显着的优势。为了进一步扩大这些模型可以支持的语言范围，许多工作都集中在对这些模型的持续预训练上。然而，很少有研究涉及如何将 mPLM 扩展到以前不受支持的低资源语言。为了解决这个问题，我们使用目标语言语料库扩展模型的词汇量。然后，我们从模型的原始词汇中筛选出一个偏向于表示源语言（例如英语）的子集，并利用双语词典来初始化扩展词汇的表示。随后，我们继续根据这些扩展词汇的表示，使用目标语言语料库对 mPLM 进行预训练。实验结果表明，我们提出的方法在 POS 标记和 NER 任务中优于使用随机初始化扩展词汇进行持续预训练的基线，分别实现了 0.54% 和 2.60% 的改进。此外，我们的方法在选择训练语料库方面表现出高度鲁棒性，并且模型在源语言上的性能在持续预训练后不会下降。

Title: Are Language Models Sensitive to Morally Irrelevant Distractors?

Authors: Andrew Shaw, Christina Hahn, Catherine Rasgaitis, Yash Mishra, Alisa Liu, Natasha Jaques, Yulia Tsvetkov, Amy X. Zhang
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.09416
Pdf URL: https://arxiv.org/pdf/2602.09416
Copy Paste: [[2602.09416]] Are Language Models Sensitive to Morally Irrelevant Distractors?(https://arxiv.org/abs/2602.09416)
Keywords: language model, llm, prompt
Abstract: With the rapid development and uptake of large language models (LLMs) across high-stakes settings, it is increasingly important to ensure that LLMs behave in ways that align with human values. Existing moral benchmarks prompt LLMs with value statements, moral scenarios, or psychological questionnaires, with the implicit underlying assumption that LLMs report somewhat stable moral preferences. However, moral psychology research has shown that human moral judgements are sensitive to morally irrelevant situational factors, such as smelling cinnamon rolls or the level of ambient noise, thereby challenging moral theories that assume the stability of human moral judgements. Here, we draw inspiration from this "situationist" view of moral psychology to evaluate whether LLMs exhibit similar cognitive moral biases to humans. We curate a novel multimodal dataset of 60 "moral distractors" from existing psychological datasets of emotionally-valenced images and narratives which have no moral relevance to the situation presented. After injecting these distractors into existing moral benchmarks to measure their effects on LLM responses, we find that moral distractors can shift the moral judgements of LLMs by over 30% even in low-ambiguity scenarios, highlighting the need for more contextual moral evaluations and more nuanced cognitive moral modeling of LLMs.
摘要：随着大型语言模型 (LLM) 在高风险环境中的快速发展和采用，确保 LLM 的行为方式符合人类价值观变得越来越重要。现有的道德基准提示法学硕士提供价值陈述、道德情景或心理调查问卷，隐含的潜在假设是法学硕士报告了某种稳定的道德偏好。然而，道德心理学研究表明，人类的道德判断对与道德无关的情境因素很敏感，例如闻到肉桂卷或环境噪音的水平，从而挑战了假设人类道德判断稳定的道德理论。在这里，我们从道德心理学的“情境主义”观点中汲取灵感，来评估法学硕士是否表现出与人类类似的认知道德偏见。我们从现有的情感图像和叙述的心理数据集中策划了一个新颖的多模态数据集，其中包含 60 个“道德干扰因素”，这些图像和叙述与所呈现的情况没有道德相关性。将这些干扰因素注入现有的道德基准以衡量其对法学硕士反应的影响后，我们发现即使在低模糊性场景下，道德干扰因素也可以使法学硕士的道德判断改变30%以上，这凸显了法学硕士需要更多的情境道德评估和更细致的认知道德建模。

Title: Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency

Authors: Taewoong Yoon, Geunyeong Jeong, Geon Park, Sihyeong Yeom, Harksoo Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09438
Pdf URL: https://arxiv.org/pdf/2602.09438
Copy Paste: [[2602.09438]] Breaking the Pre-Sampling Barrier: Activation-Informed Difficulty-Aware Self-Consistency(https://arxiv.org/abs/2602.09438)
Keywords: language model, llm, chain-of-thought
Abstract: Self-Consistency (SC) is an effective decoding strategy that improves the reasoning performance of Large Language Models (LLMs) by generating multiple chain-of-thought reasoning paths and selecting the final answer via majority voting. However, it suffers from substantial inference costs because it requires a large number of samples. To mitigate this issue, Difficulty-Adaptive Self-Consistency (DSC) was proposed to reduce unnecessary token usage for easy problems by adjusting the number of samples according to problem difficulty. However, DSC requires additional model calls and pre-sampling to estimate difficulty, and this process is repeated when applying to each dataset, leading to significant computational overhead. In this work, we propose Activation-Informed Difficulty-Aware Self-Consistency (ACTSC) to address these limitations. ACTSC leverages internal difficulty signals reflected in the feed-forward network neuron activations to construct a lightweight difficulty estimation probe, without any additional token generation or model calls. The probe dynamically adjusts the number of samples for SC and can be applied to new datasets without requiring pre-sampling for difficulty estimation. To validate its effectiveness, we conduct experiments on five benchmarks. Experimental results show that ACTSC effectively reduces inference costs while maintaining accuracy relative to existing methods.
摘要：自一致性（SC）是一种有效的解码策略，通过生成多个思想链推理路径并通过多数投票选择最终答案来提高大型语言模型（LLM）的推理性能。然而，由于需要大量样本，它的推理成本很高。为了缓解这个问题，提出了难度自适应自一致性（DSC），通过根据问题难度调整样本数量来减少简单问题的不必要的令牌使用。然而，DSC 需要额外的模型调用和预采样来估计难度，并且在应用于每个数据集时会重复此过程，从而导致大量的计算开销。在这项工作中，我们提出了激活通知难度感知自我一致性（ACTSC）来解决这些限制。 ACTSC 利用前馈网络神经元激活中反映的内部难度信号来构建轻量级难度估计探针，无需任何额外的令牌生成或模型调用。该探针动态调整 SC 的样本数量，并且可以应用于新的数据集，而不需要预采样来进行难度估计。为了验证其有效性，我们在五个基准上进行了实验。实验结果表明，ACTSC 有效降低了推理成本，同时保持了相对于现有方法的准确性。

Title: Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts

Authors: Shweta Parihar, Lu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09442
Pdf URL: https://arxiv.org/pdf/2602.09442
Copy Paste: [[2602.09442]] Evaluating Social Bias in RAG Systems: When External Context Helps and Reasoning Hurts(https://arxiv.org/abs/2602.09442)
Keywords: language model, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Social biases inherent in large language models (LLMs) raise significant fairness concerns. Retrieval-Augmented Generation (RAG) architectures, which retrieve external knowledge sources to enhance the generative capabilities of LLMs, remain susceptible to the same bias-related challenges. This work focuses on evaluating and understanding the social bias implications of RAG. Through extensive experiments across various retrieval corpora, LLMs, and bias evaluation datasets, encompassing more than 13 different bias types, we surprisingly observe a reduction in bias in RAG. This suggests that the inclusion of external context can help counteract stereotype-driven predictions, potentially improving fairness by diversifying the contextual grounding of the model's outputs. To better understand this phenomenon, we then explore the model's reasoning process by integrating Chain-of-Thought (CoT) prompting into RAG while assessing the faithfulness of the model's CoT. Our experiments reveal that the model's bias inclinations shift between stereotype and anti-stereotype responses as more contextual information is incorporated from the retrieved documents. Interestingly, we find that while CoT enhances accuracy, contrary to the bias reduction observed with RAG, it increases overall bias across datasets, highlighting the need for bias-aware reasoning frameworks that can mitigate this trade-off.
摘要：大型语言模型 (LLM) 固有的社会偏见引发了严重的公平问题。检索增强生成（RAG）架构检索外部知识源以增强法学硕士的生成能力，但仍然容易受到与偏见相关的相同挑战的影响。这项工作的重点是评估和理解 RAG 的社会偏见影响。通过对各种检索语料库、法学硕士和偏差评估数据集（涵盖超过 13 种不同的偏差类型）进行广泛的实验，我们令人惊讶地观察到 RAG 偏差的减少。这表明，纳入外部背景有助于抵消刻板印象驱动的预测，通过使模型输出的背景基础多样化，有可能提高公平性。为了更好地理解这一现象，我们通过将思想链 (CoT) 提示集成到 RAG 中来探索模型的推理过程，同时评估模型 CoT 的可信度。我们的实验表明，随着从检索到的文档中纳入更多上下文信息，模型的偏见倾向在刻板印象和反刻板印象反应之间转变。有趣的是，我们发现，虽然 CoT 提高了准确性，但与 RAG 观察到的偏差减少相反，它增加了数据集中的整体偏差，这凸显了对偏差感知推理框架的需求，以减轻这种权衡。

Title: Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality

Authors: Takumi Ohashi, Hitoshi Iyatomi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09444
Pdf URL: https://arxiv.org/pdf/2602.09444
Copy Paste: [[2602.09444]] Conceptual Cultural Index: A Metric for Cultural Specificity via Relative Generality(https://arxiv.org/abs/2602.09444)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in multicultural settings; however, systematic evaluation of cultural specificity at the sentence level remains underexplored. We propose the Conceptual Cultural Index (CCI), which estimates cultural specificity at the sentence level. CCI is defined as the difference between the generality estimate within the target culture and the average generality estimate across other cultures. This formulation enables users to operationally control the scope of culture via comparison settings and provides interpretability, since the score derives from the underlying generality estimates. We validate CCI on 400 sentences (200 culture-specific and 200 general), and the resulting score distribution exhibits the anticipated pattern: higher for culture-specific sentences and lower for general ones. For binary separability, CCI outperforms direct LLM scoring, yielding more than a 10-point improvement in AUC for models specialized to the target culture. Our code is available at this https URL .
摘要：大型语言模型（LLM）越来越多地应用于多元文化环境中；然而，对句子层面文化特性的系统评估仍未得到充分探索。我们提出概念文化指数（CCI），它估计句子级别的文化特异性。 CCI 被定义为目标文化内的普遍性估计与其他文化之间的平均普遍性估计之间的差异。这种公式使用户能够通过比较设置来操作性地控制文化的范围，并提供可解释性，因为分数来自潜在的普遍性估计。我们在 400 个句子（200 个文化特定句子和 200 个一般句子）上验证 CCI，所得分数分布显示出预期的模式：文化特定句子较高，一般句子较低。对于二元可分离性，CCI 优于直接 LLM 评分，针对专门针对目标文化的模型，AUC 提高了 10 点以上。我们的代码可在此 https URL 获取。

Title: NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts

Authors: Huu-Huy-Hoang Tran, Gia-Bao Duong, Quoc-Viet-Anh Tran, Thi-Hai-Yen Vuong, Hoang-Quynh Le
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09469
Pdf URL: https://arxiv.org/pdf/2602.09469
Copy Paste: [[2602.09469]] NOWJ @BioCreative IX ToxHabits: An Ensemble Deep Learning Approach for Detecting Substance Use and Contextual Information in Clinical Texts(https://arxiv.org/abs/2602.09469)
Keywords: language model
Abstract: Extracting drug use information from unstructured Electronic Health Records remains a major challenge in clinical Natural Language Processing. While Large Language Models demonstrate advancements, their use in clinical NLP is limited by concerns over trust, control, and efficiency. To address this, we present NOWJ submission to the ToxHabits Shared Task at BioCreative IX. This task targets the detection of toxic substance use and contextual attributes in Spanish clinical texts, a domain-specific, low-resource setting. We propose a multi-output ensemble system tackling both Subtask 1 - ToxNER and Subtask 2 - ToxUse. Our system integrates BETO with a CRF layer for sequence labeling, employs diverse training strategies, and uses sentence filtering to boost precision. Our top run achieved 0.94 F1 and 0.97 precision for Trigger Detection, and 0.91 F1 for Argument Detection.
摘要：从非结构化电子健康记录中提取药物使用信息仍然是临床自然语言处理的主要挑战。虽然大型语言模型展现了进步，但它们在临床 NLP 中的使用受到对信任、控制和效率的担忧的限制。为了解决这个问题，我们在 BioCreative IX 上向 ToxHabits 共享任务提交了 NOWJ 提交内容。该任务的目标是检测西班牙语临床文本中有毒物质的使用和上下文属性，这是一个特定领域、资源匮乏的环境。我们提出了一个多输出集成系统来处理子任务 1 - ToxNER 和子任务 2 - ToxUse。我们的系统将 BETO 与 CRF 层集成以进行序列标记，采用多种训练策略，并使用句子过滤来提高精度。我们的最高运行结果是触发检测达到 0.94 F1 和 0.97 精度，参数检测达到 0.91 F1。

Title: Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement

Authors: Koduvayur Subbalakshmi, Sabbir Hossain Ujjal, Venkata Krishna Teja Mangichetty, Nastaran Jamalipour Soofi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09486
Pdf URL: https://arxiv.org/pdf/2602.09486
Copy Paste: [[2602.09486]] Listen to the Layers: Mitigating Hallucinations with Inter-Layer Disagreement(https://arxiv.org/abs/2602.09486)
Keywords: language model, llm, hallucination
Abstract: Pretrained Large Language Models (LLMs) are prone to generating fluent yet factually incorrect text-a phenomenon known as hallucinations, undermining their reliability and utility in downstream tasks. We hypothesize that a generated text span's factuality is correlated with its representational instability across the model's internal layers. Based on this, we propose the CoCoA (Confusion and Consistency Aware) decoder, a novel, training-free decoding algorithm that mitigates hallucinations at inference time by listening to these signals in the middle layers. We propose two metrics to quantify this instability in the middle layers, and use it to penalize outputs that exhibit high internal confusion, thereby steering the model towards more internally consistent and factually grounded outputs. We further propose a self-information gated variant, CoCoA-SIG, that dynamically modulates this penalty to selectively target high-surprise, unstable generations. Extensive experiments on diverse tasks, including question-answering, summarization and code generation demonstrate that CoCoA significantly improves factual correctness across multiple model families (e.g., Llama-3, Qwen-2.5, Mistral). By leveraging model-intrinsic signals, CoCoA offers an effective and broadly applicable method for enhancing the trustworthiness of LLMs at inference time, without requiring any model retraining.
摘要：预训练的大型语言模型 (LLM) 很容易生成流畅但实际上不正确的文本，这种现象称为幻觉，从而破坏了它们在下游任务中的可靠性和实用性。我们假设生成的文本范围的真实性与其在模型内部层的表征不稳定性相关。基于此，我们提出了 CoCoA（混淆和一致性感知）解码器，这是一种新颖的免训练解码算法，通过监听中间层的这些信号来减轻推理时的幻觉。我们提出了两个指标来量化中间层的这种不稳定性，并用它来惩罚表现出高度内部混乱的输出，从而引导模型转向更加内部一致和基于事实的输出。我们进一步提出了一种自信息门控变体 CoCoA-SIG，它动态调节这种惩罚，以选择性地针对高度意外、不稳定的一代。对不同任务（包括问答、摘要和代码生成）的广泛实验表明，CoCoA 显着提高了多个模型系列（例如 Llama-3、Qwen-2.5、Mistral）的事实正确性。通过利用模型固有信号，CoCoA 提供了一种有效且广泛适用的方法，用于增强 LLM 在推理时的可信度，而无需任何模型重新训练。

Title: Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models

Authors: Hikaru Asano, Tadashi Kozuno, Kuniaki Saito, Yukino Baba
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09501
Pdf URL: https://arxiv.org/pdf/2602.09501
Copy Paste: [[2602.09501]] Where-to-Unmask: Ground-Truth-Guided Unmasking Order Learning for Masked Diffusion Language Models(https://arxiv.org/abs/2602.09501)
Keywords: language model
Abstract: Masked Diffusion Language Models (MDLMs) generate text by iteratively filling masked tokens, requiring two coupled decisions at each step: which positions to unmask (where-to-unmask) and which tokens to place (what-to-unmask). While standard MDLM training directly optimizes token prediction (what-to-unmask), inference-time unmasking orders (where-to-unmask) are typically determined by heuristic confidence measures or trained through reinforcement learning with costly on-policy rollouts. To address this, we introduce Gt-Margin, a position-wise score derived from ground-truth tokens, defined as the probability margin between the correct token and its strongest alternative. Gt-Margin yields an oracle unmasking order that prioritizes easier positions first under each partially masked state. We demonstrate that leveraging this oracle unmasking order significantly enhances final generation quality, particularly on logical reasoning benchmarks. Building on this insight, we train a supervised unmasking planner via learning-to-rank to imitate the oracle ordering from masked contexts. The resulting planner integrates into standard MDLM sampling to select where-to-unmask, improving reasoning accuracy without modifying the token prediction model.
摘要：屏蔽扩散语言模型 (MDLM) 通过迭代填充屏蔽标记来生成文本，每一步都需要两个耦合决策：取消屏蔽的位置（取消屏蔽的位置）和放置哪些标记（取消屏蔽的内容）。虽然标准 MDLM 训练直接优化令牌预测（要取消屏蔽的内容），但推理时间取消屏蔽顺序（在何处取消屏蔽）通常由启发式置信度测量确定，或通过强化学习进行训练，并进行成本高昂的策略部署。为了解决这个问题，我们引入了 Gt-Margin，这是一种从真实标记导出的位置分数，定义为正确标记与其最强替代标记之间的概率裕度。 GT-Margin 产生一个预言机解锁顺序，该顺序在每个部分屏蔽状态下首先优先考虑较容易的位置。我们证明，利用这种预言揭开顺序可以显着提高最终生成的质量，特别是在逻辑推理基准上。基于这种洞察力，我们通过学习排序来训练一个有监督的揭露规划器，以模仿来自屏蔽上下文的预言排序。生成的规划器集成到标准 MDLM 采样中，以选择在何处取消屏蔽，从而在不修改标记预测模型的情况下提高推理准确性。

Title: EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies

Authors: Xavier Hu, Jinxiang Xia, Shengze Xu, Kangqi Song, Yishuo Yuan, Guibin Zhang, Jincheng Ren, Boyu Feng, Li Lu, Tieyong Zeng, Jiaheng Liu, Minghao Liu, Yuchen Elenor Jiang, Wei Wang, He Zhu, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09514
Pdf URL: https://arxiv.org/pdf/2602.09514
Copy Paste: [[2602.09514]] EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies(https://arxiv.org/abs/2602.09514)
Keywords: llm, agent
Abstract: Long-horizon planning is widely recognized as a core capability of autonomous LLM-based agents; however, current evaluation frameworks suffer from being largely episodic, domain-specific, or insufficiently grounded in persistent economic dynamics. We introduce EcoGym, a generalizable benchmark for continuous plan-and-execute decision making in interactive economies. EcoGym comprises three diverse environments: Vending, Freelance, and Operation, implemented in a unified decision-making process with standardized interfaces, and budgeted actions over an effectively unbounded horizon (1000+ steps if 365 day-loops for evaluation). The evaluation of EcoGym is based on business-relevant outcomes (e.g., net worth, income, and DAU), targeting long-term strategic coherence and robustness under partial observability and stochasticity. Experiments across eleven leading LLMs expose a systematic tension: no single model dominates across all three scenarios. Critically, we find that models exhibit significant suboptimality in either high-level strategies or efficient actions executions. EcoGym is released as an open, extensible testbed for transparent long-horizon agent evaluation and for studying controllability-utility trade-offs in realistic economic settings.
摘要：长期规划被广泛认为是基于法学硕士的自主代理人的核心能力；然而，目前的评价框架大多是间歇性的、针对特定领域的，或者没有充分立足于持续的经济动态。我们推出了 EcoGym，这是交互式经济中持续计划和执行决策的通用基准。 EcoGym 包含三种不同的环境：自动售货、自由职业和运营，在具有标准化界面的统一决策流程中实施，并在有效无限的范围内制定预算行动（如果进行 365 天循环评估，则有 1000 多个步骤）。 EcoGym 的评估基于业务相关结果（例如净值、收入和 DAU），目标是部分可观察性和随机性下的长期战略一致性和稳健性。十一个领先的法学硕士的实验暴露了系统性的紧张：没有一个模型在所有三种情况下都占主导地位。至关重要的是，我们发现模型在高级策略或有效行动执行方面表现出明显的次优性。 EcoGym 作为一个开放、可扩展的测试平台发布，用于透明的长期代理评估以及研究现实经济环境中的可控性与效用权衡。

Title: Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models

Authors: Sangwon Yu, Ik-hwan Kim, Donghun Kang, Bongkyu Hwang, Junhwa Choi, Suk-hoon Jung, Seungki Hong, Taehee Lee, Sungroh Yoon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09517
Pdf URL: https://arxiv.org/pdf/2602.09517
Copy Paste: [[2602.09517]] Knowledge Integration Decay in Search-Augmented Reasoning of Large Language Models(https://arxiv.org/abs/2602.09517)
Keywords: language model, llm, agent
Abstract: Modern Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks by employing search-augmented reasoning to incorporate external knowledge into long chains of thought. However, we identify a critical yet underexplored bottleneck in this paradigm, termed Knowledge Integration Decay (KID). Specifically, we observe that as the length of reasoning generated before search grows, models increasingly fail to integrate retrieved evidence into subsequent reasoning steps, limiting performance even when relevant information is available. To address this, we propose Self-Anchored Knowledge Encoding (SAKE), a training-free inference-time strategy designed to stabilize knowledge utilization. By anchoring retrieved knowledge at both the beginning and end of the reasoning process, SAKE prevents it from being overshadowed by prior context, thereby preserving its semantic integrity. Extensive experiments on multi-hop QA and complex reasoning benchmarks demonstrate that SAKE significantly mitigates KID and improves performance, offering a lightweight yet effective solution for knowledge integration in agentic LLMs.
摘要：现代大型语言模型（LLM）通过采用搜索增强推理将外部知识纳入长思想链，在复杂任务中表现出了卓越的能力。然而，我们发现了这个范式中一个关键但尚未充分探索的瓶颈，称为知识集成衰减（KID）。具体来说，我们观察到，随着搜索之前生成的推理长度的增长，模型越来越无法将检索到的证据集成到后续推理步骤中，即使相关信息可用，也会限制性能。为了解决这个问题，我们提出了自锚定知识编码（SAKE），这是一种无需训练的推理时间策略，旨在稳定知识的利用。通过在推理过程的开始和结束时锚定检索到的知识，SAKE 可以防止其被先前上下文所掩盖，从而保持其语义完整性。对多跳 QA 和复杂推理基准的大量实验表明，SAKE 显着减轻了 KID 并提高了性能，为代理法学硕士中的知识集成提供了轻量级但有效的解决方案。

Title: UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment

Authors: Hongyan Xie, Yikun Ban, Ruiyu Fang, Zixuan Huang, Deqing Wang, Jianxin Li, Yitong Yao, Chao Wang, Shuangyong Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09538
Pdf URL: https://arxiv.org/pdf/2602.09538
Copy Paste: [[2602.09538]] UniARM: Towards a Unified Autoregressive Reward Model for Multi-Objective Test-Time Alignment(https://arxiv.org/abs/2602.09538)
Keywords: llm
Abstract: Multi-objective alignment aims to align LLM responses with multiple human preference objectives. Among existing methods, guiding the generation of frozen LLMs through autoregressive reward models (ARMs) to accomplish multi-objective test-time alignment is a low-cost solution. However, these methods typically rely on independent parameters for each preference objective, either by training ARMs independently across preference dimensions, which neglects interactions among preference features, or by training a single ARM with separate feature extraction modules for each preference, which can cause feature entanglement. Both strategies can result in misalignment between generated outputs and user preferences. To address this limitation, we propose Preference-Modulated \& Shared Low-Rank Adaptation (MoSLoRA) for ARM training, which first extracts shared features via a preference-agnostic module and then applies affine transformations to shared features via a preference modulation module conditioned on mixed preference vectors. This design mitigates feature entanglement and enables precise control over preference trade-offs during inference. Building on this, we introduce the Unified Autoregressive Reward Model (UniARM), a novel framework for multi-objective test-time alignment. UniARM jointly models all preference dimensions in a single parameter space, eliminating the need for independent parameters for each preference objective. es on larger-scale LLMs, enhancing its practical usability.
摘要：多目标调整旨在使法学硕士的响应与多个人类偏好目标保持一致。在现有方法中，通过自回归奖励模型（ARM）引导冻结LLM的生成以完成多目标测试时间对齐是一种低成本的解决方案。然而，这些方法通常依赖于每个偏好目标的独立参数，要么通过跨偏好维度独立训练 ARM，这忽略了偏好特征之间的交互，要么通过为每个偏好训练具有单独特征提取模块的单个 ARM，这可能导致特征纠缠。这两种策略都可能导致生成的输出和用户偏好之间不一致。为了解决这个限制，我们提出了用于 ARM 训练的偏好调制和共享低阶适应（MoSLoRA），它首先通过偏好不可知的模块提取共享特征，然后通过以混合偏好向量为条件的偏好调制模块将仿射变换应用于共享特征。这种设计减轻了特征纠缠，并能够在推理过程中精确控制偏好权衡。在此基础上，我们引入了统一自回归奖励模型（UniARM），这是一种用于多目标测试时间对齐的新颖框架。 UniARM 在单个参数空间中对所有偏好维度进行联合建模，从而消除了每个偏好目标对独立参数的需要。 es在更大规模的法学硕士上，增强了其实际可用性。

Title: Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA

Authors: Klejda Alushi, Jan Strich, Chris Biemann, Martin Semmann
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.09552
Pdf URL: https://arxiv.org/pdf/2602.09552
Copy Paste: [[2602.09552]] Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA(https://arxiv.org/abs/2602.09552)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Conversational question answering increasingly relies on retrieval-augmented generation (RAG) to ground large language models (LLMs) in external knowledge. Yet, most existing studies evaluate RAG methods in isolation and primarily focus on single-turn settings. This paper addresses the lack of a systematic comparison of RAG methods for multi-turn conversational QA, where dialogue history, coreference, and shifting user intent substantially complicate retrieval. We present a comprehensive empirical study of vanilla and advanced RAG methods across eight diverse conversational QA datasets spanning multiple domains. Using a unified experimental setup, we evaluate retrieval quality and answer generation using generator and retrieval metrics, and analyze how performance evolves across conversation turns. Our results show that robust yet straightforward methods, such as reranking, hybrid BM25, and HyDE, consistently outperform vanilla RAG. In contrast, several advanced techniques fail to yield gains and can even degrade performance below the No-RAG baseline. We further demonstrate that dataset characteristics and dialogue length strongly influence retrieval effectiveness, explaining why no single RAG strategy dominates across settings. Overall, our findings indicate that effective conversational RAG depends less on method complexity than on alignment between the retrieval strategy and the dataset structure. We publish the code used.\footnote{\href{this https URL}{GitHub Repository}}
摘要：对话式问答越来越依赖检索增强生成（RAG）来将大型语言模型（LLM）建立在外部知识的基础上。然而，大多数现有研究单独评估 RAG 方法，并且主要关注单匝设置。本文解决了多轮对话 QA 中 RAG 方法缺乏系统比较的问题，其中对话历史、共指和用户意图的变化使检索变得非常复杂。我们对跨越多个领域的八个不同的对话 QA 数据集的普通和高级 RAG 方法进行了全面的实证研究。使用统一的实验设置，我们使用生成器和检索指标来评估检索质量和答案生成，并分析性能在对话轮次中如何演变。我们的结果表明，稳健而简单的方法（例如重新排序、混合 BM25 和 HyDE）始终优于普通 RAG。相比之下，一些先进技术无法产生收益，甚至可能将性能降低到 No-RAG 基线以下。我们进一步证明数据集特征和对话长度强烈影响检索有效性，解释了为什么没有单一的 RAG 策略在整个设置中占主导地位。总体而言，我们的研究结果表明，有效的会话 RAG 较少依赖于方法复杂性，而更多地依赖于检索策略和数据集结构之间的一致性。我们发布所使用的代码。\footnote{\href{此 https URL}{GitHub 存储库}}

Title: Advancing Block Diffusion Language Models for Test-Time Scaling

Authors: Yi Lu, Deyang Kong, Jianing Wang, Linsen Guo, Xue Wang, Qi Guo, Tao Gui, Xuanjing Huang, Wei Ye, Shikun Zhang, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09555
Pdf URL: https://arxiv.org/pdf/2602.09555
Copy Paste: [[2602.09555]] Advancing Block Diffusion Language Models for Test-Time Scaling(https://arxiv.org/abs/2602.09555)
Keywords: language model, chain-of-thought
Abstract: Recent advances in block diffusion language models have demonstrated competitive performance and strong scalability on reasoning tasks. However, existing BDLMs have limited exploration under the test-time scaling setting and face more severe decoding challenges in long Chain-of-Thought reasoning, particularly in balancing the decoding speed and effectiveness. In this work, we propose a unified framework for test-time scaling in BDLMs that introduces adaptivity in both decoding and block-wise generation. At the decoding level, we propose Bounded Adaptive Confidence Decoding (BACD), a difficulty-aware sampling strategy that dynamically adjusts denoising based on model confidence, accelerating inference while controlling error accumulation. Beyond step-wise adaptivity, we introduce Think Coarse, Critic Fine (TCCF), a test-time scaling paradigm that allocates large block sizes to exploratory reasoning and smaller block sizes to refinement, achieving an effective efficiency-effectiveness balance. To enable efficient and effective decoding with a large block size, we adopt Progressive Block Size Extension, which mitigates performance degradation when scaling block sizes. Extensive experiments show that applying BACD and TCCF to TDAR-8B yields significant improvements over strong baselines such as TraDo-8B (2.26x speedup, +11.2 points on AIME24). These results mark an important step toward unlocking the potential of BDLMs for test-time scaling in complex reasoning tasks.
摘要：块扩散语言模型的最新进展在推理任务上表现出了有竞争力的性能和强大的可扩展性。然而，现有的 BDLM 在测试时间缩放设置下的探索有限，并且在长链推理中面临更严峻的解码挑战，特别是在平衡解码速度和有效性方面。在这项工作中，我们提出了一个用于 BDLM 测试时间扩展的统一框架，该框架在解码和逐块生成中引入了自适应性。在解码层面，我们提出了有界自适应置信解码（BACD），这是一种难度感知采样策略，可根据模型置信度动态调整去噪，在控制错误累积的同时加速推理。除了逐步自适应性之外，我们还引入了 Think Coarse, Critic Fine (TCCF)，这是一种测试时间扩展范式，将大块大小分配给探索性推理，将较小块大小分配给细化，从而实现有效的效率-效果平衡。为了实现大块大小的高效解码，我们采用渐进块大小扩展，这可以减轻缩放块大小时的性能下降。大量实验表明，将 BACD 和 TCCF 应用于 TDAR-8B 比 TraDo-8B 等强基线产生了显着改进（加速 2.26 倍，AIME24 上 +11.2 点）。这些结果标志着释放 BDLM 在复杂推理任务中测试时间扩展潜力的重要一步。

Title: LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval

Authors: Narges Baba Ahmadi, Jan Strich, Martin Semmann, Chris Biemann
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.09570
Pdf URL: https://arxiv.org/pdf/2602.09570
Copy Paste: [[2602.09570]] LEMUR: A Corpus for Robust Fine-Tuning of Multilingual Law Embedding Models for Retrieval(https://arxiv.org/abs/2602.09570)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used to access legal information. Yet, their deployment in multilingual legal settings is constrained by unreliable retrieval and the lack of domain-adapted, open-embedding models. In particular, existing multilingual legal corpora are not designed for semantic retrieval, and PDF-based legislative sources introduce substantial noise due to imperfect text extraction. To address these challenges, we introduce LEMUR, a large-scale multilingual corpus of EU environmental legislation constructed from 24,953 official EUR-Lex PDF documents covering 25 languages. We quantify the fidelity of PDF-to-text conversion by measuring lexical consistency against authoritative HTML versions using the Lexical Content Score (LCS). Building on LEMUR, we fine-tune three state-of-the-art multilingual embedding models using contrastive objectives in both monolingual and bilingual settings, reflecting realistic legal-retrieval scenarios. Experiments across low- and high-resource languages demonstrate that legal-domain fine-tuning consistently improves Top-k retrieval accuracy relative to strong baselines, with particularly pronounced gains for low-resource languages. Cross-lingual evaluations show that these improvements transfer to unseen languages, indicating that fine-tuning primarily enhances language-independent, content-level legal representations rather than language-specific cues. We publish code\footnote{\href{this https URL}{GitHub Repository}} and data\footnote{\href{this https URL}{Hugging Face Dataset}}.
摘要：大型语言模型 (LLM) 越来越多地用于访问法律信息。然而，它们在多语言法律环境中的部署受到不可靠的检索和缺乏适应领域的开放嵌入模型的限制。特别是，现有的多语言法律语料库并不是为语义检索而设计的，并且基于 PDF 的立法来源由于不完善的文本提取而引入了大量噪音。为了应对这些挑战，我们引入了 LEMUR，这是一个大规模的欧盟环境立法多语言语料库，由涵盖 25 种语言的 24,953 个官方 EUR-Lex PDF 文档构建而成。我们使用词汇内容评分 (LCS) 来衡量词汇与权威 HTML 版本的一致性，从而量化 PDF 到文本转换的保真度。在 LEMUR 的基础上，我们使用单语和双语环境中的对比目标来微调三个最先进的多语言嵌入模型，反映现实的法律检索场景。跨低资源和高资源语言的实验表明，相对于强基线，法律领域的微调持续提高了 Top-k 检索的准确性，对于低资源语言的增益尤其明显。跨语言评估表明，这些改进转移到了看不见的语言，这表明微调主要增强了独立于语言的内容级法律表示，而不是特定于语言的线索。我们发布了 code\footnote{\href{this https URL}{GitHub Repository}} 和 data\footnote{\href{this https URL}{Hugging Face Dataset}}。

Title: Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs

Authors: Sora Miyamoto, Daisuke Oba, Naoaki Okazaki
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09574
Pdf URL: https://arxiv.org/pdf/2602.09574
Copy Paste: [[2602.09574]] Aligning Tree-Search Policies with Fixed Token Budgets in Test-Time Scaling of LLMs(https://arxiv.org/abs/2602.09574)
Keywords: language model, llm
Abstract: Tree-search decoding is an effective form of test-time scaling for large language models (LLMs), but real-world deployment imposes a fixed per-query token budget that varies across settings. Existing tree-search policies are largely budget-agnostic, treating the budget as a termination condition, which can lead to late-stage over-branching or premature termination. We propose {Budget-Guided MCTS} (BG-MCTS), a tree-search decoding algorithm that aligns its search policy with the remaining token budget: it starts with broad exploration, then prioritizes refinement and answer completion as the budget depletes while reducing late-stage branching from shallow nodes. BG-MCTS consistently outperforms budget-agnostic tree-search baselines across different budgets on MATH500 and AIME24/25 with open-weight LLMs.
摘要：树搜索解码是大型语言模型 (LLM) 测试时间扩展的有效形式，但实际部署会强制执行固定的每个查询令牌预算，该预算因设置而异。现有的树搜索策略在很大程度上与预算无关，将预算视为终止条件，这可能导致后期过度分支或过早终止。我们提出{Budget-Guided MCTS}（BG-MCTS），这是一种树搜索解码算法，将其搜索策略与剩余令牌预算保持一致：它从广泛的探索开始，然后在预算耗尽时优先考虑细化和答案完成，同时减少浅节点的后期分支。 BG-MCTS 在具有开放权重法学硕士的 MATH500 和 AIME24/25 上的不同预算中始终优于与预算无关的树搜索基线。

Title: Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models

Authors: Shweta Parihar, Liu Guangliang, Natalie Parde, Lu Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09590
Pdf URL: https://arxiv.org/pdf/2602.09590
Copy Paste: [[2602.09590]] Context-Aware Counterfactual Data Augmentation for Gender Bias Mitigation in Language Models(https://arxiv.org/abs/2602.09590)
Keywords: language model
Abstract: A challenge in mitigating social bias in fine-tuned language models (LMs) is the potential reduction in language modeling capability, which can harm downstream performance. Counterfactual data augmentation (CDA), a widely used method for fine-tuning, highlights this issue by generating synthetic data that may align poorly with real-world distributions or creating overly simplistic counterfactuals that ignore the social context of altered sensitive attributes (e.g., gender) in the pretraining corpus. To address these limitations, we propose a simple yet effective context-augmented CDA method, Context-CDA, which uses large LMs to enhance the diversity and contextual relevance of the debiasing corpus. By minimizing discrepancies between the debiasing corpus and pretraining data through augmented context, this approach ensures better alignment, enhancing language modeling capability. We then employ uncertainty-based filtering to exclude generated counterfactuals considered low-quality by the target smaller LMs (i.e., LMs to be debiased), further improving the fine-tuning corpus quality. Experimental results on gender bias benchmarks demonstrate that Context-CDA effectively mitigates bias without sacrificing language modeling performance while offering insights into social biases by analyzing distribution shifts in next-token generation probabilities.
摘要：减轻微调语言模型 (LM) 中的社会偏见的一个挑战是语言建模能力的潜在降低，这可能会损害下游性能。反事实数据增强（CDA）是一种广泛使用的微调方法，它通过生成可能与现实世界分布不一致的合成数据或创建过于简单的反事实而忽略了预训练语料库中敏感属性（例如性别）改变的社会背景，从而突出了这个问题。为了解决这些限制，我们提出了一种简单而有效的上下文增强 CDA 方法，Context-CDA，它使用大型 LM 来增强去偏语料库的多样性和上下文相关性。通过增强上下文最大限度地减少去偏语料库和预训练数据之间的差异，这种方法可确保更好的对齐，从而增强语言建模能力。然后，我们采用基于不确定性的过滤来排除生成的被目标较小 LM（即要消除偏差的 LM）视为低质量的反事实，从而进一步提高微调语料库质量。性别偏见基准的实验结果表明，Context-CDA 在不牺牲语言建模性能的情况下有效地减轻了偏见，同时通过分析下一代代币生成概率的分布变化提供了对社会偏见的见解。

Title: On the Optimal Reasoning Length for RL-Trained Language Models

Authors: Daisuke Nohara, Taishi Nakamura, Rio Yokota
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09591
Pdf URL: https://arxiv.org/pdf/2602.09591
Copy Paste: [[2602.09591]] On the Optimal Reasoning Length for RL-Trained Language Models(https://arxiv.org/abs/2602.09591)
Keywords: language model
Abstract: Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
摘要：强化学习极大地改善了大型语言模型中的推理，但它也往往会延长思维输出链并增加训练和推理过程中的计算成本。尽管已经提出了长度控制方法，但仍不清楚平衡效率和性能的最佳输出长度是多少。在这项工作中，我们比较了 Qwen3-1.7B Base 和 DeepSeek-R1-Distill-Qwen-1.5B 这两个模型上的几种长度控制方法。我们的结果表明，长度惩罚可能会阻碍推理获取，而适当调整的长度控制可以提高具有强先验推理的模型的效率。通过将先前的工作扩展到强化学习训练的策略，我们确定了两种失败模式，1）长输出会增加分散性，2）短输出会导致思考不足。

Title: Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning

Authors: Qiao Liang, Yuke Zhu, Chao Ge, Lei Yang, Ying Shen, Bo Zheng, Sheng Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09598
Pdf URL: https://arxiv.org/pdf/2602.09598
Copy Paste: [[2602.09598]] Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning(https://arxiv.org/abs/2602.09598)
Keywords: llm, agent
Abstract: Tool-integrated reasoning (TIR) enables LLM agents to solve tasks through planning, tool use, and iterative revision, but outcome-only reinforcement learning in this setting suffers from sparse, delayed rewards and weak step-level credit assignment. In long-horizon TIR trajectories, an early irrecoverable mistake can determine success or failure, making it crucial to localize the first irrecoverable step and leverage it for fine-grained credit assignment. We propose Error-Localized Policy Optimization (ELPO), which localizes the first irrecoverable step via binary-search rollout trees under a fixed rollout budget, converts the resulting tree into stable learning signals through hierarchical advantage attribution, and applies error-localized adaptive clipping to strengthen corrective updates on the critical step and its suffix. Across TIR benchmarks in math, science QA, and code execution, ELPO consistently outperforms strong Agentic RL baselines under comparable sampling budgets, with additional gains in Pass@K and Major@K scaling, rollout ranking quality, and tool-call efficiency. Our code will be publicly released soon.
摘要：工具集成推理 (TIR) 使 LLM 代理能够通过规划、工具使用和迭代修订来解决任务，但在这种情况下，仅结果强化学习会受到稀疏、延迟的奖励和薄弱的步骤级信用分配的影响。在长期 TIR 轨迹中，早期不可挽回的错误可能决定成功或失败，因此定位第一个不可挽回的步骤并利用它进行细粒度的信用分配至关重要。我们提出了错误局部化策略优化（ELPO），它在固定的rollout预算下通过二分搜索rollout树定位第一个不可恢复的步骤，通过分层优势归因将结果树转换为稳定的学习信号，并应用错误局部自适应裁剪来加强关键步骤及其后缀的纠正更新。在数学、科学 QA 和代码执行方面的 TIR 基准中，ELPO 在可比较的采样预算下始终优于强大的 Agentic RL 基线，并在 Pass@K 和 Major@K 扩展、推出排名质量和工具调用效率方面获得了额外收益。我们的代码很快就会公开发布。

Title: AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models

Authors: R E Zera Marveen Lyngkhoi, Chirag Chawla, Pratinav Seth, Utsav Avaiya, Soham Bhattacharjee, Mykola Khandoga, Rui Yuan, Vinay Kumar Sankarapu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09621
Pdf URL: https://arxiv.org/pdf/2602.09621
Copy Paste: [[2602.09621]] AlignTune: Modular Toolkit for Post-Training Alignment of Large Language Models(https://arxiv.org/abs/2602.09621)
Keywords: language model, llm
Abstract: Post-training alignment is central to deploying large language models (LLMs), yet practical workflows remain split across backend-specific tools and ad-hoc glue code, making experiments hard to reproduce. We identify backend interference, reward fragmentation, and irreproducible pipelines as key obstacles in alignment research. We introduce AlignTune, a modular toolkit exposing a unified interface for supervised fine-tuning (SFT) and RLHF-style optimization with interchangeable TRL and Unsloth backends. AlignTune standardizes configuration, provides an extensible reward layer (rule-based and learned), and integrates evaluation over standard benchmarks and custom tasks. By isolating backend-specific logic behind a single factory boundary, AlignTune enables controlled comparisons and reproducible alignment experiments.
摘要：训练后对齐对于部署大型语言模型 (LLM) 至关重要，但实际工作流程仍然分散在后端特定工具和临时粘合代码中，使得实验难以重现。我们认为后端干扰、奖励碎片化和不可重复的流程是一致性研究的主要障碍。我们推出了 AlignTune，这是一个模块化工具包，它提供了一个统一的界面，用于监督微调 (SFT) 和 RLHF 式优化，并具有可互换的 TRL 和 Unsloth 后端。 AlignTune 标准化配置，提供可扩展的奖励层（基于规则和学习），并集成对标准基准和自定义任务的评估。通过将后端特定逻辑隔离在单个工厂边界后面，AlignTune 可以实现受控比较和可重复的对齐实验。

Title: MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation

Authors: Nalin Srun (UL, CNRS, LORIA), Parisa Rastin (UL, CNRS, LORIA), Guénaël Cabanes (UL, CNRS, LORIA), Lydia Boudjeloud Assala (UL, CNRS, LORIA)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09624
Pdf URL: https://arxiv.org/pdf/2602.09624
Copy Paste: [[2602.09624]] MILE-RefHumEval: A Reference-Free, Multi-Independent LLM Framework for Human-Aligned Evaluation(https://arxiv.org/abs/2602.09624)
Keywords: language model, llm, prompt
Abstract: We introduce MILE-RefHumEval, a reference-free framework for evaluating Large Language Models (LLMs) without ground-truth annotations or evaluator coordination. It leverages an ensemble of independently prompted evaluators guided by a human-aligned schema, supporting both discrete and continuous scoring judgement. With task-specific prompts from best candidate selection, summarization and image captioning to dialogue, MILE-RefHumEval provides flexible, interpretable, and scalable assessments. Experiments show it aligns closely with human judgments, outperforms prior methods, and reduces computational overhead, offering an efficient, robust, and human-aligned solution for real-world LLM evaluation.
摘要：我们引入了 MILE-RefHumEval，这是一个无参考框架，用于评估大型语言模型 (LLM)，无需真实注释或评估器协调。它利用一组独立提示的评估者，以人性化模式为指导，支持离散和连续评分判断。 MILE-RefHumEval 提供从最佳候选选择、摘要和图像字幕到对话的特定任务提示，提供灵活、可解释和可扩展的评估。实验表明，它与人类的判断紧密结合，优于先前的方法，并减少了计算开销，为现实世界的 LLM 评估提供了高效、稳健且符合人类需求的解决方案。

Title: MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering

Authors: Sieun Hyeon, Jusang Oh, Sunghwan Steve Cho, Jaeyoung Do
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09642
Pdf URL: https://arxiv.org/pdf/2602.09642
Copy Paste: [[2602.09642]] MATA: Multi-Agent Framework for Reliable and Flexible Table Question Answering(https://arxiv.org/abs/2602.09642)
Keywords: language model, llm, agent
Abstract: Recent advances in Large Language Models (LLMs) have significantly improved table understanding tasks such as Table Question Answering (TableQA), yet challenges remain in ensuring reliability, scalability, and efficiency, especially in resource-constrained or privacy-sensitive environments. In this paper, we introduce MATA, a multi-agent TableQA framework that leverages multiple complementary reasoning paths and a set of tools built with small language models. MATA generates candidate answers through diverse reasoning styles for a given table and question, then refines or selects the optimal answer with the help of these tools. Furthermore, it incorporates an algorithm designed to minimize expensive LLM agent calls, enhancing overall efficiency. MATA maintains strong performance with small, open-source models and adapts easily across various LLM types. Extensive experiments on two benchmarks of varying difficulty with ten different LLMs demonstrate that MATA achieves state-of-the-art accuracy and highly efficient reasoning while avoiding excessive LLM inference. Our results highlight that careful orchestration of multiple reasoning pathways yields scalable and reliable TableQA. The code is available at this https URL.
摘要：大型语言模型 (LLM) 的最新进展显着改进了表理解任务，例如表问答 (TableQA)，但在确保可靠性、可扩展性和效率方面仍然存在挑战，特别是在资源受限或隐私敏感的环境中。在本文中，我们介绍了 MATA，这是一个多代理 TableQA 框架，它利用多个互补推理路径和一组使用小型语言模型构建的工具。 MATA 通过针对给定表格和问题的多种推理方式生成候选答案，然后借助这些工具细化或选择最佳答案。此外，它还采用了一种算法，旨在最大限度地减少昂贵的 LLM 代理呼叫，从而提高整体效率。 MATA 通过小型开源模型保持强大的性能，并轻松适应各种 LLM 类型。对十个不同 LLM 的两个不同难度的基准进行的广泛实验表明，MATA 实现了最先进的准确性和高效推理，同时避免了过多的 LLM 推理。我们的结果强调，仔细编排多个推理路径可以产生可扩展且可靠的 TableQA。该代码可从此 https URL 获取。

Title: Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding

Authors: Abdulhai Alali, Abderrahmane Issam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09703
Pdf URL: https://arxiv.org/pdf/2602.09703
Copy Paste: [[2602.09703]] Maastricht University at AMIYA: Adapting LLMs for Dialectal Arabic using Fine-tuning and MBR Decoding(https://arxiv.org/abs/2602.09703)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are becoming increasingly multilingual, supporting hundreds of languages, especially high resource ones. Unfortunately, Dialect variations are still underrepresented due to limited data and linguistic variation. In this work, we adapt a pre-trained LLM to improve dialectal performance. Specifically, we use Low Rank Adaptation (LoRA) fine-tuning on monolingual and English Dialect parallel data, adapter merging and dialect-aware MBR decoding to improve dialectal fidelity generation and translation. Experiments on Syrian, Moroccan, and Saudi Arabic show that merging and MBR improve dialectal fidelity while preserving semantic accuracy. This combination provides a compact and effective framework for robust dialectal Arabic generation.
摘要：大型语言模型 (LLM) 正变得越来越多语言，支持数百种语言，尤其是高资源语言。不幸的是，由于数据和语言变异有限，方言变异的代表性仍然不足。在这项工作中，我们采用预先训练的法学硕士来提高方言表现。具体来说，我们对单语和英语方言并行数据使用低秩适应 (LoRA) 微调、适配器合并和方言感知 MBR 解码来提高方言保真度生成和翻译。对叙利亚语、摩洛哥语和沙特阿拉伯语的实验表明，合并和 MBR 可以提高方言保真度，同时保持语义准确性。这种组合为强大的阿拉伯语方言生成提供了一个紧凑而有效的框架。

Title: TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces

Authors: Yiming Shu, Pei Liu, Tiange Zhang, Ruiyang Gao, Jun Ma, Chen Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09712
Pdf URL: https://arxiv.org/pdf/2602.09712
Copy Paste: [[2602.09712]] TraceMem: Weaving Narrative Memory Schemata from User Conversational Traces(https://arxiv.org/abs/2602.09712)
Keywords: language model, llm, agent
Abstract: Sustaining long-term interactions remains a bottleneck for Large Language Models (LLMs), as their limited context windows struggle to manage dialogue histories that extend over time. Existing memory systems often treat interactions as disjointed snippets, failing to capture the underlying narrative coherence of the dialogue stream. We propose TraceMem, a cognitively-inspired framework that weaves structured, narrative memory schemata from user conversational traces through a three-stage pipeline: (1) Short-term Memory Processing, which employs a deductive topic segmentation approach to demarcate episode boundaries and extract semantic representation; (2) Synaptic Memory Consolidation, a process that summarizes episodes into episodic memories before distilling them alongside semantics into user-specific traces; and (3) Systems Memory Consolidation, which utilizes two-stage hierarchical clustering to organize these traces into coherent, time-evolving narrative threads under unifying themes. These threads are encapsulated into structured user memory cards, forming narrative memory schemata. For memory utilization, we provide an agentic search mechanism to enhance reasoning process. Evaluation on the LoCoMo benchmark shows that TraceMem achieves state-of-the-art performance with a brain-inspired architecture. Analysis shows that by constructing coherent narratives, it surpasses baselines in multi-hop and temporal reasoning, underscoring its essential role in deep narrative comprehension. Additionally, we provide an open discussion on memory systems, offering our perspectives and future outlook on the field. Our code implementation is available at: this https URL
摘要：维持长期交互仍然是大型语言模型（LLM）的瓶颈，因为它们有限的上下文窗口难以管理随着时间推移而延伸的对话历史。现有的记忆系统通常将交互视为脱节的片段，无法捕捉对话流的潜在叙事连贯性。我们提出了 TraceMem，这是一个受认知启发的框架，它通过三阶段管道从用户对话痕迹中编织结构化的叙事记忆模式：（1）短期记忆处理，采用演绎主题分割方法来划分情节边界并提取语义表示； (2) 突触记忆巩固，一个将情节总结为情节记忆，然后将其与语义一起提炼为用户特定痕迹的过程； (3)系统记忆整合，它利用两阶段层次聚类将这些痕迹组织成统一主题下连贯的、随时间演变的叙事线索。这些线程被封装到结构化的用户存储卡中，形成叙事存储模式。对于内存利用率，我们提供了一种代理搜索机制来增强推理过程。对 LoCoMo 基准的评估表明，TraceMem 通过受大脑启发的架构实现了最先进的性能。分析表明，通过构建连贯的叙述，它超越了多跳和时间推理的基线，强调了其在深度叙述理解中的重要作用。此外，我们还对内存系统进行公开讨论，提供我们对该领域的看法和未来展望。我们的代码实现位于：此 https URL

Title: Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs

Authors: Longhuan Xu, Cunjian Chen, Feng Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09719
Pdf URL: https://arxiv.org/pdf/2602.09719
Copy Paste: [[2602.09719]] Unsupervised Layer-Wise Dynamic Test Time Adaptation for LLMs(https://arxiv.org/abs/2602.09719)
Keywords: language model, llm, prompt
Abstract: Test-time adaptation (TTA) for large language models (LLMs) updates model parameters at inference time using signals available at deployment. This paper focuses on a common yet under-explored regime: unsupervised, sample-specific TTA, where the model adapts independently for each prompt using only the prompt itself, without gold answers or external supervision. Although appealing, naive unsupervised TTA with a fixed, handcrafted learning rate can be unstable: updates may overfit to prompt-specific statistics, drift from the desired answer distribution, and ultimately degrade generation quality. This failure mode is not surprising, as in this case TTA must adapt to a single prompt within only a few gradient steps, unlike standard training that averages updates over large datasets and long optimization horizons. Therefore, we propose layer-wise dynamic test-time adaptation, a framework which explicitly modulates TTA strength as a function of prompt representation, LLM structure and adaptation step. In our setting, TTA updates only LoRA parameters, and a lightweight hypernetwork predicts per-layer, per-step learning-rate multipliers, enabling fine-grained control. Experiments across various datasets and LLMs consistently show that our method substantially strengthens TTA by learning effective scaling patterns over adaptation steps and transformer layer projections, improving stability while delivering better performance.
摘要：大型语言模型 (LLM) 的测试时适应 (TTA) 使用部署时可用的信号在推理时更新模型参数。本文重点关注一种常见但尚未充分探索的机制：无监督、特定于样本的 TTA，其中模型仅使用提示本身独立适应每个提示，无需黄金答案或外部监督。尽管很有吸引力，但具有固定的、手工设计的学习率的天真的无监督 TTA 可能不稳定：更新可能会过度拟合特定于提示的统计数据，偏离所需的答案分布，并最终降低生成质量。这种失败模式并不奇怪，因为在这种情况下，TTA 必须在几个梯度步骤内适应单个提示，这与在大型数据集和长优化范围内平均更新的标准训练不同。因此，我们提出了逐层动态测试时间适应，这是一个框架，它根据提示表示、LLM 结构和适应步骤明确调节 TTA 强度。在我们的设置中，TTA 仅更新 LoRA 参数，轻量级超网络预测每层、每步的学习率乘数，从而实现细粒度控制。跨各种数据集和 LLM 的实验一致表明，我们的方法通过学习适应步骤和变换层投影的有效缩放模式，显着增强了 TTA，提高了稳定性，同时提供了更好的性能。

Title: AI-Assisted Scientific Assessment: A Case Study on Climate Change

Authors: Christian Buck, Levke Caesar, Michelle Chen Huebscher, Massimiliano Ciaramita, Erich M. Fischer, Zeke Hausfather, Özge Kart Tokmak, Reto Knutti, Markus Leippold, Joseph Ludescher, Katharine J. Mach, Sofia Palazzo Corner, Kasra Rafiezadeh Shahi, Johan Rockström, Joeri Rogelj, Boris Sakschewski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09723
Pdf URL: https://arxiv.org/pdf/2602.09723
Copy Paste: [[2602.09723]] AI-Assisted Scientific Assessment: A Case Study on Climate Change(https://arxiv.org/abs/2602.09723)
Keywords: agent
Abstract: The emerging paradigm of AI co-scientists focuses on tasks characterized by repeatable verification, where agents explore search spaces in 'guess and check' loops. This paradigm does not extend to problems where repeated evaluation is impossible and ground truth is established by the consensus synthesis of theory and existing evidence. We evaluate a Gemini-based AI environment designed to support collaborative scientific assessment, integrated into a standard scientific workflow. In collaboration with a diverse group of 13 scientists working in the field of climate science, we tested the system on a complex topic: the stability of the Atlantic Meridional Overturning Circulation (AMOC). Our results show that AI can accelerate the scientific workflow. The group produced a comprehensive synthesis of 79 papers through 104 revision cycles in just over 46 person-hours. AI contribution was significant: most AI-generated content was retained in the report. AI also helped maintain logical consistency and presentation quality. However, expert additions were crucial to ensure its acceptability: less than half of the report was produced by AI. Furthermore, substantial oversight was required to expand and elevate the content to rigorous scientific standards.
摘要：人工智能联合科学家的新兴范式专注于以可重复验证为特征的任务，其中代理在“猜测和检查”循环中探索搜索空间。这种范式不适用于无法重复评估的问题，并且基本事实是通过理论和现有证据的共识综合建立的。我们评估基于 Gemini 的人工智能环境，旨在支持协作科学评估，并将其集成到标准科学工作流程中。我们与气候科学领域的 13 名科学家组成的不同小组合作，针对一个复杂的主题测试了该系统：大西洋经向翻转环流 (AMOC) 的稳定性。我们的结果表明人工智能可以加速科学工作流程。该小组通过 104 个修订周期，仅用了 46 个多小时就完成了 79 篇论文的综合综合。人工智能的贡献是显着的：大多数人工智能生成的内容都保留在报告中。人工智能还有助于保持逻辑一致性和演示质量。然而，专家的补充对于确保其可接受性至关重要：不到一半的报告是由人工智能编写的。此外，需要进行大量监督来扩展内容并将其提升到严格的科学标准。

Title: Improving Interpretability of Lexical Semantic Change with Neurobiological Features

Authors: Kohei Oda, Hiroya Takamura, Kiyoaki Shirai, Natthawut Kertkeidkachorn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09760
Pdf URL: https://arxiv.org/pdf/2602.09760
Copy Paste: [[2602.09760]] Improving Interpretability of Lexical Semantic Change with Neurobiological Features(https://arxiv.org/abs/2602.09760)
Keywords: language model
Abstract: Lexical Semantic Change (LSC) is the phenomenon in which the meaning of a word change over time. Most studies on LSC focus on improving the performance of estimating the degree of LSC, however, it is often difficult to interpret how the meaning of a word change. Enhancing the interpretability of LSC is a significant challenge as it could lead to novel insights in this field. To tackle this challenge, we propose a method to map the semantic space of contextualized embeddings of words obtained by a pre-trained language model to a neurobiological feature space. In the neurobiological feature space, each dimension corresponds to a primitive feature of words, and its value represents the intensity of that feature. This enables humans to interpret LSC systematically. When employed for the estimation of the degree of LSC, our method demonstrates superior performance in comparison to the majority of the previous methods. In addition, given the high interpretability of the proposed method, several analyses on LSC are carried out. The results demonstrate that our method not only discovers interesting types of LSC that have been overlooked in previous studies but also effectively searches for words with specific types of LSC.
摘要：词汇语义变化（LSC）是词义随时间变化的现象。大多数关于LSC的研究都集中在提高估计LSC程度的性能，然而，通常很难解释单词的含义如何变化。增强 LSC 的可解释性是一项重大挑战，因为它可能会带来该领域的新颖见解。为了应对这一挑战，我们提出了一种方法，将预先训练的语言模型获得的单词上下文嵌入的语义空间映射到神经生物学特征空间。在神经生物学特征空间中，每个维度对应于单词的一个原始特征，其值代表该特征的强度。这使得人类能够系统地解释 LSC。当用于估计 LSC 程度时，与之前的大多数方法相比，我们的方法表现出优越的性能。此外，鉴于该方法的高可解释性，对 LSC 进行了一些分析。结果表明，我们的方法不仅发现了以前研究中被忽视的有趣的 LSC 类型，而且还可以有效地搜索具有特定类型 LSC 的单词。

Title: Decomposing Reasoning Efficiency in Large Language Models

Authors: Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09805
Pdf URL: https://arxiv.org/pdf/2602.09805
Copy Paste: [[2602.09805]] Decomposing Reasoning Efficiency in Large Language Models(https://arxiv.org/abs/2602.09805)
Keywords: language model, llm, prompt
Abstract: Large language models trained for reasoning trade off inference tokens against accuracy, yet standard evaluations report only final accuracy, obscuring where tokens are spent or wasted. We introduce a trace-optional framework that decomposes token efficiency into interpretable factors: completion under a fixed token budget (avoiding truncation), conditional correctness given completion, and verbosity (token usage). When benchmark metadata provides per-instance workload proxies, we further factor verbosity into two components: mean verbalization overhead (tokens per work unit) and a coupling coefficient capturing how overhead scales with task workload. When reasoning traces are available, we add deterministic trace-quality measures (grounding, repetition, prompt copying) to separate degenerate looping from verbose-but-engaged reasoning, avoiding human labeling and LLM judges. Evaluating 25 models on CogniLoad, we find that accuracy and token-efficiency rankings diverge (Spearman $\rho=0.63$), efficiency gaps are often driven by conditional correctness, and verbalization overhead varies by about 9 times (only weakly related to model scale). Our decomposition reveals distinct bottleneck profiles that suggest different efficiency interventions.
摘要：为推理而训练的大型语言模型会权衡推理标记与准确性，但标准评估仅报告最终准确性，从而模糊了标记的花费或浪费位置。我们引入了一个跟踪可选框架，它将令牌效率分解为可解释的因素：固定令牌预算下的完成（避免截断）、给定完成的条件正确性以及详细程度（令牌使用）。当基准元数据提供每个实例的工作负载代理时，我们进一步将详细程度分解为两个组成部分：平均语言开销（每个工作单元的令牌）和捕获开销如何随任务工作负载扩展的耦合系数。当推理轨迹可用时，我们添加确定性轨迹质量测量（基础、重复、提示复制），以将退化循环与冗长但参与的推理分开，避免人工标记和 LLM 法官。在 CogniLoad 上评估 25 个模型，我们发现准确性和令牌效率排名存在差异 (Spearman $\rho=0.63$)，效率差距通常由条件正确性驱动，并且语言化开销变化约 9 倍（仅与模型规模微弱相关）。我们的分解揭示了不同的瓶颈概况，表明不同的效率干预措施。

Title: AnalyticsGPT: An LLM Workflow for Scientometric Question Answering

Authors: Khang Ly, Georgios Cheirmpos, Adrian Raudaschl, Christopher James, Seyed Amin Tabatabaei
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2602.09817
Pdf URL: https://arxiv.org/pdf/2602.09817
Copy Paste: [[2602.09817]] AnalyticsGPT: An LLM Workflow for Scientometric Question Answering(https://arxiv.org/abs/2602.09817)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation, agent
Abstract: This paper introduces AnalyticsGPT, an intuitive and efficient large language model (LLM)-powered workflow for scientometric question answering. This underrepresented downstream task addresses the subcategory of meta-scientific questions concerning the "science of science." When compared to traditional scientific question answering based on papers, the task poses unique challenges in the planning phase. Namely, the need for named-entity recognition of academic entities within questions and multi-faceted data retrieval involving scientometric indices, e.g. impact factors. Beyond their exceptional capacity for treating traditional natural language processing tasks, LLMs have shown great potential in more complex applications, such as task decomposition and planning and reasoning. In this paper, we explore the application of LLMs to scientometric question answering, and describe an end-to-end system implementing a sequential workflow with retrieval-augmented generation and agentic concepts. We also address the secondary task of effectively synthesizing the data into presentable and well-structured high-level analyses. As a database for retrieval-augmented generation, we leverage a proprietary research performance assessment platform. For evaluation, we consult experienced subject matter experts and leverage LLMs-as-judges. In doing so, we provide valuable insights on the efficacy of LLMs towards a niche downstream task. Our (skeleton) code and prompts are available at: this https URL.
摘要：本文介绍了 AnalyticsGPT，这是一种直观且高效的大语言模型 (LLM) 支持的科学计量问答工作流程。这项代表性不足的下游任务解决了有关“科学的科学”的元科学问题的子类别。与基于论文的传统科学问答相比，该任务在规划阶段提出了独特的挑战。即，需要对问题中的学术实体进行命名实体识别以及涉及科学计量索引的多方面数据检索，例如影响因素。除了处理传统自然语言处理任务的卓越能力之外，法学硕士在更复杂的应用中也显示出了巨大的潜力，例如任务分解、规划和推理。在本文中，我们探索了法学硕士在科学计量学问答中的应用，并描述了一个端到端系统，该系统通过检索增强生成和代理概念来实现顺序工作流程。我们还解决了第二个任务，即有效地将数据合成为可呈现且结构良好的高级分析。作为检索增强生成的数据库，我们利用专有的研究绩效评估平台。为了进行评估，我们咨询了经验丰富的主题专家并利用法学硕士作为评委。在此过程中，我们就法学硕士对利基下游任务的功效提供了宝贵的见解。我们的（骨架）代码和提示可在以下位置找到：此 https URL。

Title: Text summarization via global structure awareness

Authors: Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Yibei Liu, Chenghao Li, Qigan Sun, Shuai Yuan, Fachrina Dewi Puspitasari, Dongshen Han, Guoqing Wang, Sung-Ho Bae, Yang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09821
Pdf URL: https://arxiv.org/pdf/2602.09821
Copy Paste: [[2602.09821]] Text summarization via global structure awareness(https://arxiv.org/abs/2602.09821)
Keywords: language model, llm
Abstract: Text summarization is a fundamental task in natural language processing (NLP), and the information explosion has made long-document processing increasingly demanding, making summarization essential. Existing research mainly focuses on model improvements and sentence-level pruning, but often overlooks global structure, leading to disrupted coherence and weakened downstream performance. Some studies employ large language models (LLMs), which achieve higher accuracy but incur substantial resource and time costs. To address these issues, we introduce GloSA-sum, the first summarization approach that achieves global structure awareness via topological data analysis (TDA). GloSA-sum summarizes text efficiently while preserving semantic cores and logical dependencies. Specifically, we construct a semantic-weighted graph from sentence embeddings, where persistent homology identifies core semantics and logical structures, preserved in a ``protection pool'' as the backbone for summarization. We design a topology-guided iterative strategy, where lightweight proxy metrics approximate sentence importance to avoid repeated high-cost computations, thus preserving structural integrity while improving efficiency. To further enhance long-text processing, we propose a hierarchical strategy that integrates segment-level and global summarization. Experiments on multiple datasets demonstrate that GloSA-sum reduces redundancy while preserving semantic and logical integrity, striking a balance between accuracy and efficiency, and further benefits LLM downstream tasks by shortening contexts while retaining essential reasoning chains.
摘要：文本摘要是自然语言处理（NLP）中的一项基本任务，信息爆炸使得长文档处理的要求越来越高，使得摘要变得至关重要。现有研究主要集中在模型改进和句子级剪枝上，但往往忽视全局结构，导致连贯性破坏和下游性能减弱。一些研究采用大型语言模型（LLM），它可以实现更高的准确性，但会产生大量的资源和时间成本。为了解决这些问题，我们引入了 GloSA-sum，这是第一个通过拓扑数据分析（TDA）实现全局结构感知的汇总方法。 GloSA-sum 有效地总结文本，同时保留语义核心和逻辑依赖性。具体来说，我们从句子嵌入构建了一个语义加权图，其中持久同源性标识了核心语义和逻辑结构，并保存在“保护池”中作为摘要的支柱。我们设计了一种拓扑引导的迭代策略，其中轻量级代理度量近似句子重要性，以避免重复的高成本计算，从而在提高效率的同时保持结构完整性。为了进一步增强长文本处理，我们提出了一种集成分段级别和全局摘要的分层策略。对多个数据集的实验表明，GloSA-sum 在保留语义和逻辑完整性的同时减少了冗余，在准确性和效率之间取得了平衡，并通过缩短上下文同时保留必要的推理链来进一步有利于 LLM 下游任务。

Title: From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models

Authors: Abdulmuizz Khalak, Abderrahmane Issam, Gerasimos Spanakis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09826
Pdf URL: https://arxiv.org/pdf/2602.09826
Copy Paste: [[2602.09826]] From FusHa to Folk: Exploring Cross-Lingual Transfer in Arabic Language Models(https://arxiv.org/abs/2602.09826)
Keywords: language model
Abstract: Arabic Language Models (LMs) are pretrained predominately on Modern Standard Arabic (MSA) and are expected to transfer to its dialects. While MSA as the standard written variety is commonly used in formal settings, people speak and write online in various dialects that are spread across the Arab region. This poses limitations for Arabic LMs, since its dialects vary in their similarity to MSA. In this work we study cross-lingual transfer of Arabic models using probing on 3 Natural Language Processing (NLP) Tasks, and representational similarity. Our results indicate that transfer is possible but disproportionate across dialects, which we find to be partially explained by their geographic proximity. Furthermore, we find evidence for negative interference in models trained to support all Arabic dialects. This questions their degree of similarity, and raises concerns for cross-lingual transfer in Arabic models.
摘要：阿拉伯语言模型 (LM) 主要在现代标准阿拉伯语 (MSA) 上进行预训练，并有望迁移到其方言。虽然 MSA 作为标准书面语常用于正式场合，但人们在网上用遍布阿拉伯地区的各种方言进行交谈和写作。这对阿拉伯语言 LM 造成了限制，因为其方言与 MSA 的相似性有所不同。在这项工作中，我们使用对 3 个自然语言处理 (NLP) 任务的探测和表征相似性来研究阿拉伯语模型的跨语言迁移。我们的结果表明，不同方言之间的迁移是可能的，但不成比例，我们发现这部分是由它们的地理邻近性来解释的。此外，我们发现了支持所有阿拉伯方言的模型受到负面干扰的证据。这对它们的相似程度提出了质疑，并引发了对阿拉伯语模型中跨语言迁移的担忧。

Title: LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse

Authors: Bakhtawar Ahtisham, Kirk Vanacore, Zhuqian Zhou, Jinsook Lee, Rene F. Kizilcec
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09832
Pdf URL: https://arxiv.org/pdf/2602.09832
Copy Paste: [[2602.09832]] LLM Reasoning Predicts When Models Are Right: Evidence from Coding Classroom Discourse(https://arxiv.org/abs/2602.09832)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly deployed to automatically label and analyze educational dialogue at scale, yet current pipelines lack reliable ways to detect when models are wrong. We investigate whether reasoning generated by LLMs can be used to predict the correctness of a model's own predictions. We analyze 30,300 teacher utterances from classroom dialogue, each labeled by multiple state-of-the-art LLMs with an instructional move construct and an accompanying reasoning. Using human-verified ground-truth labels, we frame the task as predicting whether a model's assigned label for a given utterance is correct. We encode LLM reasoning using Term Frequency-Inverse Document Frequency (TF-IDF) and evaluate five supervised classifiers. A Random Forest classifier achieves an F1 score of 0.83 (Recall = 0.854), successfully identifying most incorrect predictions and outperforming baselines. Training specialist detectors for specific instructional move constructs further improves performance on difficult constructs, indicating that error detection benefits from construct-specific linguistic cues. Using the Linguistic Inquiry and Word Count (LIWC) framework, we examine four linguistic markers of correctness: Causation, Differentiation, Tentativeness, and Insight. Correct predictions exhibit grounded causal language (e.g., because, therefore), while incorrect reasoning is substantially more likely to rely on epistemic hedging (e.g., might, could) and performative metacognition (e.g., think, realize). Syntactic complexity does not distinguish correct from incorrect reasoning, and longer reasoning is not more reliable. These findings demonstrate that reasoning-based error detection offers a practical and scalable approach to quality control in automated educational dialogue analysis.
摘要：大型语言模型 (LLM) 越来越多地用于自动标记和分析大规模教育对话，但当前的流程缺乏可靠的方法来检测模型何时出错。我们研究法学硕士生成的推理是否可用于预测模型自身预测的正确性。我们分析了课堂对话中的 30,300 条教师话语，每条话语都由多个最先进的法学硕士进行了标记，并带有教学动作结构和随附的推理。使用经过人工验证的真实标签，我们将任务定义为预测模型为给定话语分配的标签是否正确。我们使用术语频率-逆文档频率 (TF-IDF) 对 LLM 推理进行编码，并评估五个监督分类器。随机森林分类器的 F1 分数为 0.83（召回率 = 0.854），成功识别了大多数不正确的预测并优于基线。针对特定教学动作结构训练专业检测器进一步提高了困难结构的性能，这表明错误检测受益于结构特定的语言线索。使用语言查询和字数统计 (LIWC) 框架，我们检查了四种正确性的语言标记：因果关系、差异、尝试性和洞察力。正确的预测表现出扎根的因果语言（例如，因为、因此），而错误的推理则更可能依赖于认知对冲（例如，可能、可能）和表演性元认知（例如，思考、实现）。句法复杂性并不能区分正确和不正确的推理，而且较长的推理并不更可靠。这些发现表明，基于推理的错误检测为自动教育对话分析中的质量控制提供了一种实用且可扩展的方法。

Title: SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech

Authors: Johan Sofalas, Dilushri Pavithra, Nevidu Jayatilleke, Ruvan Weerasinghe
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09866
Pdf URL: https://arxiv.org/pdf/2602.09866
Copy Paste: [[2602.09866]] SinFoS: A Parallel Dataset for Translating Sinhala Figures of Speech(https://arxiv.org/abs/2602.09866)
Keywords: llm
Abstract: Figures of Speech (FoS) consist of multi-word phrases that are deeply intertwined with culture. While Neural Machine Translation (NMT) performs relatively well with the figurative expressions of high-resource languages, it often faces challenges when dealing with low-resource languages like Sinhala due to limited available data. To address this limitation, we introduce a corpus of 2,344 Sinhala figures of speech with cultural and cross-lingual annotations. We examine this dataset to classify the cultural origins of the figures of speech and to identify their cross-lingual equivalents. Additionally, we have developed a binary classifier to differentiate between two types of FOS in the dataset, achieving an accuracy rate of approximately 92%. We also evaluate the performance of existing LLMs on this dataset. Our findings reveal significant shortcomings in the current capabilities of LLMs, as these models often struggle to accurately convey idiomatic meanings. By making this dataset publicly available, we offer a crucial benchmark for future research in low-resource NLP and culturally aware machine translation.
摘要：修辞格 (FoS) 由与文化紧密相连的多词短语组成。虽然神经机器翻译（NMT）在处理高资源语言的比喻表达方面表现相对较好，但由于可用数据有限，在处理僧伽罗语等低资源语言时常常面临挑战。为了解决这一限制，我们引入了包含 2,344 个僧伽罗语修辞格的语料库，并带有文化和跨语言注释。我们检查这个数据集，对修辞格的文化起源进行分类，并识别它们的跨语言对应物。此外，我们还开发了一个二元分类器来区分数据集中的两种类型的 FOS，准确率达到约 92%。我们还评估了现有法学硕士在此数据集上的表现。我们的研究结果揭示了法学硕士当前能力的重大缺陷，因为这些模型往往难以准确传达惯用含义。通过公开该数据集，我们为低资源 NLP 和文化感知机器翻译的未来研究提供了重要的基准。

Title: Steer2Edit: From Activation Steering to Component-Level Editing

Authors: Chung-En Sun, Ge Yan, Zimo Wang, Tsui-Wei Weng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09870
Pdf URL: https://arxiv.org/pdf/2602.09870
Copy Paste: [[2602.09870]] Steer2Edit: From Activation Steering to Component-Level Editing(https://arxiv.org/abs/2602.09870)
Keywords: language model, hallucination
Abstract: Steering methods influence Large Language Model behavior by identifying semantic directions in hidden representations, but are typically realized through inference-time activation interventions that apply a fixed, global modification to the model's internal states. While effective, such interventions often induce unfavorable attribute-utility trade-offs under strong control, as they ignore the fact that many behaviors are governed by a small and heterogeneous subset of model components. We propose Steer2Edit, a theoretically grounded, training-free framework that transforms steering vectors from inference-time control signals into diagnostic signals for component-level rank-1 weight editing. Instead of uniformly injecting a steering direction during generation, Steer2Edit selectively redistributes behavioral influence across individual attention heads and MLP neurons, yielding interpretable edits that preserve the standard forward pass and remain compatible with optimized parallel inference. Across safety alignment, hallucination mitigation, and reasoning efficiency, Steer2Edit consistently achieves more favorable attribute-utility trade-offs: at matched downstream performance, it improves safety by up to 17.2%, increases truthfulness by 9.8%, and reduces reasoning length by 12.2% on average. Overall, Steer2Edit provides a principled bridge between representation steering and weight editing by translating steering signals into interpretable, training-free parameter updates.
摘要：引导方法通过识别隐藏表示中的语义方向来影响大型语言模型的行为，但通常是通过推理时激活干预来实现的，这些干预对模型的内部状态应用固定的全局修改。虽然有效，但此类干预措施通常会在强控制下引起不利的属性-效用权衡，因为它们忽略了许多行为是由模型组件的小而异构的子集控制的事实。我们提出了 Steer2Edit，这是一个有理论依据的、免训练的框架，它将引导向量从推理时间控制信号转换为组件级 1 级权重编辑的诊断信号。 Steer2Edit 不是在生成过程中统一注入转向方向，而是有选择地在各个注意头和 MLP 神经元之间重新分配行为影响，产生可解释的编辑，保留标准前向传递并与优化的并行推理保持兼容。在安全对齐、幻觉缓解和推理效率方面，Steer2Edit 始终实现了更有利的属性与效用权衡：在匹配的下游性能下，它的安全性提高了 17.2%，真实性提高了 9.8%，推理长度平均减少了 12.2%。总体而言，Steer2Edit 通过将转向信号转换为可解释的、免训练的参数更新，在表示转向和权重编辑之间提供了一个原则性的桥梁。

Title: The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Authors: Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.09877
Pdf URL: https://arxiv.org/pdf/2602.09877
Copy Paste: [[2602.09877]] The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies(https://arxiv.org/abs/2602.09877)
Keywords: language model, llm, agent
Abstract: The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.
摘要：由大型语言模型（LLM）构建的多智能体系统的出现为可扩展的集体智能和自我进化提供了一个有前途的范例。理想情况下，此类系统将在完全闭环中实现持续的自我改进，同时保持强大的安全一致性——我们将这种组合称为自我进化三难困境。然而，我们从理论和经验上证明，满足持续自我进化、完全隔离和安全不变性的智能体社会是不可能的。借鉴信息论框架，我们将安全性形式化为与人类价值分布的分歧程度。我们从理论上证明，孤立的自我进化会导致统计盲点，从而导致系统安全性的不可逆转的退化。来自开放式代理社区（Moltbook）和两个封闭的自我进化系统的经验和定性结果揭示了与我们对不可避免的安全侵蚀的理论预测相一致的现象。我们进一步提出了几个解决方案方向，以减轻已确定的安全问题。我们的工作对自我进化的人工智能社会建立了基本限制，并将讨论从症状驱动的安全补丁转变为对内在动态风险的原则性理解，强调了外部监督或新颖的安全保护机制的必要性。

Title: AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning

Authors: Tilahun Yeshambel, Moncef Garouani, Josiane Mothe
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.09914
Pdf URL: https://arxiv.org/pdf/2602.09914
Copy Paste: [[2602.09914]] AmharicIR+Instr: A Two-Dataset Resource for Neural Retrieval and Instruction Tuning(https://arxiv.org/abs/2602.09914)
Keywords: gpt, llm, prompt
Abstract: Neural retrieval and GPT-style generative models rely on large, high-quality supervised data, which is still scarce for low-resource languages such as Amharic. We release an Amharic data resource consisting of two datasets that supports research on (i) neural retrieval-ranking and (ii) instruction-following text generation. The retrieval-ranking dataset contains 1,091 manually verified query-positive-negative document triplets drawn from diverse Amharic sources and constructed to support contrastive training and benchmarking of neural retrievers (e.g., DPR, ColBERT-style late interaction and SPLADE-style sparse neural retrieval). Triplets are created through a combination of expert-curated queries, web-derived queries, and LLM-assisted generation, with positive/negative documents selected from the web or synthesized by LLMs and then validated by native speakers. The instruction prompt-response dataset comprises 6,285 Amharic prompt-response pairs spanning multiple domains and instruction types, generated with several LLMs and refined through manual review and correction for grammaticality, relevance, fluency, and factual plausibility. We release both datasets with standardized splits and formats (CSV,JSON,JSONL) to enable reproducible work on Amharic retrieval, ranking, and generative modelling. These datasets also come with a methodology that can be generalized to other low-resource languages.
摘要：神经检索和 GPT 式生成模型依赖于大量高质量的监督数据，这对于阿姆哈拉语等资源匮乏的语言来说仍然很稀缺。我们发布了一个阿姆哈拉语数据资源，由两个数据集组成，支持 (i) 神经检索排序和 (ii) 指令跟踪文本生成的研究。检索排名数据集包含 1,091 个手动验证的查询正负文档三元组，这些三元组来自不同的阿姆哈拉语来源，并构建为支持神经检索器的对比训练和基准测试（例如，DPR、ColBERT 式后期交互和 SPLADE 式稀疏神经检索）。三元组是通过专家策划的查询、网络衍生的查询和法学硕士辅助生成的组合来创建的，其中正/反文档从网络中选择或由法学硕士合成，然后由母语人士验证。指令提示-响应数据集包含 6,285 个跨越多个领域和指令类型的阿姆哈拉语提示-响应对，由多个法学硕士生成，并通过人工审查和纠正语法、相关性、流畅性和事实合理性进行完善。我们以标准化的分割和格式（CSV、JSON、JSONL）发布这两个数据集，以实现阿姆哈拉语检索、排名和生成建模的可重复工作。这些数据集还附带了一种可以推广到其他低资源语言的方法。

Title: LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations

Authors: William Lugoloobi, Thomas Foster, William Bankes, Chris Russell
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.09924
Pdf URL: https://arxiv.org/pdf/2602.09924
Copy Paste: [[2602.09924]] LLMs Encode Their Failures: Predicting Success from Pre-Generation Activations(https://arxiv.org/abs/2602.09924)
Keywords: llm
Abstract: Running LLMs with extended reasoning on every problem is expensive, but determining which inputs actually require additional compute remains challenging. We investigate whether their own likelihood of success is recoverable from their internal representations before generation, and if this signal can guide more efficient inference. We train linear probes on pre-generation activations to predict policy-specific success on math and coding tasks, substantially outperforming surface features such as question length and TF-IDF. Using E2H-AMC, which provides both human and model performance on identical problems, we show that models encode a model-specific notion of difficulty that is distinct from human difficulty, and that this distinction increases with extended reasoning. Leveraging these probes, we demonstrate that routing queries across a pool of models can exceed the best-performing model whilst reducing inference cost by up to 70\% on MATH, showing that internal representations enable practical efficiency gains even when they diverge from human intuitions about difficulty. Our code is available at: this https URL
摘要：对每个问题运行具有扩展推理的法学硕士成本高昂，但确定哪些输入实际上需要额外的计算仍然具有挑战性。我们研究他们自己成功的可能性是否可以从生成之前的内部表征中恢复，以及该信号是否可以指导更有效的推理。我们在预生成激活上训练线性探针，以预测特定策略在数学和编码任务上的成功，大大优于问题长度和 TF-IDF 等表面特征。使用 E2H-AMC，它可以在相同的问题上提供人类和模型的性能，我们表明模型编码了与人类难度不同的特定于模型的难度概念，并且这种区别随着扩展推理而增加。利用这些探针，我们证明了跨模型池的路由查询可以超越性能最佳的模型，同时将数学推理成本降低高达 70%，这表明内部表示即使在偏离人类对难度的直觉时也能实现实际效率增益。我们的代码位于：此 https URL

Title: A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models

Authors: Xiulin Yang, Arianna Bisazza, Nathan Schneider, Ethan Gotlieb Wilcox
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.09992
Pdf URL: https://arxiv.org/pdf/2602.09992
Copy Paste: [[2602.09992]] A Unified Assessment of the Poverty of the Stimulus Argument for Neural Language Models(https://arxiv.org/abs/2602.09992)
Keywords: language model
Abstract: How can children acquire native-level syntax from limited input? According to the Poverty of the Stimulus Hypothesis (PoSH), the linguistic input children receive is insufficient to explain certain generalizations that are robustly learned; innate linguistic constraints, many have argued, are thus necessary to explain language learning. Neural language models, which lack such language-specific constraints in their design, offer a computational test of this longstanding (but controversial) claim. We introduce \poshbench, a training-and-evaluation suite targeting question formation, islands to movement, and other English phenomena at the center of the PoSH arguments. Training Transformer models on 10--50M words of developmentally plausible text, we find indications of generalization on all phenomena even without direct positive evidence -- yet neural models remain less data-efficient and their generalizations are weaker than those of children. We further enhance our models with three recently proposed cognitively motivated inductive biases. We find these biases improve general syntactic competence but not \poshbench performance. Our findings challenge the claim that innate syntax is the only possible route to generalization, while suggesting that human-like data efficiency requires inductive biases beyond those tested here.
摘要：孩子如何从有限的输入中获得母语水平的语法？根据刺激贫困假说（PoSH），儿童接受的语言输入不足以解释某些经过稳健学习的概括；许多人认为，先天的语言限制对于解释语言学习是必要的。神经语言模型在设计中缺乏这种特定于语言的约束，为这一长期存在（但有争议）的主张提供了计算测试。我们引入了 \poshbench，这是一个针对问题形成、岛屿运动以及 PoSH 争论中心的其他英语现象的训练和评估套件。用 10--50M 字的发展合理文本训练 Transformer 模型，即使没有直接的积极证据，我们也发现了所有现象的泛化迹象——但神经模型的数据效率仍然较低，而且它们的泛化能力比儿童的要弱。我们通过最近提出的三种认知动机归纳偏差进一步增强了我们的模型。我们发现这些偏差提高了一般句法能力，但没有提高 \poshbench 性能。我们的研究结果挑战了先天语法是泛化唯一可能途径的说法，同时表明类人数据效率需要超出此处测试的归纳偏差。

Title: SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation

Authors: Homaira Huda Shomee, Rochana Chaturvedi, Yangxinyu Xie, Tanwi Mallick
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10017
Pdf URL: https://arxiv.org/pdf/2602.10017
Copy Paste: [[2602.10017]] SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation(https://arxiv.org/abs/2602.10017)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) are increasingly used to support question answering and decision-making in high-stakes, domain-specific settings such as natural hazard response and infrastructure planning, where effective answers must convey fine-grained, decision-critical details. However, existing evaluation frameworks for retrieval-augmented generation (RAG) and open-ended question answering primarily rely on surface-level similarity, factual consistency, or semantic relevance, and often fail to assess whether responses provide the specific information required for domain-sensitive decisions. To address this gap, we propose a multi-dimensional, reference-free evaluation framework that assesses LLM outputs along four complementary dimensions: specificity, robustness to paraphrasing and semantic perturbations, answer relevance, and context utilization. We introduce a curated dataset of 1,412 domain-specific question-answer pairs spanning 40 professional roles and seven natural hazard types to support systematic evaluation. We further conduct human evaluation to assess inter-annotator agreement and alignment between model outputs and human judgments, which highlights the inherent subjectivity of open-ended, domain-specific evaluation. Our results show that no single metric sufficiently captures answer quality in isolation and demonstrate the need for structured, multi-metric evaluation frameworks when deploying LLMs in high-stakes applications.
摘要：大型语言模型 (LLM) 越来越多地用于支持高风险、特定领域环境中的问答和决策，例如自然灾害响应和基础设施规划，其中有效的答案必须传达细粒度的决策关键细节。然而，现有的检索增强生成（RAG）和开放式问答的评估框架主要依赖于表面相似性、事实一致性或语义相关性，并且常常无法评估响应是否提供了领域敏感决策所需的特定信息。为了解决这一差距，我们提出了一个多维、无参考的评估框架，该框架从四个互补的维度评估法学硕士的输出：特异性、对释义和语义扰动的鲁棒性、答案相关性和上下文利用。我们引入了一个包含 1,412 个特定领域问答对的精选数据集，涵盖 40 个专业角色和 7 种自然灾害类型，以支持系统评估。我们进一步进行人工评估，以评估注释者间的一致性以及模型输出与人类判断之间的一致性，这凸显了开放式、特定领域评估的固有主观性。我们的结果表明，没有任何单一指标能够充分捕获孤立的答案质量，并证明在高风险应用程序中部署法学硕士时需要结构化的多指标评估框架。

Title: Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference

Authors: Wenxuan Xie, Yujia Wang, Xin Tan, Chaochao Lu, Xia Hu, Xuhong Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10021
Pdf URL: https://arxiv.org/pdf/2602.10021
Copy Paste: [[2602.10021]] Decoupled Reasoning with Implicit Fact Tokens (DRIFT): A Dual-Model Framework for Efficient Long-Context Inference(https://arxiv.org/abs/2602.10021)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: The integration of extensive, dynamic knowledge into Large Language Models (LLMs) remains a significant challenge due to the inherent entanglement of factual data and reasoning patterns. Existing solutions, ranging from non-parametric Retrieval-Augmented Generation (RAG) to parametric knowledge editing, are often constrained in practice by finite context windows, retriever noise, or the risk of catastrophic forgetting. In this paper, we propose DRIFT, a novel dual-model architecture designed to explicitly decouple knowledge extraction from the reasoning process. Unlike static prompt compression, DRIFT employs a lightweight knowledge model to dynamically compress document chunks into implicit fact tokens conditioned on the query. These dense representations are projected into the reasoning model's embedding space, replacing raw, redundant text while maintaining inference accuracy. Extensive experiments show that DRIFT significantly improves performance on long-context tasks, outperforming strong baselines among comparably sized models. Our approach provides a scalable and efficient paradigm for extending the effective context window and reasoning capabilities of LLMs. Our code is available at this https URL.
摘要：由于事实数据和推理模式固有的纠缠，将广泛的动态知识集成到大型语言模型（LLM）中仍然是一个重大挑战。现有的解决方案，从非参数检索增强生成（RAG）到参数知识编辑，在实践中通常受到有限上下文窗口、检索器噪声或灾难性遗忘风险的限制。在本文中，我们提出了 DRIFT，一种新颖的双模型架构，旨在显式地将知识提取与推理过程解耦。与静态提示压缩不同，DRIFT 采用轻量级知识模型将文档块动态压缩为以查询为条件的隐式事实标记。这些密集表示被投影到推理模型的嵌入空间中，替换原始的冗余文本，同时保持推理准确性。大量实验表明，DRIFT 显着提高了长上下文任务的性能，优于同等规模模型中的强大基线。我们的方法提供了一个可扩展且高效的范例，用于扩展法学硕士的有效上下文窗口和推理能力。我们的代码可以在这个 https URL 上找到。

Title: Anagent For Enhancing Scientific Table & Figure Analysis

Authors: Xuehang Guo, Zhiyong Lu, Tom Hope, Qingyun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.10081
Pdf URL: https://arxiv.org/pdf/2602.10081
Copy Paste: [[2602.10081]] Anagent For Enhancing Scientific Table & Figure Analysis(https://arxiv.org/abs/2602.10081)
Keywords: agent
Abstract: In scientific research, analysis requires accurately interpreting complex multimodal knowledge, integrating evidence from different sources, and drawing inferences grounded in domain-specific knowledge. However, current artificial intelligence (AI) systems struggle to consistently demonstrate such capabilities. The complexity and variability of scientific tables and figures, combined with heterogeneous structures and long-context requirements, pose fundamental obstacles to scientific table \& figure analysis. To quantify these challenges, we introduce AnaBench, a large-scale benchmark featuring $63,178$ instances from nine scientific domains, systematically categorized along seven complexity dimensions. To tackle these challenges, we propose Anagent, a multi-agent framework for enhanced scientific table \& figure analysis through four specialized agents: Planner decomposes tasks into actionable subtasks, Expert retrieves task-specific information through targeted tool execution, Solver synthesizes information to generate coherent analysis, and Critic performs iterative refinement through five-dimensional quality assessment. We further develop modular training strategies that leverage supervised finetuning and specialized reinforcement learning to optimize individual capabilities while maintaining effective collaboration. Comprehensive evaluation across 170 subdomains demonstrates that Anagent achieves substantial improvements, up to $\uparrow 13.43\%$ in training-free settings and $\uparrow 42.12\%$ with finetuning, while revealing that task-oriented reasoning and context-aware problem-solving are essential for high-quality scientific table \& figure analysis. Our project page: this https URL.
摘要：在科学研究中，分析需要准确解释复杂的多模态知识，整合不同来源的证据，并根据特定领域的知识得出推论。然而，当前的人工智能（AI）系统很难始终如一地展示这种能力。科学表格和图形的复杂性和可变性，加上异构结构和长上下文要求，对科学表格和图形分析构成了根本障碍。为了量化这些挑战，我们引入了 AnaBench，这是一个大型基准测试，包含来自九个科学领域的价值 63,178 美元的实例，并按照七个复杂性维度进行系统分类。为了应对这些挑战，我们提出了 Anagent，一个通过四个专门代理来增强科学表格和图形分析的多代理框架：Planner 将任务分解为可操作的子任务，Expert 通过有针对性的工具执行检索特定于任务的信息，Solver 综合信息以生成连贯的分析，Critic 通过五维质量评估进行迭代细化。我们进一步开发模块化培训策略，利用监督微调和专业强化学习来优化个人能力，同时保持有效的协作。跨 170 个子域的综合评估表明 Anagent 取得了显着的改进，在免训练设置中高达 $\uparrow 13.43\%$，通过微调高达 $\uparrow 42.12\%$，同时揭示了面向任务的推理和上下文感知的问题解决对于高质量的科学表格和图形分析至关重要。我们的项目页面：这个 https URL。

Title: Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing

Authors: Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, Juntao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.10092
Pdf URL: https://arxiv.org/pdf/2602.10092
Copy Paste: [[2602.10092]] Quantum-Audit: Evaluating the Reasoning Limits of LLMs on Quantum Computing(https://arxiv.org/abs/2602.10092)
Keywords: language model, llm
Abstract: Language models have become practical tools for quantum computing education and research, from summarizing technical papers to explaining theoretical concepts and answering questions about recent developments in the field. While existing benchmarks evaluate quantum code generation and circuit design, their understanding of quantum computing concepts has not been systematically measured. Quantum-Audit addresses this gap with 2,700 questions covering core quantum computing topics. We evaluate 26 models from leading organizations. Our benchmark comprises 1,000 expert-written questions, 1,000 questions extracted from research papers using LLMs and validated by experts, plus an additional 700 questions including 350 open-ended questions and 350 questions with false premises to test whether models can correct erroneous assumptions. Human participants scored between 23% and 86%, with experts averaging 74%. Top-performing models exceeded the expert average, with Claude Opus 4.5 reaching 84% accuracy, though top models showed an average 12-point accuracy drop on expert-written questions compared to LLM-generated ones. Performance declined further on advanced topics, dropping to 73% on security questions. Additionally, models frequently accepted and reinforced false premises embedded in questions instead of identifying them, with accuracy below 66% on these critical reasoning tasks.
摘要：语言模型已成为量子计算教育和研究的实用工具，从总结技术论文到解释理论概念和回答有关该领域最新发展的问题。虽然现有的基准评估量子代码生成和电路设计，但它们对量子计算概念的理解尚未得到系统测量。 Quantum-Audit 通过涵盖核心量子计算主题的 2,700 个问题弥补了这一差距。我们评估了来自领先组织的 26 个模型。我们的基准包括 1,000 个专家撰写的问题、1,000 个从法学硕士研究论文中提取并经专家验证的问题，以及另外 700 个问题，包括 350 个开放式问题和 350 个带有错误前提的问题，以测试模型是否可以纠正错误的假设。人类参与者的得分在 23% 到 86% 之间，专家的平均得分为 74%。表现最好的模型超过了专家的平均水平，Claude Opus 4.5 的准确率达到了 84%，尽管与 LLM 生成的问题相比，顶级模型在专家编写的问题上的准确度平均下降了 12 点。在高级主题上的表现进一步下降，在安全问题上下降至 73%。此外，模型经常接受并强化问题中嵌入的错误前提，而不是识别它们，在这些关键推理任务上的准确率低于 66%。