2026-02-04

Title: The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders

Authors: Shikhar Shiromani, Archie Chaudhury, Sri Pranav Kunda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02496
Pdf URL: https://arxiv.org/pdf/2602.02496
Copy Paste: [[2602.02496]] The Hypocrisy Gap: Quantifying Divergence Between Internal Belief and Chain-of-Thought Explanation via Sparse Autoencoders(https://arxiv.org/abs/2602.02496)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) frequently exhibit unfaithful behavior, producing a final answer that differs significantly from their internal chain of thought (CoT) reasoning in order to appease the user they are conversing with. In order to better detect this behavior, we introduce the Hypocrisy Gap, a mechanistic metric utilizing Sparse Autoencoders (SAEs) to quantify the divergence between a model's internal reasoning and its final generation. By mathematically comparing an internal truth belief, derived via sparse linear probes, to the final generated trajectory in latent space, we quantify and detect a model's tendency to engage in unfaithful behavior. Experiments on Gemma, Llama, and Qwen models using Anthropic's Sycophancy benchmark show that our method achieves an AUROC of 0.55-0.73 for detecting sycophantic runs and 0.55-0.74 for hypocritical cases where the model internally "knows" the user is wrong, consistently outperforming a decision-aligned log-probability baseline (0.41-0.50 AUROC).
摘要：大型语言模型 (LLM) 经常表现出不忠实行为，产生与其内部思维链 (CoT) 推理显着不同的最终答案，以安抚正在与之交谈的用户。为了更好地检测这种行为，我们引入了 Hypocrisy Gap，这是一种利用稀疏自动编码器 (SAE) 来量化模型内部推理与其最终生成之间的差异的机械度量。通过对通过稀疏线性探针导出的内部真实信念与潜在空间中最终生成的轨迹进行数学比较，我们量化并检测模型从事不忠行为的倾向。使用 Anthropic 的谄媚基准对 Gemma、Llama 和 Qwen 模型进行的实验表明，我们的方法在检测谄媚运行时实现了 0.55-0.73 的 AUROC，在模型内部“知道”用户错误的虚伪情况下实现了 0.55-0.74 的 AUROC，始终优于决策对齐的对数概率基线 (0.41-0.50 AUROC)。

Title: STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Authors: Xuzhao Li, Xuchen Li, Jian Zhao, Shiyu Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02497
Pdf URL: https://arxiv.org/pdf/2602.02497
Copy Paste: [[2602.02497]] STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models(https://arxiv.org/abs/2602.02497)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.
摘要：随着大型语言模型（LLM）在复杂推理任务中取得重大突破，评估其在科学、技术、工程和数学（STEM）方面的熟练程度已成为衡量机器智能的主要方法。然而，当前的评估范式通常将基准视为孤立的“孤岛”，仅提供整体的总体分数，而忽略了学术专业化和认知深度的复杂性。这种以结果为导向的方法无法区分模型错误是否源于领域知识不足或认知能力缺陷，从而限制了诊断价值。为了解决这个问题，我们提出了 STEMVerse，这是一个诊断框架，旨在系统地分析法学硕士的 STEM 推理能力。该框架描述了跨学术专业化和认知复杂性的模型性能，以映射推理所需的能力。我们将主流基准测试中的 20,000 多个 STEM 问题重新聚合到统一的“学科$\times$认知”能力空间中，为每个实例分配双轴标签。利用这个统一的诊断框架，我们系统地评估具有不同参数尺度和培训范式的代表性法学硕士系列。我们的实证结果揭示了 STEM 推理中的结构性失败模式。通过将多学科覆盖和细粒度认知分层整合到统一框架中，STEMVerse 为理解法学硕士的科学推理特征提供了清晰且可操作的视角。

Title: Test-Time Detoxification without Training or Learning Anything

Authors: Baturay Saglam, Dionysis Kalogerias
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02498
Pdf URL: https://arxiv.org/pdf/2602.02498
Copy Paste: [[2602.02498]] Test-Time Detoxification without Training or Learning Anything(https://arxiv.org/abs/2602.02498)
Keywords: language model, prompt
Abstract: Large language models can produce toxic or inappropriate text even for benign inputs, creating risks when deployed at scale. Detoxification is therefore important for safety and user trust, particularly when we want to reduce harmful content without sacrificing the model's generation quality. Many existing approaches rely on model retraining, gradients, or learned auxiliary components, which can be costly and may not transfer across model families or to truly black-box settings. We introduce a test-time procedure that approximates the gradient of completion toxicity with respect to the input embeddings and uses a small number of descent steps to steer generation toward less toxic continuations. This is achieved with zeroth-order optimization that requires only access to input embeddings, a toxicity scoring function, and forward evaluations of the model. Empirically, the approach delivers robust toxicity reductions across models and prompts and, in most settings, achieves the best overall toxicity-quality trade-off. More broadly, our work positions word embeddings as effective control variables and encourages wider use of black-box optimization to guide autoregressive language models toward scalable, safer text generation, without requiring any training or access to intermediate computations.
摘要：即使对于良性输入，大型语言模型也可能会产生有毒或不适当的文本，从而在大规模部署时产生风险。因此，解毒对于安全和用户信任非常重要，特别是当我们希望在不牺牲模型生成质量的情况下减少有害内容时。许多现有方法依赖于模型再训练、梯度或学习的辅助组件，这些方法可能成本高昂，并且可能无法跨模型系列或真正的黑盒设置转移。我们引入了一个测试时间程序，该程序近似相对于输入嵌入的完成毒性梯度，并使用少量的下降步骤来引导生成趋向毒性较小的延续。这是通过零阶优化实现的，零阶优化仅需要访问输入嵌入、毒性评分函数和模型的前向评估。根据经验，该方法可以在模型和提示中实现稳健的毒性降低，并且在大多数设置中实现最佳的总体毒性质量权衡。更广泛地说，我们的工作将词嵌入定位为有效的控制变量，并鼓励更广泛地使用黑盒优化来引导自回归语言模型实现可扩展、更安全的文本生成，而不需要任何训练或访问中间计算。

Title: ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching

Authors: Yunao Zheng, Xiaojie Wang, Lei Ren, Wei Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02499
Pdf URL: https://arxiv.org/pdf/2602.02499
Copy Paste: [[2602.02499]] ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching(https://arxiv.org/abs/2602.02499)
Keywords: language model, long context
Abstract: Long-context capability and computational efficiency are among the central challenges facing today's large language models. Existing efficient attention methods reduce computational complexity, but they typically suffer from a limited coverage of the model state. This paper proposes ROSA-Tuning, a retrieval-and-recall mechanism for enhancing the long-context modeling ability of pretrained models. Beyond the standard attention mechanism, ROSA-Tuning introduces in parallel a CPU-based ROSA (RWKV Online Suffix Automaton) retrieval module, which efficiently locates historical positions in long contexts that are relevant to the current query, and injects the retrieved information into the model state in a trainable manner; subsequent weighted fusion can then be handled by range-restricted attention. To enable end-to-end training, we design a binary discretization strategy and a counterfactual gradient algorithm, and further optimize overall execution efficiency via an asynchronous CPU-GPU pipeline. Systematic evaluations on Qwen3-Base-1.7B show that ROSA-Tuning substantially restores the long-context modeling ability of windowed-attention models, achieving performance close to and in some cases matching global attention on benchmarks such as LongBench, while maintaining computational efficiency and GPU memory usage that are nearly comparable to windowed-attention methods, offering a new technical path for efficient long-context processing. The example code can be found at this https URL.
摘要：长上下文能力和计算效率是当今大型语言模型面临的主要挑战之一。现有的有效注意力方法降低了计算复杂性，但它们通常受到模型状态覆盖范围有限的影响。本文提出了 ROSA-Tuning，一种检索和召回机制，用于增强预训练模型的长上下文建模能力。除了标准的注意力机制之外，ROSA-Tuning还并行引入了基于CPU的ROSA（RWKV在线后缀自动机）检索模块，该模块可以有效地定位与当前查询相关的长上下文中的历史位置，并以可训练的方式将检索到的信息注入到模型状态中；随后的加权融合可以通过范围限制注意力来处理。为了实现端到端训练，我们设计了二进制离散化策略和反事实梯度算法，并通过异步CPU-GPU管道进一步优化整体执行效率。对Qwen3-Base-1.7B的系统评估表明，ROSA-Tuning大幅恢复了窗口注意力模型的长上下文建模能力，在LongBench等基准上实现了接近甚至在某些情况下匹配全局注意力的性能，同时保持了与窗口注意力方法几乎可比的计算效率和GPU内存使用量，为高效的长上下文处理提供了一条新的技术路径。示例代码可以在此 https URL 中找到。

Title: Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management

Authors: Siyu Li, Chenwei Song, Qi Zhou, Wan Zhou, Xinyi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02635
Pdf URL: https://arxiv.org/pdf/2602.02635
Copy Paste: [[2602.02635]] Graph-Augmented Reasoning with Large Language Models for Tobacco Pest and Disease Management(https://arxiv.org/abs/2602.02635)
Keywords: language model, llm, chat
Abstract: This paper proposes a graph-augmented reasoning framework for tobacco pest and disease management that integrates structured domain knowledge into large language models. Building on GraphRAG, we construct a domain-specific knowledge graph and retrieve query-relevant subgraphs to provide relational evidence during answer generation. The framework adopts ChatGLM as the Transformer backbone with LoRA-based parameter-efficient fine-tuning, and employs a graph neural network to learn node representations that capture symptom-disease-treatment dependencies. By explicitly modeling diseases, symptoms, pesticides, and control measures as linked entities, the system supports evidence-aware retrieval beyond surface-level text similarity. Retrieved graph evidence is incorporated into the LLM input to guide generation toward domain-consistent recommendations and to mitigate hallucinated or inappropriate treatments. Experimental results show consistent improvements over text-only baselines, with the largest gains observed on multi-hop and comparative reasoning questions that require chaining multiple relations.
摘要：本文提出了一种用于烟草病虫害管理的图增强推理框架，该框架将结构化领域知识集成到大型语言模型中。在 GraphRAG 的基础上，我们构建了一个特定领域的知识图并检索与查询相关的子图，以在答案生成过程中提供关系证据。该框架采用 ChatGLM 作为 Transformer 主干，具有基于 LoRA 的参数高效微调，并采用图神经网络来学习捕获症状-疾病-治疗依赖性的节点表示。通过将疾病、症状、农药和控制措施明确建模为链接实体，该系统支持超越表面文本相似性的证据感知检索。检索到的图形证据被纳入法学硕士的输入中，以指导生成领域一致的建议，并减少幻觉或不适当的治疗。实验结果表明，与纯文本基线相比，取得了一致的改进，在需要链接多个关系的多跳和比较推理问题上观察到最大的收益。

Title: WideSeek: Advancing Wide Research via Multi-Agent Scaling

Authors: Ziyang Huang, Haolin Ren, Xiaowei Yuan, Jiawei Wang, Zhongtao Jiang, Kun Xu, Shizhu He, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.02636
Pdf URL: https://arxiv.org/pdf/2602.02636
Copy Paste: [[2602.02636]] WideSeek: Advancing Wide Research via Multi-Agent Scaling(https://arxiv.org/abs/2602.02636)
Keywords: agent
Abstract: Search intelligence is evolving from Deep Research to Wide Research, a paradigm essential for retrieving and synthesizing comprehensive information under complex constraints in parallel. However, progress in this field is impeded by the lack of dedicated benchmarks and optimization methodologies for search breadth. To address these challenges, we take a deep dive into Wide Research from two perspectives: Data Pipeline and Agent Optimization. First, we produce WideSeekBench, a General Broad Information Seeking (GBIS) benchmark constructed via a rigorous multi-phase data pipeline to ensure diversity across the target information volume, logical constraints, and domains. Second, we introduce WideSeek, a dynamic hierarchical multi-agent architecture that can autonomously fork parallel sub-agents based on task requirements. Furthermore, we design a unified training framework that linearizes multi-agent trajectories and optimizes the system using end-to-end RL. Experimental results demonstrate the effectiveness of WideSeek and multi-agent RL, highlighting that scaling the number of agents is a promising direction for advancing the Wide Research paradigm.
摘要：搜索智能正在从深度研究发展到广泛研究，这是在复杂约束下并行检索和合成综合信息所必需的范式。然而，由于缺乏针对搜索广度的专用基准和优化方法，该领域的进展受到阻碍。为了应对这些挑战，我们从数据管道和代理优化两个角度深入研究了广泛的研究。首先，我们生产 WideSeekBench，这是一个通过严格的多阶段数据管道构建的通用广泛信息搜索 (GBIS) 基准，以确保目标信息量、逻辑约束和领域的多样性。其次，我们介绍了 WideSeek，一种动态分层多代理架构，可以根据任务要求自动分叉并行子代理。此外，我们设计了一个统一的训练框架，线性化多智能体轨迹并使用端到端强化学习优化系统。实验结果证明了 WideSeek 和多智能体 RL 的有效性，强调了扩展智能体数量是推进 Wide Research 范式的一个有希望的方向。

Title: Monotonicity as an Architectural Bias for Robust Language Models

Authors: Patrick Cooper, Alireza Nadali, Ashutosh Trivedi, Alvaro Velasquez
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02686
Pdf URL: https://arxiv.org/pdf/2602.02686
Copy Paste: [[2602.02686]] Monotonicity as an Architectural Bias for Robust Language Models(https://arxiv.org/abs/2602.02686)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are known to exhibit brittle behavior under adversarial prompts and jailbreak attacks, even after extensive alignment and fine-tuning. This fragility reflects a broader challenge of modern neural language models: small, carefully structured perturbations in high-dimensional input spaces can induce large and unpredictable changes in internal semantic representations and output. We investigate monotonicity as an architectural inductive bias for improving the robustness of Transformer-based language models. Monotonicity constrains semantic transformations so that strengthening information, evidence, or constraints cannot lead to regressions in the corresponding internal representations. Such order-preserving behavior has long been exploited in control and safety-critical systems to simplify reasoning and improve robustness, but has traditionally been viewed as incompatible with the expressivity required by neural language models. We show that this trade-off is not inherent. By enforcing monotonicity selectively in the feed-forward sublayers of sequence-to-sequence Transformers -- while leaving attention mechanisms unconstrained -- we obtain monotone language models that preserve the performance of their pretrained counterparts. This architectural separation allows negation, contradiction, and contextual interactions to be introduced explicitly through attention, while ensuring that subsequent semantic refinement is order-preserving. Empirically, monotonicity substantially improves robustness: adversarial attack success rates drop from approximately 69% to 19%, while standard summarization performance degrades only marginally.
摘要：众所周知，即使经过广泛的调整和微调，大型语言模型（LLM）在对抗性提示和越狱攻击下也会表现出脆弱的行为。这种脆弱性反映了现代神经语言模型面临的更广泛的挑战：高维输入空间中微小的、精心构造的扰动可能会导致内部语义表示和输出发生巨大且不可预测的变化。我们研究单调性作为一种架构归纳偏差，以提高基于 Transformer 的语言模型的鲁棒性。单调性限制了语义转换，因此加强信息、证据或约束不会导致相应内部表示的回归。这种保序行为长期以来一直在控制和安全关键系统中被利用，以简化推理并提高鲁棒性，但传统上被认为与神经语言模型所需的表达能力不兼容。我们证明这种权衡并不是固有的。通过在序列到序列 Transformer 的前馈子层中选择性地强制单调性，同时使注意力机制不受约束，我们获得了单调语言模型，该模型保留了预训练对应模型的性能。这种架构分离允许通过注意力显式地引入否定、矛盾和上下文交互，同时确保后续的语义细化保持顺序。根据经验，单调性大大提高了鲁棒性：对抗性攻击成功率从大约 69% 下降到 19%，而标准摘要性能仅略有下降。

Title: InfMem: Learning System-2 Memory Control for Long-Context Agent

Authors: Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, Yufei Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02704
Pdf URL: https://arxiv.org/pdf/2602.02704
Copy Paste: [[2602.02704]] InfMem: Learning System-2 Memory Control for Long-Context Agent(https://arxiv.org/abs/2602.02704)
Keywords: agent
Abstract: Reasoning over ultra-long documents requires synthesizing sparse evidence scattered across distant segments under strict memory constraints. While streaming agents enable scalable processing, their passive memory update strategy often fails to preserve low-salience bridging evidence required for multi-hop reasoning. We propose InfMem, a control-centric agent that instantiates System-2-style control via a PreThink-Retrieve-Write protocol. InfMem actively monitors evidence sufficiency, performs targeted in-document retrieval, and applies evidence-aware joint compression to update a bounded memory. To ensure reliable control, we introduce a practical SFT-to-RL training recipe that aligns retrieval, writing, and stopping decisions with end-task correctness. On ultra-long QA benchmarks from 32k to 1M tokens, InfMem consistently outperforms MemAgent across backbones. Specifically, InfMem improves average absolute accuracy by +10.17, +11.84, and +8.23 points on Qwen3-1.7B, Qwen3-4B, and Qwen2.5-7B, respectively, while reducing inference time by $3.9\times$ on average (up to $5.1\times$) via adaptive early stopping.
摘要：对超长文档的推理需要在严格的内存限制下综合分散在遥远片段的稀疏证据。虽然流代理支持可扩展处理，但它们的被动内存更新策略通常无法保留多跳推理所需的低显着性桥接证据。我们提出了 InfMem，一种以控制为中心的代理，它通过 PreThink-Retrieve-Write 协议实例化 System-2 风格的控制。 InfMem 主动监控证据充分性，执行有针对性的文档内检索，并应用证据感知联合压缩来更新有界内存。为了确保可靠的控制，我们引入了实用的 SFT 到 RL 训练方法，将检索、写入和停止决策与最终任务的正确性保持一致。在从 32k 到 1M 代币的超长 QA 基准测试中，InfMem 在各个骨干网络上始终优于 MemAgent。具体来说，InfMem 在 Qwen3-1.7B、Qwen3-4B 和 Qwen2.5-7B 上分别将平均绝对精度提高了 +10.17、+11.84 和 +8.23 个点，同时通过自适应提前停止将推理时间平均减少了 $3.9\times$（最高可达 $5.1\times$）。

Title: Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors

Authors: Rohan Pandey, Haijuan Yan, Hong Yu, Jack Tsai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02731
Pdf URL: https://arxiv.org/pdf/2602.02731
Copy Paste: [[2602.02731]] Predicting first-episode homelessness among US Veterans using longitudinal EHR data: time-varying models and social risk factors(https://arxiv.org/abs/2602.02731)
Keywords: language model, llm
Abstract: Homelessness among US veterans remains a critical public health challenge, yet risk prediction offers a pathway for proactive intervention. In this retrospective prognostic study, we analyzed electronic health record (EHR) data from 4,276,403 Veterans Affairs patients during a 2016 observation period to predict first-episode homelessness occurring 3-12 months later in 2017 (prevalence: 0.32-1.19%). We constructed static and time-varying EHR representations, utilizing clinician-informed logic to model the persistence of clinical conditions and social risks over time. We then compared the performance of classical machine learning, transformer-based masked language models, and fine-tuned large language models (LLMs). We demonstrate that incorporating social and behavioral factors into longitudinal models improved precision-recall area under the curve (PR-AUC) by 15-30%. In the top 1% risk tier, models yielded positive predictive values ranging from 3.93-4.72% at 3 months, 7.39-8.30% at 6 months, 9.84-11.41% at 9 months, and 11.65-13.80% at 12 months across model architectures. Large language models underperformed encoder-based models on discrimination but showed smaller performance disparities across racial groups. These results demonstrate that longitudinal, socially informed EHR modeling concentrates homelessness risk into actionable strata, enabling targeted and data-informed prevention strategies for at-risk veterans.
摘要：美国退伍军人的无家可归仍然是一个重大的公共卫生挑战，但风险预测为主动干预提供了途径。在这项回顾性预后研究中，我们分析了 2016 年观察期内 4,276,403 名退伍军人事务部患者的电子健康记录 (EHR) 数据，以预测 2017 年 3-12 个月后发生的首次无家可归情况（患病率：0.32-1.19%）。我们构建了静态和时变的 EHR 表示，利用临床医生知情的逻辑来模拟临床状况和社会风险随时间的持续性。然后，我们比较了经典机器学习、基于 Transformer 的掩码语言模型和微调大型语言模型 (LLM) 的性能。我们证明，将社会和行为因素纳入纵向模型可将精确召回曲线下面积 (PR-AUC) 提高 15-30%。在前 1% 风险层中，模型架构的阳性预测值范围为 3 个月时的 3.93-4.72%、6 个月时的 7.39-8.30%、9 个月时的 9.84-11.41% 和 12 个月时的 11.65-13.80%。大型语言模型在歧视方面的表现不如基于编码器的模型，但不同种族群体之间的表现差异较小。这些结果表明，纵向的、基于社会的 EHR 模型将无家可归风险集中到可采取行动的层面，从而为高危退伍军人制定有针对性的、基于数据的预防策略。

Title: From Task Solving to Robust Real-World Adaptation in LLM Agents

Authors: Pouya Pezeshkpour, Estevam Hruschka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02760
Pdf URL: https://arxiv.org/pdf/2602.02760
Copy Paste: [[2602.02760]] From Task Solving to Robust Real-World Adaptation in LLM Agents(https://arxiv.org/abs/2602.02760)
Keywords: language model, llm, agent
Abstract: Large language models are increasingly deployed as specialized agents that plan, call tools, and take actions over extended horizons. Yet many existing evaluations assume a "clean interface" where dynamics are specified and stable, tools and sensors are reliable, and success is captured by a single explicit objective-often overestimating real-world readiness. In practice, agents face underspecified rules, unreliable signals, shifting environments, and implicit, multi-stakeholder goals. The challenge is therefore not just solving tasks, but adapting while solving: deciding what to trust, what is wanted, when to verify, and when to fall back or escalate. We stress-test deployment-relevant robustness under four operational circumstances: partial observability, dynamic environments, noisy signals, and dynamic agent state. We benchmark agentic LLMs in a grid-based game with a simple goal but long-horizon execution. Episodes violate clean-interface assumptions yet remain solvable, forcing agents to infer rules, pay for information, adapt to environmental and internal shifts, and act cautiously under noise. Across five state-of-the-art LLM agents, we find large gaps between nominal task-solving and deployment-like robustness. Performance generally degrades as grid size and horizon increase, but rankings are unstable: weaker models can beat stronger ones when strategy matches the uncertainty regime. Despite no explicit instruction, agents trade off completion, efficiency, and penalty avoidance, suggesting partial objective inference. Ablations and feature analyses reveal model-specific sensitivities and failure drivers, motivating work on verification, safe action selection, and objective inference under partial observability, noise, and non-stationarity.
摘要：大型语言模型越来越多地被部署为专门的代理，可以在更广泛的范围内规划、调用工具并采取行动。然而，许多现有的评估都假设一个“干净的界面”，其中动态是指定的且稳定的，工具和传感器是可靠的，并且成功是由单个明确的目标捕获的——通常高估了现实世界的准备情况。在实践中，代理人面临着不明确的规则、不可靠的信号、不断变化的环境以及隐含的多利益相关者目标。因此，挑战不仅仅是解决任务，而是在解决问题时进行适应：决定信任什么、想要什么、何时验证以及何时回退或升级。我们在四种操作环境下对部署相关的鲁棒性进行了压力测试：部分可观察性、动态环境、噪声信号和动态代理状态。我们在基于网格的游戏中对代理法学硕士进行基准测试，目标简单但执行时间较长。情节违反了干净界面的假设，但仍然是可以解决的，迫使智能体推断规则，为信息付费，适应环境和内部变化，并在噪音下谨慎行事。在五个最先进的 LLM 代理中，我们发现名义任务解决和类似部署的稳健性之间存在巨大差距。随着网格大小和范围的增加，性能通常会下降，但排名不稳定：当策略与不确定性机制相匹配时，较弱的模型可以击败较强的模型。尽管没有明确的指示，智能体仍会权衡完成度、效率和避免惩罚，这表明部分客观推理。消融和特征分析揭示了模型特定的敏感性和故障驱动因素，激发了验证工作、安全行动选择以及部分可观测性、噪声和非平稳性下的客观推理。

Title: AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic

Authors: Israel Abebe Azime, Abenezer Kebede Angamo, Hana Mekonen Tamiru, Dagnachew Mekonnen Marilign, Philipp Slusallek, Seid Muhie Yimam, Dietrich Klakow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02774
Pdf URL: https://arxiv.org/pdf/2602.02774
Copy Paste: [[2602.02774]] AmharicStoryQA: A Multicultural Story Question Answering Benchmark in Amharic(https://arxiv.org/abs/2602.02774)
Keywords: language model, llm
Abstract: With the growing emphasis on multilingual and cultural evaluation benchmarks for large language models, language and culture are often treated as synonymous, and performance is commonly used as a proxy for a models understanding of a given language. In this work, we argue that such evaluations overlook meaningful cultural variation that exists within a single language. We address this gap by focusing on narratives from different regions of Ethiopia and demonstrate that, despite shared linguistic characteristics, region-specific and domain-specific content substantially influences language evaluation outcomes. To this end, we introduce \textbf{\textit{AmharicStoryQA}}, a long-sequence story question answering benchmark grounded in culturally diverse narratives from Amharic-speaking regions. Using this benchmark, we reveal a significant narrative understanding gap in existing LLMs, highlight pronounced regional differences in evaluation results, and show that supervised fine-tuning yields uneven improvements across regions and evaluation settings. Our findings emphasize the need for culturally grounded benchmarks that go beyond language-level evaluation to more accurately assess and improve narrative understanding in low-resource languages.
摘要：随着对大型语言模型的多语言和文化评估基准的日益重视，语言和文化通常被视为同义词，并且性能通常用作模型对给定语言的理解的代理。在这项工作中，我们认为这种评估忽视了单一语言中存在的有意义的文化差异。我们通过关注埃塞俄比亚不同地区的叙述来解决这一差距，并证明，尽管有共同的语言特征，但特定地区和特定领域的内容会极大地影响语言评估结果。为此，我们引入 \textbf{\textit{AmharicStoryQA}}，这是一个基于阿姆哈拉语地区文化多样性叙事的长序列故事问答基准。使用这个基准，我们揭示了现有法学硕士中重大的叙事理解差距，强调了评估结果中明显的区域差异，并表明监督微调在不同区域和评估设置之间产生了不均匀的改进。我们的研究结果强调，需要超越语言层面评估的基于文化的基准，以更准确地评估和提高低资源语言的叙事理解。

Title: R2-Router: A New Paradigm for LLM Routing with Reasoning

Authors: Jiaqi Xue, Qian Lou, Jiarong Xing, Heng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02823
Pdf URL: https://arxiv.org/pdf/2602.02823
Copy Paste: [[2602.02823]] R2-Router: A New Paradigm for LLM Routing with Reasoning(https://arxiv.org/abs/2602.02823)
Keywords: llm
Abstract: As LLMs proliferate with diverse capabilities and costs, LLM routing has emerged by learning to predict each LLM's quality and cost for a given query, then selecting the one with high quality and low cost. However, existing routers implicitly assume a single fixed quality and cost per LLM for each query, ignoring that the same LLM's quality varies with its output length. This causes routers to exclude powerful LLMs when their estimated cost exceeds the budget, missing the opportunity that these LLMs could still deliver high quality at reduced cost with shorter outputs. To address this, we introduce R2-Router, which treats output length budget as a controllable variable and jointly selects the best LLM and length budget, enforcing the budget via length-constrained instructions. This enables R2-Router to discover that a powerful LLM with constrained output can outperform a weaker LLM at comparable cost-efficient configurations invisible to prior methods. Together with the router framework, we construct R2-Bench, the first routing dataset capturing LLM behavior across diverse output length budgets. Experiments show that R2-Router achieves state-of-the-art performance at 4-5x lower cost compared with existing routers. This work opens a new direction: routing as reasoning, where routers evolve from reactive selectors to deliberate reasoners that explore which LLM to use and at what cost budget.
摘要：随着 LLM 的能力和成本不断增加，LLM 路由的出现是通过学习预测给定查询的每个 LLM 的质量和成本，然后选择高质量和低成本的 LLM。然而，现有的路由器隐式地假设每个查询的每个 LLM 的质量和成本是固定的，忽略了相同的 LLM 的质量随其输出长度的变化。这会导致路由器在估计成本超出预算时将强大的法学硕士排除在外，从而错失了这些法学硕士仍然可以以较低的成本和较短的输出提供高质量的机会。为了解决这个问题，我们引入了 R2-Router，它将输出长度预算视为可控变量，并联合选择最佳的 LLM 和长度预算，通过长度约束指令强制执行预算。这使得 R2-Router 能够发现，输出受限的功能强大的 LLM 可以在先前方法所不可见的相当成本效益的配置下胜过较弱的 LLM。与路由器框架一起，我们构建了 R2-Bench，这是第一个捕获跨不同输出长度预算的 LLM 行为的路由数据集。实验表明，R2-Router 实现了最先进的性能，而成本比现有路由器低 4-5 倍。这项工作开辟了一个新的方向：路由即推理，其中路由器从反应性选择器演变为深思熟虑的推理器，探索要使用哪个 LLM 以及以什么成本预算。

Title: CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment

Authors: Zhengbang Yang, Yisheng Zhong, Junyuan Hong, Zhuangdi Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02824
Pdf URL: https://arxiv.org/pdf/2602.02824
Copy Paste: [[2602.02824]] CATNIP: LLM Unlearning via Calibrated and Tokenized Negative Preference Alignment(https://arxiv.org/abs/2602.02824)
Keywords: llm
Abstract: Pretrained knowledge memorized in LLMs raises critical concerns over safety and privacy, which has motivated LLM Unlearning as a technique for selectively removing the influences of undesirable knowledge. Existing approaches, rooted in Gradient Ascent (GA), often degrade general domain knowledge while relying on retention data or curated contrastive pairs, which can be either impractical or data and computationally prohibitive. Negative Preference Alignment has been explored for unlearning to tackle the limitations of GA, which, however, remains confined by its choice of reference model and shows undermined performance in realistic data settings. These limitations raise two key questions: i) Can we achieve effective unlearning that quantifies model confidence in undesirable knowledge and uses it to calibrate gradient updates more precisely, thus reducing catastrophic forgetting? ii) Can we make unlearning robust to data scarcity and length variation? We answer both questions affirmatively with CATNIP (Calibrated and Tokenized Negative Preference Alignment), a principled method that rescales unlearning effects in proportion to the model's token-level confidence, thus ensuring fine-grained control over forgetting. Extensive evaluations on MUSE and WMDP benchmarks demonstrated that our work enables effective unlearning without requiring retention data or contrastive unlearning response pairs, with stronger knowledge forgetting and preservation tradeoffs than state-of-the-art methods.
摘要：法学硕士中记忆的预训练知识引发了对安全和隐私的严重担忧，这促使法学硕士取消学习成为一种选择性消除不良知识影响的技术。植根于梯度上升 (GA) 的现有方法往往会降低一般领域知识的水平，同时依赖于保留数据或精心策划的对比对，这可能不切实际，或者数据和计算量令人望而却步。负偏好对齐已经被探索用于消除学习来解决遗传算法的局限性，然而，遗传算法仍然受到其参考模型选择的限制，并且在实际数据设置中表现出较差的性能。这些限制提出了两个关键问题：i）我们能否实现有效的遗忘，量化模型对不需要的知识的置信度，并用它来更精确地校准梯度更新，从而减少灾难性遗忘？ ii）我们能让遗忘对数据稀缺和长度变化具有鲁棒性吗？我们用 CATNIP（校准和标记化的负偏好对齐）肯定地回答了这两个问题，这是一种原则方法，可以根据模型的标记级别置信度按比例重新调整遗忘效果，从而确保对遗忘的细粒度控制。对 MUSE 和 WMDP 基准的广泛评估表明，我们的工作能够实现有效的遗忘，而不需要保留数据或对比遗忘响应对，与最先进的方法相比，具有更强的知识遗忘和保存权衡。

Title: Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

Authors: Polina Tsvilodub, Karl Mulligan, Todd Snider, Robert D. Hawkins, Michael Franke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02843
Pdf URL: https://arxiv.org/pdf/2602.02843
Copy Paste: [[2602.02843]] Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication(https://arxiv.org/abs/2602.02843)
Keywords: agent
Abstract: When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that this http URL communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.
摘要：当决定如何在不确定性下采取行动时，代理可能会选择采取行动来减少不确定性，或者尽管有这种 http URL 通信设置，但他们可能会采取行动，减少不确定性的一个重要方法是提出澄清问题 (CQ)。我们预测，询问 CQ 的决定取决于上下文的不确定性和替代行动的成本，并且这些因素相互作用：当不正确的行动代价高昂时，不确定性应该最重要。我们根据预期遗憾在计算模型中形式化这种交互：如果代理现在采取行动而不是掌握完整信息，那么他们会损失多少。我们在两个实验中测试这些预测，一个实验检查对问题的纯粹语言反应，另一个实验扩展到澄清和非语言行为之间的选择。总而言之，我们的结果表明了一种理性的权衡：人类在不确定的情况下采取行动时，往往会寻求与重大损失风险成比例的澄清。

Title: Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs

Authors: Junyi Jessy Li, Yang Janet Liu, Kanishka Misra, Valentina Pyatkin, William Sheffield
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02878
Pdf URL: https://arxiv.org/pdf/2602.02878
Copy Paste: [[2602.02878]] Which course? Discourse! Teaching Discourse and Generation in the Era of LLMs(https://arxiv.org/abs/2602.02878)
Keywords: llm
Abstract: The field of NLP has undergone vast, continuous transformations over the past few years, sparking debates going beyond discipline boundaries. This begs important questions in education: how do we design courses that bridge sub-disciplines in this shifting landscape? This paper explores this question from the angle of discourse processing, an area with rich linguistic insights and computational models for the intentional, attentional, and coherence structure of language. Discourse is highly relevant for open-ended or long-form text generation, yet this connection is under-explored in existing undergraduate curricula. We present a new course, "Computational Discourse and Natural Language Generation". The course is collaboratively designed by a team with complementary expertise and was offered for the first time in Fall 2025 as an upper-level undergraduate course, cross-listed between Linguistics and Computer Science. Our philosophy is to deeply integrate the theoretical and empirical aspects, and create an exploratory mindset inside the classroom and in the assignments. This paper describes the course in detail and concludes with takeaways from an independent survey as well as our vision for future directions.
摘要：过去几年，NLP 领域经历了巨大、持续的变革，引发了超越学科界限的争论。这就引出了教育中的重要问题：我们如何设计在这种不断变化的环境中连接子学科的课程？本文从话语处理的角度探讨了这个问题，这个领域拥有丰富的语言见解和语言的意图、注意力和连贯结构的计算模型。话语与开放式或长篇文本生成高度相关，但现有本科课程中尚未充分探索这种联系。我们推出了一门新课程“计算话语和自然语言生成”。该课程由具有互补专业知识的团队合作设计，于 2025 年秋季首次作为高级本科课程提供，在语言学和计算机科学之间交叉列出。我们的理念是深度整合理论和实证，并在课堂和作业中创造探索性思维。本文详细描述了该课程，并以一项独立调查的结论以及我们对未来方向的愿景作为总结。

Title: HALT: Hallucination Assessment via Log-probs as Time series

Authors: Ahmad Shapiro, Karan Taneja, Ashok Goel
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02888
Pdf URL: https://arxiv.org/pdf/2602.02888
Copy Paste: [[2602.02888]] HALT: Hallucination Assessment via Log-probs as Time series(https://arxiv.org/abs/2602.02888)
Keywords: language model, llm, hallucination, chat
Abstract: Hallucinations remain a major obstacle for large language models (LLMs), especially in safety-critical domains. We present HALT (Hallucination Assessment via Log-probs as Time series), a lightweight hallucination detector that leverages only the top-20 token log-probabilities from LLM generations as a time series. HALT uses a gated recurrent unit model combined with entropy-based features to learn model calibration bias, providing an extremely efficient alternative to large encoders. Unlike white-box approaches, HALT does not require access to hidden states or attention maps, relying only on output log-probabilities. Unlike black-box approaches, it operates on log-probs rather than surface-form text, which enables stronger domain generalization and compatibility with proprietary LLMs without requiring access to internal weights. To benchmark performance, we introduce HUB (Hallucination detection Unified Benchmark), which consolidates prior datasets into ten capabilities covering both reasoning tasks (Algorithmic, Commonsense, Mathematical, Symbolic, Code Generation) and general purpose skills (Chat, Data-to-Text, Question Answering, Summarization, World Knowledge). While being 30x smaller, HALT outperforms Lettuce, a fine-tuned modernBERT-base encoder, achieving a 60x speedup gain on HUB. HALT and HUB together establish an effective framework for hallucination detection across diverse LLM capabilities.
摘要：幻觉仍然是大型语言模型（LLM）的主要障碍，特别是在安全关键领域。我们提出了 HALT（通过对数概率作为时间序列进行幻觉评估），这是一种轻量级幻觉检测器，仅利用 LLM 世代的前 20 个令牌对数概率作为时间序列。 HALT 使用门控循环单元模型与基于熵的特征相结合来学习模型校准偏差，为大型编码器提供了极其有效的替代方案。与白盒方法不同，HALT 不需要访问隐藏状态或注意力图，仅依赖于输出对数概率。与黑盒方法不同，它在对数概率而不是表面形式文本上运行，这使得域泛化能力更强，并且与专有的 LLM 兼容，而无需访问内部权重。为了对性能进行基准测试，我们引入了 HUB（幻觉检测统一基准），它将先前的数据集整合为十个功能，涵盖推理任务（算法、常识、数学、符号、代码生成）和通用技能（聊天、数据到文本、问答、总结、世界知识）。 HALT 虽然体积缩小了 30 倍，但其性能优于 Lettuce（一种经过微调的现代 BERT 编码器），在 HUB 上实现了 60 倍的加速增益。 HALT 和 HUB 共同建立了一个跨不同法学硕士能力的幻觉检测的有效框架。

Title: Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness

Authors: Alireza Amiri-Margavi, Arshia Gharagozlou, Amin Gholami Davodi, Seyed Pouyan Mousavi Davoudi, Hamidreza Hasani Balyani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02932
Pdf URL: https://arxiv.org/pdf/2602.02932
Copy Paste: [[2602.02932]] Equal Access, Unequal Interaction: A Counterfactual Audit of LLM Fairness(https://arxiv.org/abs/2602.02932)
Keywords: language model, gpt, llm, prompt
Abstract: Prior work on fairness in large language models (LLMs) has primarily focused on access-level behaviors such as refusals and safety filtering. However, equitable access does not ensure equitable interaction quality once a response is provided. In this paper, we conduct a controlled fairness audit examining how LLMs differ in tone, uncertainty, and linguistic framing across demographic identities after access is granted. Using a counterfactual prompt design, we evaluate GPT-4 and LLaMA-3.1-70B on career advice tasks while varying identity attributes along age, gender, and nationality. We assess access fairness through refusal analysis and measure interaction quality using automated linguistic metrics, including sentiment, politeness, and hedging. Identity-conditioned differences are evaluated using paired statistical tests. Both models exhibit zero refusal rates across all identities, indicating uniform access. Nevertheless, we observe systematic, model-specific disparities in interaction quality: GPT-4 expresses significantly higher hedging toward younger male users, while LLaMA exhibits broader sentiment variation across identity groups. These results show that fairness disparities can persist at the interaction level even when access is equal, motivating evaluation beyond refusal-based audits.
摘要：先前有关大型语言模型 (LLM) 公平性的工作主要集中在访问级行为，例如拒绝和安全过滤。然而，一旦提供了响应，公平的访问并不能确保公平的交互质量。在本文中，我们进行了受控的公平性审计，检查了法学硕士在获得访问权限后在不同人口身份的语气、不确定性和语言框架方面有何不同。使用反事实提示设计，我们在职业建议任务上评估 GPT-4 和 LLaMA-3.1-70B，同时根据年龄、性别和国籍改变身份属性。我们通过拒绝分析评估访问公平性，并使用自动化语言指标（包括情绪、礼貌和对冲）衡量交互质量。使用配对统计检验评估身份条件差异。两种模型在所有身份上都表现出零拒绝率，表明访问是统一的。尽管如此，我们观察到交互质量方面系统性的、特定于模型的差异：GPT-4 对年轻男性用户表现出明显更高的对冲，而 LLaMA 在不同身份群体中表现出更广泛的情绪变化。这些结果表明，即使访问平等，公平差异也可能在交互层面持续存在，从而激发评估超越基于拒绝的审计。

Title: Where Norms and References Collide: Evaluating LLMs on Normative Reasoning

Authors: Mitchell Abrams, Kaveh Eskandari Miandoab, Felix Gervits, Vasanth Sarathy, Matthias Scheutz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02975
Pdf URL: https://arxiv.org/pdf/2602.02975
Copy Paste: [[2602.02975]] Where Norms and References Collide: Evaluating LLMs on Normative Reasoning(https://arxiv.org/abs/2602.02975)
Keywords: language model, llm, agent
Abstract: Embodied agents, such as robots, will need to interact in situated environments where successful communication often depends on reasoning over social norms: shared expectations that constrain what actions are appropriate in context. A key capability in such settings is norm-based reference resolution (NBRR), where interpreting referential expressions requires inferring implicit normative expectations grounded in physical and social context. Yet it remains unclear whether Large Language Models (LLMs) can support this kind of reasoning. In this work, we introduce SNIC (Situated Norms in Context), a human-validated diagnostic testbed designed to probe how well state-of-the-art LLMs can extract and utilize normative principles relevant to NBRR. SNIC emphasizes physically grounded norms that arise in everyday tasks such as cleaning, tidying, and serving. Across a range of controlled evaluations, we find that even the strongest LLMs struggle to consistently identify and apply social norms, particularly when norms are implicit, underspecified, or in conflict. These findings reveal a blind spot in current LLMs and highlight a key challenge for deploying language-based systems in socially situated, embodied settings.
摘要：诸如机器人之类的实体主体需要在特定的环境中进行交互，在这种环境中，成功的沟通通常取决于对社会规范的推理：共同的期望限制了在上下文中适当的行动。此类环境中的一个关键功能是基于规范的参考解析（NBRR），其中解释参考表达需要推断基于物理和社会背景的隐含规范期望。然而，目前尚不清楚大型语言模型（LLM）是否可以支持这种推理。在这项工作中，我们介绍了 SNIC（情景规范），这是一个经过人工验证的诊断测试平台，旨在探讨最先进的法学硕士如何提取和利用与 NBRR 相关的规范原则。 SNIC 强调清洁、整理和服务等日常任务中出现的基于物理的规范。在一系列受控评估中，我们发现即使是最强大的法学硕士也很难一致地识别和应用社会规范，特别是当规范是隐含的、不明确的或冲突的时候。这些发现揭示了当前法学硕士的盲点，并强调了在社会情境中部署基于语言的系统所面临的关键挑战。

Title: CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning

Authors: Ran Li, Zeyuan Liu, Yinghao chen, Bingxiang He, Jiarui Yuan, Zixuan Fu, Weize Chen, Jinyi Hu, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02979
Pdf URL: https://arxiv.org/pdf/2602.02979
Copy Paste: [[2602.02979]] CPMobius: Iterative Coach-Player Reasoning for Data-Free Reinforcement Learning(https://arxiv.org/abs/2602.02979)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated strong potential in complex reasoning, yet their progress remains fundamentally constrained by reliance on massive high-quality human-curated tasks and labels, either through supervised fine-tuning (SFT) or reinforcement learning (RL) on reasoning-specific data. This dependence renders supervision-heavy training paradigms increasingly unsustainable, with signs of diminishing scalability already evident in practice. To overcome this limitation, we introduce CPMöbius (CPMobius), a collaborative Coach-Player paradigm for data-free reinforcement learning of reasoning models. Unlike traditional adversarial self-play, CPMöbius, inspired by real world human sports collaboration and multi-agent collaboration, treats the Coach and Player as independent but cooperative roles. The Coach proposes instructions targeted at the Player's capability and receives rewards based on changes in the Player's performance, while the Player is rewarded for solving the increasingly instructive tasks generated by the Coach. This cooperative optimization loop is designed to directly enhance the Player's mathematical reasoning ability. Remarkably, CPMöbius achieves substantial improvement without relying on any external training data, outperforming existing unsupervised approaches. For example, on Qwen2.5-Math-7B-Instruct, our method improves accuracy by an overall average of +4.9 and an out-of-distribution average of +5.4, exceeding RENT by +1.5 on overall accuracy and R-zero by +4.2 on OOD accuracy.
摘要：大型语言模型（LLM）在复杂推理方面表现出了强大的潜力，但其进步仍然从根本上受到对大量高质量人类策划任务和标签的依赖的限制，无论是通过监督微调（SFT）还是对推理特定数据的强化学习（RL）。这种依赖性使得监督繁重的训练范式越来越不可持续，实践中可扩展性减弱的迹象已经很明显。为了克服这一限制，我们引入了 CPMöbius (CPMobius)，这是一种用于推理模型的无数据强化学习的协作教练-玩家范例。与传统的对抗性自我对战不同，CPMöbius 受到现实世界人类运动协作和多智能体协作的启发，将教练和玩家视为独立但合作的角色。教练针对玩家的能力提出指示，并根据玩家表现的变化获得奖励，而玩家则因解决教练提出的越来越有指导性的任务而获得奖励。这种合作优化循环旨在直接增强玩家的数学推理能力。值得注意的是，CPMöbius 在不依赖任何外部训练数据的情况下实现了实质性改进，优于现有的无监督方法。例如，在 Qwen2.5-Math-7B-Instruct 上，我们的方法将总体平均值提高了 +4.9，分布外平均值提高了 +5.4，整体精度超过 RENT +1.5，OOD 精度超过 R-zero +4.2。

Title: LatentMem: Customizing Latent Memory for Multi-Agent Systems

Authors: Muxin Fu, Guibin Zhang, Xiangyuan Xue, Yafu Li, Zefeng He, Siyuan Huang, Xiaoye Qu, Yu Cheng, Yang Yang
Subjects: cs.CL, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2602.03036
Pdf URL: https://arxiv.org/pdf/2602.03036
Copy Paste: [[2602.03036]] LatentMem: Customizing Latent Memory for Multi-Agent Systems(https://arxiv.org/abs/2602.03036)
Keywords: language model, llm, agent
Abstract: Large language model (LLM)-powered multi-agent systems (MAS) demonstrate remarkable collective intelligence, wherein multi-agent memory serves as a pivotal mechanism for continual adaptation. However, existing multi-agent memory designs remain constrained by two fundamental bottlenecks: (i) memory homogenization arising from the absence of role-aware customization, and (ii) information overload induced by excessively fine-grained memory entries. To address these limitations, we propose LatentMem, a learnable multi-agent memory framework designed to customize agent-specific memories in a token-efficient manner. Specifically, LatentMem comprises an experience bank that stores raw interaction trajectories in a lightweight form, and a memory composer that synthesizes compact latent memories conditioned on retrieved experience and agent-specific contexts. Further, we introduce Latent Memory Policy Optimization (LMPO), which propagates task-level optimization signals through latent memories to the composer, encouraging it to produce compact and high-utility representations. Extensive experiments across diverse benchmarks and mainstream MAS frameworks show that LatentMem achieves a performance gain of up to $19.36$% over vanilla settings and consistently outperforms existing memory architectures, without requiring any modifications to the underlying frameworks.
摘要：由大语言模型（LLM）驱动的多智能体系统（MAS）展示了卓越的集体智慧，其中多智能体记忆是持续适应的关键机制。然而，现有的多智能体内存设计仍然受到两个基本瓶颈的限制：（i）由于缺乏角色感知定制而产生的内存同质化，以及（ii）过度细粒度的内存条目引起的信息过载。为了解决这些限制，我们提出了 LatentMem，这是一种可学习的多智能体记忆框架，旨在以令牌有效的方式定制特定于智能体的记忆。具体来说，LatentMem 包括一个以轻量级形式存储原始交互轨迹的经验库，以及一个根据检索到的经验和特定于代理的上下文来合成紧凑的潜在记忆的记忆合成器。此外，我们引入了潜在内存策略优化（LMPO），它通过潜在内存将任务级优化信号传播给作曲家，鼓励它产生紧凑且高效的表示。跨不同基准和主流 MAS 框架的大量实验表明，LatentMem 比普通设置实现了高达 19.36$% 的性能提升，并且始终优于现有内存架构，而无需对底层框架进行任何修改。

Title: SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression

Authors: Xing Hu, Dawei Yang, Yuan Cheng, Zhixuan Chen, Zukang Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03051
Pdf URL: https://arxiv.org/pdf/2602.03051
Copy Paste: [[2602.03051]] SAES-SVD: Self-Adaptive Suppression of Accumulated and Local Errors for SVD-based LLM Compression(https://arxiv.org/abs/2602.03051)
Keywords: language model, llm
Abstract: The rapid growth in the parameter scale of large language models (LLMs) has created a high demand for efficient compression techniques. As a hardware-agnostic and highly compatible technique, low-rank compression has been widely adopted. However, existing methods typically compress each layer independently by minimizing per-layer reconstruction error, overlooking a critical limitation: the reconstruction error propagates and accumulates through the network, which leads to amplified global deviations from the full-precision baseline. To address this, we propose Self-Adaptive Error Suppression SVD (SAES-SVD), a LLMs compression framework that jointly optimizes intra-layer reconstruction and inter-layer error compensation. SAES-SVD is composed of two novel components: (1) Cumulative Error-Aware Layer Compression (CEALC), which formulates the compression objective as a combination of local reconstruction and weighted cumulative error compensation. Based on it, we derive a closed-form low-rank solution relied on second-order activation statistics, which explicitly aligns each layer's output with its full-precision counterpart to compensate for accumulated errors. (2) Adaptive Collaborative Error Suppression (ACES), which automatically adjusts the weighting coefficient to enhance the low-rank structure of the compression objective in CEALC. Specifically, the coefficient is optimized to maximize the ratio between the Frobenius norm of the compressed layer's output and that of the compression objective under a fixed rank, thus ensuring that the rank budget is utilized effectively. Extensive experiments across multiple LLM architectures and tasks show that, without fine-tuning or mixed-rank strategies, SAES-SVD consistently improves post-compression performance.
摘要：大型语言模型（LLM）参数规模的快速增长对高效压缩技术产生了很高的需求。低秩压缩作为一种与硬件无关且高度兼容的技术，已被广泛采用。然而，现有方法通常通过最小化每层重建误差来独立压缩每一层，忽略了一个关键限制：重建误差通过网络传播和累积，这导致与全精度基线的全局偏差放大。为了解决这个问题，我们提出了自适应误差抑制 SVD (SAES-SVD)，这是一种 LLM 压缩框架，可以联合优化层内重建和层间误差补偿。 SAES-SVD 由两个新颖的组件组成：（1）累积误差感知层压缩（CEALC），它将压缩目标制定为局部重建和加权累积误差补偿的组合。基于此，我们得出了一个依赖于二阶激活统计的封闭式低秩解决方案，该解决方案将每个层的输出与其全精度对应部分明确对齐，以补偿累积的误差。 (2)自适应协同误差抑制(ACES)，自动调整权重系数以增强CEALC中压缩目标的低秩结构。具体来说，对系数进行优化，使固定秩下压缩层输出的Frobenius范数与压缩目标的Frobenius范数之比最大化，从而保证秩预算得到有效利用。跨多个 LLM 架构和任务的大量实验表明，无需微调或混合秩策略，SAES-SVD 就能持续提高压缩后性能。

Title: ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution

Authors: Junjie Huang, Jiarui Qin, Di Yin, Weiwen Liu, Yong Yu, Xing Sun, Weinan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03075
Pdf URL: https://arxiv.org/pdf/2602.03075
Copy Paste: [[2602.03075]] ReMiT: RL-Guided Mid-Training for Iterative LLM Evolution(https://arxiv.org/abs/2602.03075)
Keywords: language model, llm
Abstract: Standard training pipelines for large language models (LLMs) are typically unidirectional, progressing from pre-training to post-training. However, the potential for a bidirectional process--where insights from post-training retroactively improve the pre-trained foundation--remains unexplored. We aim to establish a self-reinforcing flywheel: a cycle in which reinforcement learning (RL)-tuned model strengthens the base model, which in turn enhances subsequent post-training performance, requiring no specially trained teacher or reference model. To realize this, we analyze training dynamics and identify the mid-training (annealing) phase as a critical turning point for model capabilities. This phase typically occurs at the end of pre-training, utilizing high-quality corpora under a rapidly decaying learning rate. Building upon this insight, we introduce ReMiT (Reinforcement Learning-Guided Mid-Training). Specifically, ReMiT leverages the reasoning priors of RL-tuned models to dynamically reweight tokens during the mid-training phase, prioritizing those pivotal for reasoning. Empirically, ReMiT achieves an average improvement of 3\% on 10 pre-training benchmarks, spanning math, code, and general reasoning, and sustains these gains by over 2\% throughout the post-training pipeline. These results validate an iterative feedback loop, enabling continuous and self-reinforcing evolution of LLMs.
摘要：大型语言模型 (LLM) 的标准训练流程通常是单向的，从预训练到训练后进行。然而，双向过程的潜力——训练后的见解追溯性地改善了预训练的基础——仍有待探索。我们的目标是建立一个自我强化的飞轮：在这个循环中，强化学习（RL）调整的模型强化了基础模型，从而增强了后续的训练后表现，不需要经过专门训练的教师或参考模型。为了实现这一点，我们分析了训练动态，并将训练中期（退火）阶段确定为模型能力的关键转折点。该阶段通常发生在预训练结束时，在快速衰减的学习率下利用高质量的语料库。基于这一见解，我们引入了 ReMiT（强化学习引导的中期训练）。具体来说，ReMiT 利用 RL 调整模型的推理先验，在训练中期动态重新加权令牌，优先考虑那些对于推理至关重要的令牌。根据经验，ReMiT 在 10 个训练前基准（涵盖数学、代码和一般推理）上平均提高了 3%，并在整个训练后管道中保持了超过 2% 的进步。这些结果验证了迭代反馈循环，从而实现了法学硕士的持续和自我强化的发展。

Title: AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback

Authors: Zhitao Gao, Jie Ma, Xuhong Li, Pengyu Li, Ning Qu, Yaqiang Wu, Hui Liu, Jun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03084
Pdf URL: https://arxiv.org/pdf/2602.03084
Copy Paste: [[2602.03084]] AERO: Autonomous Evolutionary Reasoning Optimization via Endogenous Dual-Loop Feedback(https://arxiv.org/abs/2602.03084)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have achieved significant success in complex reasoning but remain bottlenecked by reliance on expert-annotated data and external verifiers. While existing self-evolution paradigms aim to bypass these constraints, they often fail to identify the optimal learning zone and risk reinforcing collective hallucinations and incorrect priors through flawed internal feedback. To address these challenges, we propose \underline{A}utonomous \underline{E}volutionary \underline{R}easoning \underline{O}ptimization (AERO), an unsupervised framework that achieves autonomous reasoning evolution by internalizing self-questioning, answering, and criticism within a synergistic dual-loop system. Inspired by the \textit{Zone of Proximal Development (ZPD)} theory, AERO utilizes entropy-based positioning to target the ``solvability gap'' and employs Independent Counterfactual Correction for robust verification. Furthermore, we introduce a Staggered Training Strategy to synchronize capability growth across functional roles and prevent curriculum collapse. Extensive evaluations across nine benchmarks spanning three domains demonstrate that AERO achieves average performance improvements of 4.57\% on Qwen3-4B-Base and 5.10\% on Qwen3-8B-Base, outperforming competitive baselines. Code is available at this https URL.
摘要：大型语言模型 (LLM) 在复杂推理方面取得了巨大成功，但由于依赖专家注释数据和外部验证者，仍然存在瓶颈。虽然现有的自我进化范式旨在绕过这些限制，但它们往往无法识别最佳学习区域，并且存在通过有缺陷的内部反馈强化集体幻觉和不正确先验的风险。为了应对这些挑战，我们提出了 \underline{A}autonomous \underline{E}volutionary \underline{R}reasoning \underline{O}ptimization (AERO)，这是一种无监督框架，通过在协同双环系统中内化自我质疑、回答和批评来实现自主推理进化。受 \textit{最近发展区 (ZPD)} 理论的启发，AERO 利用基于熵的定位来瞄准“可解性差距”，并采用独立反事实校正进行稳健验证。此外，我们引入了交错培训策略，以同步跨职能角色的能力增长并防止课程崩溃。对跨越三个领域的九个基准的广泛评估表明，AERO 在 Qwen3-4B-Base 上实现了 4.57\% 的平均性能改进，在 Qwen3-8B-Base 上实现了 5.10\% 的平均性能改进，优于竞争基准。代码可从此 https URL 获取。

Title: Test-time Recursive Thinking: Self-Improvement without External Feedback

Authors: Yufan Zhuang, Chandan Singh, Liyuan Liu, Yelong Shen, Dinghuai Zhang, Jingbo Shang, Jianfeng Gao, Weizhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03094
Pdf URL: https://arxiv.org/pdf/2602.03094
Copy Paste: [[2602.03094]] Test-time Recursive Thinking: Self-Improvement without External Feedback(https://arxiv.org/abs/2602.03094)
Keywords: language model, llm
Abstract: Modern Large Language Models (LLMs) have shown rapid improvements in reasoning capabilities, driven largely by reinforcement learning (RL) with verifiable rewards. Here, we ask whether these LLMs can self-improve without the need for additional training. We identify two core challenges for such systems: (i) efficiently generating diverse, high-quality candidate solutions, and (ii) reliably selecting correct answers in the absence of ground-truth supervision. To address these challenges, we propose Test-time Recursive Thinking (TRT), an iterative self-improvement framework that conditions generation on rollout-specific strategies, accumulated knowledge, and self-generated verification signals. Using TRT, open-source models reach 100% accuracy on AIME-25/24, and on LiveCodeBench's most difficult problems, closed-source models improve by 10.4-14.8 percentage points without external feedback.
摘要：现代大型语言模型 (LLM) 的推理能力已显示出快速提高，这主要是由具有可验证奖励的强化学习 (RL) 推动的。在这里，我们询问这些法学硕士是否可以在不需要额外培训的情况下自我提高。我们确定了此类系统的两个核心挑战：（i）有效生成多样化的高质量候选解决方案，以及（ii）在缺乏真实监督的情况下可靠地选择正确答案。为了应对这些挑战，我们提出了测试时递归思维（TRT），这是一种迭代的自我改进框架，可以根据特定于推出的策略、积累的知识和自我生成的验证信号来调节生成。使用TRT，开源模型在AIME-25/24上达到100%的准确率，在LiveCodeBench最困难的问题上，闭源模型在没有外部反馈的情况下提高了10.4-14.8个百分点。

Title: Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision

Authors: Pritam Kadasi, Abhishek Upperwal, Mayank Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03103
Pdf URL: https://arxiv.org/pdf/2602.03103
Copy Paste: [[2602.03103]] Task--Specificity Score: Measuring How Much Instructions Really Matter for Supervision(https://arxiv.org/abs/2602.03103)
Keywords: language model, llm
Abstract: Instruction tuning is now the default way to train and adapt large language models, but many instruction--input--output pairs are only weakly specified: for a given input, the same output can remain plausible under several alternative instructions. This raises a simple question: \emph{does the instruction uniquely determine the target output?} We propose the \textbf{Task--Specificity Score (TSS)} to quantify how much an instruction matters for predicting its output, by contrasting the true instruction against plausible alternatives for the same input. We further introduce \textbf{TSS++}, which uses hard alternatives and a small quality term to mitigate easy-negative effects. Across three instruction datasets (\textsc{Alpaca}, \textsc{Dolly-15k}, \textsc{NI-20}) and three open LLMs (Gemma, Llama, Qwen), we show that selecting task-specific examples improves downstream performance under tight token budgets and complements quality-based filters such as perplexity and IFD.
摘要：指令调优现在是训练和适应大型语言模型的默认方法，但许多指令-输入-输出对只是弱指定的：对于给定的输入，相同的输出在多个替代指令下仍然是合理的。这就提出了一个简单的问题：\emph{指令是否唯一确定目标输出？}我们提出 \textbf{任务 - 特异性得分（TSS）}，通过将真实指令与相同输入的合理替代指令进行对比，来量化指令对于预测其输出的重要性。我们进一步引入 \textbf{TSS++}，它使用硬替代品和小的质量项来减轻容易产生的负面影响。在三个指令数据集（\textsc{Alpaca}、\textsc{Dolly-15k}、\textsc{NI-20}）和三个开放的 LLM（Gemma、Llama、Qwen）中，我们表明，选择特定于任务的示例可以在紧张的代币预算下提高下游性能，并补充基于质量的过滤器（例如困惑度和 IFD）。

Title: The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models

Authors: Yitong Zhang, Yuhan Xiang, Mingxuan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03107
Pdf URL: https://arxiv.org/pdf/2602.03107
Copy Paste: [[2602.03107]] The Mask of Civility: Benchmarking Chinese Mock Politeness Comprehension in Large Language Models(https://arxiv.org/abs/2602.03107)
Keywords: language model, gpt, llm, prompt
Abstract: From a pragmatic perspective, this study systematically evaluates the differences in performance among representative large language models (LLMs) in recognizing politeness, impoliteness, and mock politeness phenomena in Chinese. Addressing the existing gaps in pragmatic comprehension, the research adopts the frameworks of Rapport Management Theory and the Model of Mock Politeness to construct a three-category dataset combining authentic and simulated Chinese discourse. Six representative models, including GPT-5.1 and DeepSeek, were selected as test subjects and evaluated under four prompting conditions: zero-shot, few-shot, knowledge-enhanced, and hybrid strategies. This study serves as a meaningful attempt within the paradigm of ``Great Linguistics,'' offering a novel approach to applying pragmatic theory in the age of technological transformation. It also responds to the contemporary question of how technology and the humanities may coexist, representing an interdisciplinary endeavor that bridges linguistic technology and humanistic reflection.
摘要：本研究从实用的角度出发，系统评估了代表性大语言模型（LLM）在识别汉语礼貌、不礼貌和假礼貌现象方面的表现差异。针对语用理解中存在的差距，本研究采用融洽管理理论和模拟礼貌模型的框架，构建了真实和模拟汉语话语相结合的三类数据集。选择GPT-5.1和DeepSeek等六种代表性模型作为测试对象，并在零样本、少样本、知识增强和混合策略四种提示条件下进行评估。这项研究是“伟大语言学”范式内的一次有意义的尝试，为技术变革时代应用实用理论提供了一种新颖的方法。它还回应了技术与人文学科如何共存的当代问题，代表了连接语言技术和人文反思的跨学科努力。

Title: ChemPro: A Progressive Chemistry Benchmark for Large Language Models

Authors: Aaditya Baranwal, Shruti Vyas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03108
Pdf URL: https://arxiv.org/pdf/2602.03108
Copy Paste: [[2602.03108]] ChemPro: A Progressive Chemistry Benchmark for Large Language Models(https://arxiv.org/abs/2602.03108)
Keywords: language model, llm
Abstract: We introduce ChemPro, a progressive benchmark with 4100 natural language question-answer pairs in Chemistry, across 4 coherent sections of difficulty designed to assess the proficiency of Large Language Models (LLMs) in a broad spectrum of general chemistry topics. We include Multiple Choice Questions and Numerical Questions spread across fine-grained information recall, long-horizon reasoning, multi-concept questions, problem-solving with nuanced articulation, and straightforward questions in a balanced ratio, effectively covering Bio-Chemistry, Inorganic-Chemistry, Organic-Chemistry and Physical-Chemistry. ChemPro is carefully designed analogous to a student's academic evaluation for basic to high-school chemistry. A gradual increase in the question difficulty rigorously tests the ability of LLMs to progress from solving basic problems to solving more sophisticated challenges. We evaluate 45+7 state-of-the-art LLMs, spanning both open-source and proprietary variants, and our analysis reveals that while LLMs perform well on basic chemistry questions, their accuracy declines with different types and levels of complexity. These findings highlight the critical limitations of LLMs in general scientific reasoning and understanding and point towards understudied dimensions of difficulty, emphasizing the need for more robust methodologies to improve LLMs.
摘要：我们推出了 ChemPro，这是一个渐进式基准测试，包含 4100 个化学领域的自然语言问答对，涵盖 4 个连贯的难度部分，旨在评估大型语言模型 (LLM) 在广泛的一般化学主题中的熟练程度。我们包括多项选择题和数字题，涵盖细粒度的信息回忆、长视野推理、多概念问题、细致入微的解决问题以及平衡比例的简单问题，有效涵盖生物化学、无机化学、有机化学和物理化学。 ChemPro 经过精心设计，类似于学生对基础到高中化学的学术评估。问题难度的逐渐增加严格考验LLM从解决基本问题到解决更复杂挑战的能力。我们评估了 45+7 个最先进的法学硕士，涵盖开源和专有变体，我们的分析表明，虽然法学硕士在基本化学问题上表现良好，但其准确性随着不同类型和复杂程度的不同而下降。这些发现强调了法学硕士在一般科学推理和理解方面的关键局限性，并指出了尚未充分研究的困难维度，强调需要更强大的方法来改进法学硕士。

Title: One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence

Authors: Bowen Jiang, Taiwei Shi, Ryo Kamoi, Yuan Yuan, Camillo J. Taylor, Longqi Yang, Pei Zhou, Sihao Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03109
Pdf URL: https://arxiv.org/pdf/2602.03109
Copy Paste: [[2602.03109]] One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence(https://arxiv.org/abs/2602.03109)
Keywords: agent
Abstract: This paper introduces OMAR: One Model, All Roles, a reinforcement learning framework that enables AI to develop social intelligence through multi-turn, multi-agent conversational self-play. Unlike traditional paradigms that rely on static, single-turn optimizations, OMAR allows a single model to role-play all participants in a conversation simultaneously, learning to achieve long-term goals and complex social norms directly from dynamic social interaction. To ensure training stability across long dialogues, we implement a hierarchical advantage estimation that calculates turn-level and token-level advantages. Evaluations in the SOTOPIA social environment and Werewolf strategy games show that our trained models develop fine-grained, emergent social intelligence, such as empathy, persuasion, and compromise seeking, demonstrating the effectiveness of learning collaboration even under competitive scenarios. While we identify practical challenges like reward hacking, our results show that rich social intelligence can emerge without human supervision. We hope this work incentivizes further research on AI social intelligence in group conversations.
摘要：本文介绍了 OMAR：一个模型，所有角色，这是一种强化学习框架，使人工智能能够通过多回合、多智能体对话自我博弈来开发社交智能。与依赖静态、单轮优化的传统范例不同，OMAR 允许单个模型同时对对话中的所有参与者进行角色扮演，学习直接从动态社交互动中实现长期目标和复杂的社会规范。为了确保长时间对话中的训练稳定性，我们实施了分层优势估计，计算回合级和令牌级优势。 SOTOPIA 社交环境和狼人策略游戏中的评估表明，我们训练的模型开发了细粒度的、新兴的社交智能，例如同理心、说服力和寻求妥协，证明了即使在竞争场景下学习协作的有效性。虽然我们发现了奖励黑客等实际挑战，但我们的结果表明，丰富的社交智能可以在没有人类监督的情况下出现。我们希望这项工作能够激励对群体对话中人工智能社交智能的进一步研究。

Title: FASA: Frequency-aware Sparse Attention

Authors: Yifei Wang, Yueqi Wang, Zhenrui Yue, Huimin Zeng, Yong Wang, Ismini Lourentzou, Zhengzhong Tu, Xiangxiang Chu, Julian McAuley
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03152
Pdf URL: https://arxiv.org/pdf/2602.03152
Copy Paste: [[2602.03152]] FASA: Frequency-aware Sparse Attention(https://arxiv.org/abs/2602.03152)
Keywords: language model, llm
Abstract: The deployment of Large Language Models (LLMs) faces a critical bottleneck when handling lengthy inputs: the prohibitive memory footprint of the Key Value (KV) cache. To address this bottleneck, the token pruning paradigm leverages attention sparsity to selectively retain a small, critical subset of tokens. However, existing approaches fall short, with static methods risking irreversible information loss and dynamic strategies employing heuristics that insufficiently capture the query-dependent nature of token importance. We propose FASA, a novel framework that achieves query-aware token eviction by dynamically predicting token importance. FASA stems from a novel insight into RoPE: the discovery of functional sparsity at the frequency-chunk (FC) level. Our key finding is that a small, identifiable subset of "dominant" FCs consistently exhibits high contextual agreement with the full attention head. This provides a robust and computationally free proxy for identifying salient tokens. %making them a powerful and efficient proxy for token importance. Building on this insight, FASA first identifies a critical set of tokens using dominant FCs, and then performs focused attention computation solely on this pruned subset. % Since accessing only a small fraction of the KV cache, FASA drastically lowers memory bandwidth requirements and computational cost. Across a spectrum of long-context tasks, from sequence modeling to complex CoT reasoning, FASA consistently outperforms all token-eviction baselines and achieves near-oracle accuracy, demonstrating remarkable robustness even under constraint budgets. Notably, on LongBench-V1, FASA reaches nearly 100\% of full-KV performance when only keeping 256 tokens, and achieves 2.56$\times$ speedup using just 18.9\% of the cache on AIME24.
摘要：大型语言模型 (LLM) 的部署在处理冗长的输入时面临着一个关键瓶颈：键值 (KV) 缓存的内存占用过高。为了解决这个瓶颈，令牌修剪范例利用注意力稀疏性来选择性地保留一小部分关键的令牌子集。然而，现有方法存在不足，静态方法存在不可逆信息丢失的风险，而动态策略采用启发式方法，无法充分捕获令牌重要性的查询依赖性质。我们提出了 FASA，这是一种新颖的框架，通过动态预测令牌重要性来实现查询感知令牌驱逐。 FASA 源于对 RoPE 的新颖见解：发现频率块 (FC) 级别的功能稀疏性。我们的主要发现是，一小部分可识别的“主导”FC 始终表现出与全注意力头的高度上下文一致性。这提供了一个强大且无需计算的代理来识别显着标记。 %使它们成为代币重要性的强大且高效的代理。基于这一见解，FASA 首先使用主导 FC 识别一组关键令牌，然后仅对这个修剪后的子集执行集中注意力计算。由于仅访问 KV 缓存的一小部分，FASA 大大降低了内存带宽要求和计算成本。在一系列长上下文任务中，从序列建模到复杂的 CoT 推理，FASA 始终优于所有令牌驱逐基线，并实现接近预言机的准确性，即使在预算有限的情况下也表现出卓越的稳健性。值得注意的是，在 LongBench-V1 上，FASA 在仅保留 256 个令牌时就达到了近 100% 的全 KV 性能，并且在 AIME24 上仅使用 18.9% 的缓存即可实现 2.56$\times$ 的加速。

Title: Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

Authors: Hyunwoo Kim, Niloofar Mireshghallah, Michael Duan, Rui Xin, Shuyue Stella Li, Jaehun Jung, David Acuna, Qi Pang, Hanshen Xiao, G. Edward Suh, Sewoong Oh, Yulia Tsvetkov, Pang Wei Koh, Yejin Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03183
Pdf URL: https://arxiv.org/pdf/2602.03183
Copy Paste: [[2602.03183]] Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch(https://arxiv.org/abs/2602.03183)
Keywords: language model, gpt, agent
Abstract: Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.
摘要：涉及隐私敏感数据的研究一直受到数据稀缺的限制，这与受益于数据扩展的其他领域形成鲜明对比。随着现代人工智能代理（例如 OpenClaw 和 Gemini Agent）被授予对高度敏感的个人信息的持续访问权限，这一挑战变得越来越紧迫。为了解决这一长期存在的瓶颈和不断上升的风险，我们推出了 Privasis（即隐私绿洲），这是第一个完全从头开始构建的百万级完全合成数据集——一个包含丰富多样的私人信息的庞大文本库——旨在扩大和加速不可避免地需要处理敏感社交数据的领域的研究。与现有数据集相比，包含 140 万条记录的 Privasis 提供了数量级更大的质量和更大的多样性，涵盖各种文档类型，包括病史、法律文件、财务记录、日历和短信，总共有 5510 万条带注释的属性，如种族、出生日期、工作场所等。我们利用 Privasis 构建一个并行语料库，通过我们的管道来分解文本并应用有针对性的清理。我们在此数据集上训练的紧凑型清理模型 (<=4B) 的性能优于最先进的大型语言模型，例如 GPT-5 和 Qwen-3 235B。我们计划发布数据、模型和代码，以加速未来对隐私敏感领域和代理的研究。

Title: ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

Authors: Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03203
Pdf URL: https://arxiv.org/pdf/2602.03203
Copy Paste: [[2602.03203]] ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution(https://arxiv.org/abs/2602.03203)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches.
摘要：最近，大型语言模型（LLM）通过产生长推理轨迹而显示出卓越的推理能力。然而，随着序列长度的增长，键值（KV）缓存会线性扩展，从而产生大量的内存和计算成本。现有的 KV 缓存驱逐方法通过丢弃不太重要的 KV 对来缓解这个问题，但通常无法捕获复杂的 KV 依赖关系，从而导致性能下降。为了更好地平衡效率和性能，我们引入了 ForesightKV，这是一种基于训练的 KV 缓存驱逐框架，可以学习预测在长文本生成期间要驱逐哪些 KV 对。我们首先设计了黄金驱逐算法，该算法使用未来的注意力分数来识别每一步的最佳驱逐 KV 对。然后通过带有成对排名损失的监督训练来提取每一步的痕迹和分数。此外，我们将缓存驱逐制定为马尔可夫决策过程，并应用 GRPO 算法来减轻低熵令牌上显着的语言建模损失增加。对三种推理模型的 AIME2024 和 AIME2025 基准进行的实验表明，ForesightKV 在仅一半的缓存预算下始终优于先前的方法，同时从监督学习和强化学习方法中协同受益。

Title: Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection

Authors: Dongwon Jo, Beomseok Kang, Jiwon Song, Jae-Joon Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03216
Pdf URL: https://arxiv.org/pdf/2602.03216
Copy Paste: [[2602.03216]] Token Sparse Attention: Efficient Long-Context Inference with Interleaved Token Selection(https://arxiv.org/abs/2602.03216)
Keywords: language model
Abstract: The quadratic complexity of attention remains the central bottleneck in long-context inference for large language models. Prior acceleration methods either sparsify the attention map with structured patterns or permanently evict tokens at specific layers, which can retain irrelevant tokens or rely on irreversible early decisions despite the layer-/head-wise dynamics of token importance. In this paper, we propose Token Sparse Attention, a lightweight and dynamic token-level sparsification mechanism that compresses per-head $Q$, $K$, $V$ to a reduced token set during attention and then decompresses the output back to the original sequence, enabling token information to be reconsidered in subsequent layers. Furthermore, Token Sparse Attention exposes a new design point at the intersection of token selection and sparse attention. Our approach is fully compatible with dense attention implementations, including Flash Attention, and can be seamlessly composed with existing sparse attention kernels. Experimental results show that Token Sparse Attention consistently improves accuracy-latency trade-off, achieving up to $\times$3.23 attention speedup at 128K context with less than 1% accuracy degradation. These results demonstrate that dynamic and interleaved token-level sparsification is a complementary and effective strategy for scalable long-context inference.
摘要：注意力的二次复杂度仍然是大型语言模型长上下文推理的中心瓶颈。先前的加速方法要么使用结构化模式稀疏注意力图，要么永久驱逐特定层的令牌，这可以保留不相关的令牌或依赖于不可逆的早期决策，尽管令牌重要性具有层/头动态。在本文中，我们提出了 Token Sparse Attention，这是一种轻量级、动态的 token 级稀疏机制，在注意力过程中将每个头 $Q$、$K$、$V$ 压缩为简化的 token 集，然后将输出解压缩回原始序列，从而使 token 信息能够在后续层中重新考虑。此外，令牌稀疏注意力在令牌选择和稀疏注意力的交叉点上暴露了一个新的设计点。我们的方法与密集注意力实现完全兼容，包括 Flash Attention，并且可以与现有的稀疏注意力内核无缝组合。实验结果表明，Token Sparse Attention 持续改进了准确度与延迟之间的权衡，在 128K 上下文中实现了高达 $\times$3.23 的注意力加速，而准确率下降不到 1%。这些结果表明，动态和交错的令牌级稀疏化是可扩展的长上下文推理的补充和有效策略。

Title: ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs

Authors: Xuancheng Li, Haitao Li, Yujia Zhou, Qingyao Ai, Yiqun Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03226
Pdf URL: https://arxiv.org/pdf/2602.03226
Copy Paste: [[2602.03226]] ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs(https://arxiv.org/abs/2602.03226)
Keywords: language model, llm, long context
Abstract: Long-context inputs in large language models (LLMs) often suffer from the "lost in the middle" problem, where critical information becomes diluted or ignored due to excessive length. Context compression methods aim to address this by reducing input size, but existing approaches struggle with balancing information preservation and compression efficiency. We propose Adaptive Task-Aware Compressor (ATACompressor), which dynamically adjusts compression based on the specific requirements of the task. ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content. Its adaptive allocation controller perceives the length of relevant content and adjusts the compression rate accordingly, optimizing resource utilization. We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance. Our approach provides a scalable solution for long-context processing in LLMs. Furthermore, we perform a range of ablation studies and analysis experiments to gain deeper insights into the key components of ATACompressor.
摘要：大型语言模型 (LLM) 中的长上下文输入通常会遇到“中间丢失”问题，即关键信息由于长度过长而被稀释或忽略。上下文压缩方法旨在通过减少输入大小来解决这个问题，但现有方法很难平衡信息保存和压缩效率。我们提出了自适应任务感知压缩器（ATACompressor），它可以根据任务的具体要求动态调整压缩。 ATACompressor 采用选择性编码器，仅压缩长上下文中与任务相关的部分，确保保留重要信息，同时减少不必要的内容。其自适应分配控制器感知相关内容的长度并相应调整压缩率，优化资源利用率。我们在三个 QA 数据集上评估 ATACompressor：HotpotQA、MSMARCO 和 SQUAD，表明它在压缩效率和任务性能方面都优于现有方法。我们的方法为法学硕士中的长上下文处理提供了可扩展的解决方案。此外，我们还进行了一系列消融研究和分析实验，以更深入地了解 ATACompressor 的关键组件。

Title: POP: Prefill-Only Pruning for Efficient Large Model Inference

Authors: Junhui He, Zhihui Fu, Jun Wang, Qingan Li
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.03295
Pdf URL: https://arxiv.org/pdf/2602.03295
Copy Paste: [[2602.03295]] POP: Prefill-Only Pruning for Efficient Large Model Inference(https://arxiv.org/abs/2602.03295)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated remarkable capabilities. However, their deployment is hindered by significant computational costs. Existing structured pruning methods, while hardware-efficient, often suffer from significant accuracy degradation. In this paper, we argue that this failure stems from a stage-agnostic pruning approach that overlooks the asymmetric roles between the prefill and decode stages. By introducing a virtual gate mechanism, our importance analysis reveals that deep layers are critical for next-token prediction (decode) but largely redundant for context encoding (prefill). Leveraging this insight, we propose Prefill-Only Pruning (POP), a stage-aware inference strategy that safely omits deep layers during the computationally intensive prefill stage while retaining the full model for the sensitive decode stage. To enable the transition between stages, we introduce independent Key-Value (KV) projections to maintain cache integrity, and a boundary handling strategy to ensure the accuracy of the first generated token. Extensive experiments on Llama-3.1, Qwen3-VL, and Gemma-3 across diverse modalities demonstrate that POP achieves up to 1.37$\times$ speedup in prefill latency with minimal performance loss, effectively overcoming the accuracy-efficiency trade-off limitations of existing structured pruning methods.
摘要：大型语言模型 (LLM) 和视觉语言模型 (VLM) 已展现出非凡的功能。然而，它们的部署受到巨大的计算成本的阻碍。现有的结构化剪枝方法虽然硬件效率高，但常常会遭受严重的准确性下降。在本文中，我们认为这种失败源于与阶段无关的修剪方法，该方法忽略了预填充和解码阶段之间的不对称作用。通过引入虚拟门机制，我们的重要性分析表明深层对于下一个令牌预测（解码）至关重要，但对于上下文编码（预填充）来说很大程度上是多余的。利用这一见解，我们提出了仅预填充修剪（POP），这是一种阶段感知推理策略，可以在计算密集型预填充阶段安全地省略深层，同时保留敏感解码阶段的完整模型。为了实现阶段之间的转换，我们引入了独立的键值（KV）投影来维护缓存完整性，并引入边界处理策略来确保第一个生成的令牌的准确性。对 Llama-3.1、Qwen3-VL 和 Gemma-3 跨不同模式的广泛实验表明，POP 在预填充延迟方面实现了高达 1.37$\times$ 的加速，同时性能损失最小，有效克服了现有结构化剪枝方法的准确性-效率权衡限制。

Title: MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research

Authors: Yifan Shi, Jialong Shi, Jiayi Wang, Ye Fan, Jianyong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03318
Pdf URL: https://arxiv.org/pdf/2602.03318
Copy Paste: [[2602.03318]] MIRROR: A Multi-Agent Framework with Iterative Adaptive Revision and Hierarchical Retrieval for Optimization Modeling in Operations Research(https://arxiv.org/abs/2602.03318)
Keywords: language model, llm, agent
Abstract: Operations Research (OR) relies on expert-driven modeling-a slow and fragile process ill-suited to novel scenarios. While large language models (LLMs) can automatically translate natural language into optimization models, existing approaches either rely on costly post-training or employ multi-agent frameworks, yet most still lack reliable collaborative error correction and task-specific retrieval, often leading to incorrect outputs. We propose MIRROR, a fine-tuning-free, end-to-end multi-agent framework that directly translates natural language optimization problems into mathematical models and solver code. MIRROR integrates two core mechanisms: (1) execution-driven iterative adaptive revision for automatic error correction, and (2) hierarchical retrieval to fetch relevant modeling and coding exemplars from a carefully curated exemplar library. Experiments show that MIRROR outperforms existing methods on standard OR benchmarks, with notable results on complex industrial datasets such as IndustryOR and Mamo-ComplexLP. By combining precise external knowledge infusion with systematic error correction, MIRROR provides non-expert users with an efficient and reliable OR modeling solution, overcoming the fundamental limitations of general-purpose LLMs in expert optimization tasks.
摘要：运筹学 (OR) 依赖于专家驱动的建模，这是一个缓慢且脆弱的过程，不适合新的场景。虽然大型语言模型（LLM）可以自动将自然语言转换为优化模型，但现有方法要么依赖昂贵的后期训练，要么采用多智能体框架，但大多数方法仍然缺乏可靠的协作纠错和特定于任务的检索，常常导致错误的输出。我们提出了 MIRROR，一种无需微调的端到端多智能体框架，可直接将自然语言优化问题转化为数学模型和求解器代码。 MIRROR 集成了两个核心机制：(1) 执行驱动的迭代自适应修订，用于自动纠错；(2) 分层检索，从精心策划的示例库中获取相关建模和编码示例。实验表明，MIRROR 在标准 OR 基准上优于现有方法，在复杂的工业数据集（例如 IndustryOR 和 Mamo-ComplexLP）上取得了显着的结果。通过将精确的外部知识注入与系统误差校正相结合，MIRROR为非专家用户提供了高效可靠的OR建模解决方案，克服了通用LLM在专家优化任务中的基本局限性。

Title: Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention

Authors: Rakshith Vasudev, Melisa Russak, Dan Bikel, Waseem Alshikh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03338
Pdf URL: https://arxiv.org/pdf/2602.03338
Copy Paste: [[2602.03338]] Accurate Failure Prediction in Agents Does Not Imply Effective Failure Prevention(https://arxiv.org/abs/2602.03338)
Keywords: llm, agent
Abstract: Proactive interventions by LLM critic models are often assumed to improve reliability, yet their effects at deployment time are poorly understood. We show that a binary LLM critic with strong offline accuracy (AUROC 0.94) can nevertheless cause severe performance degradation, inducing a 26 percentage point (pp) collapse on one model while affecting another by near zero pp. This variability demonstrates that LLM critic accuracy alone is insufficient to determine whether intervention is safe. We identify a disruption-recovery tradeoff: interventions may recover failing trajectories but also disrupt trajectories that would have succeeded. Based on this insight, we propose a pre-deployment test that uses a small pilot of 50 tasks to estimate whether intervention is likely to help or harm, without requiring full deployment. Across benchmarks, the test correctly anticipates outcomes: intervention degrades performance on high-success tasks (0 to -26 pp), while yielding a modest improvement on the high-failure ALFWorld benchmark (+2.8 pp, p=0.014). The primary value of our framework is therefore identifying when not to intervene, preventing severe regressions before deployment.
摘要：通常认为法学硕士批评模型的主动干预可以提高可靠性，但人们对它们在部署时的影响知之甚少。我们表明，具有很强离线准确性（AUROC 0.94）的二元 LLM 批评家仍然会导致严重的性能下降，导致一个模型崩溃 26 个百分点 (pp)，同时影响另一个模型接近零 pp。这种可变性表明，仅 LLM 批评家准确性不足以确定干预是否安全。我们确定了破坏与恢复的权衡：干预措施可能会恢复失败的轨迹，但也会破坏原本会成功的轨迹。基于这一见解，我们提出了一项部署前测试，该测试使用 50 项任务的小型试点来估计干预是否可能有所帮助或有害，而不需要全面部署。在各个基准中，测试正确地预测了结果：干预会降低高成功任务的性能（0 至 -26 pp），同时对高失败 ALFWorld 基准产生适度的改进（+2.8 pp，p = 0.014）。因此，我们框架的主要价值是确定何时不进行干预，防止部署前出现严重的倒退。

Title: PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning

Authors: Yunzhi Shen, Hao Zhou, Xin Huang, Xue Han, Junlan Feng, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03352
Pdf URL: https://arxiv.org/pdf/2602.03352
Copy Paste: [[2602.03352]] PEGRL: Improving Machine Translation by Post-Editing Guided Reinforcement Learning(https://arxiv.org/abs/2602.03352)
Keywords: llm
Abstract: Reinforcement learning (RL) has shown strong promise for LLM-based machine translation, with recent methods such as GRPO demonstrating notable gains; nevertheless, translation-oriented RL remains challenged by noisy learning signals arising from Monte Carlo return estimation, as well as a large trajectory space that favors global exploration over fine-grained local optimization. We introduce \textbf{PEGRL}, a \textit{two-stage} RL framework that uses post-editing as an auxiliary task to stabilize training and guide overall optimization. At each iteration, translation outputs are sampled to construct post-editing inputs, allowing return estimation in the post-editing stage to benefit from conditioning on the current translation behavior, while jointly supporting both global exploration and fine-grained local optimization. A task-specific weighting scheme further balances the contributions of translation and post-editing objectives, yielding a biased yet more sample-efficient estimator. Experiments on English$\to$Finnish, English$\to$Turkish, and English$\leftrightarrow$Chinese show consistent gains over RL baselines, and for English$\to$Turkish, performance on COMET-KIWI is comparable to advanced LLM-based systems (DeepSeek-V3.2).
摘要：强化学习 (RL) 在基于 LLM 的机器翻译方面显示出强大的前景，最近的 GRPO 等方法显示出显着的成果；然而，面向翻译的强化学习仍然受到蒙特卡罗返回估计产生的噪声学习信号的挑战，以及有利于全局探索而不是细粒度局部优化的大轨迹空间。我们引入了 \textbf{PEGRL}，一个 \textit{两阶段} RL 框架，它使用后期编辑作为辅助任务来稳定训练并指导整体优化。在每次迭代中，对翻译输出进行采样以构建后期编辑输入，从而允许后期编辑阶段的回报估计受益于对当前翻译行为的调节，同时共同支持全局探索和细粒度局部优化。特定于任务的加权方案进一步平衡翻译和译后编辑目标的贡献，产生有偏差但样本效率更高的估计器。在英语$\to$芬兰语、英语$\to$土耳其语和英语$\leftrightarrow$中文上的实验显示出相对于 RL 基线的一致增益，而对于英语$\to$土耳其语，COMET-KIWI 上的性能与基于 LLM 的高级系统 (DeepSeek-V3.2) 相当。

Title: Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain

Authors: Wei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03368
Pdf URL: https://arxiv.org/pdf/2602.03368
Copy Paste: [[2602.03368]] Pursuing Best Industrial Practices for Retrieval-Augmented Generation in the Medical Domain(https://arxiv.org/abs/2602.03368)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: While retrieval augmented generation (RAG) has been swiftly adopted in industrial applications based on large language models (LLMs), there is no consensus on what are the best practices for building a RAG system in terms of what are the components, how to organize these components and how to implement each component for the industrial applications, especially in the medical domain. In this work, we first carefully analyze each component of the RAG system and propose practical alternatives for each component. Then, we conduct systematic evaluations on three types of tasks, revealing the best practices for improving the RAG system and how LLM-based RAG systems make trade-offs between performance and efficiency.
摘要：虽然检索增强生成（RAG）已在基于大语言模型（LLM）的工业应用中迅速采用，但对于构建 RAG 系统的最佳实践，包括哪些组件、如何组织这些组件以及如何为工业应用（尤其是在医疗领域）实现每个组件，尚未达成共识。在这项工作中，我们首先仔细分析 RAG 系统的每个组件，并为每个组件提出实用的替代方案。然后，我们对三类任务进行系统评估，揭示改进 RAG 系统的最佳实践以及基于 LLM 的 RAG 系统如何在性能和效率之间进行权衡。

Title: Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective

Authors: Hao Fang, Tianyi Zhang, Tianqu Zhuang, Jiawei Kong, Kuofeng Gao, Bin Chen, Leqi Liang, Shu-Tao Xia, Ke Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03396
Pdf URL: https://arxiv.org/pdf/2602.03396
Copy Paste: [[2602.03396]] Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective(https://arxiv.org/abs/2602.03396)
Keywords: language model, llm
Abstract: Proprietary large language models (LLMs) embody substantial economic value and are generally exposed only as black-box APIs, yet adversaries can still exploit their outputs to extract knowledge via distillation. Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored. In this work, we analyze this problem and present an effective solution from an information-theoretic perspective. We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels. This quantity captures contextual information beneficial for model extraction, motivating us to defend distillation via CMI minimization. Guided by our theoretical analysis, we propose learning a transformation matrix that purifies the original outputs to enhance distillation resistance. We further derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility. Extensive experiments across multiple LLMs and strong distillation algorithms demonstrate that the proposed method significantly degrades distillation performance while preserving task accuracy, effectively protecting models' intellectual property.
摘要：专有的大语言模型 (LLM) 体现了巨大的经济价值，通常仅作为黑盒 API 公开，但对手仍然可以利用其输出通过蒸馏提取知识。现有的防御措施仅专注于基于文本的蒸馏，而重要的基于逻辑的蒸馏在很大程度上尚未得到探索。在这项工作中，我们从信息论的角度分析这个问题并提出有效的解决方案。我们使用教师逻辑和以真实标签为条件的输入查询之间的条件互信息（CMI）来表征教师输出中的蒸馏相关信息。这个数量捕获了有利于模型提取的上下文信息，激励我们通过 CMI 最小化来保护蒸馏。在我们的理论分析的指导下，我们建议学习一个变换矩阵，该矩阵可以净化原始输出以增强蒸馏阻力。我们进一步推导出受 CMI 启发的反蒸馏目标来优化此转换，从而有效地删除与蒸馏相关的信息，同时保留输出效用。跨多个法学硕士和强大的蒸馏算法的大量实验表明，所提出的方法在保持任务准确性的同时显着降低了蒸馏性能，有效保护了模型的知识产权。

Title: Verified Critical Step Optimization for LLM Agents

Authors: Mukai Li, Qingcheng Zeng, Tianqing Fang, Zhenwen Liang, Linfeng Song, Qi Liu, Haitao Mi, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03412
Pdf URL: https://arxiv.org/pdf/2602.03412
Copy Paste: [[2602.03412]] Verified Critical Step Optimization for LLM Agents(https://arxiv.org/abs/2602.03412)
Keywords: language model, llm, agent
Abstract: As large language model agents tackle increasingly complex long-horizon tasks, effective post-training becomes critical. Prior work faces fundamental challenges: outcome-only rewards fail to precisely attribute credit to intermediate steps, estimated step-level rewards introduce systematic noise, and Monte Carlo sampling approaches for step reward estimation incur prohibitive computational cost. Inspired by findings that only a small fraction of high-entropy tokens drive effective RL for reasoning, we propose Critical Step Optimization (CSO), which focuses preference learning on verified critical steps, decision points where alternate actions demonstrably flip task outcomes from failure to success. Crucially, our method starts from failed policy trajectories rather than expert demonstrations, directly targeting the policy model's weaknesses. We use a process reward model (PRM) to identify candidate critical steps, leverage expert models to propose high-quality alternatives, then continue execution from these alternatives using the policy model itself until task completion. Only alternatives that the policy successfully executes to correct outcomes are verified and used as DPO training data, ensuring both quality and policy reachability. This yields fine-grained, verifiable supervision at critical decisions while avoiding trajectory-level coarseness and step-level noise. Experiments on GAIA-Text-103 and XBench-DeepSearch show that CSO achieves 37% and 26% relative improvement over the SFT baseline and substantially outperforms other post-training methods, while requiring supervision at only 16% of trajectory steps. This demonstrates the effectiveness of selective verification-based learning for agent post-training.
摘要：随着大型语言模型代理处理日益复杂的长期任务，有效的后期培训变得至关重要。先前的工作面临着根本性的挑战：仅结果奖励无法精确地将功劳归因于中间步骤，估计的步骤级奖励引入系统噪声，以及用于步骤奖励估计的蒙特卡罗采样方法会产生过高的计算成本。受到只有一小部分高熵令牌驱动有效 RL 推理的发现的启发，我们提出了关键步骤优化 (CSO)，它将偏好学习重点放在经过验证的关键步骤和决策点上，在这些决策点上，替代行动可以明显地将任务结果从失败转变为成功。至关重要的是，我们的方法从失败的政策轨迹而不是专家论证出发，直接针对政策模型的弱点。我们使用过程奖励模型（PRM）来识别候选关键步骤，利用专家模型提出高质量的替代方案，然后使用策略模型本身继续执行这些替代方案，直到任务完成。只有成功执行策略以纠正结果的替代方案才会被验证并用作 DPO 培训数据，以确保质量和策略的可达性。这可以在关键决策中产生细粒度、可验证的监督，同时避免轨迹级粗糙度和步进级噪声。 GAIA-Text-103 和 XBench-DeepSearch 上的实验表明，CSO 比 SFT 基线实现了 37% 和 26% 的相对改进，并且大大优于其他训练后方法，同时仅需要 16% 的轨迹步骤进行监督。这证明了基于选择性验证的学习对于代理后培训的有效性。

Title: FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding

Authors: Yingli Shen, Wen Lai, Jie Zhou, Xueren Zhang, Yudong Wang, Kangyang Luo, Shuo Wang, Ge Gao, Alexander Fraser, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03417
Pdf URL: https://arxiv.org/pdf/2602.03417
Copy Paste: [[2602.03417]] FactNet: A Billion-Scale Knowledge Graph for Multilingual Factual Grounding(https://arxiv.org/abs/2602.03417)
Keywords: llm, hallucination
Abstract: While LLMs exhibit remarkable fluency, their utility is often compromised by factual hallucinations and a lack of traceable provenance. Existing resources for grounding mitigate this but typically enforce a dichotomy: they offer either structured knowledge without textual context (e.g., knowledge bases) or grounded text with limited scale and linguistic coverage. To bridge this gap, we introduce FactNet, a massive, open-source resource designed to unify 1.7 billion atomic assertions with 3.01 billion auditable evidence pointers derived exclusively from 316 Wikipedia editions. Unlike recent synthetic approaches, FactNet employs a strictly deterministic construction pipeline, ensuring that every evidence unit is recoverable with byte-level precision. Extensive auditing confirms a high grounding precision of 92.1%, even in long-tail languages. Furthermore, we establish FactNet-Bench, a comprehensive evaluation suite for Knowledge Graph Completion, Question Answering, and Fact Checking. FactNet provides the community with a foundational, reproducible resource for training and evaluating trustworthy, verifiable multilingual systems.
摘要：虽然法学硕士表现出非凡的流畅性，但其效用常常因事实幻觉和缺乏可追溯的出处而受到损害。现有的接地资源可以缓解这种情况，但通常会强制执行二分法：它们要么提供没有文本上下文（例如知识库）的结构化知识，要么提供规模和语言覆盖范围有限的接地文本。为了弥补这一差距，我们引入了 FactNet，这是一个庞大的开源资源，旨在将 17 亿个原子断言与专门源自 316 个维基百科版本的 30.1 亿个可审计证据指针统一起来。与最近的合成方法不同，FactNet 采用严格确定性的构建管道，确保每个证据单元都可以字节级精度恢复。广泛的审核证实，即使在长尾语言中，接地精度也高达 92.1%。此外，我们还建立了 FactNet-Bench，这是一个用于知识图谱补全、问答和事实检查的综合评估套件。 FactNet 为社区提供了基础的、可复制的资源，用于培训和评估值得信赖、可验证的多语言系统。

Title: A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces

Authors: Mingxuan Du, Benfeng Xu, Chiwei Zhu, Shaohan Wang, Pengyu Wang, Xiaorui Wang, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03442
Pdf URL: https://arxiv.org/pdf/2602.03442
Copy Paste: [[2602.03442]] A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces(https://arxiv.org/abs/2602.03442)
Keywords: language model, prompt, retrieval-augmented generation, agent
Abstract: Frontier language models have demonstrated strong reasoning and long-horizon tool-use capabilities. However, existing RAG systems fail to leverage these capabilities. They still rely on two paradigms: (1) designing an algorithm that retrieves passages in a single shot and concatenates them into the model's input, or (2) predefining a workflow and prompting the model to execute it step-by-step. Neither paradigm allows the model to participate in retrieval decisions, preventing efficient scaling with model improvements. In this paper, we introduce A-RAG, an Agentic RAG framework that exposes hierarchical retrieval interfaces directly to the model. A-RAG provides three retrieval tools: keyword search, semantic search, and chunk read, enabling the agent to adaptively search and retrieve information across multiple granularities. Experiments on multiple open-domain QA benchmarks show that A-RAG consistently outperforms existing approaches with comparable or lower retrieved tokens, demonstrating that A-RAG effectively leverages model capabilities and dynamically adapts to different RAG tasks. We further systematically study how A-RAG scales with model size and test-time compute. We will release our code and evaluation suite to facilitate future research. Code and evaluation suite are available at this https URL.
摘要：前沿语言模型展现了强大的推理能力和远景工具使用能力。然而，现有的 RAG 系统无法利用这些功能。他们仍然依赖两种范例：（1）设计一种算法，在一次中检索段落并将它们连接到模型的输入中，或者（2）预定义工作流程并提示模型逐步执行。这两种范式都不允许模型参与检索决策，从而阻碍了模型改进的有效扩展。在本文中，我们介绍了 A-RAG，这是一种 Agentic RAG 框架，它直接向模型公开分层检索接口。 A-RAG 提供三种检索工具：关键字搜索、语义搜索和块读取，使代理能够跨多个粒度自适应地搜索和检索信息。对多个开放域 QA 基准的实验表明，A-RAG 在检索到的令牌相当或更低的情况下始终优于现有方法，这表明 A-RAG 有效地利用了模型功能并动态适应不同的 RAG 任务。我们进一步系统地研究 A-RAG 如何随模型大小和测试时间计算进行扩展。我们将发布我们的代码和评估套件以促进未来的研究。代码和评估套件可从此 https URL 获取。

Title: Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish

Authors: Jenny Kunz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03484
Pdf URL: https://arxiv.org/pdf/2602.03484
Copy Paste: [[2602.03484]] Preferences for Idiomatic Language are Acquired Slowly -- and Forgotten Quickly: A Case Study on Swedish(https://arxiv.org/abs/2602.03484)
Keywords: language model
Abstract: In this study, we investigate how language models develop preferences for \textit{idiomatic} as compared to \textit{linguistically acceptable} Swedish, both during pretraining and when adapting a model from English to Swedish. To do so, we train models on Swedish from scratch and by fine-tuning English-pretrained models, probing their preferences at various checkpoints using minimal pairs that differ in linguistic acceptability or idiomaticity. For linguistic acceptability, we adapt existing benchmarks into a minimal-pair format. To assess idiomaticity, we introduce two novel datasets: one contrasting conventionalized idioms with plausible variants, and another contrasting idiomatic Swedish with Translationese. Our findings suggest that idiomatic competence emerges more slowly than other linguistic abilities, including grammatical and lexical correctness. While longer training yields diminishing returns for most tasks, idiom-related performance continues to improve, particularly in the largest model tested (8B). However, instruction tuning on data machine-translated from English -- the common approach for languages with little or no native instruction data -- causes models to rapidly lose their preference for idiomatic language.
摘要：在这项研究中，我们研究了在预训练期间以及将模型从英语调整为瑞典语时，语言模型如何与 \textit{语言上可接受的} 瑞典语相比，对 \textit{惯用} 产生偏好。为此，我们从头开始训练瑞典语模型，并微调英语预训练模型，使用语言可接受性或惯用性不同的最小对来探索它们在不同检查点的偏好。为了语言的可接受性，我们将现有基准调整为最小对格式。为了评估惯用性，我们引入了两个新颖的数据集：一个将常规习语与可能的变体进行对比，另一个将惯用瑞典语与翻译语进行对比。我们的研究结果表明，惯用能力比其他语言能力（包括语法和词汇正确性）出现得更慢。虽然对于大多数任务来说，较长的训练时间会带来收益递减，但与习语相关的性能却在持续提高，尤其是在测试的最大模型中 (8B)。然而，对从英语机器翻译的数据进行指令调整（这是本地指令数据很少或没有的语言的常见方法）会导致模型迅速失去对惯用语言的偏好。

Title: Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning

Authors: Quanyu Long, Kai Jie Jiang, Jianda Chen, Xu Guo, Leilei Gan, Wenya Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03485
Pdf URL: https://arxiv.org/pdf/2602.03485
Copy Paste: [[2602.03485]] Self-Verification Dilemma: Experience-Driven Suppression of Overused Checking in LLM Reasoning(https://arxiv.org/abs/2602.03485)
Keywords: llm
Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long reasoning traces with reflection. Through a large-scale empirical analysis, we find that a substantial fraction of reflective steps consist of self-verification (recheck) that repeatedly confirm intermediate results. These rechecks occur frequently across models and benchmarks, yet the vast majority are confirmatory rather than corrective, rarely identifying errors and altering reasoning outcomes. This reveals a mismatch between how often self-verification is activated and how often it is actually useful. Motivated by this, we propose a novel, experience-driven test-time framework that reduces the overused verification. Our method detects the activation of recheck behavior, consults an offline experience pool of past verification outcomes, and estimates whether a recheck is likely unnecessary via efficient retrieval. When historical experience suggests unnecessary, a suppression signal redirects the model to proceed. Across multiple model and benchmarks, our approach reduces token usage up to 20.3% while maintaining the accuracy, and in some datasets even yields accuracy improvements.
摘要：大型推理模型 (LRM) 通过利用反射生成长推理轨迹来实现强大的性能。通过大规模的实证分析，我们发现反思步骤的很大一部分是由反复确认中间结果的自我验证（复查）组成的。这些重新检查在模型和基准测试中频繁发生，但绝大多数是确认性的而不是纠正性的，很少发现错误并改变推理结果。这揭示了自我验证的激活频率与实际有用的频率之间的不匹配。受此启发，我们提出了一种新颖的、经验驱动的测试时间框架，可以减少过度使用的验证。我们的方法检测重新检查行为的激活，查阅过去验证结果的离线经验池，并通过有效检索来估计是否可能不需要重新检查。当历史经验表明不必要时，抑制信号会重定向模型以继续进行。在多个模型和基准测试中，我们的方法在保持准确性的同时将令牌使用量减少了高达 20.3%，并且在某些数据集中甚至提高了准确性。

Title: Learning to Reason Faithfully through Step-Level Faithfulness Maximization

Authors: Runquan Gui, Yafu Li, Xiaoye Qu, Ziyan Liu, Yeqiu Cheng, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03507
Pdf URL: https://arxiv.org/pdf/2602.03507
Copy Paste: [[2602.03507]] Learning to Reason Faithfully through Step-Level Faithfulness Maximization(https://arxiv.org/abs/2602.03507)
Keywords: language model, llm, hallucination
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has markedly improved the performance of Large Language Models (LLMs) on tasks requiring multi-step reasoning. However, most RLVR pipelines rely on sparse outcome-based rewards, providing little supervision over intermediate steps and thus encouraging over-confidence and spurious reasoning, which in turn increases hallucinations. To address this, we propose FaithRL, a general reinforcement learning framework that directly optimizes reasoning faithfulness. We formalize a faithfulness-maximization objective and theoretically show that optimizing it mitigates over-confidence. To instantiate this objective, we introduce a geometric reward design and a faithfulness-aware advantage modulation mechanism that assigns step-level credit by penalizing unsupported steps while preserving valid partial derivations. Across diverse backbones and benchmarks, FaithRL consistently reduces hallucination rates while maintaining (and often improving) answer correctness. Further analysis confirms that FaithRL increases step-wise reasoning faithfulness and generalizes robustly. Our code is available at this https URL.
摘要：带有可验证奖励的强化学习（RLVR）显着提高了大型语言模型（LLM）在需要多步推理的任务上的性能。然而，大多数 RLVR 管道依赖于稀疏的基于结果的奖励，对中间步骤提供很少的监督，从而鼓励过度自信和虚假推理，进而增加幻觉。为了解决这个问题，我们提出了 FaithRL，一种通用的强化学习框架，可以直接优化推理的可信度。我们形式化了忠实度最大化目标，并从理论上证明优化它可以减轻过度自信。为了实例化这一目标，我们引入了几何奖励设计和忠诚意识优势调制机制，该机制通过惩罚不受支持的步骤来分配步骤级信用，同时保留有效的偏导数。在不同的主干和基准中，FaithRL 持续降低幻觉率，同时保持（并且经常提高）答案的正确性。进一步的分析证实，FaithRL 提高了逐步推理的可信度并具有稳健的泛化能力。我们的代码可以在这个 https URL 上找到。

Title: Can Large Language Models Generalize Procedures Across Representations?

Authors: Fangru Lin, Valentin Hofmann, Xingchen Wan, Weixing Wang, Zifeng Ding, Anthony G. Cohn, Janet B. Pierrehumbert
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03542
Pdf URL: https://arxiv.org/pdf/2602.03542
Copy Paste: [[2602.03542]] Can Large Language Models Generalize Procedures Across Representations?(https://arxiv.org/abs/2602.03542)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are trained and tested extensively on symbolic representations such as code and graphs, yet real-world user tasks are often specified in natural language. To what extent can LLMs generalize across these representations? Here, we approach this question by studying isomorphic tasks involving procedures represented in code, graphs, and natural language (e.g., scheduling steps in planning). We find that training LLMs with popular post-training methods on graphs or code data alone does not reliably generalize to corresponding natural language tasks, while training solely on natural language can lead to inefficient performance gains. To address this gap, we propose a two-stage data curriculum that first trains on symbolic, then natural language data. The curriculum substantially improves model performance across model families and tasks. Remarkably, a 1.5B Qwen model trained by our method can closely match zero-shot GPT-4o in naturalistic planning. Finally, our analysis suggests that successful cross-representation generalization can be interpreted as a form of generative analogy, which our curriculum effectively encourages.
摘要：大型语言模型 (LLM) 在代码和图形等符号表示上进行了广泛的训练和测试，但现实世界的用户任务通常是用自然语言指定的。法学硕士可以在多大程度上概括这些表述？在这里，我们通过研究涉及用代码、图形和自然语言表示的过程的同构任务来解决这个问题（例如，规划中的调度步骤）。我们发现，仅在图或代码数据上使用流行的后训练方法来训练法学硕士并不能可靠地推广到相应的自然语言任务，而仅在自然语言上进行训练可能会导致低效的性能提升。为了解决这一差距，我们提出了一个两阶段的数据课程，首先训练符号数据，然后训练自然语言数据。该课程极大地提高了模型系列和任务的模型性能。值得注意的是，用我们的方法训练的 1.5B Qwen 模型可以与自然规划中的零样本 GPT-4o 非常接近。最后，我们的分析表明，成功的交叉表示泛化可以解释为生成类比的一种形式，我们的课程有效地鼓励这种形式。

Title: SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue

Authors: Yuqin Dai, Ning Gao, Wei Zhang, Jie Wang, Zichen Luo, Jinpeng Wang, Yujie Wang, Ruiyuan Wu, Chaozheng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03548
Pdf URL: https://arxiv.org/pdf/2602.03548
Copy Paste: [[2602.03548]] SEAD: Self-Evolving Agent for Multi-Turn Service Dialogue(https://arxiv.org/abs/2602.03548)
Keywords: language model, agent
Abstract: Large Language Models have demonstrated remarkable capabilities in open-domain dialogues. However, current methods exhibit suboptimal performance in service dialogues, as they rely on noisy, low-quality human conversation data. This limitation arises from data scarcity and the difficulty of simulating authentic, goal-oriented user behaviors. To address these issues, we propose SEAD (Self-Evolving Agent for Service Dialogue), a framework that enables agents to learn effective strategies without large-scale human annotations. SEAD decouples user modeling into two components: a Profile Controller that generates diverse user states to manage training curriculum, and a User Role-play Model that focuses on realistic role-playing. This design ensures the environment provides adaptive training scenarios rather than acting as an unfair adversary. Experiments demonstrate that SEAD significantly outperforms Open-source Foundation Models and Closed-source Commercial Models, improving task completion rate by 17.6% and dialogue efficiency by 11.1%. Code is available at: this https URL.
摘要：大型语言模型在开放领域对话中表现出了卓越的能力。然而，当前的方法在服务对话中表现出次优的性能，因为它们依赖于嘈杂、低质量的人类对话数据。这种限制源于数据稀缺以及模拟真实的、以目标为导向的用户行为的困难。为了解决这些问题，我们提出了 SEAD（服务对话的自我进化代理），这是一个框架，使代理能够在没有大规模人工注释的情况下学习有效的策略。 SEAD 将用户建模分解为两个组件：生成不同用户状态以管理培训课程的配置文件控制器，以及专注于现实角色扮演的用户角色扮演模型。这种设计确保环境提供适应性训练场景，而不是充当不公平的对手。实验表明，SEAD 显着优于开源基础模型和闭源商业模型，任务完成率提高了 17.6%，对话效率提高了 11.1%。代码可在以下位置获得：此 https URL。

Title: Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models

Authors: Vitalii Hirak, Jaap Jumelet, Arianna Bisazza
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03551
Pdf URL: https://arxiv.org/pdf/2602.03551
Copy Paste: [[2602.03551]] Assessing the Impact of Typological Features on Multilingual Machine Translation in the Age of Large Language Models(https://arxiv.org/abs/2602.03551)
Keywords: language model
Abstract: Despite major advances in multilingual modeling, large quality disparities persist across languages. Besides the obvious impact of uneven training resources, typological properties have also been proposed to determine the intrinsic difficulty of modeling a language. The existing evidence, however, is mostly based on small monolingual language models or bilingual translation models trained from scratch. We expand on this line of work by analyzing two large pre-trained multilingual translation models, NLLB-200 and Tower+, which are state-of-the-art representatives of encoder-decoder and decoder-only machine translation, respectively. Based on a broad set of languages, we find that target language typology drives translation quality of both models, even after controlling for more trivial factors, such as data resourcedness and writing script. Additionally, languages with certain typological properties benefit more from a wider search of the output space, suggesting that such languages could profit from alternative decoding strategies beyond the standard left-to-right beam search. To facilitate further research in this area, we release a set of fine-grained typological properties for 212 languages of the FLORES+ MT evaluation benchmark.
摘要：尽管多语言建模取得了重大进展，但不同语言之间仍然存在巨大的质量差异。除了训练资源不均匀的明显影响之外，还提出了类型学属性来确定语言建模的内在难度。然而，现有的证据大多基于从头开始训练的小型单语语言模型或双语翻译模型。我们通过分析两个大型预训练多语言翻译模型 NLLB-200 和 Tower+ 来扩展这一工作线，它们分别是编码器-解码器和仅解码器机器翻译的最先进代表。基于广泛的语言，我们发现目标语言类型决定了两种模型的翻译质量，即使在控制了更琐碎的因素（例如数据资源和编写脚本）之后也是如此。此外，具有某些类型学属性的语言从更广泛的输出空间搜索中受益更多，这表明此类语言可以从标准从左到右波束搜索之外的替代解码策略中受益。为了促进该领域的进一步研究，我们发布了 FLORES+ MT 评估基准的 212 种语言的一组细粒度类型属性。

Title: Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs

Authors: Su Dong, Qinggang Zhang, Yilin Xiao, Shengyuan Chen, Chuang Zhou, Xiao Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03578
Pdf URL: https://arxiv.org/pdf/2602.03578
Copy Paste: [[2602.03578]] Use Graph When It Needs: Efficiently and Adaptively Integrating Retrieval-Augmented Generation with Graphs(https://arxiv.org/abs/2602.03578)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) often struggle with knowledge-intensive tasks due to hallucinations and outdated parametric knowledge. While Retrieval-Augmented Generation (RAG) addresses this by integrating external corpora, its effectiveness is limited by fragmented information in unstructured domain documents. Graph-augmented RAG (GraphRAG) emerged to enhance contextual reasoning through structured knowledge graphs, yet paradoxically underperforms vanilla RAG in real-world scenarios, exhibiting significant accuracy drops and prohibitive latency despite gains on complex queries. We identify the rigid application of GraphRAG to all queries, regardless of complexity, as the root cause. To resolve this, we propose an efficient and adaptive GraphRAG framework called EA-GraphRAG that dynamically integrates RAG and GraphRAG paradigms through syntax-aware complexity analysis. Our approach introduces: (i) a syntactic feature constructor that parses each query and extracts a set of structural features; (ii) a lightweight complexity scorer that maps these features to a continuous complexity score; and (iii) a score-driven routing policy that selects dense RAG for low-score queries, invokes graph-based retrieval for high-score queries, and applies complexity-aware reciprocal rank fusion to handle borderline cases. Extensive experiments on a comprehensive benchmark, consisting of two single-hop and two multi-hop QA benchmarks, demonstrate that our EA-GraphRAG significantly improves accuracy, reduces latency, and achieves state-of-the-art performance in handling mixed scenarios involving both simple and complex queries.
摘要：由于幻觉和过时的参数知识，大型语言模型（LLM）经常难以应对知识密集型任务。虽然检索增强生成（RAG）通过集成外部语料库解决了这个问题，但其有效性受到非结构化领域文档中的碎片信息的限制。图增强 RAG (GraphRAG) 的出现是为了通过结构化知识图来增强上下文推理，但矛盾的是，在现实场景中，它的表现不如普通 RAG，尽管在复杂查询方面有所提高，但仍表现出显着的准确性下降和令人望而却步的延迟。我们将 GraphRAG 严格应用于所有查询（无论复杂程度如何）视为根本原因。为了解决这个问题，我们提出了一种高效且自适应的 GraphRAG 框架，称为 EA-GraphRAG，它通过语法感知的复杂性分析动态集成 RAG 和 GraphRAG 范式。我们的方法引入了：（i）一个语法特征构造函数，用于解析每个查询并提取一组结构特征； (ii) 一个轻量级复杂度评分器，将这些特征映射到连续的复杂度评分； (iii) 分数驱动的路由策略，为低分查询选择密集的 RAG，为高分查询调用基于图的检索，并应用复杂性感知的倒数排名融合来处理边界情况。对由两个单跳和两个多跳 QA 基准组成的综合基准进行的大量实验表明，我们的 EA-GraphRAG 显着提高了准确性，减少了延迟，并在处理涉及简单和复杂查询的混合场景时实现了最先进的性能。

Title: $V_0$: A Generalist Value Model for Any Policy at State Zero

Authors: Yi-Kai Zhang, Zhiyuan Yao, Hongyan Hao, Yueqing Sun, Qi Gu, Hui Su, Xunliang Cai, De-Chuan Zhan, Han-Jia Ye
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03584
Pdf URL: https://arxiv.org/pdf/2602.03584
Copy Paste: [[2602.03584]] $V_0$: A Generalist Value Model for Any Policy at State Zero(https://arxiv.org/abs/2602.03584)
Keywords: language model, llm, prompt
Abstract: Policy gradient methods rely on a baseline to measure the relative advantage of an action, ensuring the model reinforces behaviors that outperform its current average capability. In the training of Large Language Models (LLMs) using Actor-Critic methods (e.g., PPO), this baseline is typically estimated by a Value Model (Critic) often as large as the policy model itself. However, as the policy continuously evolves, the value model requires expensive, synchronous incremental training to accurately track the shifting capabilities of the policy. To avoid this overhead, Group Relative Policy Optimization (GRPO) eliminates the coupled value model by using the average reward of a group of rollouts as the baseline; yet, this approach necessitates extensive sampling to maintain estimation stability. In this paper, we propose $V_0$, a Generalist Value Model capable of estimating the expected performance of any model on unseen prompts without requiring parameter updates. We reframe value estimation by treating the policy's dynamic capability as an explicit context input; specifically, we leverage a history of instruction-performance pairs to dynamically profile the model, departing from the traditional paradigm that relies on parameter fitting to perceive capability shifts. Focusing on value estimation at State Zero (i.e., the initial prompt, hence $V_0$), our model serves as a critical resource scheduler. During GRPO training, $V_0$ predicts success rates prior to rollout, allowing for efficient sampling budget allocation; during deployment, it functions as a router, dispatching instructions to the most cost-effective and suitable model. Empirical results demonstrate that $V_0$ significantly outperforms heuristic budget allocation and achieves a Pareto-optimal trade-off between performance and cost in LLM routing tasks.
摘要：策略梯度方法依靠基线来衡量行动的相对优势，确保模型强化优于其当前平均能力的行为。在使用 Actor-Critic 方法（例如 PPO）训练大型语言模型（LLM）时，该基线通常由通常与策略模型本身一样大的价值模型（Critic）估计。然而，随着政策的不断发展，价值模型需要昂贵的同步增量训练来准确跟踪政策的变化能力。为了避免这种开销，组相对策略优化（GRPO）通过使用一组推出的平均奖励作为基线来消除耦合价值模型；然而，这种方法需要进行大量采样以保持估计的稳定性。在本文中，我们提出了 $V_0$，这是一种通才价值模型，能够估计任何模型在未见过的提示上的预期性能，而无需更新参数。我们通过将策略的动态能力视为显式上下文输入来重新构建价值估计；具体来说，我们利用指令-性能对的历史记录来动态分析模型，这与依赖参数拟合来感知能力变化的传统范例不同。我们的模型专注于状态零（即初始提示，因此 $V_0$）的价值估计，充当关键资源调度程序。在 GRPO 培训期间，$V_0$ 在推出之前预测成功率，从而实现高效的抽样预算分配；在部署过程中，它充当路由器，将指令发送到最具成本效益和最合适的模型。实证结果表明，$V_0$ 显着优于启发式预算分配，并在 LLM 路由任务中实现了性能和成本之间的帕累托最优权衡。

Title: CL-bench: A Benchmark for Context Learning

Authors: Shihan Dou, Ming Zhang, Zhangyue Yin, Chenhao Huang, Yujiong Shen, Junzhe Wang, Jiayi Chen, Yuchen Ni, Junjie Ye, Cheng Zhang, Huaibing Xie, Jianglu Hu, Shaolei Wang, Weichao Wang, Yanling Xiao, Yiting Liu, Zenan Xu, Zhen Guo, Pluto Zhou, Tao Gui, Zuxuan Wu, Xipeng Qiu, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang, Di Wang, Shunyu Yao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03587
Pdf URL: https://arxiv.org/pdf/2602.03587
Copy Paste: [[2602.03587]] CL-bench: A Benchmark for Context Learning(https://arxiv.org/abs/2602.03587)
Keywords: language model, gpt, prompt
Abstract: Current language models (LMs) excel at reasoning over prompts using pre-trained knowledge. However, real-world tasks are far more complex and context-dependent: models must learn from task-specific context and leverage new knowledge beyond what is learned during pre-training to reason and resolve tasks. We term this capability context learning, a crucial ability that humans naturally possess but has been largely overlooked. To this end, we introduce CL-bench, a real-world benchmark consisting of 500 complex contexts, 1,899 tasks, and 31,607 verification rubrics, all crafted by experienced domain experts. Each task is designed such that the new content required to resolve it is contained within the corresponding context. Resolving tasks in CL-bench requires models to learn from the context, ranging from new domain-specific knowledge, rule systems, and complex procedures to laws derived from empirical data, all of which are absent from pre-training. This goes far beyond long-context tasks that primarily test retrieval or reading comprehension, and in-context learning tasks, where models learn simple task patterns via instructions and demonstrations. Our evaluations of ten frontier LMs find that models solve only 17.2% of tasks on average. Even the best-performing model, GPT-5.1, solves only 23.7%, revealing that LMs have yet to achieve effective context learning, which poses a critical bottleneck for tackling real-world, complex context-dependent tasks. CL-bench represents a step towards building LMs with this fundamental capability, making them more intelligent and advancing their deployment in real-world scenarios.
摘要：当前的语言模型（LM）擅长使用预先训练的知识对提示进行推理。然而，现实世界的任务要复杂得多并且依赖于上下文：模型必须从特定于任务的上下文中学习，并利用预训练期间学到的新知识来推理和解决任务。我们将这种能力称为情境学习，这是人类自然拥有但在很大程度上被忽视的一项重要能力。为此，我们引入了 CL-bench，这是一个真实世界的基准测试，由 500 个复杂的上下文、1,899 个任务和 31,607 个验证标准组成，所有这些均由经验丰富的领域专家精心设计。每个任务的设计都使得解决该任务所需的新内容包含在相应的上下文中。解决 CL-bench 中的任务需要模型从上下文中学习，从新的特定领域知识、规则系统和复杂的程序到从经验数据得出的规律，所有这些都是预训练中所缺少的。这远远超出了主要测试检索或阅读理解的长上下文任务，以及上下文学习任务，其中模型通过指令和演示学习简单的任务模式。我们对 10 个前沿 LM 的评估发现，模型平均只能解决 17.2% 的任务。即使是性能最好的模型 GPT-5.1，也只能解决 23.7% 的问题，这表明 LM 尚未实现有效的上下文学习，这为处理现实世界中复杂的上下文相关任务构成了关键瓶颈。 CL-bench 代表着朝着构建具有这一基本功能的 LM 迈出的一步，使它们更加智能并推进它们在现实场景中的部署。

Title: Controlling Output Rankings in Generative Engines for LLM-based Search

Authors: Haibo Jin, Ruoxi Chen, Peiyan Zhang, Yifeng Luo, Huimin Zeng, Man Luo, Haohan Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.03608
Pdf URL: https://arxiv.org/pdf/2602.03608
Copy Paste: [[2602.03608]] Controlling Output Rankings in Generative Engines for LLM-based Search(https://arxiv.org/abs/2602.03608)
Keywords: language model, gpt, llm
Abstract: The way customers search for and choose products is changing with the rise of large language models (LLMs). LLM-based search, or generative engines, provides direct product recommendations to users, rather than traditional online search results that require users to explore options themselves. However, these recommendations are strongly influenced by the initial retrieval order of LLMs, which disadvantages small businesses and independent creators by limiting their visibility. In this work, we propose CORE, an optimization method that \textbf{C}ontrols \textbf{O}utput \textbf{R}ankings in g\textbf{E}nerative Engines for LLM-based search. Since the LLM's interactions with the search engine are black-box, CORE targets the content returned by search engines as the primary means of influencing output rankings. Specifically, CORE optimizes retrieved content by appending strategically designed optimization content to steer the ranking of outputs. We introduce three types of optimization content: string-based, reasoning-based, and review-based, demonstrating their effectiveness in shaping output rankings. To evaluate CORE in realistic settings, we introduce ProductBench, a large-scale benchmark with 15 product categories and 200 products per category, where each product is associated with its top-10 recommendations collected from Amazon's search interface. Extensive experiments on four LLMs with search capabilities (GPT-4o, Gemini-2.5, Claude-4, and Grok-3) demonstrate that CORE achieves an average Promotion Success Rate of \textbf{91.4\% @Top-5}, \textbf{86.6\% @Top-3}, and \textbf{80.3\% @Top-1}, across 15 product categories, outperforming existing ranking manipulation methods while preserving the fluency of optimized content.
摘要：随着大型语言模型 (LLM) 的兴起，客户搜索和选择产品的方式正在发生变化。基于法学硕士的搜索或生成引擎向用户提供直接的产品推荐，而不是需要用户自己探索选项的传统在线搜索结果。然而，这些建议受到法学硕士初始检索顺序的强烈影响，这限制了小型企业和独立创作者的可见性，从而对他们不利。在这项工作中，我们提出了 CORE，一种优化方法，用于控制基于 LLM 搜索的 g\textbf{E} 生成引擎中的 \textbf{O} 输出 \textbf{R}ankings。由于LLM与搜索引擎的交互是黑盒的，CORE将搜索引擎返回的内容作为影响输出排名的主要手段。具体来说，CORE 通过附加战略设计的优化内容来优化检索的内容，以引导输出的排名。我们引入了三种类型的优化内容：基于字符串、基于推理和基于评论，展示了它们在塑造输出排名方面的有效性。为了在现实环境中评估 CORE，我们引入了 ProductBench，这是一个包含 15 个产品类别和每个类别 200 个产品的大型基准，其中每个产品都与其从亚马逊搜索界面收集的前 10 名推荐相关联。对四个具有搜索功能的法学硕士（GPT-4o、Gemini-2.5、Claude-4 和 Grok-3）进行的广泛实验表明，CORE 在 15 个产品类别中实现了 \textbf{91.4\% @Top-5}、\textbf{86.6\% @Top-3} 和 \textbf{80.3\% @Top-1} 的平均促销成功率，优于现有的排名操作方法，同时保持优化内容的流畅性。

Title: Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

Authors: Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03619
Pdf URL: https://arxiv.org/pdf/2602.03619
Copy Paste: [[2602.03619]] Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation(https://arxiv.org/abs/2602.03619)
Keywords: llm, agent
Abstract: Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
摘要：如今，由于缺乏可验证的奖励信号，训练和评估 DeepResearch 生成的报告仍然具有挑战性。因此，基于标准的评估已成为一种常见做法。然而，现有的方法要么依赖于缺乏足够粒度的粗略、预定义的规则，要么依赖于手动构建的特定于查询的规则，这些规则成本高昂且难以扩展。在本文中，我们提出了一个管道来训练与人类偏好一致的特定于查询的标题生成器，这些生成器是为 DeepResearch 报告生成而定制的。我们首先构建一个 DeepResearch 风格的查询数据集，通过配对报告标注人类偏好，并通过强化学习和结合人类偏好监督和基于 LLM 的评分标准评估的混合奖励来训练评分生成器。为了更好地处理长期推理，我们进一步引入了用于报告生成的多智能体马尔可夫状态（MaMs）工作流程。我们的经验表明，与现有的标题设计策略相比，我们提出的标题生成器提供了更具辨别力和更好的人性化监督。此外，当集成到 MaMs 训练框架中时，配备我们的 rubric 生成器的 DeepResearch 系统始终优于 DeepResearch Bench 上的所有开源基线，并实现与领先的闭源模型相当的性能。

Title: BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish

Authors: Burak Aktaş, Mehmet Can Baytekin, Süha Kağan Köse, Ömer İlbilgi, Elif Özge Yılmaz, Çağrı Toraman, Bilge Kaan Görür
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2602.03633
Pdf URL: https://arxiv.org/pdf/2602.03633
Copy Paste: [[2602.03633]] BIRDTurk: Adaptation of the BIRD Text-to-SQL Dataset to Turkish(https://arxiv.org/abs/2602.03633)
Keywords: llm, prompt, agent
Abstract: Text-to-SQL systems have achieved strong performance on English benchmarks, yet their behavior in morphologically rich, low-resource languages remains largely unexplored. We introduce BIRDTurk, the first Turkish adaptation of the BIRD benchmark, constructed through a controlled translation pipeline that adapts schema identifiers to Turkish while strictly preserving the logical structure and execution semantics of SQL queries and databases. Translation quality is validated on a sample size determined by the Central Limit Theorem to ensure 95% confidence, achieving 98.15% accuracy on human-evaluated samples. Using BIRDTurk, we evaluate inference-based prompting, agentic multi-stage reasoning, and supervised fine-tuning. Our results reveal that Turkish introduces consistent performance degradation, driven by both structural linguistic divergence and underrepresentation in LLM pretraining, while agentic reasoning demonstrates stronger cross-lingual robustness. Supervised fine-tuning remains challenging for standard multilingual baselines but scales effectively with modern instruction-tuned models. BIRDTurk provides a controlled testbed for cross-lingual Text-to-SQL evaluation under realistic database conditions. We release the training and development splits to support future research.
摘要：文本到 SQL 系统在英语基准测试中取得了出色的性能，但它们在形态丰富、资源匮乏的语言中的行为在很大程度上仍未得到探索。我们引入了 BIRDTurk，这是 BIRD 基准的第一个土耳其语版本，通过受控翻译管道构建，该管道将模式标识符适应土耳其语，同时严格保留 SQL 查询和数据库的逻辑结构和执行语义。翻译质量根据中心极限定理确定的样本量进行验证，以确保 95% 的置信度，在人工评估样本上实现 98.15% 的准确度。使用 BIRDTurk，我们评估基于推理的提示、代理多阶段推理和监督微调。我们的结果表明，由于 LLM 预训练中的结构语言分歧和代表性不足，土耳其语引入了一致的性能下降，而代理推理则表现出更强的跨语言鲁棒性。监督微调对于标准多语言基线仍然具有挑战性，但可以通过现代指令调整模型有效扩展。 BIRDTurk 提供了一个受控测试平台，用于在实际数据库条件下进行跨语言文本到 SQL 评估。我们发布培训和开发拆分以支持未来的研究。

Title: TRE: Encouraging Exploration in the Trust Region

Authors: Chao Huang, Yujing Lu, Quangang Li, Shenghe Wang, Yan Wang, Yueyang Zhang, Long Xia, Jiashu Zhao, Zhiyuan Sun, Daiting Shi, Tingwen Liu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.03635
Pdf URL: https://arxiv.org/pdf/2602.03635
Copy Paste: [[2602.03635]] TRE: Encouraging Exploration in the Trust Region(https://arxiv.org/abs/2602.03635)
Keywords: language model, llm
Abstract: Entropy regularization is a standard technique in reinforcement learning (RL) to enhance exploration, yet it yields negligible effects or even degrades performance in Large Language Models (LLMs). We attribute this failure to the cumulative tail risk inherent to LLMs with massive vocabularies and long generation horizons. In such environments, standard global entropy maximization indiscriminately dilutes probability mass into the vast tail of invalid tokens rather than focusing on plausible candidates, thereby disrupting coherent reasoning. To address this, we propose Trust Region Entropy (TRE), a method that encourages exploration strictly within the model's trust region. Extensive experiments across mathematical reasoning (MATH), combinatorial search (Countdown), and preference alignment (HH) tasks demonstrate that TRE consistently outperforms vanilla PPO, standard entropy regularization, and other exploration baselines. Our code is available at this https URL.
摘要：熵正则化是强化学习 (RL) 中增强探索的标准技术，但它产生的影响可以忽略不计，甚至会降低大型语言模型 (LLM) 的性能。我们将这种失败归因于具有大量词汇量和长世代视野的法学硕士固有的累积尾部风险。在这种环境中，标准的全局熵最大化不加区别地将概率质量稀释到无效标记的巨大尾部，而不是专注于看似合理的候选者，从而破坏了连贯的推理。为了解决这个问题，我们提出了信任区域熵（TRE），这是一种鼓励严格在模型信任区域内进行探索的方法。数学推理 (MATH)、组合搜索 (倒计时) 和偏好对齐 (HH) 任务的广泛实验表明，TRE 始终优于普通 PPO、标准熵正则化和其他探索基线。我们的代码可以在这个 https URL 上找到。

Title: RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish

Authors: Süha Kağan Köse, Mehmet Can Baytekin, Burak Aktaş, Bilge Kaan Görür, Evren Ayberk Munis, Deniz Yılmaz, Muhammed Yusuf Kartal, Çağrı Toraman
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.03652
Pdf URL: https://arxiv.org/pdf/2602.03652
Copy Paste: [[2602.03652]] RAGTurk: Best Practices for Retrieval Augmented Generation in Turkish(https://arxiv.org/abs/2602.03652)
Keywords: llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) enhances LLM factuality, yet design guidance remains English-centric, limiting insights for morphologically rich languages like Turkish. We address this by constructing a comprehensive Turkish RAG dataset derived from Turkish Wikipedia and CulturaX, comprising question-answer pairs and relevant passage chunks. We benchmark seven stages of the RAG pipeline, from query transformation and reranking to answer refinement, without task-specific fine-tuning. Our results show that complex methods like HyDE maximize accuracy (85%) that is considerably higher than the baseline (78.70%). Also a Pareto-optimal configuration using Cross-encoder Reranking and Context Augmentation achieves comparable performance (84.60%) with much lower cost. We further demonstrate that over-stacking generative modules can degrade performance by distorting morphological cues, whereas simple query clarification with robust reranking offers an effective solution.
摘要：检索增强生成（RAG）增强了法学硕士的事实性，但设计指导仍然以英语为中心，限制了对土耳其语等形态丰富的语言的洞察。我们通过构建源自土耳其语维基百科和 CulturaX 的综合土耳其语 RAG 数据集来解决这个问题，其中包括问答对和相关段落块。我们对 RAG 管道的七个阶段进行了基准测试，从查询转换和重新排名到答案细化，无需针对特定任务进行微调。我们的结果表明，像 HyDE 这样的复杂方法可以最大限度地提高准确率 (85%)，远高于基线 (78.70%)。此外，使用交叉编码器重排序和上下文增强的帕累托最优配置以低得多的成本实现了可比的性能 (84.60%)。我们进一步证明，过度堆叠的生成模块会通过扭曲形态线索来降低性能，而简单的查询澄清和强大的重新排名提供了有效的解决方案。

Title: Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration

Authors: Yu Zhang, Mufan Xu, Xuefeng Bai, Kehai chen, Pengfei Zhang, Yang Xiang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03677
Pdf URL: https://arxiv.org/pdf/2602.03677
Copy Paste: [[2602.03677]] Instruction Anchors: Dissecting the Causal Dynamics of Modality Arbitration(https://arxiv.org/abs/2602.03677)
Keywords: language model, llm
Abstract: Modality following serves as the capacity of multimodal large language models (MLLMs) to selectively utilize multimodal contexts based on user instructions. It is fundamental to ensuring safety and reliability in real-world deployments. However, the underlying mechanisms governing this decision-making process remain poorly understood. In this paper, we investigate its working mechanism through an information flow lens. Our findings reveal that instruction tokens function as structural anchors for modality arbitration: Shallow attention layers perform non-selective information transfer, routing multimodal cues to these anchors as a latent buffer; Modality competition is resolved within deep attention layers guided by the instruction intent, while MLP layers exhibit semantic inertia, acting as an adversarial force. Furthermore, we identify a sparse set of specialized attention heads that drive this arbitration. Causal interventions demonstrate that manipulating a mere $5\%$ of these critical heads can decrease the modality-following ratio by $60\%$ through blocking, or increase it by $60\%$ through targeted amplification of failed samples. Our work provides a substantial step toward model transparency and offers a principled framework for the orchestration of multimodal information in MLLMs.
摘要：模态跟随是多模态大语言模型（MLLM）根据用户指令有选择地利用多模态上下文的能力。它对于确保现实部署中的安全性和可靠性至关重要。然而，人们对控制这一决策过程的基本机制仍然知之甚少。在本文中，我们通过信息流镜头研究其工作机制。我们的研究结果表明，指令标记充当模态仲裁的结构锚点：浅层注意力层执行非选择性信息传输，将多模态线索路由到这些锚点作为潜在缓冲区；模态竞争是在指令意图引导的深层注意力层内解决的，而 MLP 层表现出语义惰性，充当对抗力量。此外，我们还确定了一组稀疏的专门关注头来推动这一仲裁。因果干预表明，操纵这些关键头中的仅仅 $5\%$ 就可以通过阻止将模态遵循率降低 $60\%$，或者通过有针对性地放大失败样本将模态遵循率提高 $60\%$。我们的工作为模型透明度迈出了实质性的一步，并为 MLLM 中多模式信息的编排提供了原则框架。

Title: Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation

Authors: Jiashuo Sun, Pengcheng Jiang, Saizhuo Wang, Jiajun Fan, Heng Wang, Siru Ouyang, Ming Zhong, Yizhu Jiao, Chengsong Huang, Xueqiang Xu, Pengrui Han, Peiran Li, Jiaxin Huang, Ge Liu, Heng Ji, Jiawei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03689
Pdf URL: https://arxiv.org/pdf/2602.03689
Copy Paste: [[2602.03689]] Rethinking the Reranker: Boundary-Aware Evidence Selection for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2602.03689)
Keywords: retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems remain brittle under realistic retrieval noise, even when the required evidence appears in the top-K results. A key reason is that retrievers and rerankers optimize solely for relevance, often selecting either trivial, answer-revealing passages or evidence that lacks the critical information required to answer the question, without considering whether the evidence is suitable for the generator. We propose BAR-RAG, which reframes the reranker as a boundary-aware evidence selector that targets the generator's Goldilocks Zone -- evidence that is neither trivially easy nor fundamentally unanswerable for the generator, but is challenging yet sufficient for inference and thus provides the strongest learning signal. BAR-RAG trains the selector with reinforcement learning using generator feedback, and adopts a two-stage pipeline that fine-tunes the generator under the induced evidence distribution to mitigate the distribution mismatch between training and inference. Experiments on knowledge-intensive question answering benchmarks show that BAR-RAG consistently improves end-to-end performance under noisy retrieval, achieving an average gain of 10.3 percent over strong RAG and reranking baselines while substantially improving robustness. Code is publicly avaliable at this https URL.
摘要：即使所需的证据出现在前 K 个结果中，检索增强生成 (RAG) 系统在现实的检索噪声下仍然很脆弱。一个关键原因是检索器和重新排序器仅针对相关性进行优化，通常选择琐碎的、揭示答案的段落或缺乏回答问题所需的关键信息的证据，而不考虑证据是否适合生成器。我们提出了 BAR-RAG，它将重新排序器重新构建为一个边界感知的证据选择器，以生成器的金发姑娘区为目标——对于生成器来说，这些证据既不是简单的，也不是根本上无法回答的，但具有挑战性，但足以进行推理，从而提供最强的学习信号。 BAR-RAG 使用生成器反馈通过强化学习来训练选择器，并采用两阶段管道在诱导证据分布下微调生成器，以减轻训练和推理之间的分布不匹配。知识密集型问答基准实验表明，BAR-RAG 持续提高了噪声检索下的端到端性能，与强 RAG 和重新排序基线相比，平均增益提高了 10.3%，同时大幅提高了鲁棒性。代码可通过此 https URL 公开获取。

Title: OCRTurk: A Comprehensive OCR Benchmark for Turkish

Authors: Deniz Yılmaz, Evren Ayberk Munis, Çağrı Toraman, Süha Kağan Köse, Burak Aktaş, Mehmet Can Baytekin, Bilge Kaan Görür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03693
Pdf URL: https://arxiv.org/pdf/2602.03693
Copy Paste: [[2602.03693]] OCRTurk: A Comprehensive OCR Benchmark for Turkish(https://arxiv.org/abs/2602.03693)
Keywords: retrieval-augmented generation
Abstract: Document parsing is now widely used in applications, such as large-scale document digitization, retrieval-augmented generation, and domain-specific pipelines in healthcare and education. Benchmarking these models is crucial for assessing their reliability and practical robustness. Existing benchmarks mostly target high-resource languages and provide limited coverage for low-resource settings, such as Turkish. Moreover, existing studies on Turkish document parsing lack a standardized benchmark that reflects real-world scenarios and document diversity. To address this gap, we introduce OCRTurk, a Turkish document parsing benchmark covering multiple layout elements and document categories at three difficulty levels. OCRTurk consists of 180 Turkish documents drawn from academic articles, theses, slide decks, and non-academic articles. We evaluate seven OCR models on OCRTurk using element-wise metrics. Across difficulty levels, PaddleOCR achieves the strongest overall results, leading most element-wise metrics except figures and attaining high Normalized Edit Distance scores in easy, medium, and hard subsets. We also observe performance variation by document type. Models perform well on non-academic documents, while slideshows become the most challenging.
摘要：文档解析现已广泛应用于大规模文档数字化、检索增强生成以及医疗保健和教育领域的特定领域管道等应用中。对这些模型进行基准测试对于评估其可靠性和实际稳健性至关重要。现有的基准测试主要针对高资源语言，对低资源环境（例如土耳其语）的覆盖范围有限。此外，现有的土耳其文档解析研究缺乏反映现实世界场景和文档多样性的标准化基准。为了解决这一差距，我们引入了 OCRTurk，这是一个土耳其文档解析基准，涵盖三个难度级别的多个布局元素和文档类别。 OCRTurk 包含 180 份土耳其语文档，这些文档取自学术文章、论文、幻灯片和非学术文章。我们使用逐元素指标评估 OCRTurk 上的七个 OCR 模型。在各个难度级别上，PaddleOCR 取得了最强的整体结果，领先于除数字之外的大多数元素指标，并在简单、中等和困难子集中获得了较高的归一化编辑距离分数。我们还观察了不同文档类型的性能差异。模型在非学术文档上表现良好，而幻灯片则成为最具挑战性的。

Title: Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models

Authors: Yu Tian, Linh Huynh, Katerina Christhilf, Shubham Chakraborty, Micah Watanabe, Tracy Arner, Danielle McNamara
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03704
Pdf URL: https://arxiv.org/pdf/2602.03704
Copy Paste: [[2602.03704]] Cognitively Diverse Multiple-Choice Question Generation: A Hybrid Multi-Agent Framework with Large Language Models(https://arxiv.org/abs/2602.03704)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Recent advances in large language models (LLMs) have made automated multiple-choice question (MCQ) generation increasingly feasible; however, reliably producing items that satisfy controlled cognitive demands remains a challenge. To address this gap, we introduce ReQUESTA, a hybrid, multi-agent framework for generating cognitively diverse MCQs that systematically target text-based, inferential, and main idea comprehension. ReQUESTA decomposes MCQ authoring into specialized subtasks and coordinates LLM-powered agents with rule-based components to support planning, controlled generation, iterative evaluation, and post-processing. We evaluated the framework in a large-scale reading comprehension study using academic expository texts, comparing ReQUESTA-generated MCQs with those produced by a single-pass GPT-5 zero-shot baseline. Psychometric analyses of learner responses assessed item difficulty and discrimination, while expert raters evaluated question quality across multiple dimensions, including topic relevance and distractor quality. Results showed that ReQUESTA-generated items were consistently more challenging, more discriminative, and more strongly aligned with overall reading comprehension performance. Expert evaluations further indicated stronger alignment with central concepts and superior distractor linguistic consistency and semantic plausibility, particularly for inferential questions. These findings demonstrate that hybrid, agentic orchestration can systematically improve the reliability and controllability of LLM-based generation, highlighting workflow design as a key lever for structured artifact generation beyond single-pass prompting.
摘要：大型语言模型 (LLM) 的最新进展使得自动多项选择题 (MCQ) 生成变得越来越可行；然而，可靠地生产满足受控认知需求的物品仍然是一个挑战。为了解决这一差距，我们引入了 ReQUESTA，这是一种混合的多智能体框架，用于生成认知多样化的 MCQ，系统地针对基于文本、推理和主要思想的理解。 ReQUESTA 将 MCQ 创作分解为专门的子任务，并协调由 LLM 驱动的代理与基于规则的组件，以支持规划、受控生成、迭代评估和后处理。我们使用学术说明性文本在大规模阅读理解研究中评估了该框架，将 ReQUESTA 生成的 MCQ 与单通道 GPT-5 零样本基线生成的MCQ 进行了比较。对学习者反应的心理测量分析评估了项目难度和歧视性，而专家评分者则从多个维度评估问题质量，包括主题相关性和干扰质量。结果表明，ReQUESTA 生成的项目始终更具挑战性、更具辨别力，并且与整体阅读理解表现更加一致。专家评估进一步表明，与中心概念的一致性更强，干扰项语言一致性和语义可信度更高，特别是对于推理问题。这些发现表明，混合、代理编排可以系统地提高基于 LLM 的生成的可靠性和可控性，强调工作流设计是超越单遍提示的结构化工件生成的关键杠杆。

Title: OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering

Authors: Yifan Zhu, Xinyu Mu, Tao Feng, Zhonghong Ou, Yuning Gong, Haoran Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03707
Pdf URL: https://arxiv.org/pdf/2602.03707
Copy Paste: [[2602.03707]] OmniRAG-Agent: Agentic Omnimodal Reasoning for Low-Resource Long Audio-Video Question Answering(https://arxiv.org/abs/2602.03707)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Long-horizon omnimodal question answering answers questions by reasoning over text, images, audio, and video. Despite recent progress on OmniLLMs, low-resource long audio-video QA still suffers from costly dense encoding, weak fine-grained retrieval, limited proactive planning, and no clear end-to-end this http URL address these issues, we propose OmniRAG-Agent, an agentic omnimodal QA method for budgeted long audio-video reasoning. It builds an image-audio retrieval-augmented generation module that lets an OmniLLM fetch short, relevant frames and audio snippets from external banks. Moreover, it uses an agent loop that plans, calls tools across turns, and merges retrieved evidence to answer complex queries. Furthermore, we apply group relative policy optimization to jointly improve tool use and answer quality over time. Experiments on OmniVideoBench, WorldSense, and Daily-Omni show that OmniRAG-Agent consistently outperforms prior methods under low-resource settings and achieves strong results, with ablations validating each component.
摘要：长视野全模态问答通过文本、图像、音频和视频推理来回答问题。尽管 OmniLLM 最近取得了进展，但低资源长音视频 QA 仍然受到昂贵的密集编码、薄弱的细粒度检索、有限的主动规划以及没有明确的端到端此 http URL 解决这些问题的困扰，我们提出 OmniRAG-Agent，一种用于预算长音视频推理的代理全模态 QA 方法。它构建了一个图像音频检索增强生成模块，让 OmniLLM 从外部库获取简短的相关帧和音频片段。此外，它使用一个代理循环来计划、跨轮调用工具并合并检索到的证据来回答复杂的查询。此外，我们应用组相关策略优化来随着时间的推移共同提高工具的使用和答案质量。在 OmniVideoBench、WorldSense 和 Daily-Omni 上进行的实验表明，OmniRAG-Agent 在低资源设置下始终优于先前的方法，并取得了出色的结果，并通过消融验证了每个组件。

Title: Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States

Authors: Ximing Dong, Shaowei Wang, Dayi Lin, Boyuan Chen, Ahmed E. Hassan
Subjects: cs.CL, cs.PF
Abstract URL: https://arxiv.org/abs/2602.03708
Pdf URL: https://arxiv.org/pdf/2602.03708
Copy Paste: [[2602.03708]] Beyond Tokens: Semantic-Aware Speculative Decoding for Efficient Inference by Probing Internal States(https://arxiv.org/abs/2602.03708)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) achieve strong performance across many tasks but suffer from high inference latency due to autoregressive decoding. The issue is exacerbated in Large Reasoning Models (LRMs), which generate lengthy chains of thought. While speculative decoding accelerates inference by drafting and verifying multiple tokens in parallel, existing methods operate at the token level and ignore semantic equivalence (i.e., different token sequences expressing the same meaning), leading to inefficient rejections. We propose SemanticSpec, a semantic-aware speculative decoding framework that verifies entire semantic sequences instead of tokens. SemanticSpec introduces a semantic probability estimation mechanism that probes the model's internal hidden states to assess the likelihood of generating sequences with specific this http URL on four benchmarks show that SemanticSpec achieves up to 2.7x speedup on DeepSeekR1-32B and 2.1x on QwQ-32B, consistently outperforming token-level and sequence-level baselines in both efficiency and effectiveness.
摘要：大型语言模型 (LLM) 在许多任务中实现了强大的性能，但由于自回归解码而存在高推理延迟。这个问题在大型推理模型（LRM）中更加严重，它会产生冗长的思维链。虽然推测解码通过并行起草和验证多个令牌来加速推理，但现有方法在令牌级别运行并忽略语义等价性（即表达相同含义的不同令牌序列），导致低效的拒绝。我们提出了 SemanticSpec，一种语义感知推测解码框架，用于验证整个语义序列而不是标记。 SemanticSpec 引入了一种语义概率估计机制，可探测模型的内部隐藏状态，以评估使用特定此 http URL 生成序列的可能性。在四个基准测试中，SemanticSpec 在 DeepSeekR1-32B 上实现了高达 2.7 倍的加速，在 QwQ-32B 上实现了 2.1 倍的加速，在效率和有效性方面始终优于令牌级和序列级基线。

Title: No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding

Authors: Vynska Amalia Permadi, Xingwei Tan, Nafise Sadat Moosavi, Nikos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03709
Pdf URL: https://arxiv.org/pdf/2602.03709
Copy Paste: [[2602.03709]] No Shortcuts to Culture: Indonesian Multi-hop Question Answering for Complex Cultural Understanding(https://arxiv.org/abs/2602.03709)
Keywords: language model, llm
Abstract: Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.
摘要：理解文化需要跨背景、传统和隐性社会知识进行推理，而不仅仅是回忆孤立的事实。然而，大多数以文化为中心的问答 (QA) 基准依赖于单跳问题，这可能允许模型利用浅层线索，而不是展示真正的文化推理。在这项工作中，我们介绍了 ID-MoCQA，这是第一个用于评估大语言模型 (LLM) 文化理解的大规模多跳 QA 数据集，该数据集植根于印度尼西亚传统，并提供英语和印度尼西亚语版本。我们提出了一个新的框架，系统地将单跳文化问题转化为跨越六种线索类型（例如常识、时间、地理）的多跳推理链。我们的多阶段验证流程结合了专家评审和法学硕士法官筛选，确保了高质量的问答对。我们对最先进模型的评估揭示了文化推理方面的巨大差距，特别是在需要细致入微的推理的任务中。 ID-MoCQA 为提升法学硕士的文化能力提供了具有挑战性且重要的基准。

Title: Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling

Authors: Yubao Zhao, Weiquan Huang, Sudong Wang, Ruochen Zhao, Chen Chen, Yao Shu, Chengwei Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03719
Pdf URL: https://arxiv.org/pdf/2602.03719
Copy Paste: [[2602.03719]] Training Multi-Turn Search Agent via Contrastive Dynamic Branch Sampling(https://arxiv.org/abs/2602.03719)
Keywords: language model, agent
Abstract: Agentic reinforcement learning has enabled large language models to perform complex multi-turn planning and tool use. However, learning in long-horizon settings remains challenging due to sparse, trajectory-level outcome rewards. While prior tree-based methods attempt to mitigate this issue, they often suffer from high variance and computational inefficiency. Through empirical analysis of search agents, We identify a common pattern: performance diverges mainly due to decisions near the tail. Motivated by this observation, we propose Branching Relative Policy Optimization (BranPO), a value-free method that provides step-level contrastive supervision without dense rewards. BranPO truncates trajectories near the tail and resamples alternative continuations to construct contrastive suffixes over shared prefixes, reducing credit ambiguity in long-horizon rollouts. To further boost efficiency and stabilize training, we introduce difficulty-aware branch sampling to adapt branching frequency across tasks, and redundant step masking to suppress uninformative actions. Extensive experiments on various question answering benchmarks demonstrate that BranPO consistently outperforms strong baselines, achieving significant accuracy gains on long-horizon tasks without increasing the overall training budget. Our code is available at \href{this https URL}{code}.
摘要：代理强化学习使大型语言模型能够执行复杂的多轮规划和工具使用。然而，由于轨迹级结果奖励稀疏，长期环境中的学习仍然具有挑战性。虽然先前的基于树的方法试图缓解这个问题，但它们经常遭受高方差和计算效率低下的困扰。通过对搜索代理的实证分析，我们发现了一个共同的模式：性能差异主要是由于尾部附近的决策造成的。受这一观察的启发，我们提出了分支相对策略优化（BranPO），这是一种无价值方法，可以提供步骤级对比监督，而无需密集奖励。 BranPO 在尾部附近截断轨迹，并对替代延续进行重新采样，以在共享前缀上构建对比后缀，从而减少长期部署中的信用模糊性。为了进一步提高效率和稳定训练，我们引入了难度感知分支采样来适应跨任务的分支频率，并引入冗余步骤屏蔽来抑制无信息行为。对各种问答基准的大量实验表明，BranPO 始终优于强大的基线，在不增加总体训练预算的情况下，在长期任务上实现了显着的准确性提升。我们的代码位于 \href{此 https URL}{code}。

Title: CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora, 16 GB RAM, Single-Device Deployment

Authors: Paolo Astrino
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03731
Pdf URL: https://arxiv.org/pdf/2602.03731
Copy Paste: [[2602.03731]] CUBO: Self-Contained Retrieval-Augmented Generation on Consumer Laptops 10 GB Corpora, 16 GB RAM, Single-Device Deployment(https://arxiv.org/abs/2602.03731)
Keywords: retrieval-augmented generation
Abstract: Organizations handling sensitive documents face a tension: cloud-based AI risks GDPR violations, while local systems typically require 18-32 GB RAM. This paper presents CUBO, a systems-oriented RAG platform for consumer laptops with 16 GB shared memory. CUBO's novelty lies in engineering integration of streaming ingestion (O(1) buffer overhead), tiered hybrid retrieval, and hardware-aware orchestration that enables competitive Recall@10 (0.48-0.97 across BEIR domains) within a hard 15.5 GB RAM ceiling. The 37,000-line codebase achieves retrieval latencies of 185 ms (p50) on C1,300 laptops while maintaining data minimization through local-only processing aligned with GDPR Art. 5(1)(c). Evaluation on BEIR benchmarks validates practical deployability for small-to-medium professional archives. The codebase is publicly available at this https URL.
摘要：处理敏感文档的组织面临着紧张局势：基于云的人工智能存在违反 GDPR 的风险，而本地系统通常需要 18-32 GB RAM。本文介绍了 CUBO，这是一个面向系统的 RAG 平台，适用于具有 16 GB 共享内存的消费笔记本电脑。 CUBO 的新颖之处在于流式摄取（O(1) 缓冲区开销）、分层混合检索和硬件感知编排的工程集成，可在 15.5 GB 硬 RAM 上限内实现具有竞争力的 Recall@10（跨 BEIR 域为 0.48-0.97）。 37,000 行代码库在 C1,300 笔记本电脑上实现了 185 毫秒 (p50) 的检索延迟，同时通过符合 GDPR Art 的本地处理保持数据最小化。 5(1)(c)。 BEIR 基准评估验证了中小型专业档案的实际部署能力。代码库可通过此 https URL 公开获取。

Title: Context Compression via Explicit Information Transmission

Authors: Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03784
Pdf URL: https://arxiv.org/pdf/2602.03784
Copy Paste: [[2602.03784]] Context Compression via Explicit Information Transmission(https://arxiv.org/abs/2602.03784)
Keywords: language model, llm, long context
Abstract: Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.
摘要：由于二次关注和不断增长的键值缓存，刺激了上下文压缩，使用大型语言模型 (LLM) 的长上下文推理成本高昂。在这项工作中，我们研究了软上下文压缩，其中长上下文被压缩为一小组连续表示。现有方法通常将 LLM 本身重新用作可训练的压缩器，依靠逐层自注意力来迭代聚合信息。我们认为这种范式存在两个结构性限制：（i）跨层的渐进式表示覆盖（ii）跨令牌的压缩容量分配不协调。我们提出了 ComprExIT（通过显式信息传输进行上下文压缩），这是一个轻量级框架，它将软压缩制定为一种新的范式：通过冻结的 LLM 隐藏状态进行显式信息传输。这将压缩与模型的内部自注意力动态分离。 ComprExIT 执行 (i) 深度方向传输，选择性地将多层信息传输到令牌锚中，减轻渐进式覆盖；(ii) 宽度方向传输，通过全局优化的传输计划将锚聚合到少量时隙中，确保信息的协调分配。在六个问答基准测试中，ComprExIT 始终优于最先进的上下文压缩方法，同时仅引入约 1% 的额外参数，这表明显式且协调的信息传输可以实现更有效、更稳健的长上下文压缩。

Title: They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References

Authors: Sahil Tripathi, Gautam Siddharth Kashyap, Mehwish Nasim, Jian Yang, Jiechao Gao, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03822
Pdf URL: https://arxiv.org/pdf/2602.03822
Copy Paste: [[2602.03822]] They Said Memes Were Harmless-We Found the Ones That Hurt: Decoding Jokes, Symbols, and Cultural References(https://arxiv.org/abs/2602.03822)
Keywords: language model
Abstract: Meme-based social abuse detection is challenging because harmful intent often relies on implicit cultural symbolism and subtle cross-modal incongruence. Prior approaches, from fusion-based methods to in-context learning with Large Vision-Language Models (LVLMs), have made progress but remain limited by three factors: i) cultural blindness (missing symbolic context), ii) boundary ambiguity (satire vs. abuse confusion), and iii) lack of interpretability (opaque model reasoning). We introduce CROSS-ALIGN+, a three-stage framework that systematically addresses these limitations: (1) Stage I mitigates cultural blindness by enriching multimodal representations with structured knowledge from ConceptNet, Wikidata, and Hatebase; (2) Stage II reduces boundary ambiguity through parameter-efficient LoRA adapters that sharpen decision boundaries; and (3) Stage III enhances interpretability by generating cascaded explanations. Extensive experiments on five benchmarks and eight LVLMs demonstrate that CROSS-ALIGN+ consistently outperforms state-of-the-art methods, achieving up to 17% relative F1 improvement while providing interpretable justifications for each decision.
摘要：基于模因的社会虐待检测具有挑战性，因为有害意图通常依赖于隐含的文化象征和微妙的跨模式不一致。先前的方法，从基于融合的方法到使用大视觉语言模型（LVLM）的上下文学习，已经取得了进展，但仍然受到三个因素的限制：i）文化盲目性（缺少符号上下文），ii）边界模糊（讽刺与滥用混淆），以及iii）缺乏可解释性（不透明的模型推理）。我们引入了 CROSS-ALIGN+，这是一个三阶段框架，系统地解决了这些局限性：（1）第一阶段通过利用来自 ConceptNet、Wikidata 和 Hatebase 的结构化知识丰富多模态表示来减轻文化盲目性； (2) 第二阶段通过参数高效的 LoRA 适配器来锐化决策边界，减少边界模糊性； (3) 第三阶段通过生成级联解释来增强可解释性。对五个基准和八个 LVLM 的广泛实验表明，CROSS-ALIGN+ 始终优于最先进的方法，实现高达 17% 的相对 F1 改进，同时为每个决策提供可解释的理由。

Title: Accelerating Scientific Research with Gemini: Case Studies and Common Techniques

Authors: David P. Woodruff, Vincent Cohen-Addad, Lalit Jain, Jieming Mao, Song Zuo, MohammadHossein Bateni, Simina Branzei, Michael P. Brenner, Lin Chen, Ying Feng, Lance Fortnow, Gang Fu, Ziyi Guan, Zahra Hadizadeh, Mohammad T. Hajiaghayi, Mahdi JafariRaviz, Adel Javanmard, Karthik C. S., Ken-ichi Kawarabayashi, Ravi Kumar, Silvio Lattanzi, Euiwoong Lee, Yi Li, Ioannis Panageas, Dimitris Paparas, Benjamin Przybocki, Bernardo Subercaseaux, Ola Svensson, Shayan Taherijam, Xuan Wu, Eylon Yogev, Morteza Zadimoghaddam, Samson Zhou, Vahab Mirrokni
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03837
Pdf URL: https://arxiv.org/pdf/2602.03837
Copy Paste: [[2602.03837]] Accelerating Scientific Research with Gemini: Case Studies and Common Techniques(https://arxiv.org/abs/2602.03837)
Keywords: language model, llm, chat
Abstract: Recent advances in large language models (LLMs) have opened new avenues for accelerating scientific research. While models are increasingly capable of assisting with routine tasks, their ability to contribute to novel, expert-level mathematical discovery is less understood. We present a collection of case studies demonstrating how researchers have successfully collaborated with advanced AI models, specifically Google's Gemini-based models (in particular Gemini Deep Think and its advanced variants), to solve open problems, refute conjectures, and generate new proofs across diverse areas in theoretical computer science, as well as other areas such as economics, optimization, and physics. Based on these experiences, we extract common techniques for effective human-AI collaboration in theoretical research, such as iterative refinement, problem decomposition, and cross-disciplinary knowledge transfer. While the majority of our results stem from this interactive, conversational methodology, we also highlight specific instances that push beyond standard chat interfaces. These include deploying the model as a rigorous adversarial reviewer to detect subtle flaws in existing proofs, and embedding it within a "neuro-symbolic" loop that autonomously writes and executes code to verify complex derivations. Together, these examples highlight the potential of AI not just as a tool for automation, but as a versatile, genuine partner in the creative process of scientific discovery.
摘要：大语言模型 (LLM) 的最新进展为加速科学研究开辟了新途径。虽然模型协助日常任务的能力越来越强，但它们为新颖的、专家级数学发现做出贡献的能力却鲜为人知。我们提供了一系列案例研究，展示研究人员如何成功地与先进的人工智能模型，特别是基于 Google 的 Gemini 模型（特别是 Gemini Deep Think 及其高级变体）合作，解决理论计算机科学不同领域以及经济学、优化和物理学等其他领域的开放问题、反驳猜想并生成新的证明。基于这些经验，我们提取了理论研究中有效的人机协作的常用技术，例如迭代细化、问题分解和跨学科知识转移。虽然我们的大部分结果都源于这种交互式对话方法，但我们也强调了超越标准聊天界面的特定实例。其中包括将模型部署为严格的对抗性审查者，以检测现有证明中的细微缺陷，并将其嵌入“神经符号”循环中，该循环自动编写和执行代码以验证复杂的推导。这些例子共同凸显了人工智能的潜力，它不仅作为自动化工具，而且作为科学发现创造性过程中的多功能、真正的合作伙伴。