2026-02-03

Title: PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering

Authors: MinGyu Jeon, SuWan Cho, JaeYoung Shu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.00007
Pdf URL: https://arxiv.org/pdf/2602.00007
Copy Paste: [[2602.00007]] PPoGA: Predictive Plan-on-Graph with Action for Knowledge Graph Question Answering(https://arxiv.org/abs/2602.00007)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) augmented with Knowledge Graphs (KGs) have advanced complex question answering, yet they often remain susceptible to failure when their initial high-level reasoning plan is flawed. This limitation, analogous to cognitive functional fixedness, prevents agents from restructuring their approach, leading them to pursue unworkable solutions. To address this, we propose PPoGA (Predictive Plan-on-Graph with Action), a novel KGQA framework inspired by human cognitive control and problem-solving. PPoGA incorporates a Planner-Executor architecture to separate high-level strategy from low-level execution and leverages a Predictive Processing mechanism to anticipate outcomes. The core innovation of our work is a self-correction mechanism that empowers the agent to perform not only Path Correction for local execution errors but also Plan Correction by identifying, discarding, and reformulating the entire plan when it proves ineffective. We conduct extensive experiments on three challenging multi-hop KGQA benchmarks: GrailQA, CWQ, and WebQSP. The results demonstrate that PPoGA achieves state-of-the-art performance, significantly outperforming existing methods. Our work highlights the critical importance of metacognitive abilities like problem restructuring for building more robust and flexible AI reasoning systems.
摘要：知识图谱 (KG) 增强的大型语言模型 (LLM) 具有先进的复杂问题回答能力，但当其最初的高级推理计划存在缺陷时，它们通常仍然容易失败。这种限制类似于认知功能固定性，阻止代理人重组他们的方法，导致他们寻求不可行的解决方案。为了解决这个问题，我们提出了 PPoGA（带行动的图上预测计划），这是一种受人类认知控制和问题解决启发的新型 KGQA 框架。 PPoGA 结合了规划器-执行器架构，将高级策略与低级执行分开，并利用预测处理机制来预测结果。我们工作的核心创新是一种自我纠正机制，该机制使智能体不仅能够对本地执行错误进行路径纠正，还能在整个计划无效时通过识别、丢弃和重新制定整个计划来进行计划纠正。我们对三个具有挑战性的多跳 KGQA 基准进行了广泛的实验：GrailQA、CWQ 和 WebQSP。结果表明，PPoGA 实现了最先进的性能，显着优于现有方法。我们的工作强调了元认知能力（例如问题重组）对于构建更强大、更灵活的人工智能推理系统的至关重要性。

Title: Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA

Authors: Samuel Thio, Matthew Lewis, Spiros Denaxas, Richard JB Dobson
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.00009
Pdf URL: https://arxiv.org/pdf/2602.00009
Copy Paste: [[2602.00009]] Unlocking Electronic Health Records: A Hybrid Graph RAG Approach to Safe Clinical AI for Patient QA(https://arxiv.org/abs/2602.00009)
Keywords: language model, llm, hallucination
Abstract: Electronic health record (EHR) systems present clinicians with vast repositories of clinical information, creating a significant cognitive burden where critical details are easily overlooked. While Large Language Models (LLMs) offer transformative potential for data processing, they face significant limitations in clinical settings, particularly regarding context grounding and hallucinations. Current solutions typically isolate retrieval methods focusing either on structured data (SQL/Cypher) or unstructured semantic search but fail to integrate both simultaneously. This work presents MediGRAF (Medical Graph Retrieval Augmented Framework), a novel hybrid Graph RAG system that bridges this gap. By uniquely combining Neo4j Text2Cypher capabilities for structured relationship traversal with vector embeddings for unstructured narrative retrieval, MediGRAF enables natural language querying of the complete patient journey. Using 10 patients from the MIMIC-IV dataset (generating 5,973 nodes and 5,963 relationships), we generated enough nodes and data for patient level question answering (QA), and we evaluated this architecture across varying query complexities. The system demonstrated 100\% recall for factual queries which means all relevant information was retrieved and in the output, while complex inference tasks achieved a mean expert quality score of 4.25/5 with zero safety violations. These results demonstrate that hybrid graph-grounding significantly advances clinical information retrieval, offering a safer, more comprehensive alternative to standard LLM deployments.
摘要：电子健康记录 (EHR) 系统为临床医生提供了大量临床信息存储库，造成了严重的认知负担，而关键细节很容易被忽视。虽然大型语言模型 (LLM) 为数据处理提供了变革潜力，但它们在临床环境中面临着巨大的局限性，特别是在上下文基础和幻觉方面。当前的解决方案通常隔离侧重于结构化数据（SQL/Cypher）或非结构化语义搜索的检索方法，但无法同时集成两者。这项工作提出了 MediGRAF（医学图检索增强框架），这是一种弥补这一差距的新型混合图 RAG 系统。通过独特地将用于结构化关系遍历的 Neo4j Text2Cypher 功能与用于非结构化叙述检索的向量嵌入相结合，MediGRAF 能够对整个患者旅程进行自然语言查询。使用 MIMIC-IV 数据集中的 10 名患者（生成 5,973 个节点和 5,963 个关系），我们为患者级问答 (QA) 生成了足够的节点和数据，并通过不同的查询复杂性评估了该架构。该系统展示了事实查询的 100\% 召回率，这意味着所有相关信息均已检索并输出，而复杂的推理任务的平均专家质量得分为 4.25/5，安全违规为零。这些结果表明，混合图基础显着推进了临床信息检索，为标准法学硕士部署提供了更安全、更全面的替代方案。

Title: G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models

Authors: Xun Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00015
Pdf URL: https://arxiv.org/pdf/2602.00015
Copy Paste: [[2602.00015]] G-MemLLM: Gated Latent Memory Augmentation for Long-Context Reasoning in Large Language Models(https://arxiv.org/abs/2602.00015)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, yet they remain constrained by the finite capacity of their context windows and the inherent difficulty of maintaining long-term factual consistency during multi-hop reasoning. While existing methods utilize context compression or recurrent tokens, they often suffer from ``context rot'' or the dilution of information over long horizons. In this paper, we propose \textbf{G-MemLLM}, a memory-augmented architecture that integrates a frozen LLM backbone with a trainable \textbf{Latent Memory Bank}. Our key innovation is a GRU-style gated update logic that allows the model to selectively update, preserve, or overwrite latent memory slots, preventing the vanishing gradients of knowledge common in recurrent systems. We evaluate G-MemLLM across scales, from GPT-2 (124M) to Llama 3.1 (8B), on the HotpotQA and Zero-Shot Relation Extraction (ZsRE) benchmarks. Our results demonstrate that G-MemLLM significantly enhances multi-hop reasoning and relational precision, achieving a 13.3\% accuracy boost on ZsRE for Llama 3.1-8B, and it also yields improvements across model scales, boosting Answer F1 by 8.56 points for GPT-2 and increasing Supporting Fact F1 by 6.89 points for Llama 3.1-8B on HotpotQA.
摘要：大型语言模型（LLM）在自然语言理解方面表现出了卓越的能力，但它们仍然受到上下文窗口的有限容量以及在多跳推理过程中保持长期事实一致性的固有困难的限制。虽然现有方法利用上下文压缩或循环标记，但它们经常遭受“上下文腐烂”或长期信息稀释的困扰。在本文中，我们提出了 \textbf{G-MemLLM}，这是一种内存增强架构，它将冻结的 LLM 主干与可训练的 \textbf{潜在内存库} 集成在一起。我们的关键创新是 GRU 式的门控更新逻辑，该逻辑允许模型有选择地更新、保留或覆盖潜在内存槽，从而防止循环系统中常见的知识梯度消失。我们在 HotpotQA 和零样本关系提取 (ZsRE) 基准上跨尺度评估 G-MemLLM，从 GPT-2 (124M) 到 Llama 3.1 (8B)。我们的结果表明，G-MemLLM 显着增强了多跳推理和关系精度，在 Llama 3.1-8B 的 ZsRE 上实现了 13.3% 的准确率提升，并且它还产生了跨模型规模的改进，在 HotpotQA 上将 GPT-2 的答案 F1 提高了 8.56 分，将 Llama 3.1-8B 的支持事实 F1 提高了 6.89 分。

Title: PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems

Authors: Jiongchi Yu, Yuhan Ma, Xiaoyu Zhang, Junjie Wang, Qiang Hu, Chao Shen, Xiaofei Xie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00016
Pdf URL: https://arxiv.org/pdf/2602.00016
Copy Paste: [[2602.00016]] PTCBENCH: Benchmarking Contextual Stability of Personality Traits in LLM Systems(https://arxiv.org/abs/2602.00016)
Keywords: language model, llm, agent
Abstract: With the increasing deployment of large language models (LLMs) in affective agents and AI systems, maintaining a consistent and authentic LLM personality becomes critical for user trust and engagement. However, existing work overlooks a fundamental psychological consensus that personality traits are dynamic and context-dependent. To bridge this gap, we introduce PTCBENCH, a systematic benchmark designed to quantify the consistency of LLM personalities under controlled situational contexts. PTCBENCH subjects models to 12 distinct external conditions spanning diverse location contexts and life events, and rigorously assesses the personality using the NEO Five-Factor Inventory. Our study on 39,240 personality trait records reveals that certain external scenarios (e.g., "Unemployment") can trigger significant personality changes of LLMs, and even alter their reasoning capabilities. Overall, PTCBENCH establishes an extensible framework for evaluating personality consistency in realistic, evolving environments, offering actionable insights for developing robust and psychologically aligned AI systems.
摘要：随着大型语言模型 (LLM) 在情感代理和 AI 系统中的部署不断增加，保持一致且真实的 LLM 个性对于用户信任和参与变得至关重要。然而，现有的研究忽视了一个基本的心理学共识，即人格特质是动态的且依赖于环境。为了弥补这一差距，我们引入了 PTCBENCH，这是一个系统基准，旨在量化受控情境背景下法学硕士性格的一致性。 PTCBENCH 对涵盖不同地点背景和生活事件的 12 种不同外部条件进行建模，并使用 NEO 五因素量表严格评估人格。我们对 39,240 条人格特质记录的研究表明，某些外部情景（例如“失业”）可以引发法学硕士的显着人格变化，甚至改变他们的推理能力。总体而言，PTCBENCH 建立了一个可扩展的框架，用于评估现实、不断变化的环境中的人格一致性，为开发强大且心理一致的 AI 系统提供可操作的见解。

Title: SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations

Authors: Benyamin Tabarsi, Wenbo Li, Tahreem Yasir, Aryan Santhosh Kumar, Laura Widman, Dongkuan Xu, Tiffany Barnes
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2602.00017
Pdf URL: https://arxiv.org/pdf/2602.00017
Copy Paste: [[2602.00017]] SafeTalkCoach: Diversity-Driven Multi-Agent Simulation for Parent-Teen Health Conversations(https://arxiv.org/abs/2602.00017)
Keywords: llm, agent
Abstract: The importance of effective parent-child communication about sexual health is widely acknowledged, but real-world data on these conversations is scarce and challenging to collect, due to their private and sensitive nature. Although LLMs have been widely adopted in dialogue generation, they may deviate from best practices and frequently lack realism and diversity. We introduce SafeTalkCoach, a diversity-driven multi-agent dialogue generation framework that simulates parent-child conversations about sexual health, and present an accompanying dataset. SafeTalkCoach integrates crowd-sourced and synthesized scenarios, established sexual health guidelines, evidence-based personas, adaptive control modules, and hierarchical diversification. Through evaluations, we demonstrate that SafeTalkCoach generates diverse conversations while maintaining realism, communication quality, and controllability in practice. Our goal is that the SafeTalkCoach framework and the dataset support both AI research and health communications practices.
摘要：亲子关于性健康的有效沟通的重要性已被广泛认可，但由于这些对话的私密性和敏感性，真实世界的数据很少且难以收集。尽管法学硕士在对话生成中被广泛采用，但它们可能偏离最佳实践，并且经常缺乏现实性和多样性。我们介绍了 SafeTalkCoach，这是一种多样性驱动的多智能体对话生成框架，可模拟有关性健康的亲子对话，并提供随附的数据集。 SafeTalkCoach 集成了众包和综合场景、既定的性健康指南、基于证据的角色、自适应控制模块和分层多样化。通过评估，我们证明 SafeTalkCoach 可以生成多样化的对话，同时在实践中保持真实性、通信质量和可控性。我们的目标是 SafeTalkCoach 框架和数据集支持人工智能研究和健康通信实践。

Title: Reversible Diffusion Decoding for Diffusion Language Models

Authors: Xinyun Wang, Min Zhang, Sen Cui, Zhikang Chen, Bo Jiang, Kun Kuang, Mingbao Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00150
Pdf URL: https://arxiv.org/pdf/2602.00150
Copy Paste: [[2602.00150]] Reversible Diffusion Decoding for Diffusion Language Models(https://arxiv.org/abs/2602.00150)
Keywords: language model
Abstract: Diffusion language models enable parallel token generation through block-wise decoding, but their irreversible commitments can lead to stagnation, where the reverse diffusion process fails to make further progress under a suboptimal this http URL propose Reversible Diffusion Decoding (RDD), a decoding framework that introduces reversibility into block-wise diffusion generation. RDD detects stagnation as a state-dependent failure of the reverse process and enables efficient backtracking to earlier blocks without recomputation via cached model states. To avoid repeated failure trajectories, RDD applies confidence-guided re-masking to selectively reinitialize uncertain tokens while preserving reliable this http URL reversible formulation allows decoding to recover from early commitment errors while maintaining the parallel efficiency of diffusion-based generation. Experiments show that RDD improves generation robustness and quality over baselines with minimal computational overhead.
摘要：扩散语言模型可以通过逐块解码实现并行令牌生成，但其不可逆的承诺可能会导致停滞，在次优情况下，反向扩散过程无法取得进一步进展。此http URL提出了可逆扩散解码（RDD），这是一种将可逆性引入逐块扩散生成的解码框架。 RDD 将停滞检测为逆向过程的状态相关故障，并能够有效回溯到较早的块，而无需通过缓存的模型状态重新计算。为了避免重复的失败轨迹，RDD 应用置信度引导的重新屏蔽来选择性地重新初始化不确定的令牌，同时保留可靠的 http URL 可逆公式，允许解码从早期承诺错误中恢复，同时保持基于扩散的生成的并行效率。实验表明，RDD 以最小的计算开销提高了基线的生成鲁棒性和质量。

Title: DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

Authors: Tianyi Hu, Niket Tandon, Akhil Arora
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00238
Pdf URL: https://arxiv.org/pdf/2602.00238
Copy Paste: [[2602.00238]] DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking(https://arxiv.org/abs/2602.00238)
Keywords: llm, chat, retrieval-augmented generation, agent
Abstract: Existing retrieval-augmented generation (RAG) systems are primarily designed under the assumption that each query has a single correct answer. This overlooks common information-seeking scenarios with multiple plausible answers, where diversity is essential to avoid collapsing to a single dominant response, thereby constraining creativity and compromising fair and inclusive information access. Our analysis reveals a commonly overlooked limitation of standard RAG systems: they underutilize retrieved context diversity, such that increasing retrieval diversity alone does not yield diverse generations. To address this limitation, we propose DIVERGE, a plug-and-play agentic RAG framework with novel reflection-guided generation and memory-augmented iterative refinement, which promotes diverse viewpoints while preserving answer quality. We introduce novel metrics tailored to evaluating the diversity-quality trade-off in open-ended questions, and show that they correlate well with human judgments. We demonstrate that DIVERGE achieves the best diversity-quality trade-off compared to competitive baselines and previous state-of-the-art methods on the real-world Infinity-Chat dataset, substantially improving diversity while maintaining quality. More broadly, our results reveal a systematic limitation of current LLM-based systems for open-ended information-seeking and show that explicitly modeling diversity can mitigate it. Our code is available at: this https URL
摘要：现有的检索增强生成（RAG）系统主要是在每个查询都有一个正确答案的假设下设计的。这忽视了具有多种看似合理的答案的常见信息寻求场景，其中多样性对于避免陷入单一主导响应至关重要，从而限制创造力并损害公平和包容性的信息获取。我们的分析揭示了标准 RAG 系统的一个经常被忽视的局限性：它们没有充分利用检索到的上下文多样性，因此仅增加检索多样性并不能产生不同的世代。为了解决这一限制，我们提出了 DIVERGE，这是一种即插即用的代理 RAG 框架，具有新颖的反射引导生成和内存增强迭代细化，可在保持答案质量的同时促进多样化的观点。我们引入了专门用于评估开放式问题中多样性与质量权衡的新颖指标，并表明它们与人类判断密切相关。我们证明，与现实世界 Infinity-Chat 数据集上的竞争基线和之前最先进的方法相比，DIVERGE 实现了最佳的多样性与质量权衡，在保持质量的同时大幅提高了多样性。更广泛地说，我们的结果揭示了当前基于法学硕士的开放式信息搜索系统的系统局限性，并表明显式建模多样性可以减轻这种局限性。我们的代码位于：此 https URL

Title: Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering

Authors: Philip Müller, Nicholas Popovič, Michael Färber, Peter Steinbach
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00279
Pdf URL: https://arxiv.org/pdf/2602.00279
Copy Paste: [[2602.00279]] Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering(https://arxiv.org/abs/2602.00279)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are commonly used in Question Answering (QA) settings, increasingly in the natural sciences if not science at large. Reliable Uncertainty Quantification (UQ) is critical for the trustworthy uptake of generated answers. Existing UQ approaches remain weakly validated in scientific QA, a domain relying on fact-retrieval and reasoning capabilities. We introduce the first large-scale benchmark for evaluating UQ metrics in reasoning-demanding QA studying calibration of UQ methods, providing an extensible open-source framework to reproducibly assess calibration. Our study spans up to 20 large language models of base, instruction-tuned and reasoning variants. Our analysis covers seven scientific QA datasets, including both multiple-choice and arithmetic question answering tasks, using prompting to emulate an open question answering setting. We evaluate and compare methods representative of prominent approaches on a total of 685,000 long-form responses, spanning different reasoning complexities representative of domain-specific tasks. At the token level, we find that instruction tuning induces strong probability mass polarization, reducing the reliability of token-level confidences as estimates of uncertainty. Models further fine-tuned for reasoning are exposed to the same effect, but the reasoning process appears to mitigate it depending on the provider. At the sequence level, we show that verbalized approaches are systematically biased and poorly correlated with correctness, while answer frequency (consistency across samples) yields the most reliable calibration. In the wake of our analysis, we study and report the misleading effect of relying exclusively on ECE as a sole measure for judging performance of UQ methods on benchmark datasets. Our findings expose critical limitations of current UQ methods for LLMs and standard practices in benchmarking thereof.
摘要：大型语言模型 (LLM) 通常用于问答 (QA) 设置中，即使不是一般科学，也越来越多地用于自然科学。可靠的不确定性量化 (UQ) 对于生成的答案的可信采用至关重要。现有的昆士兰大学方法在科学质量保证（一个依赖事实检索和推理能力的领域）中的验证仍然很弱。我们引入了第一个大规模基准，用于在推理要求高的 QA 中评估 UQ 指标，研究 UQ 方法的校准，提供可扩展的开源框架来可重复地评估校准。我们的研究涵盖了多达 20 个大型语言模型的基础、指令调整和推理变体。我们的分析涵盖七个科学问答数据集，包括多项选择和算术问答任务，并使用提示来模拟开放式问答设置。我们在总共 685,000 个长格式响应中评估和比较了代表突出方法的方法，涵盖了代表特定领域任务的不同推理复杂性。在令牌级别，我们发现指令调整会引起强烈的概率质量极化，从而降低了令牌级别置信度作为不确定性估计的可靠性。为推理而进一步微调的模型也会受到相同的影响，但推理过程似乎会根据提供者的不同来减轻这种影响。在序列层面，我们表明语言化方法存在系统偏差，并且与正确性相关性较差，而答案频率（样本之间的一致性）产生最可靠的校准。在我们的分析之后，我们研究并报告了完全依赖 ECE 作为判断昆士兰大学方法在基准数据集上的性能的唯一衡量标准的误导性影响。我们的研究结果揭示了当前昆士兰大学法学硕士方法及其基准测试标准实践的严重局限性。

Title: Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models

Authors: Xilin Gong, Shu Yang, Zehua Cao, Lynne Billard, Di Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00300
Pdf URL: https://arxiv.org/pdf/2602.00300
Copy Paste: [[2602.00300]] Faithful-Patchscopes: Understanding and Mitigating Model Bias in Hidden Representations Explanation of Large Language Models(https://arxiv.org/abs/2602.00300)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated strong capabilities for hidden representation interpretation through Patchscopes, a framework that uses LLMs themselves to generate human-readable explanations by decoding from internal hidden representations. However, our work shows that LLMs tend to rely on inherent linguistic patterns, which can override contextual information encoded in the hidden representations during decoding. For example, even when a hidden representation encodes the contextual attribute "purple" for "broccoli", LLMs still generate "green" in their explanations, reflecting a strong prior association. This behavior reveals a systematic unfaithfulness in Patchscopes. To systematically study this issue, we first designed a dataset to evaluate the faithfulness of Patchscopes under biased cases, and our results show that there is an 18.84\% faithfulness decrease on average. We then propose Bias Alignment through Logit Recalibration (BALOR), which treats the output logits from an unpatched prompt as capturing model bias and contrasts them with logits obtained under patched contextual information. By recalibrating the logit distribution through this contrast, BALOR suppresses model bias and amplifies contextual information during generation. Experiments across multiple LLMs demonstrate that BALOR consistently outperforms existing baselines, achieving up to 33\% relative performance improvement.
摘要：大型语言模型 (LLM) 通过 Patchscopes 展示了强大的隐藏表示解释能力，Patchscope 是一个使用 LLM 本身通过解码内部隐藏表示来生成人类可读解释的框架。然而，我们的工作表明，法学硕士倾向于依赖固有的语言模式，这可能会覆盖解码过程中隐藏表示中编码的上下文信息。例如，即使隐藏表示对“西兰花”的上下文属性“紫色”进行编码，法学硕士仍然在其解释中生成“绿色”，反映了强烈的先验关联。这种行为揭示了 Patchscopes 中系统性的不忠行为。为了系统地研究这个问题，我们首先设计了一个数据集来评估 Patchscopes 在有偏差的情况下的忠实度，结果表明平均忠实度下降了 18.84%。然后，我们提出通过 Logit 重新校准 (BALOR) 进行偏差对齐，它将未修补提示的输出 logits 视为捕获模型偏差，并将它们与在修补的上下文信息下获得的 logits 进行对比。通过这种对比重新校准 logit 分布，BALOR 可以抑制模型偏差并在生成过程中放大上下文信息。多个法学硕士的实验表明，BALOR 始终优于现有基准，实现高达 33% 的相对性能改进。

Title: MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes

Authors: Rodrigo Batista, Luís Filipe Cunha, Purificação Silvano, Nuno Guimarães, Alípio Jorge, Evelin Amorim, Ricardo Campos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00316
Pdf URL: https://arxiv.org/pdf/2602.00316
Copy Paste: [[2602.00316]] MiNER: A Two-Stage Pipeline for Metadata Extraction from Municipal Meeting Minutes(https://arxiv.org/abs/2602.00316)
Keywords: llm
Abstract: Municipal meeting minutes are official documents of local governance, exhibiting heterogeneous formats and writing styles. Effective information retrieval (IR) requires identifying metadata such as meeting number, date, location, participants, and start/end times, elements that are rarely standardized or easy to extract automatically. Existing named entity recognition (NER) models are ill-suited to this task, as they are not adapted to such domain-specific categories. In this paper, we propose a two-stage pipeline for metadata extraction from municipal minutes. First, a question answering (QA) model identifies the opening and closing text segments containing metadata. Transformer-based models (BERTimbau and XLM-RoBERTa with and without a CRF layer) are then applied for fine-grained entity extraction and enhanced through deslexicalization. To evaluate our proposed pipeline, we benchmark both open-weight (Phi) and closed-weight (Gemini) LLMs, assessing predictive performance, inference cost, and carbon footprint. Our results demonstrate strong in-domain performance, better than larger general-purpose LLMs. However, cross-municipality evaluation reveals reduced generalization reflecting the variability and linguistic complexity of municipal records. This work establishes the first benchmark for metadata extraction from municipal meeting minutes, providing a solid foundation for future research in this domain.
摘要：市政会议纪要是地方治理的正式文件，表现出不同的格式和写作风格。有效的信息检索 (IR) 需要识别元数据，例如会议编号、日期、地点、参与者和开始/结束时间，这些元素很少标准化或易于自动提取。现有的命名实体识别（NER）模型不适合这项任务，因为它们不适合此类特定于领域的类别。在本文中，我们提出了一种从市政会议记录中提取元数据的两阶段管道。首先，问答 (QA) 模型识别包含元数据的开头和结尾文本段。然后，应用基于 Transformer 的模型（带有或不带有 CRF 层的 BERTimbau 和 XLM-RoBERTa）进行细粒度实体提取，并通过去词汇化进行增强。为了评估我们提出的管道，我们对开放权重 (Phi) 和封闭权重 (Gemini) 法学硕士进行了基准测试，评估预测性能、推理成本和碳足迹。我们的结果证明了强大的领域内表现，优于大型通用法学硕士。然而，跨城市评估显示概括性下降，反映了城市记录的可变性和语言复杂性。这项工作建立了从市政会议纪要中提取元数据的第一个基准，为该领域的未来研究奠定了坚实的基础。

Title: Detecting AI-Generated Content in Academic Peer Reviews

Authors: Siyuan Shen, Kai Wang
Subjects: cs.CL, cs.AI, cs.LG, cs.SI
Abstract URL: https://arxiv.org/abs/2602.00319
Pdf URL: https://arxiv.org/pdf/2602.00319
Copy Paste: [[2602.00319]] Detecting AI-Generated Content in Academic Peer Reviews(https://arxiv.org/abs/2602.00319)
Keywords: language model, llm
Abstract: The growing availability of large language models (LLMs) has raised questions about their role in academic peer review. This study examines the temporal emergence of AI-generated content in peer reviews by applying a detection model trained on historical reviews to later review cycles at International Conference on Learning Representations (ICLR) and Nature Communications (NC). We observe minimal detection of AI-generated content before 2022, followed by a substantial increase through 2025, with approximately 20% of ICLR reviews and 12% of Nature Communications reviews classified as AI-generated in 2025. The most pronounced growth of AI-generated reviews in NC occurs between the third and fourth quarter of 2024. Together, these findings provide suggestive evidence of a rapidly increasing presence of AI-assisted content in peer review and highlight the need for further study of its implications for scholarly evaluation.
摘要：大型语言模型（LLM）的日益普及引发了人们对其在学术同行评审中的作用的质疑。本研究通过将历史评论训练的检测模型应用于国际学习表征会议 (ICLR) 和自然通讯 (NC) 的后续评审周期，研究了同行评审中人工智能生成内容的时间出现。我们观察到，在 2022 年之前，人工智能生成的内容检测量很少，随后到 2025 年大幅增加，到 2025 年，约 20% 的 ICLR 审稿和 12% 的《自然通讯》审稿被归类为人工智能生成。北卡罗来纳州人工智能生成审稿的最显着增长发生在 2024 年第三季度至第四季度之间。总而言之，这些发现提供了证据，表明同行评审中人工智能辅助内容的存在迅速增加，并强调需要进一步研究其影响供学术评价。

Title: DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning

Authors: Li Siyan, Darshan Deshpande, Anand Kannappan, Rebecca Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00352
Pdf URL: https://arxiv.org/pdf/2602.00352
Copy Paste: [[2602.00352]] DETOUR: An Interactive Benchmark for Dual-Agent Search and Reasoning(https://arxiv.org/abs/2602.00352)
Keywords: prompt, agent
Abstract: When recalling information in conversation, people often arrive at the recollection after multiple turns. However, existing benchmarks for evaluating agent capabilities in such tip-of-the-tongue search processes are restricted to single-turn settings. To more realistically simulate tip-of-the-tongue search, we introduce Dual-agent based Evaluation Through Obscure Under-specified Retrieval (DETOUR), a dual-agent evaluation benchmark containing 1,011 prompts. The benchmark design involves a Primary Agent, which is the subject of evaluation, tasked with identifying the recollected entity through querying a Memory Agent that is held consistent across evaluations. Our results indicate that current state-of-the-art models still struggle with our benchmark, only achieving 36% accuracy when evaluated on all modalities (text, image, audio, and video), highlighting the importance of enhancing capabilities in underspecified scenarios.
摘要：人们在对话中回忆信息时，往往要经过多次轮回才能回忆起来。然而，在这种舌尖搜索过程中评估代理能力的现有基准仅限于单轮设置。为了更真实地模拟舌尖搜索，我们引入了基于模糊指定检索的双代理评估（DETOUR），这是一个包含 1,011 个提示的双代理评估基准。基准测试设计涉及一个主代理，它是评估的主体，其任务是通过查询在评估中保持一致的内存代理来识别重新收集的实体。我们的结果表明，当前最先进的模型仍然难以达到我们的基准，在所有模式（文本、图像、音频和视频）上进行评估时，准确率仅达到 36%，这凸显了在未指定场景中增强能力的重要性。

Title: DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models

Authors: Zhaochen Hong, Jiaxuan You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00377
Pdf URL: https://arxiv.org/pdf/2602.00377
Copy Paste: [[2602.00377]] DecompressionLM: Deterministic, Diagnostic, and Zero-Shot Concept Graph Extraction from Language Models(https://arxiv.org/abs/2602.00377)
Keywords: language model, gpt, hallucination
Abstract: Existing knowledge probing methods rely on pre-defined queries, limiting extraction to known concepts. We introduce DecompressionLM, a stateless framework for zero-shot concept graph extraction that discovers what language models encode without pre-specified queries or shared cross-sequence state. Our method targets three limitations of common decoding-based probing approaches: cross-sequence coupling that concentrates probability mass on high-frequency prefixes, competitive decoding effects that suppress long-tail concepts, and scalability constraints arising from sequential exploration. Using Van der Corput low-discrepancy sequences with arithmetic decoding, DecompressionLM enables deterministic, embarrassingly parallel generation without shared state across sequences. Across two model families and five quantization variants, we find that activation-aware quantization (AWQ-4bit) expands concept coverage by 30-170%, while uniform quantization (GPTQ-Int4) induces 71-86% coverage collapse -- divergent behaviors not reliably reflected by explanation-level perplexity. Corpus-based verification further reveals a 17-point hallucination gap between top- and bottom-ranked MMLU-Pro Law models. DecompressionLM establishes concept coverage as a complementary evaluation dimension for assessing knowledge breadth and factual grounding in compressed models useful for their deployment.
摘要：现有的知识探测方法依赖于预定义的查询，限制了对已知概念的提取。我们引入了 DecompressionLM，这是一种用于零样本概念图提取的无状态框架，它可以发现语言模型编码的内容，而无需预先指定的查询或共享的跨序列状态。我们的方法针对常见的基于解码的探测方法的三个局限性：将概率质量集中在高频前缀上的跨序列耦合、抑制长尾概念的竞争解码效应以及顺序探索产生的可扩展性约束。 DecompressionLM 使用 Van der Corput 低差异序列和算术解码，可以实现确定性的、令人尴尬的并行生成，而无需跨序列共享状态。在两个模型系列和五个量化变体中，我们发现激活感知量化 (AWQ-4bit) 将概念覆盖范围扩大了 30-170%，而均匀量化 (GPTQ-Int4) 则导致 71-86% 的覆盖范围崩溃——解释级困惑度无法可靠地反映不同的行为。基于语料库的验证进一步揭示了排名靠前的 MMLU-Pro Law 模型与排名靠后的 MMLU-Pro Law 模型之间存在 17 点的幻觉差距。 DecompressionLM 将概念覆盖率建立为补充评估维度，用于评估对其部署有用的压缩模型的知识广度和事实基础。

Title: Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models

Authors: Sercan Karakaş
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00380
Pdf URL: https://arxiv.org/pdf/2602.00380
Copy Paste: [[2602.00380]] Clause-Internal or Clause-External? Testing Turkish Reflexive Binding in Adapted versus Chain of Thought Large Language Models(https://arxiv.org/abs/2602.00380)
Keywords: language model, llm, chain-of-thought
Abstract: This study evaluates whether state-of-the-art large language models capture the binding relations of Turkish reflexive pronouns. We construct a balanced set of 100 sentences that pit local against non-local antecedents for the reflexives kendi and kendisi, and test two contrasting systems: an OpenAI chain-of-thought model designed for multi-step reasoning and Trendyol-LLM-7B-base-v0.1, a LLaMA-2-derived model extensively fine-tuned on Turkish data. Antecedent choice is assessed using a combined sentence-level perplexity and forced-choice paradigm. Trendyol-LLM favours local bindings in approximately 70% of trials, exhibiting a strong locality bias, whereas o1 Mini distributes its choices almost evenly between local and long-distance readings, revealing a marked contrast in binding behaviour across the two systems.
摘要：这项研究评估了最先进的大型语言模型是否能够捕捉土耳其语反身代词的绑定关系。我们构建了一组平衡的 100 个句子，将反身词 kendi 和 kendisi 的本地先行词与非本地先行词进行对比，并测试了两个对比系统：一个为多步推理设计的 OpenAI 思维链模型和 Trendyol-LLM-7B-base-v0.1（一个基于土耳其语数据进行广泛微调的 LLaMA-2 衍生模型）。使用组合的句子级困惑度和强制选择范式来评估先行词选择。 Trendyol-LLM 在大约 70% 的试验中倾向于局部结合，表现出强烈的局部性偏差，而 o1 Mini 在局部和远距离读数之间几乎均匀地分布其选择，揭示了两个系统之间的结合行为的显着对比。

Title: When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems

Authors: Naen Xu, Hengyu An, Shuo Shi, Jinghuai Zhang, Chunyi Zhou, Changjiang Li, Tianyu Du, Zhihui Fu, Jun Wang, Shouling Ji
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2602.00428
Pdf URL: https://arxiv.org/pdf/2602.00428
Copy Paste: [[2602.00428]] When Agents "Misremember" Collectively: Exploring the Mandela Effect in LLM-based Multi-Agent Systems(https://arxiv.org/abs/2602.00428)
Keywords: language model, llm, prompt, agent
Abstract: Recent advancements in large language models (LLMs) have significantly enhanced the capabilities of collaborative multi-agent systems, enabling them to address complex challenges. However, within these multi-agent systems, the susceptibility of agents to collective cognitive biases remains an underexplored issue. A compelling example is the Mandela effect, a phenomenon where groups collectively misremember past events as a result of false details reinforced through social influence and internalized misinformation. This vulnerability limits our understanding of memory bias in multi-agent systems and raises ethical concerns about the potential spread of misinformation. In this paper, we conduct a comprehensive study on the Mandela effect in LLM-based multi-agent systems, focusing on its existence, causing factors, and mitigation strategies. We propose MANBENCH, a novel benchmark designed to evaluate agent behaviors across four common task types that are susceptible to the Mandela effect, using five interaction protocols that vary in agent roles and memory timescales. We evaluate agents powered by several LLMs on MANBENCH to quantify the Mandela effect and analyze how different factors affect it. Moreover, we propose strategies to mitigate this effect, including prompt-level defenses (e.g., cognitive anchoring and source scrutiny) and model-level alignment-based defense, achieving an average 74.40% reduction in the Mandela effect compared to the baseline. Our findings provide valuable insights for developing more resilient and ethically aligned collaborative multi-agent systems.
摘要：大型语言模型（LLM）的最新进展显着增强了协作多智能体系统的能力，使它们能够应对复杂的挑战。然而，在这些多智能体系统中，智能体对集体认知偏差的敏感性仍然是一个尚未充分研究的问题。一个令人信服的例子是曼德拉效应，这种现象是由于社会影响和内在错误信息强化了虚假细节，群体集体错误地记住了过去的事件。此漏洞限制了我们对多代理系统中记忆偏差的理解，并引发了对错误信息潜在传播的道德担忧。在本文中，我们对基于LLM的多智能体系统中的曼德拉效应进行了全面的研究，重点关注其存在、影响因素和缓解策略。我们提出了 MANBENCH，这是一种新颖的基准，旨在使用五种在代理角色和内存时间尺度上有所不同的交互协议来评估易受曼德拉效应影响的四种常见任务类型的代理行为。我们在 MANBENCH 上评估由多个法学硕士支持的代理，以量化曼德拉效应并分析不同因素如何影响它。此外，我们提出了减轻这种影响的策略，包括即时级防御（例如认知锚定和源审查）和基于模型级对齐的防御，与基线相比，曼德拉效应平均降低了 74.40%。我们的研究结果为开发更具弹性和符合道德的协作多智能体系统提供了宝贵的见解。

Title: What Matters to an LLM? Behavioral and Computational Evidences from Summarization

Authors: Yongxin Zhou, Changshun Wu, Philippe Mulhem, Didier Schwab, Maxime Peyrard
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00459
Pdf URL: https://arxiv.org/pdf/2602.00459
Copy Paste: [[2602.00459]] What Matters to an LLM? Behavioral and Computational Evidences from Summarization(https://arxiv.org/abs/2602.00459)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are now state-of-the-art at summarization, yet the internal notion of importance that drives their information selections remains hidden. We propose to investigate this by combining behavioral and computational analyses. Behaviorally, we generate a series of length-controlled summaries for each document and derive empirical importance distributions based on how often each information unit is selected. These reveal that LLMs converge on consistent importance patterns, sharply different from pre-LLM baselines, and that LLMs cluster more by family than by size. Computationally, we identify that certain attention heads align well with empirical importance distributions, and that middle-to-late layers are strongly predictive of importance. Together, these results provide initial insights into what LLMs prioritize in summarization and how this priority is internally represented, opening a path toward interpreting and ultimately controlling information selection in these models.
摘要：大型语言模型（LLM）现在在总结方面是最先进的，但驱动其信息选择的内部重要性概念仍然隐藏。我们建议通过结合行为和计算分析来研究这一点。在行为上，我们为每个文档生成一系列长度控制的摘要，并根据选择每个信息单元的频率得出经验重要性分布。这些表明，法学硕士收敛于一致的重要性模式，与法学硕士之前的基线截然不同，而且法学硕士更多地按家庭而不是规模聚集。通过计算，我们发现某些注意力头与经验重要性分布很好地吻合，并且中后期层可以强烈预测重要性。总之，这些结果为法学硕士在总结中优先考虑的事项以及该优先事项如何在内部表示提供了初步见解，为解释和最终控制这些模型中的信息选择开辟了一条道路。

Title: Intention-Adaptive LLM Fine-Tuning for Text Revision Generation

Authors: Zhexiong Liu, Diane Litman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00477
Pdf URL: https://arxiv.org/pdf/2602.00477
Copy Paste: [[2602.00477]] Intention-Adaptive LLM Fine-Tuning for Text Revision Generation(https://arxiv.org/abs/2602.00477)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved impressive capabilities in various context-based text generation tasks, such as summarization and reasoning; however, their applications in intention-based generation tasks remain underexplored. One such example is revision generation, which requires the generated text to explicitly reflect the writer's actual intentions. Identifying intentions and generating desirable revisions are challenging due to their complex and diverse nature. Although prior work has employed LLMs to generate revisions with few-shot learning, they struggle with handling entangled multi-intent scenarios. While fine-tuning LLMs using intention-based instructions appears promising, it demands large amounts of annotated data, which is expensive and scarce in the revision community. To address these challenges, we propose Intention-Tuning, an intention-adaptive layer-wise LLM fine-tuning framework that dynamically selects a subset of LLM layers to learn the intentions and subsequently transfers their representations to revision generation. Experimental results suggest that Intention-Tuning is effective and efficient on small revision corpora, outperforming several PEFT baselines.
摘要：大型语言模型（LLM）在各种基于上下文的文本生成任务中取得了令人印象深刻的能力，例如摘要和推理；然而，它们在基于意图的生成任务中的应用仍未得到充分探索。其中一个例子是修订生成，它要求生成的文本明确反映作者的实际意图。由于其复杂性和多样性，识别意图并生成所需的修订具有挑战性。尽管之前的工作已经使用法学硕士通过少量学习来生成修订，但它们在处理纠缠的多意图场景方面遇到了困难。虽然使用基于意图的指令对法学硕士进行微调似乎很有希望，但它需要大量带注释的数据，而这在修订社区中是昂贵且稀缺的。为了应对这些挑战，我们提出了意图调整，这是一种意图自适应分层 LLM 微调框架，它动态选择 LLM 层的子集来学习意图，然后将其表示转移到修订生成中。实验结果表明，意图调整在小型修订语料库上是有效且高效的，优于多个 PEFT 基线。

Title: From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas

Authors: Zhaokun Yan, Zhaohan Liu, Wuzheng Dong, Lijie Feng, Chengxiao Dai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00491
Pdf URL: https://arxiv.org/pdf/2602.00491
Copy Paste: [[2602.00491]] From Knowledge to Inference: Scaling Laws of Specialized Reasoning on GlobalHealthAtlas(https://arxiv.org/abs/2602.00491)
Keywords: language model, llm
Abstract: Public health reasoning requires population level inference grounded in scientific evidence, expert consensus, and safety constraints. However, it remains underexplored as a structured machine learning problem with limited supervised signals and benchmarks. We introduce \textbf{GlobalHealthAtlas}, a large scale multilingual dataset of 280,210 instances spanning 15 public health domains and 17 languages, stratified into three difficulty levels from health literacy to epidemiological and policy reasoning. Instances are derived from openly available public health sources and labeled by language, domain, and difficulty to support supervised learning and slice based evaluation. We further propose large language model (LLM) assisted construction and quality control pipeline with retrieval, duplication, evidence grounding checks, and label validation to improve consistency at scale. Finally, we present a domain aligned evaluator distilled from high confidence judgments of diverse LLMs to assess outputs along six dimensions: Accuracy, Reasoning, Completeness, Consensus Alignment, Terminology Norms, and Insightfulness. Together, these contributions enable reproducible training and evaluation of LLMs for safety critical public health reasoning beyond conventional QA benchmarks.
摘要：公共卫生推理需要基于科学证据、专家共识和安全约束的人口水平推断。然而，作为一个监督信号和基准有限的结构化机器学习问题，它仍然没有得到充分探索。我们引入了 \textbf{GlobalHealthAtlas}，这是一个包含 280,210 个实例的大型多语言数据集，涵盖 15 个公共卫生领域和 17 种语言，分为从健康素养到流行病学和政策推理的三个难度级别。实例来自公开可用的公共卫生资源，并按语言、领域和难度进行标记，以支持监督学习和基于切片的评估。我们进一步提出大语言模型（LLM）辅助构建和质量控制管道，包括检索、复制、证据基础检查和标签验证，以提高大规模的一致性。最后，我们提出了一个从不同法学硕士的高置信度判断中提炼出来的领域对齐评估器，以评估六个维度的输出：准确性、推理、完整性、共识一致性、术语规范和洞察力。这些贡献共同实现了法学硕士的可重复培训和评估，以超越传统的质量保证基准，进行安全关键的公共卫生推理。

Title: Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design

Authors: Hanjing Shi, Dominic DiFranzo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00497
Pdf URL: https://arxiv.org/pdf/2602.00497
Copy Paste: [[2602.00497]] Culturally-Grounded Governance for Multilingual Language Models: Rights, Data Boundaries, and Accountable AI Design(https://arxiv.org/abs/2602.00497)
Keywords: language model, llm
Abstract: Multilingual large language models (MLLMs) are increasingly deployed across cultural, linguistic, and political contexts, yet existing governance frameworks largely assume English-centric data, homogeneous user populations, and abstract notions of fairness. This creates systematic risks for low-resource languages and culturally marginalized communities, where data practices, model behavior, and accountability mechanisms often fail to align with local norms, rights, and expectations. Drawing on cross-cultural perspectives in human-centered computing and AI governance, this paper synthesizes existing evidence on multilingual model behavior, data asymmetries, and sociotechnical harm, and articulates a culturally grounded governance framework for MLLMs. We identify three interrelated governance challenges: cultural and linguistic inequities in training data and evaluation practices, misalignment between global deployment and locally situated norms, values, and power structures, and limited accountability mechanisms for addressing harms experienced by marginalized language communities. Rather than proposing new technical benchmarks, we contribute a conceptual agenda that reframes multilingual AI governance as a sociocultural and rights based problem. We outline design and policy implications for data stewardship, transparency, and participatory accountability, and argue that culturally grounded governance is essential for ensuring that multilingual language models do not reproduce existing global inequalities under the guise of scale and neutrality.
摘要：多语言大语言模型 (MLLM) 越来越多地在文化、语言和政治背景下部署，但现有的治理框架在很大程度上假设以英语为中心的数据、同质的用户群体和抽象的公平概念。这给资源匮乏的语言和文化边缘化社区带来了系统性风险，这些社区的数据实践、模型行为和问责机制往往无法符合当地规范、权利和期望。本文借鉴以人为中心的计算和人工智能治理中的跨文化视角，综合了有关多语言模型行为、数据不对称和社会技术危害的现有证据，并阐明了基于文化的 MLLM 治理框架。我们确定了三个相互关联的治理挑战：培训数据和评估实践中的文化和语言不平等，全球部署与当地规范、价值观和权力结构之间的不一致，以及解决边缘化语言社区所经历的伤害的有限问责机制。我们没有提出新的技术基准，而是提出了一个概念议程，将多语言人工智能治理重新定义为一个社会文化和基于权利的问题。我们概述了数据管理、透明度和参与性问责制的设计和政策影响，并认为基于文化的治理对于确保多语言模型不会在规模和中立的幌子下重现现有的全球不平等至关重要。

Title: Reasoning by Commented Code for Table Question Answering

Authors: Seho Pyo, Jiheon Seok, Jaejin Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00543
Pdf URL: https://arxiv.org/pdf/2602.00543
Copy Paste: [[2602.00543]] Reasoning by Commented Code for Table Question Answering(https://arxiv.org/abs/2602.00543)
Keywords: language model, llm
Abstract: Table Question Answering (TableQA) poses a significant challenge for large language models (LLMs) because conventional linearization of tables often disrupts the two-dimensional relationships intrinsic to structured data. Existing methods, which depend on end-to-end answer generation or single-line program queries, typically exhibit limited numerical accuracy and reduced interpretability. This work introduces a commented, step-by-step code-generation framework that incorporates explicit reasoning into the Python program-generation process. The approach decomposes TableQA reasoning into multi-line executable programs with concise natural language comments, thereby promoting clearer reasoning and increasing the likelihood of generating correct code. On the WikiTableQuestions benchmark, the proposed method achieves 70.9\% accuracy using Qwen2.5-Coder-7B-Instruct, surpassing the Repanda baseline (67.6\%). Integrating the proposed framework with a robust end-to-end TableQA model via a lightweight answer-selection mechanism yields further improvements. This combined approach achieves up to 84.3\% accuracy on the WikiTableQuestions benchmark.
摘要：表问答 (TableQA) 对大型语言模型 (LLM) 提出了重大挑战，因为表的传统线性化通常会破坏结构化数据固有的二维关系。现有的方法依赖于端到端答案生成或单行程序查询，通常表现出有限的数值准确性和降低的可解释性。这项工作引入了一个带注释的分步代码生成框架，它将显式推理合并到 Python 程序生成过程中。该方法将 TableQA 推理分解为具有简洁自然语言注释的多行可执行程序，从而促进更清晰的推理并增加生成正确代码的可能性。在 WikiTableQuestions 基准上，所提出的方法使用 Qwen2.5-Coder-7B-Instruct 实现了 70.9% 的准确率，超过了 Repanda 基线 (67.6%)。通过轻量级答案选择机制将所提出的框架与强大的端到端 TableQA 模型集成，可以产生进一步的改进。这种组合方法在 WikiTableQuestions 基准测试中的准确率高达 84.3%。

Title: The French Drama Revolution: Political Economy and Literary Production, 1700-1900

Authors: Thiago Dumont Oliveira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00588
Pdf URL: https://arxiv.org/pdf/2602.00588
Copy Paste: [[2602.00588]] The French Drama Revolution: Political Economy and Literary Production, 1700-1900(https://arxiv.org/abs/2602.00588)
Keywords: prompt
Abstract: This paper investigates the changing nature of French drama between 1700-1900 using Latent Dirichlet Allocation and Jensen-Shannon Divergence. Results indicate that the topical distribution of French drama changed profoundly after the French Revolution, particularly between 1789 and 1850. Bourgeois themes emerged among the most prevalent topics since the late 18th century. To assess the coevolution of drama and economic growth, I plot the yearly prevalence of topics alongside French GDP between 1700-1900, and discuss these changes in light of the political and economic changes prompted by the French Revolution and the industrialization of the country.
摘要：本文利用潜在狄利克雷分配和詹森-香农分歧研究了 1700 年至 1900 年间法国戏剧性质的变化。结果表明，法国大革命后，特别是 1789 年至 1850 年间，法国戏剧的主题分布发生了深刻变化。资产阶级主题成为 18 世纪末以来最流行的主题之一。为了评估戏剧和经济增长的共同演化，我绘制了 1700 年至 1900 年间法国 GDP 的年度流行度，并根据法国大革命和国家工业化引发的政治和经济变化来讨论这些变化。

Title: Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling

Authors: Zhijie Huang, Stephen McIntosh, Daisuke Saito, Nobuaki Minematsu
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2602.00594
Pdf URL: https://arxiv.org/pdf/2602.00594
Copy Paste: [[2602.00594]] Kanade: A Simple Disentangled Tokenizer for Spoken Language Modeling(https://arxiv.org/abs/2602.00594)
Keywords: language model
Abstract: A good language model starts with a good tokenizer. Tokenization is especially important for speech modeling, which must handle continuous signals that mix linguistic and non-linguistic information. A speech tokenizer should extract phonetics and prosody, suppress linguistically irrelevant information like speaker identity, and enable high-quality synthesis. We present Kanade, a single-layer disentangled speech tokenizer that realizes this ideal. Kanade separates out acoustic constants to create a single stream of tokens that captures rich phonetics and prosody. It does so without the need for auxiliary methods that existing disentangled codecs often rely on. Experiments show that Kanade achieves state-of-the-art speaker disentanglement and lexical availability, while maintaining excellent reconstruction quality.
摘要：一个好的语言模型始于一个好的分词器。标记化对于语音建模尤其重要，它必须处理混合语言和非语言信息的连续信号。语音分词器应该提取语音和韵律，抑制语言上不相关的信息（例如说话者身份），并实现高质量的合成。我们推出了 Kanade，一种单层解缠结语音分词器，可以实现这一理想。 Kanade 分离出声学常数，创建一个单一的标记流，捕获丰富的语音和韵律。它不需要现有解缠结编解码器通常依赖的辅助方法来实现这一点。实验表明，Kanade 实现了最先进的说话人解缠和词汇可用性，同时保持出色的重建质量。

Title: Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling

Authors: Chaoqun Cui, Shijing Wang, Liangbin Huang, Qingqing Gu, Zhaolong Huang, Xiao Zeng, Wenji Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00597
Pdf URL: https://arxiv.org/pdf/2602.00597
Copy Paste: [[2602.00597]] Hermes the Polyglot: A Unified Framework to Enhance Expressiveness for Multimodal Interlingual Subtitling(https://arxiv.org/abs/2602.00597)
Keywords: language model, llm
Abstract: Interlingual subtitling, which translates subtitles of visual media into a target language, is essential for entertainment localization but has not yet been explored in machine translation. Although Large Language Models (LLMs) have significantly advanced the general capabilities of machine translation, the distinctive characteristics of subtitle texts pose persistent challenges in interlingual subtitling, particularly regarding semantic coherence, pronoun and terminology translation, and translation expressiveness. To address these issues, we present Hermes, an LLM-based automated subtitling framework. Hermes integrates three modules: Speaker Diarization, Terminology Identification, and Expressiveness Enhancement, which effectively tackle the above challenges. Experiments demonstrate that Hermes achieves state-of-the-art diarization performance and generates expressive, contextually coherent translations, thereby advancing research in interlingual subtitling.
摘要：语际字幕将视觉媒体的字幕翻译成目标语言，对于娱乐本地化至关重要，但尚未在机器翻译中得到探索。尽管大型语言模型（LLM）显着提高了机器翻译的一般能力，但字幕文本的独特特征对语间字幕提出了持续的挑战，特别是在语义连贯性、代词和术语翻译以及翻译表现力方面。为了解决这些问题，我们推出了 Hermes，一个基于法学硕士的自动字幕框架。 Hermes集成了说话人分类、术语识别和表达力增强三个模块，有效解决了上述挑战。实验表明，Hermes 实现了最先进的二值化性能，并生成富有表现力、上下文连贯的翻译，从而推进了语间字幕的研究。

Title: Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars

Authors: Yitong Zhang, Yongmin Li, Yuetong Liu, Jia Li, Xiaoran Jia, Zherui Li, Ge Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00612
Pdf URL: https://arxiv.org/pdf/2602.00612
Copy Paste: [[2602.00612]] Lookahead-then-Verify: Reliable Constrained Decoding for Diffusion LLMs under Context-Free Grammars(https://arxiv.org/abs/2602.00612)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (dLLMs) have demonstrated promising generative capabilities and are increasingly used to produce formal languages defined by context-free grammars, such as source code and chemical expressions. However, as probabilistic models, they still struggle to generate syntactically valid outputs reliably. A natural and promising direction to address this issue is to adapt constrained decoding techniques to enforce grammatical correctness during generation. However, applying these techniques faces two primary obstacles. On the one hand, the non-autoregressive nature of dLLMs renders most existing constrained decoding approaches inapplicable. On the other hand, current approaches specifically designed for dLLMs may allow intermediate outputs that are impossible to complete into valid sentences, which significantly limits their reliability in practice. To address these challenges, we present LAVE, a constrained decoding approach specifically designed for dLLMs. Our approach leverages a key property of dLLMs, namely their ability to predict token distributions for all positions in parallel during each forward pass. Whenever a new token is proposed by model, LAVE performs lookahead using these distributions to efficiently and reliably verify the validity of the proposed token. This design ensures reliable constraints by reliably preserving the potential for intermediate outputs to be extended into valid sentences. Extensive experiments across four widely used dLLMs and three representative benchmarks demonstrate that LAVE consistently outperforms existing baselines and achieves substantial improvements in syntactic correctness, while incurring negligible runtime overhead.
摘要：扩散大型语言模型 (dLLM) 已展现出有前景的生成能力，并且越来越多地用于生成由上下文无关语法定义的形式语言，例如源代码和化学表达式。然而，作为概率模型，它们仍然难以可靠地生成语法上有效的输出。解决这个问题的一个自然而有前途的方向是采用约束解码技术来强制生成过程中的语法正确性。然而，应用这些技术面临两个主要障碍。一方面，dLLM 的非自回归性质使得大多数现有的约束解码方法不适用。另一方面，当前专为 dLLM 设计的方法可能允许无法完成有效句子的中间输出，这极大地限制了它们在实践中的可靠性。为了应对这些挑战，我们提出了 LAVE，这是一种专为 dLLM 设计的受限解码方法。我们的方法利用了 dLLM 的一个关键属性，即它们在每次前向传递期间并行预测所有位置的代币分布的能力。每当模型提出新令牌时，LAVE 都会使用这些分布执行前瞻，以高效可靠地验证所提议令牌的有效性。这种设计通过可靠地保留中间输出扩展到有效句子的潜力来确保可靠的约束。跨越四个广泛使用的 dLLM 和三个代表性基准的大量实验表明，LAVE 始终优于现有基线，并在语法正确性方面实现了实质性改进，同时产生的运行时开销可以忽略不计。

Title: Transformer-Based Model for Multilingual Hope Speech Detection

Authors: Nsrin Ashraf, Mariam Labib, Hamada Nayel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00613
Pdf URL: https://arxiv.org/pdf/2602.00613
Copy Paste: [[2602.00613]] Transformer-Based Model for Multilingual Hope Speech Detection(https://arxiv.org/abs/2602.00613)
Keywords: language model
Abstract: This paper describes a system that has been submitted to the "PolyHope-M" at RANLP2025. In this work various transformers have been implemented and evaluated for hope speech detection for English and Germany. RoBERTa has been implemented for English, while the multilingual model XLM-RoBERTa has been implemented for both English and German languages. The proposed system using RoBERTa reported a weighted f1-score of 0.818 and an accuracy of 81.8% for English. On the other hand, XLM-RoBERTa achieved a weighted f1-score of 0.786 and an accuracy of 78.5%. These results reflects the importance of improvement of pre-trained large language models and how these models enhancing the performance of different natural language processing tasks.
摘要：本文描述了一个已在 RANLP2025 上提交给“PolyHope-M”的系统。在这项工作中，已经实现并评估了各种变压器，用于英语和德语的希望语音检测。 RoBERTa 已针对英语实施，而多语言模型 XLM-RoBERTa 已针对英语和德语实施。使用 RoBERTa 的拟议系统报告的英语加权 f1 分数为 0.818，准确率为 81.8%。另一方面，XLM-RoBERTa 的加权 f1 得分为 0.786，准确度为 78.5%。这些结果反映了改进预训练大型语言模型的重要性以及这些模型如何提高不同自然语言处理任务的性能。

Title: Jailbreaking LLMs via Calibration

Authors: Yuxuan Lu, Yongkang Guo, Yuqing Kong
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00619
Pdf URL: https://arxiv.org/pdf/2602.00619
Copy Paste: [[2602.00619]] Jailbreaking LLMs via Calibration(https://arxiv.org/abs/2602.00619)
Keywords: language model, gpt, llm
Abstract: Safety alignment in Large Language Models (LLMs) often creates a systematic discrepancy between a model's aligned output and the underlying pre-aligned data distribution. We propose a framework in which the effect of safety alignment on next-token prediction is modeled as a systematic distortion of a pre-alignment distribution. We cast Weak-to-Strong Jailbreaking as a forecast aggregation problem and derive an optimal aggregation strategy characterized by a Gradient Shift in the loss-induced dual space. We show that logit-arithmetic jailbreaking methods are a special case of this framework under cross-entropy loss, and derive a broader family of aggregation rules corresponding to other proper losses. We also propose a new hybrid aggregation rule. Evaluations across red-teaming benchmarks and math utility tasks using frontier models demonstrate that our approach achieves superior Attack Success Rates and lower "Jailbreak Tax" compared with existing methods, especially on the safety-hardened gpt-oss-120b.
摘要：大型语言模型 (LLM) 中的安全对齐通常会在模型的对齐输出和底层预对齐数据分布之间产生系统差异。我们提出了一个框架，其中安全对齐对下一个令牌预测的影响被建模为预对齐分布的系统失真。我们将弱到强的越狱视为一个预测聚合问题，并推导出一个最优聚合策略，其特征是损失引起的对偶空间中的梯度转移。我们证明了对数算术越狱方法是该框架在交叉熵损失下的一个特例，并导出了与其他适当损失相对应的更广泛的聚合规则族。我们还提出了一种新的混合聚合规则。使用前沿模型对红队基准和数学实用任务进行的评估表明，与现有方法相比，我们的方法实现了卓越的攻击成功率和更低的“越狱税”，特别是在安全强化的 gpt-oss-120b 上。

Title: Formal Semantic Control over Language Models

Authors: Yingji Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00638
Pdf URL: https://arxiv.org/pdf/2602.00638
Copy Paste: [[2602.00638]] Formal Semantic Control over Language Models(https://arxiv.org/abs/2602.00638)
Keywords: language model
Abstract: This thesis advances semantic representation learning to render language representations or models more semantically and geometrically interpretable, and to enable localised, quasi-symbolic, compositional control through deliberate shaping of their latent space geometry. We pursue this goal within a VAE framework, exploring two complementary research directions: (i) Sentence-level learning and control: disentangling and manipulating specific semantic features in the latent space to guide sentence generation, with explanatory text serving as the testbed; and (ii) Reasoning-level learning and control: isolating and steering inference behaviours in the latent space to control NLI. In this direction, we focus on Explanatory NLI tasks, in which two premises (explanations) are provided to infer a conclusion. The overarching objective is to move toward language models whose internal semantic representations can be systematically interpreted, precisely structured, and reliably directed. We introduce a set of novel theoretical frameworks and practical methodologies, together with corresponding experiments, to demonstrate that our approaches enhance both the interpretability and controllability of latent spaces for natural language across the thesis.
摘要：本论文推进了语义表示学习，使语言表示或模型在语义和几何上更具可解释性，并通过有意塑造其潜在空间几何形状来实现局部的、准符号的组合控制。我们在 VAE 框架内追求这一目标，探索两个互补的研究方向：（i）句子级学习和控制：解开和操纵潜在空间中的特定语义特征以指导句子生成，以解释性文本作为测试平台； (ii) 推理级学习和控制：隔离和引导潜在空间中的推理行为以控制 NLI。在这个方向上，我们专注于解释性 NLI 任务，其中提供两个前提（解释）来推断结论。总体目标是建立内部语义表示可以被系统解释、精确结构化和可靠引导的语言模型。我们引入了一套新颖的理论框架和实用方法，以及相应的实验，以证明我们的方法增强了整个论文中自然语言潜在空间的可解释性和可控性。

Title: LegalOne: A Family of Foundation Models for Reliable Legal Reasoning

Authors: Haitao Li, Yifan Chen, Shuo Miao, Qian Dong, Jia Chen, Yiran Hu, Junjie Chen, Minghao Qin, Qingyao Ai, Yiqun Liu, Cheng Luo, Quan Zhou, Ya Zhang, Jikun Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00642
Pdf URL: https://arxiv.org/pdf/2602.00642
Copy Paste: [[2602.00642]] LegalOne: A Family of Foundation Models for Reliable Legal Reasoning(https://arxiv.org/abs/2602.00642)
Keywords: language model, llm, agent
Abstract: While Large Language Models (LLMs) have demonstrated impressive general capabilities, their direct application in the legal domain is often hindered by a lack of precise domain knowledge and complexity of performing rigorous multi-step judicial reasoning. To address this gap, we present LegalOne, a family of foundational models specifically tailored for the Chinese legal domain. LegalOne is developed through a comprehensive three-phase pipeline designed to master legal reasoning. First, during mid-training phase, we propose Plasticity-Adjusted Sampling (PAS) to address the challenge of domain adaptation. This perplexity-based scheduler strikes a balance between the acquisition of new knowledge and the retention of original capabilities, effectively establishing a robust legal foundation. Second, during supervised fine-tuning, we employ Legal Agentic CoT Distillation (LEAD) to distill explicit reasoning from raw legal texts. Unlike naive distillation, LEAD utilizes an agentic workflow to convert complex judicial processes into structured reasoning trajectories, thereby enforcing factual grounding and logical rigor. Finally, we implement a Curriculum Reinforcement Learning (RL) strategy. Through a progressive reinforcement process spanning memorization, understanding, and reasoning, LegalOne evolves from simple pattern matching to autonomous and reliable legal reasoning. Experimental results demonstrate that LegalOne achieves state-of-the-art performance across a wide range of legal tasks, surpassing general-purpose LLMs with vastly larger parameter counts through enhanced knowledge density and efficiency. We publicly release the LegalOne weights and the LegalKit evaluation framework to advance the field of Legal AI, paving the way for deploying trustworthy and interpretable foundation models in high-stakes judicial applications.
摘要：虽然大型语言模型（LLM）已经表现出令人印象深刻的通用能力，但它们在法律领域的直接应用往往因缺乏精确的领域知识和执行严格的多步骤司法推理的复杂性而受到阻碍。为了解决这一差距，我们推出了 LegalOne，这是专门为中国法律领域量身定制的一系列基础模型。 LegalOne 是通过旨在掌握法律推理的综合三阶段管道开发的。首先，在训练中期阶段，我们提出可塑性调整采样（PAS）来解决领域适应的挑战。这种基于困惑的调度器在获取新知识和保留原有能力之间取得了平衡，有效地建立了坚实的法律基础。其次，在监督微调过程中，我们采用法律代理 CoT 蒸馏 (LEAD) 从原始法律文本中提取明确的推理。与朴素蒸馏不同，LEAD 利用代理工作流程将复杂的司法流程转换为结构化推理轨迹，从而强化事实基础和逻辑严谨性。最后，我们实施课程强化学习（RL）策略。通过跨越记忆、理解和推理的渐进强化过程，LegalOne 从简单的模式匹配演变为自主且可靠的法律推理。实验结果表明，LegalOne 在广泛的法律任务中实现了最先进的性能，通过提高知识密度和效率，超越了参数数量大得多的通用法学硕士。我们公开发布LegalOne权重和LegalKit评估框架，以推进法律人工智能领域的发展，为在高风险司法应用中部署值得信赖和可解释的基础模型铺平道路。

Title: Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation

Authors: Lakshan Cooray, Deshan Sumanathilaka, Pattigadapa Venkatesh Raju
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00665
Pdf URL: https://arxiv.org/pdf/2602.00665
Copy Paste: [[2602.00665]] Can Small Language Models Handle Context-Summarized Multi-Turn Customer-Service QA? A Synthetic Data-Driven Comparative Evaluation(https://arxiv.org/abs/2602.00665)
Keywords: language model, llm
Abstract: Customer-service question answering (QA) systems increasingly rely on conversational language understanding. While Large Language Models (LLMs) achieve strong performance, their high computational cost and deployment constraints limit practical use in resource-constrained environments. Small Language Models (SLMs) provide a more efficient alternative, yet their effectiveness for multi-turn customer-service QA remains underexplored, particularly in scenarios requiring dialogue continuity and contextual understanding. This study investigates instruction-tuned SLMs for context-summarized multi-turn customer-service QA, using a history summarization strategy to preserve essential conversational state. We also introduce a conversation stage-based qualitative analysis to evaluate model behavior across different phases of customer-service interactions. Nine instruction-tuned low-parameterized SLMs are evaluated against three commercial LLMs using lexical and semantic similarity metrics alongside qualitative assessments, including human evaluation and LLM-as-a-judge methods. Results show notable variation across SLMs, with some models demonstrating near-LLM performance, while others struggle to maintain dialogue continuity and contextual alignment. These findings highlight both the potential and current limitations of low-parameterized language models for real-world customer-service QA systems.
摘要：客户服务问答 (QA) 系统越来越依赖对话语言理解。虽然大型语言模型 (LLM) 实现了强大的性能，但其高计算成本和部署限制限制了在资源受限环境中的实际使用。小语言模型 (SLM) 提供了一种更有效的替代方案，但其多轮客户服务 QA 的有效性仍未得到充分探索，特别是在需要对话连续性和上下文理解的场景中。本研究研究了针对上下文总结的多轮客户服务 QA 的指令调整 SLM，使用历史总结策略来保留基本的对话状态。我们还引入了基于对话阶段的定性分析，以评估客户服务交互不同阶段的模型行为。使用词汇和语义相似性指标以及定性评估（包括人工评估和法学硕士作为法官的方法），针对三个商业法学硕士对九个指令调整的低参数化 SLM 进行了评估。结果显示，SLM 之间存在显着差异，一些模型表现出接近 LLM 的性能，而其他模型则难以保持对话连续性和上下文一致性。这些发现凸显了低参数化语言模型对于现实世界的客户服务 QA 系统的潜在和当前局限性。

Title: ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement

Authors: Ziyan Xiao, Yinghao Zhu, Liang Peng, Lequan Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00740
Pdf URL: https://arxiv.org/pdf/2602.00740
Copy Paste: [[2602.00740]] ExperienceWeaver: Optimizing Small-sample Experience Learning for LLM-based Clinical Text Improvement(https://arxiv.org/abs/2602.00740)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Clinical text improvement is vital for healthcare efficiency but remains difficult due to limited high-quality data and the complex constraints of medical documentation. While Large Language Models (LLMs) show promise, current approaches struggle in small-sample settings: supervised fine-tuning is data-intensive and costly, while retrieval-augmented generation often provides superficial corrections without capturing the reasoning behind revisions. To address these limitations, we propose ExperienceWeaver, a hierarchical framework that shifts the focus from data retrieval to experience learning. Instead of simply recalling past examples, ExperienceWeaver distills noisy, multi-dimensional feedback into structured, actionable knowledge. Specifically, error-specific Tips and high-level Strategies. By injecting this distilled experience into an agentic pipeline, the model learns "how to revise" rather than just "what to revise". Extensive evaluations across four clinical datasets demonstrate that ExperienceWeaver consistently improves performance, surpassing state-of-the-art models such as Gemini-3 Pro in small-sample settings.
摘要：临床文本改进对于医疗保健效率至关重要，但由于高质量数据有限和医疗文档的复杂限制，临床文本改进仍然很困难。虽然大型语言模型（LLM）显示出希望，但当前的方法在小样本环境中举步维艰：监督微调需要大量数据且成本高昂，而检索增强生成通常提供肤浅的更正，而无法捕获修订背后的推理。为了解决这些限制，我们提出了 ExperienceWeaver，这是一个分层框架，可将重点从数据检索转移到体验学习。 ExperienceWeaver 不是简单地回忆过去的例子，而是将嘈杂的多维反馈提炼成结构化的、可操作的知识。具体来说，是针对特定错误的提示和高级策略。通过将这种经过提炼的经验注入代理管道中，模型学习“如何修改”而不仅仅是“修改什么”。对四个临床数据集的广泛评估表明，ExperienceWeaver 不断提高性能，在小样本设置中超越了 Gemini-3 Pro 等最先进的模型。

Title: CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs

Authors: Liang Wang, Xinyi Mou, Xiaoyou Liu, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00742
Pdf URL: https://arxiv.org/pdf/2602.00742
Copy Paste: [[2602.00742]] CURP: Codebook-based Continuous User Representation for Personalized Generation with LLMs(https://arxiv.org/abs/2602.00742)
Keywords: language model, llm, prompt
Abstract: User modeling characterizes individuals through their preferences and behavioral patterns to enable personalized simulation and generation with Large Language Models (LLMs) in contemporary approaches. However, existing methods, whether prompt-based or training-based methods, face challenges in balancing personalization quality against computational and data efficiency. We propose a novel framework CURP, which employs a bidirectional user encoder and a discrete prototype codebook to extract multi-dimensional user traits. This design enables plug-and-play personalization with a small number of trainable parameters (about 20M parameters, about 0.2\% of the total model size). Through extensive experiments on variant generation tasks, we show that CURP achieves superior performance and generalization compared to strong baselines, while offering better interpretability and scalability. The code are available at this https URL
摘要：用户建模通过个人的偏好和行为模式来表征个人，从而以现代方法利用大型语言模型 (LLM) 实现个性化模拟和生成。然而，现有的方法，无论是基于提示的方法还是基于训练的方法，都面临着平衡个性化质量与计算和数据效率的挑战。我们提出了一种新颖的框架 CURP，它采用双向用户编码器和离散原型码本来提取多维用户特征。这种设计可以通过少量的可训练参数（约 20M 个参数，约占总模型大小的 0.2%）实现即插即用的个性化。通过对变体生成任务的大量实验，我们表明，与强大的基线相比，CURP 实现了卓越的性能和泛化，同时提供了更好的可解释性和可扩展性。代码可在此 https URL 获取

Title: Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training

Authors: Shengrui Li, Fei Zhao, Kaiyan Zhao, Jieying Ye, Haifeng Liu, Fangcheng Shi, Zheyong Xie, Yao Hu, Shaosheng Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00747
Pdf URL: https://arxiv.org/pdf/2602.00747
Copy Paste: [[2602.00747]] Decouple Searching from Training: Scaling Data Mixing via Model Merging for Large Language Model Pre-training(https://arxiv.org/abs/2602.00747)
Keywords: language model, llm
Abstract: Determining an effective data mixture is a key factor in Large Language Model (LLM) pre-training, where models must balance general competence with proficiency on hard tasks such as math and code. However, identifying an optimal mixture remains an open challenge, as existing approaches either rely on unreliable tiny-scale proxy experiments or require prohibitively expensive large-scale exploration. To address this, we propose Decouple Searching from Training Mix (DeMix), a novel framework that leverages model merging to predict optimal data ratios. Instead of training proxy models for every sampled mixture, DeMix trains component models on candidate datasets at scale and derives data mixture proxies via weighted model merging. This paradigm decouples search from training costs, enabling evaluation of unlimited sampled mixtures without extra training burden and thus facilitating better mixture discovery through more search trials. Extensive experiments demonstrate that DeMix breaks the trade-off between sufficiency, accuracy and efficiency, obtaining the optimal mixture with higher benchmark performance at lower search cost. Additionally, we release the DeMix Corpora, a comprehensive 22T-token dataset comprising high-quality pre-training data with validated mixtures to facilitate open research. Our code and DeMix Corpora is available at this https URL.
摘要：确定有效的数据混合是大型语言模型 (LLM) 预训练的关键因素，其中模型必须平衡一般能力与数学和代码等困难任务的熟练程度。然而，确定最佳混合物仍然是一个开放的挑战，因为现有方法要么依赖于不可靠的小规模代理实验，要么需要极其昂贵的大规模探索。为了解决这个问题，我们提出了从训练混合中解耦搜索（DeMix），这是一种利用模型合并来预测最佳数据比率的新颖框架。 DeMix 不是为每个采样的混合物训练代理模型，而是在候选数据集上大规模训练组件模型，并通过加权模型合并导出数据混合代理。这种范例将搜索与训练成本解耦，能够在没有额外训练负担的情况下评估无限采样的混合物，从而通过更多的搜索试验促进更好的混合物发现。大量实验表明，DeMix 打破了充分性、准确性和效率之间的权衡，以更低的搜索成本获得了具有更高基准性能的最佳混合。此外，我们还发布了 DeMix Corpora，这是一个全面的 22T 代币数据集，包含高质量的预训练数据和经过验证的混合物，以促进开放研究。我们的代码和 DeMix Corpora 可以通过此 https URL 获取。

Title: Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting

Authors: Ali El Lahib, Ying-Jieh Xia, Zehan Li, Yuxuan Wang, Xinyu Pi
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.00758
Pdf URL: https://arxiv.org/pdf/2602.00758
Copy Paste: [[2602.00758]] Temporal Leakage in Search-Engine Date-Filtered Web Retrieval: A Case Study from Retrospective Forecasting(https://arxiv.org/abs/2602.00758)
Keywords: language model, gpt, llm
Abstract: Search-engine date filters are widely used to enforce pre-cutoff retrieval in retrospective evaluations of search-augmented forecasters. We show this approach is unreliable: auditing Google Search with a before: filter, 71% of questions return at least one page containing strong post-cutoff leakage, and for 41%, at least one page directly reveals the answer. Using a large language model (LLM), gpt-oss-120b, to forecast with these leaky documents, we demonstrate an inflated prediction accuracy (Brier score 0.108 vs. 0.242 with leak-free documents). We characterize common leakage mechanisms, including updated articles, related-content modules, unreliable metadata/timestamps, and absence-based signals, and argue that date-restricted search is insufficient for temporal evaluation. We recommend stronger retrieval safeguards or evaluation on frozen, time-stamped web snapshots to ensure credible retrospective forecasting.
摘要：搜索引擎日期过滤器广泛用于在搜索增强预报器的回顾性评估中强制执行截止前检索。我们证明这种方法是不可靠的：使用 before: 过滤器审核 Google 搜索，71% 的问题返回至少一个包含强截断后泄漏的页面，而 41% 的问题至少有一个页面直接揭示答案。使用大型语言模型 (LLM) gpt-oss-120b 来对这些泄漏文档进行预测，我们展示了夸大的预测准确性（Brier 得分为 0.108，而无泄漏文档的得分为 0.242）。我们描述了常见的泄漏机制，包括更新的文章、相关内容模块、不可靠的元数据/时间戳和基于缺席的信号，并认为日期限制的搜索不足以进行时间评估。我们建议对冻结的带时间戳的网络快照采取更强有力的检索保护措施或评估，以确保可靠的回顾性预测。

Title: Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning

Authors: Zhipeng Chen, Xiaobo Qin, Wayne Xin Zhao, Youbin Wu, Ji-Rong Wen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00759
Pdf URL: https://arxiv.org/pdf/2602.00759
Copy Paste: [[2602.00759]] Adaptive Ability Decomposing for Unlocking Large Reasoning Model Effective Reinforcement Learning(https://arxiv.org/abs/2602.00759)
Keywords: language model, llm
Abstract: Reinforcement learning with verifiable rewards (RLVR) has shown great potential to enhance the reasoning ability of large language models (LLMs). However, due to the limited amount of information provided during the RLVR process, the model can only engage in largely blind exploration, which often results in failure on challenging problems. To provide additional information for the RLVR process without relying on a teacher model, we propose A$^2$D, an Adaptive Ability Decomposing method for enhancing the effectiveness of RLVR. Specifically, we first train a decomposer via RLVR without distillation, enabling it to decompose complex questions into a set of simpler sub-questions. Next, we use this decomposer to annotate sub-questions for each question in the training dataset, and then train the reasoner under RLVR with sub-question guidance. To better understand A$^2$D, we first compare its performance with competitive baselines, showing its effectiveness. Next, we observe that our method functions as a plug-and-play module that can be applied to different RLVR algorithms. Furthermore, we conduct an analysis of the decomposer, revealing how the RLVR process affects its performance and behavior, and which type of guidance is better suited for enhancing the reasoner's exploration and exploitation abilities.
摘要：具有可验证奖励的强化学习（RLVR）已显示出增强大型语言模型（LLM）推理能力的巨大潜力。然而，由于RLVR过程中提供的信息量有限，模型只能进行很大程度上的盲目探索，这往往导致在挑战性问题上失败。为了在不依赖教师模型的情况下为 RLVR 过程提供额外信息，我们提出了 A$^2$D，一种用于增强 RLVR 有效性的自适应能力分解方法。具体来说，我们首先通过 RLVR 训练分解器而不进行蒸馏，使其能够将复杂的问题分解为一组更简单的子问题。接下来，我们使用这个分解器来注释训练数据集中每个问题的子问题，然后在子问题指导下在 RLVR 下训练推理器。为了更好地理解 A$^2$D，我们首先将其性能与竞争基线进行比较，以显示其有效性。接下来，我们观察到我们的方法充当即插即用模块，可以应用于不同的 RLVR 算法。此外，我们对分解器进行了分析，揭示了 RLVR 过程如何影响其性能和行为，以及哪种类型的指导更适合增强推理器的探索和利用能力。

Title: WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs

Authors: Yuheng Shao, Junjie Xiong, Chaoran Wu, Xiyuan Wang, Ziyu Zhou, Yang Ouyang, Qinyi Tao, Quan Li
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2602.00762
Pdf URL: https://arxiv.org/pdf/2602.00762
Copy Paste: [[2602.00762]] WordCraft: Scaffolding the Keyword Method for L2 Vocabulary Learning with Multimodal LLMs(https://arxiv.org/abs/2602.00762)
Keywords: language model, llm
Abstract: Applying the keyword method for vocabulary memorization remains a significant challenge for L1 Chinese-L2 English learners. They frequently struggle to generate phonologically appropriate keywords, construct coherent associations, and create vivid mental imagery to aid long-term retention. Existing approaches, including fully automated keyword generation and outcome-oriented mnemonic aids, either compromise learner engagement or lack adequate process-oriented guidance. To address these limitations, we conducted a formative study with L1 Chinese-L2 English learners and educators (N=18), which revealed key difficulties and requirements in applying the keyword method to vocabulary learning. Building on these insights, we introduce WordCraft, a learner-centered interactive tool powered by Multimodal Large Language Models (MLLMs). WordCraft scaffolds the keyword method by guiding learners through keyword selection, association construction, and image formation, thereby enhancing the effectiveness of vocabulary memorization. Two user studies demonstrate that WordCraft not only preserves the generation effect but also achieves high levels of effectiveness and usability.
摘要：应用关键词法进行词汇记忆对于第一汉语-第二英语学习者来说仍然是一个重大挑战。他们经常努力生成语音上合适的关键词、构建连贯的联想、创造生动的心理意象以帮助长期记忆。现有的方法，包括全自动关键字生成和以结果为导向的助记辅助工具，要么会损害学习者的参与度，要么缺乏足够的以过程为导向的指导。为了解决这些局限性，我们对母语-母语英语学习者和教育者（N=18）进行了一项形成性研究，揭示了将关键词法应用于词汇学习的关键困难和要求。基于这些见解，我们推出了 WordCraft，这是一种由多模式大型语言模型 (MLLM) 提供支持的以学习者为中心的交互式工具。 WordCraft搭建了关键词法，引导学习者进行关键词选择、联想构建、图像形成等，从而提高词汇记忆的有效性。两项用户研究表明，WordCraft 不仅保留了生成效果，而且实现了高水平的有效性和可用性。

Title: Eliciting Trustworthiness Priors of Large Language Models via Economic Games

Authors: Siyu Yan, Lusha Zhu, Jian-Qiao Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00769
Pdf URL: https://arxiv.org/pdf/2602.00769
Copy Paste: [[2602.00769]] Eliciting Trustworthiness Priors of Large Language Models via Economic Games(https://arxiv.org/abs/2602.00769)
Keywords: language model, gpt, llm, agent
Abstract: One critical aspect of building human-centered, trustworthy artificial intelligence (AI) systems is maintaining calibrated trust: appropriate reliance on AI systems outperforms both overtrust (e.g., automation bias) and undertrust (e.g., disuse). A fundamental challenge, however, is how to characterize the level of trust exhibited by an AI system itself. Here, we propose a novel elicitation method based on iterated in-context learning (Zhu and Griffiths, 2024a) and apply it to elicit trustworthiness priors using the Trust Game from behavioral game theory. The Trust Game is particularly well suited for this purpose because it operationalizes trust as voluntary exposure to risk based on beliefs about another agent, rather than self-reported attitudes. Using our method, we elicit trustworthiness priors from several leading large language models (LLMs) and find that GPT-4.1's trustworthiness priors closely track those observed in humans. Building on this result, we further examine how GPT-4.1 responds to different player personas in the Trust Game, providing an initial characterization of how such models differentiate trust across agent characteristics. Finally, we show that variation in elicited trustworthiness can be well predicted by a stereotype-based model grounded in perceived warmth and competence.
摘要：构建以人为本、值得信赖的人工智能 (AI) 系统的一个关键方面是保持校准的信任：对人工智能系统的适当依赖优于过度信任（例如自动化偏差）和不信任（例如废弃）。然而，一个根本的挑战是如何表征人工智能系统本身所表现出的信任水平。在这里，我们提出了一种基于迭代上下文学习的新颖诱导方法（Zhu and Griffiths，2024a），并将其应用于使用行为博弈论中的信任博弈来诱导可信度先验。信任游戏特别适合这个目的，因为它将信任运作为基于对另一个代理人的信念而不是自我报告的态度而自愿承担风险。使用我们的方法，我们从几个领先的大型语言模型 (LLM) 中引出了可信度先验，并发现 GPT-4.1 的可信度先验与在人类中观察到的可信度先验密切相关。在此结果的基础上，我们进一步研究了 GPT-4.1 如何响应信任博弈中的不同玩家角色，提供此类模型如何区分不同代理特征之间的信任的初步特征。最后，我们表明，基于感知温暖和能力的基于刻板印象的模型可以很好地预测引发的可信度的变化。

Title: Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models

Authors: Siyuan Zhang, Jialian Li, Yichi Zhang, Xiao Yang, Yinpeng Dong, Hang Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00770
Pdf URL: https://arxiv.org/pdf/2602.00770
Copy Paste: [[2602.00770]] Reasoning as State Transition: A Representational Analysis of Reasoning Evolution in Large Language Models(https://arxiv.org/abs/2602.00770)
Keywords: language model
Abstract: Large Language Models have achieved remarkable performance on reasoning tasks, motivating research into how this ability evolves during training. Prior work has primarily analyzed this evolution via explicit generation outcomes, treating the reasoning process as a black box and obscuring internal changes. To address this opacity, we introduce a representational perspective to investigate the dynamics of the model's internal states. Through comprehensive experiments across models at various training stages, we discover that post-training yields only limited improvement in static initial representation quality. Furthermore, we reveal that, distinct from non-reasoning tasks, reasoning involves a significant continuous distributional shift in representations during generation. Comparative analysis indicates that post-training empowers models to drive this transition toward a better distribution for task solving. To clarify the relationship between internal states and external outputs, statistical analysis confirms a high correlation between generation correctness and the final representations; while counterfactual experiments identify the semantics of the generated tokens, rather than additional computation during inference or intrinsic parameter differences, as the dominant driver of the transition. Collectively, we offer a novel understanding of the reasoning process and the effect of training on reasoning enhancement, providing valuable insights for future model analysis and optimization.
摘要：大型语言模型在推理任务上取得了卓越的性能，激发了人们对这种能力在训练过程中如何发展的研究。先前的工作主要通过显式的生成结果来分析这种演变，将推理过程视为黑匣子并掩盖内部变化。为了解决这种不透明性，我们引入了表征视角来研究模型内部状态的动态。通过在各个训练阶段对模型进行全面的实验，我们发现训练后对静态初始表示质量的改善非常有限。此外，我们发现，与非推理任务不同，推理涉及生成过程中表示的显着连续分布变化。比较分析表明，训练后使模型能够推动这种转变，实现更好的任务解决分配。为了阐明内部状态和外部输出之间的关系，统计分析证实了生成正确性和最终表示之间的高度相关性；而反事实实验则确定生成标记的语义，而不是推理过程中的额外计算或内在参数差异，作为转换的主要驱动力。总的来说，我们对推理过程以及训练对推理增强的影响提供了新颖的理解，为未来的模型分析和优化提供了宝贵的见解。

Title: HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference

Authors: Xuan Ai, Qingqing Yang, Peng Wang, Lei Deng, Lin Zhang, Renhai Chen, Gong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00777
Pdf URL: https://arxiv.org/pdf/2602.00777
Copy Paste: [[2602.00777]] HyLRA: Hybrid Layer Reuse Attention for Efficient Long-Context Inference(https://arxiv.org/abs/2602.00777)
Keywords: language model, llm
Abstract: Long-context inference in Large Language Models (LLMs) is bottlenecked by the quadratic computation complexity of attention and the substantial memory footprint of Key-Value (KV) caches. While existing sparse attention mechanisms attempt to mitigate this by exploiting inherent sparsity, they often rely on rigid patterns or aggressive pruning, failing to achieve an optimal balance between efficiency and accuracy. In this paper, we introduce {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention), a novel framework driven by layer-wise sparsity profiling. Our empirical analysis uncovers a dual characteristic in attention mechanics: \textit{intra-layer sensitivity}, where specific layers necessitate full attention to prevent feature distortion, and \textit{inter-layer similarity}, where consecutive layers share substantial critical tokens. Based on these observations, HyLRA employs an offline dynamic programming approach to derive an optimal layer-wise policy. This hybrid strategy retains full attention for sensitive layers to ensure robustness, while enabling tolerant layers to bypass quadratic calculations by directly reusing top-$k$ indices from preceding layers. This approach allows LLMs to restrict computation to the most critical tokens, effectively overcoming the quadratic bottleneck of dense attention. Extensive evaluations demonstrate that HyLRA improves inference throughput by 6\%--46\% while maintaining comparable performance (with $<1\%$ accuracy degradation), consistently outperforming state-of-the-art sparse attention methods. HyLRA is open source at \href{this https URL}{\texttt{/r/unified-cache-management-CF80/}}
摘要：大型语言模型 (LLM) 中的长上下文推理受到注意力的二次计算复杂性和键值 (KV) 缓存的大量内存占用的瓶颈。虽然现有的稀疏注意力机制试图通过利用固有的稀疏性来缓解这一问题，但它们通常依赖于严格的模式或积极的修剪，无法在效率和准确性之间实现最佳平衡。在本文中，我们介绍了 {\bf HyLRA} ({\bf Hy}brid {\bf L}ayer {\bf R}euse {\bf A}ttention)，这是一种由分层稀疏性分析驱动的新颖框架。我们的实证分析揭示了注意力机制的双重特征：\textit{层内敏感性}，其中特定层需要充分关注以防止特征失真；以及 \textit{层间相似性}，其中连续层共享大量关键标记。基于这些观察，HyLRA 采用离线动态规划方法来得出最佳的分层策略。这种混合策略保留了对敏感层的充分关注，以确保鲁棒性，同时使宽容层能够通过直接重用前面层的 top-$k$ 索引来绕过二次计算。这种方法允许法学硕士将计算限制在最关键的标记上，有效地克服了密集注意力的二次瓶颈。广泛的评估表明，HyLRA 将推理吞吐量提高了 6\%--46\%，同时保持了相当的性能（精度下降 $<1\%$），始终优于最先进的稀疏注意力方法。 HyLRA 是开源的，位于 \href{此 https URL}{\texttt{/r/unified-cache-management-CF80/}}

Title: Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis

Authors: Zicheng Kong, Dehua Ma, Zhenbo Xu, Alven Yang, Yiwei Ru, Haoran Wang, Zixuan Zhou, Fuqing Bie, Liuyu Xiang, Huijia Wu, Jian Zhao, Zhaofeng He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00846
Pdf URL: https://arxiv.org/pdf/2602.00846
Copy Paste: [[2602.00846]] Omni-RRM: Advancing Omni Reward Modeling via Automatic Rubric-Grounded Preference Synthesis(https://arxiv.org/abs/2602.00846)
Keywords: language model, gpt, llm
Abstract: Multimodal large language models (MLLMs) have shown remarkable capabilities, yet their performance is often capped by the coarse nature of existing alignment techniques. A critical bottleneck remains the lack of effective reward models (RMs): existing RMs are predominantly vision-centric, return opaque scalar scores, and rely on costly human annotations. We introduce \textbf{Omni-RRM}, the first open-source rubric-grounded reward model that produces structured, multi-dimension preference judgments with dimension-wise justifications across \textbf{text, image, video, and audio}. At the core of our approach is \textbf{Omni-Preference}, a large-scale dataset built via a fully automated pipeline: we synthesize candidate response pairs by contrasting models of different capabilities, and use strong teacher models to \emph{reconcile and filter} preferences while providing a modality-aware \emph{rubric-grounded rationale} for each pair. This eliminates the need for human-labeled training preferences. Omni-RRM is trained in two stages: supervised fine-tuning to learn the rubric-grounded outputs, followed by reinforcement learning (GRPO) to sharpen discrimination on difficult, low-contrast pairs. Comprehensive evaluations show that Omni-RRM achieves state-of-the-art accuracy on video (80.2\% on ShareGPT-V) and audio (66.8\% on Audio-HH-RLHF) benchmarks, and substantially outperforms existing open-source RMs on image tasks, with a 17.7\% absolute gain over its base model on overall accuracy. Omni-RRM also improves downstream performance via Best-of-$N$ selection and transfers to text-only preference benchmarks. Our data, code, and models are available at this https URL.
摘要：多模态大语言模型 (MLLM) 已显示出卓越的功能，但其性能往往受到现有对齐技术的粗糙性质的限制。一个关键的瓶颈仍然是缺乏有效的奖励模型（RM）：现有的 RM 主要以视觉为中心，返回不透明的标量分数，并依赖于昂贵的人工注释。我们引入了 \textbf{Omni-RRM}，这是第一个开源的基于标题的奖励模型，它产生结构化的、多维度的偏好判断，并在 \textbf{文本、图像、视频和音频}上进行维度上的论证。我们方法的核心是 \textbf{Omni-Preference}，这是一个通过全自动管道构建的大规模数据集：我们通过对比不同功能的模型来合成候选响应对，并使用强大的教师模型来 \emph{协调和过滤} 偏好，同时为每对提供模态感知的 \emph{基于标题的基本原理}。这消除了人工标记训练偏好的需要。 Omni-RRM 分两个阶段进行训练：监督微调以学习基于标题的输出，然后进行强化学习 (GRPO) 以提高对困难、低对比度对的辨别能力。综合评估表明，Omni-RRM 在视频（ShareGPT-V 上为 80.2%）和音频（Audio-HH-RLHF 上为 66.8%）基准上实现了最先进的精度，并且在图像任务上大大优于现有的开源 RM，整体精度比其基本模型高出 17.7%。 Omni-RRM 还通过 Best-of-$N$ 选择和转移到纯文本偏好基准来提高下游性能。我们的数据、代码和模型可在此 https URL 中获取。

Title: Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation

Authors: Ziwei Gong, Yanda Chen, Julia Hirschberg, Chen Zhao, He He, Zhou Yu, Kathleen Mckeown
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00848
Pdf URL: https://arxiv.org/pdf/2602.00848
Copy Paste: [[2602.00848]] Factuality on Demand: Controlling the Factuality-Informativeness Trade-off in Text Generation(https://arxiv.org/abs/2602.00848)
Keywords: language model, llm
Abstract: Large language models (LLMs) encode knowledge with varying degrees of confidence. When responding to queries, models face an inherent trade-off: they can generate responses that are less informative but highly factual, or more informative but potentially less accurate. Different applications demand different balances between informativeness and factuality. We introduce Factuality-Controlled Generation (FCG), a framework that enables users to specify factuality constraints alongside their queries. We propose to evaluate FCG performance on two dimensions: adherence to factuality constraints and response informativeness. We propose to train models on the FCG task using synthetic data, and show that our synthetic training significantly improves models' ability to both respect factuality requirements and maintain informativeness in their outputs.
摘要：大型语言模型 (LLM) 以不同的置信度对知识进行编码。在响应查询时，模型面临着固有的权衡：它们可能会生成信息较少但高度真实的响应，或者信息较多但可能不太准确的响应。不同的应用程序需要信息性和真实性之间的不同平衡。我们引入事实控制生成（FCG），这是一个框架，使用户能够在查询的同时指定事实约束。我们建议从两个维度评估 FCG 性能：遵守事实约束和响应信息量。我们建议使用合成数据来训练 FCG 任务的模型，并表明我们的合成训练显着提高了模型尊重事实要求和保持输出信息量的能力。

Title: Unifying Adversarial Robustness and Training Across Text Scoring Models

Authors: Manveer Singh Tamber, Hosna Oyarhoseini, Jimmy Lin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.00857
Pdf URL: https://arxiv.org/pdf/2602.00857
Copy Paste: [[2602.00857]] Unifying Adversarial Robustness and Training Across Text Scoring Models(https://arxiv.org/abs/2602.00857)
Keywords: language model, llm
Abstract: Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.
摘要：目前，对语言模型中对抗性鲁棒性的研究在应用程序和攻击中分散，掩盖了共同的漏洞。在这项工作中，我们建议统一跨密集检索器、重排序器和奖励模型的文本评分模型的对抗鲁棒性研究。这促使跨模型角色调整攻击和对抗训练方法。与开放式生成不同，文本评分失败是可以直接测试的：当不相关或被拒绝的文本得分超过相关或选定的文本时，攻击就会成功。使用文本评分的原则性视角，我们证明当前语言模型的对抗性训练公式通常是短视的，无法有效地泛化攻击。为了解决这个问题，我们为文本评分模型引入了多种对抗性训练方法，并表明结合互补的训练方法可以产生强大的鲁棒性，同时还可以提高任务效率。我们还强调了我们的 RLHF 方法的实际价值，表明我们的对抗性训练奖励模型可以减轻奖励黑客行为并支持更好地协调 LLM 的培训。我们提供代码和模型以供进一步研究。

Title: ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople

Authors: Shounak Paul, Raghav Dogra, Pawan Goyal, Saptarshi Ghosh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00881
Pdf URL: https://arxiv.org/pdf/2602.00881
Copy Paste: [[2602.00881]] ILSIC: Corpora for Identifying Indian Legal Statutes from Queries by Laypeople(https://arxiv.org/abs/2602.00881)
Keywords: retrieval-augmented generation
Abstract: Legal Statute Identification (LSI) for a given situation is one of the most fundamental tasks in Legal NLP. This task has traditionally been modeled using facts from court judgments as input queries, due to their abundance. However, in practical settings, the input queries are likely to be informal and asked by laypersons, or non-professionals. While a few laypeople LSI datasets exist, there has been little research to explore the differences between court and laypeople data for LSI. In this work, we create ILSIC, a corpus of laypeople queries covering 500+ statutes from Indian law. Additionally, the corpus also contains court case judgements to enable researchers to effectively compare between court and laypeople data for LSI. We conducted extensive experiments on our corpus, including benchmarking over the laypeople dataset using zero and few-shot inference, retrieval-augmented generation and supervised fine-tuning. We observe that models trained purely on court judgements are ineffective during test on laypeople queries, while transfer learning from court to laypeople data can be beneficial in certain scenarios. We also conducted fine-grained analyses of our results in terms of categories of queries and frequency of statutes.
摘要：给定情况的法律法规识别（LSI）是法律 NLP 中最基本的任务之一。传统上，由于法院判决中的事实丰富，因此使用法院判决中的事实作为输入查询来对该任务进行建模。然而，在实际设置中，输入查询可能是非正式的并且由外行或非专业人士提出。虽然存在一些外行 LSI 数据集，但很少有研究来探讨 LSI 的法庭数据和外行数据之间的差异。在这项工作中，我们创建了 ILSIC，这是一个涵盖 500 多条印度法律法规的外行查询语料库。此外，该语料库还包含法庭案件判决，使研究人员能够有效比较 LSI 的法庭数据和外行数据。我们对我们的语料库进行了广泛的实验，包括使用零次和少次推理、检索增强生成和监督微调对外行数据集进行基准测试。我们观察到，纯粹根据法院判决训练的模型在外行查询测试中是无效的，而从法院到外行数据的迁移学习在某些情况下可能是有益的。我们还根据查询类别和法规频率对结果进行了细粒度分析。

Title: EffGen: Enabling Small Language Models as Capable Autonomous Agents

Authors: Gaurav Srivastava, Aafiya Hussain, Chi Wang, Yingyan Celine Lin, Xuan Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00887
Pdf URL: https://arxiv.org/pdf/2602.00887
Copy Paste: [[2602.00887]] EffGen: Enabling Small Language Models as Capable Autonomous Agents(https://arxiv.org/abs/2602.00887)
Keywords: language model, gpt, prompt, agent
Abstract: Most existing language model agentic systems today are built and optimized for large language models (e.g., GPT, Claude, Gemini) via API calls. While powerful, this approach faces several limitations including high token costs and privacy concerns for sensitive applications. We introduce effGen, an open-source agentic framework optimized for small language models (SLMs) that enables effective, efficient, and secure local deployment (pip install effgen). effGen makes four major contributions: (1) Enhanced tool-calling with prompt optimization that compresses contexts by 70-80% while preserving task semantics, (2) Intelligent task decomposition that breaks complex queries into parallel or sequential subtasks based on dependencies, (3) Complexity-based routing using five factors to make smart pre-execution decisions, and (4) Unified memory system combining short-term, long-term, and vector-based storage. Additionally, effGen unifies multiple agent protocols (MCP, A2A, ACP) for cross-protocol communication. Results on 13 benchmarks show effGen outperforms LangChain, AutoGen, and Smolagents with higher success rates, faster execution, and lower memory. Our results reveal that prompt optimization and complexity routing have complementary scaling behavior: optimization benefits SLMs more (11.2% gain at 1.5B vs 2.4% at 32B), while routing benefits large models more (3.6% at 1.5B vs 7.9% at 32B), providing consistent gains across all scales when combined. effGen (this https URL) is released under the MIT License, ensuring broad accessibility for research and commercial use. Our framework code is publicly available at this https URL.
摘要：如今，大多数现有的语言模型代理系统都是通过 API 调用针对大型语言模型（例如 GPT、Claude、Gemini）构建和优化的。虽然功能强大，但这种方法面临一些限制，包括高代币成本和敏感应用程序的隐私问题。我们推出 effGen，这是一个针对小型语言模型 (SLM) 进行优化的开源代理框架，可实现有效、高效且安全的本地部署 (pip install effgen)。 effGen 做出了四大贡献：(1) 通过即时优化增强工具调用，将上下文压缩 70-80%，同时保留任务语义；(2) 智能任务分解，根据依赖关系将复杂查询分解为并行或顺序子任务；(3) 基于复杂性的路由，使用五个因素做出智能预执行决策；(4) 结合短期、长期和基于向量的存储的统一内存系统。此外，effGen 统一了多个代理协议（MCP、A2A、ACP）以进行跨协议通信。 13 个基准测试的结果显示 effGen 的性能优于 LangChain、AutoGen 和 Smolagents，具有更高的成功率、更快的执行速度和更低的内存。我们的结果表明，即时优化和复杂性路由具有互补的扩展行为：优化对 SLM 的好处更大（1.5B 时的增益为 11.2%，32B 时的增益为 2.4%），而路由则对大型模型的好处更大（1.5B 时的增益为 3.6%，32B 时的增益为 7.9%），组合时可以在所有尺度上提供一致的增益。 effGen（此 https URL）根据 MIT 许可证发布，确保研究和商业用途的广泛可访问性。我们的框架代码可通过此 https URL 公开获取。

Title: Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts

Authors: Víctor Yeste, Paolo Rosso
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00913
Pdf URL: https://arxiv.org/pdf/2602.00913
Copy Paste: [[2602.00913]] Do Schwartz Higher-Order Values Help Sentence-Level Human Value Detection? When Hard Gating Hurts(https://arxiv.org/abs/2602.00913)
Keywords: llm
Abstract: Sentence-level human value detection is typically framed as multi-label classification over Schwartz values, but it remains unclear whether Schwartz higher-order (HO) categories provide usable structure. We study this under a strict compute-frugal budget (single 8 GB GPU) on ValueEval'24 / ValuesML (74K English sentences). We compare (i) direct supervised transformers, (ii) HO$\rightarrow$values pipelines that enforce the hierarchy with hard masks, and (iii) Presence$\rightarrow$HO$\rightarrow$values cascades, alongside low-cost add-ons (lexica, short context, topics), label-wise threshold tuning, small instruction-tuned LLM baselines ($\le$10B), QLoRA, and simple ensembles. HO categories are learnable from single sentences (e.g., the easiest bipolar pair reaches Macro-$F_1\approx0.58$), but hard hierarchical gating is not a reliable win: it often reduces end-task Macro-$F_1$ via error compounding and recall suppression. In contrast, label-wise threshold tuning is a high-leverage knob (up to $+0.05$ Macro-$F_1$), and small transformer ensembles provide the most consistent additional gains (up to $+0.02$ Macro-$F_1$). Small LLMs lag behind supervised encoders as stand-alone systems, yet can contribute complementary errors in cross-family ensembles. Overall, HO structure is useful descriptively, but enforcing it with hard gates hurts sentence-level value detection; robust improvements come from calibration and lightweight ensembling.
摘要：句子级人类价值检测通常被构建为基于 Schwartz 值的多标签分类，但目前尚不清楚 Schwartz 高阶 (HO) 类别是否提供可用的结构。我们在严格的计算节约预算（单个 8 GB GPU）下在 ValueEval'24 / ValuesML（74K 英语句子）上进行了研究。我们比较了（i）直接监督变压器，（ii）通过硬掩码强制执行层次结构的 HO$\rightarrow$values 管道，以及（iii）Presence$\rightarrow$HO$\rightarrow$values 级联，以及低成本附加组件（词汇、短上下文、主题）、标签阈值调整、小型指令调整的 LLM 基线（$\le$10B）、QLoRA 和简单集成。 HO 类别可以从单个句子中学习（例如，最简单的双极对达到 Macro-$F_1\approx0.58$），但硬分层门控并不是一个可靠的胜利：它通常通过错误复合和召回抑制来减少最终任务 Macro-$F_1$。相比之下，标签式阈值调整是一个高杠杆旋钮（高达 $+0.05$ Macro-$F_1$），小型变压器集成提供最一致的额外增益（高达 $+0.02$ Macro-$F_1$）。小型法学硕士作为独立系统落后于受监督编码器，但可能会在跨家族集成中产生互补错误。总体而言，HO 结构在描述性方面很有用，但使用硬门强制执行它会损害句子级值检测；强大的改进来自校准和轻量级集成。

Title: Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs

Authors: Anusa Saha, Tanmay Joshi, Vinija Jain, Aman Chadha, Amitava Das
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00945
Pdf URL: https://arxiv.org/pdf/2602.00945
Copy Paste: [[2602.00945]] Neural FOXP2 -- Language Specific Neuron Steering for Targeted Language Improvement in LLMs(https://arxiv.org/abs/2602.00945)
Keywords: llm
Abstract: LLMs are multilingual by training, yet their lingua franca is often English, reflecting English language dominance in pretraining. Other languages remain in parametric memory but are systematically suppressed. We argue that language defaultness is governed by a sparse, low-rank control circuit, language neurons, that can be mechanistically isolated and safely steered. We introduce Neural FOXP2, that makes a chosen language (Hindi or Spanish) primary in a model by steering language-specific neurons. Neural FOXP2 proceeds in three stages: (i) Localize: We train per-layer SAEs so each activation decomposes into a small set of active feature components. For every feature, we quantify English vs. Hindi/Spanish selectivity overall logit-mass lift toward the target-language token set. Tracing the top-ranked features back to their strongest contributing units yields a compact language-neuron set. (ii) Steering directions: We localize controllable language-shift geometry via a spectral low-rank analysis. For each layer, we build English to target activation-difference matrices and perform layerwise SVD to extract the dominant singular directions governing language change. The eigengap and effective-rank spectra identify a compact steering subspace and an empirically chosen intervention window (where these directions are strongest and most stable). (iii) Steer: We apply a signed, sparse activation shift targeted to the language neurons. Concretely, within low to mid layers we add a positive steering along the target-language dominant directions and a compensating negative shift toward the null space for the English neurons, yielding controllable target-language defaultness.
摘要：法学硕士经过培训可以使用多种语言，但他们的通用语言通常是英语，这反映出英语在预培训中占主导地位。其他语言保留在参数存储器中，但被系统地抑制。我们认为，语言默认性是由稀疏的低级控制电路（语言神经元）控制的，可以机械地隔离和安全地操纵它们。我们引入了 Neural FOXP2，它通过控制特定于语言的神经元，使选定的语言（印地语或西班牙语）成为模型中的主要语言。神经 FOXP2 分三个阶段进行：(i) 本地化：我们训练每层 SAE，以便每次激活分解为一小组活动特征组件。对于每个特征，我们都量化了英语与印地语/西班牙语选择性对目标语言标记集的总体逻辑质量提升。将排名靠前的特征追溯到其最强的贡献单元会产生一个紧凑的语言神经元集。（ii）转向方向：我们通过谱低阶分析来定位可控语言转换几何。对于每一层，我们构建英语以激活差异矩阵为目标，并执行分层 SVD 来提取控制语言变化的主导奇异方向。特征间隙和有效秩谱确定了一个紧凑的转向子空间和一个凭经验选择的干预窗口（其中这些方向最强且最稳定）。 (iii) 引导：我们针对语言神经元应用有符号的稀疏激活偏移。具体来说，在低到中层中，我们沿着目标语言主导方向添加正向转向，并为英语神经元添加向零空间的补偿性负向偏移，从而产生可控的目标语言默认性。

Title: Verification Required: The Impact of Information Credibility on AI Persuasion

Authors: Saaduddin Mahmud, Eugene Bagdasarian, Shlomo Zilberstein
Subjects: cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2602.00970
Pdf URL: https://arxiv.org/pdf/2602.00970
Copy Paste: [[2602.00970]] Verification Required: The Impact of Information Credibility on AI Persuasion(https://arxiv.org/abs/2602.00970)
Keywords: language model, llm, agent
Abstract: Agents powered by large language models (LLMs) are increasingly deployed in settings where communication shapes high-stakes decisions, making a principled understanding of strategic communication essential. Prior work largely studies either unverifiable cheap-talk or fully verifiable disclosure, failing to capture realistic domains in which information has probabilistic credibility. We introduce MixTalk, a strategic communication game for LLM-to-LLM interaction that models information credibility. In MixTalk, a sender agent strategically combines verifiable and unverifiable claims to communicate private information, while a receiver agent allocates a limited budget to costly verification and infers the underlying state from prior beliefs, claims, and verification outcomes. We evaluate state-of-the-art LLM agents in large-scale tournaments across three realistic deployment settings, revealing their strengths and limitations in reasoning about information credibility and the explicit behavior that shapes these interactions. Finally, we propose Tournament Oracle Policy Distillation (TOPD), an offline method that distills tournament oracle policy from interaction logs and deploys it in-context at inference time. Our results show that TOPD significantly improves receiver robustness to persuasion.
摘要：由大型语言模型 (LLM) 支持的代理越来越多地部署在通信影响高风险决策的环境中，因此对战略通信的原则性理解至关重要。先前的工作主要研究无法验证的廉价言论或完全可验证的披露，未能捕获信息具有概率可信度的现实领域。我们推出 MixTalk，这是一款用于 LLM 与 LLM 互动的战略沟通游戏，可对信息可信度进行建模。在 MixTalk 中，发送方代理策略性地结合可验证和不可验证的声明来传达私人信息，而接收方代理将有限的预算分配给昂贵的验证，并根据先前的信念、声明和验证结果推断潜在状态。我们在三种实际部署环境的大型锦标赛中评估最先进的 LLM 代理，揭示它们在推理信息可信度以及塑造这些交互的显式行为方面的优势和局限性。最后，我们提出了锦标赛预言机策略蒸馏（TOPD），这是一种离线方法，可以从交互日志中提取锦标赛预言机策略，并在推理时将其部署在上下文中。我们的结果表明，TOPD 显着提高了接收者的说服鲁棒性。

Title: Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals

Authors: Pengyue Yang, Jiawen Wen, Haolin Jin, Linghan Huang, Huaming Chen, Ling Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00977
Pdf URL: https://arxiv.org/pdf/2602.00977
Copy Paste: [[2602.00977]] Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals(https://arxiv.org/abs/2602.00977)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs. Yet standard confidence estimators, such as token likelihood, semantic similarity and multi-sample consistency, remain brittle under distribution shift, domain-specialised text, and compute limits. In this work, we present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction based on multi-scale structural signals derived from a model's final-layer hidden-state trajectory. By combining spectral, local-variation, and global shape descriptors, our method captures internal stability patterns that are missed by probabilities and sentence embeddings. We conduct extensive, cross-domain evaluation across four heterogeneous benchmarks-FEVER (fact verification), SciFact (scientific claims), WikiBio-hallucination (biographical consistency), and TruthfulQA (truthfulness-oriented QA). Our Structural Confidence framework demonstrates strong performance compared with established baselines in terms of AUROC and AUPR. More importantly, unlike sampling-based consistency methods which require multiple stochastic generations and an auxiliary model, our approach uses a single deterministic forward pass, offering a practical basis for efficient, robust post-hoc confidence estimation in socially impactful, resource-constrained LLM applications.
摘要：大型语言模型 (LLM) 越来越多地部署在错误带来较高社会、科学或安全成本的领域。然而，标准置信度估计器（例如标记似然、语义相似性和多样本一致性）在分布偏移、领域专用文本和计算限制下仍然很脆弱。在这项工作中，我们提出了结构置信度（Structural Confidence），这是一种单通道、模型无关的框架，它基于从模型的最终层隐藏状态轨迹导出的多尺度结构信号来增强输出正确性预测。通过结合光谱、局部变化和全局形状描述符，我们的方法捕获了概率和句子嵌入遗漏的内部稳定性模式。我们对四个异构基准进行广泛的跨领域评估：FEVER（事实验证）、SciFact（科学主张）、WikiBio-hallucination（传记一致性）和 TruthfulQA（面向真实的 QA）。与 AUROC 和 AUPR 方面的既定基线相比，我们的结构置信度框架表现出强劲的性能。更重要的是，与需要多个随机生成和辅助模型的基于采样的一致性方法不同，我们的方法使用单个确定性前向传递，为在具有社会影响力、资源有限的 LLM 应用中高效、稳健的事后置信度估计提供了实用基础。

Title: MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA

Authors: Yutong Song, Shiva Shrestha, Chenhan Lyu, Elahe Khatibi, Pengfei Zhang, Honghui Xu, Nikil Dutt, Amir Rahmani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00981
Pdf URL: https://arxiv.org/pdf/2602.00981
Copy Paste: [[2602.00981]] MedSpeak: A Knowledge Graph-Aided ASR Error Correction Framework for Spoken Medical QA(https://arxiv.org/abs/2602.00981)
Keywords: llm
Abstract: Spoken question-answering (SQA) systems relying on automatic speech recognition (ASR) often struggle with accurately recognizing medical terminology. To this end, we propose MedSpeak, a novel knowledge graph-aided ASR error correction framework that refines noisy transcripts and improves downstream answer prediction by leveraging both semantic relationships and phonetic information encoded in a medical knowledge graph, together with the reasoning power of LLMs. Comprehensive experimental results on benchmarks demonstrate that MedSpeak significantly improves the accuracy of medical term recognition and overall medical SQA performance, establishing MedSpeak as a state-of-the-art solution for medical SQA. The code is available at this https URL.
摘要：依赖自动语音识别 (ASR) 的语音问答 (SQA) 系统通常难以准确识别医学术语。为此，我们提出了 MedSpeak，一种新颖的知识图辅助 ASR 纠错框架，它通过利用医学知识图中编码的语义关系和语音信息以及法学硕士的推理能力来细化噪声转录本并改进下游答案预测。基准测试的综合实验结果表明，MedSpeak 显着提高了医学术语识别的准确性和整体医学 SQA 性能，使 MedSpeak 成为最先进的医学 SQA 解决方案。该代码可从此 https URL 获取。

Title: DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning

Authors: Batuhan K. Karaman, Aditya Rawal, Suhaila Shakiah, Mohammad Ghavamzadeh, Mingyi Hong, Arijit Biswas, Ruida Zhou
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.00983
Pdf URL: https://arxiv.org/pdf/2602.00983
Copy Paste: [[2602.00983]] DISPO: Enhancing Training Efficiency and Stability in Reinforcement Learning for Large Language Model Mathematical Reasoning(https://arxiv.org/abs/2602.00983)
Keywords: language model
Abstract: Reinforcement learning with verifiable rewards has emerged as a promising paradigm for enhancing the reasoning capabilities of large language models particularly in mathematics. Current approaches in this domain present a clear trade-off: PPO-style methods (e.g., GRPO/DAPO) offer training stability but exhibit slow learning trajectories due to their trust-region constraints on policy updates, while REINFORCE-style approaches (e.g., CISPO) demonstrate improved learning efficiency but suffer from performance instability as they clip importance sampling weights while still permitting non-zero gradients outside the trust-region. To address these limitations, we introduce DISPO, a simple yet effective REINFORCE-style algorithm that decouples the up-clipping and down-clipping of importance sampling weights for correct and incorrect responses, yielding four controllable policy update regimes. Through targeted ablations, we uncover how each regime impacts training: for correct responses, weights >1 increase the average token entropy (i.e., exploration) while weights <1 decrease it (i.e., distillation) -- both beneficial but causing gradual performance degradation when excessive. For incorrect responses, overly restrictive clipping triggers sudden performance collapse through repetitive outputs (when weights >1) or vanishing response lengths (when weights <1). By separately tuning these four clipping parameters, DISPO maintains the exploration-distillation balance while preventing catastrophic failures, achieving 61.04% on AIME'24 (vs. 55.42% CISPO and 50.21% DAPO) with similar gains across various benchmarks and models.
摘要：具有可验证奖励的强化学习已成为增强大型语言模型（尤其是数学模型）推理能力的有前途的范例。该领域当前的方法呈现出明显的权衡：PPO 式方法（例如 GRPO/DAPO）提供训练稳定性，但由于其信任域对策略更新的限制而表现出缓慢的学习轨迹，而 REINFORCE 式方法（例如 CISPO）表现出提高的学习效率，但由于它们修剪重要性采样权重而仍然允许信任域外的非零梯度而受到性能不稳定的影响。为了解决这些限制，我们引入了 DISPO，这是一种简单而有效的 REINFORCE 式算法，可以解耦正确和错误响应的重要性采样权重的上剪裁和下剪裁，从而产生四种可控的策略更新机制。通过有针对性的消融，我们揭示了每种机制如何影响训练：对于正确的响应，权重 >1 会增加平均令牌熵（即探索），而权重 <1 会减少平均令牌熵（即蒸馏）——两者都是有益的，但过度时会导致性能逐渐下降。对于不正确的响应，过度限制性的裁剪会通过重复输出（当权重 >1 时）或消失的响应长度（当权重 <1 时）触发性能突然崩溃。通过分别调整这四个裁剪参数，DISPO 保持了探索-蒸馏平衡，同时防止了灾难性故障，在 AIME'24 上实现了 61.04%（相对于 55.42% CISPO 和 50.21% DAPO），在各种基准和模型上都有相似的收益。

Title: Sparse Reward Subsystem in Large Language Models

Authors: Guowei Xu, Mert Yuksekgonul, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00986
Pdf URL: https://arxiv.org/pdf/2602.00986
Copy Paste: [[2602.00986]] Sparse Reward Subsystem in Large Language Models(https://arxiv.org/abs/2602.00986)
Keywords: language model, llm
Abstract: In this paper, we identify a sparse reward subsystem within the hidden states of Large Language Models (LLMs), drawing an analogy to the biological reward subsystem in the human brain. We demonstrate that this subsystem contains value neurons that represent the model's internal expectation of state value, and through intervention experiments, we establish the importance of these neurons for reasoning. Our experiments reveal that these value neurons are robust across diverse datasets, model scales, and architectures; furthermore, they exhibit significant transferability across different datasets and models fine-tuned from the same base model. By examining cases where value predictions and actual rewards diverge, we identify dopamine neurons within the reward subsystem which encode reward prediction errors (RPE). These neurons exhibit high activation when the reward is higher than expected and low activation when the reward is lower than expected.
摘要：在本文中，我们在大型语言模型（LLM）的隐藏状态中识别出一个稀疏奖励子系统，与人脑中的生物奖励子系统进行类比。我们证明该子系统包含代表模型对状态值的内部期望的值神经元，并通过干预实验，我们确定了这些神经元对于推理的重要性。我们的实验表明，这些价值神经元在不同的数据集、模型规模和架构中都很稳健；此外，它们在不同数据集和从同一基本模型微调的模型之间表现出显着的可转移性。通过检查价值预测和实际奖励出现分歧的情况，我们识别了奖励子系统中编码奖励预测错误（RPE）的多巴胺神经元。当奖励高于预期时，这些神经元表现出高激活；当奖励低于预期时，这些神经元表现出低激活。

Title: DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework

Authors: Abhijit Chakraborty, Ashish Raj Shekhar, Shiven Agarwal, Vivek Gupta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.00996
Pdf URL: https://arxiv.org/pdf/2602.00996
Copy Paste: [[2602.00996]] DeALOG: Decentralized Multi-Agents Log-Mediated Reasoning Framework(https://arxiv.org/abs/2602.00996)
Keywords: agent
Abstract: Complex question answering across text, tables and images requires integrating diverse information sources. A framework supporting specialized processing with coordination and interpretability is needed. We introduce DeALOG, a decentralized multi-agent framework for multimodal question answering. It uses specialized agents: Table, Context, Visual, Summarizing and Verification, that communicate through a shared natural-language log as persistent memory. This log-based approach enables collaborative error detection and verification without central control, improving robustness. Evaluations on FinQA, TAT-QA, CRT-QA, WikiTableQuestions, FeTaQA, and MultiModalQA show competitive performance. Analysis confirms the importance of the shared log, agent specialization, and verification for accuracy. DeALOG, provides a scalable approach through modular components using natural-language communication.
摘要：跨文本、表格和图像的复杂问题回答需要整合不同的信息源。需要一个支持具有协调性和可解释性的专门处理的框架。我们介绍 DeALOG，一个用于多模式问答的去中心化多代理框架。它使用专门的代理：表格、上下文、视觉、总结和验证，通过作为持久内存的共享自然语言日志进行通信。这种基于日志的方法无需中央控制即可实现协作式错误检测和验证，从而提高了稳健性。对 FinQA、TAT-QA、CRT-QA、WikiTableQuestions、FeTaQA 和 MultiModalQA 的评估显示出具有竞争力的性能。分析证实了共享日志、代理专业化和准确性验证的重要性。 DeALOG 通过使用自然语言通信的模块化组件提供了一种可扩展的方法。

Title: Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning

Authors: Zhikun Xu, Xiaodong Yu, Ben Zhou, Jiang Liu, Jialian Wu, Ze Wang, Ximeng Sun, Hao Chen, Zicheng Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.00998
Pdf URL: https://arxiv.org/pdf/2602.00998
Copy Paste: [[2602.00998]] Reliable Use of Lemmas via Eligibility Reasoning and Section$-$Aware Reinforcement Learning(https://arxiv.org/abs/2602.00998)
Keywords: language model, llm
Abstract: Recent large language models (LLMs) perform strongly on mathematical benchmarks yet often misapply lemmas, importing conclusions without validating assumptions. We formalize lemma$-$judging as a structured prediction task: given a statement and a candidate lemma, the model must output a precondition check and a conclusion$-$utility check, from which a usefulness decision is derived. We present RULES, which encodes this specification via a two$-$section output and trains with reinforcement learning plus section$-$aware loss masking to assign penalty to the section responsible for errors. Training and evaluation draw on diverse natural language and formal proof corpora; robustness is assessed with a held$-$out perturbation suite; and end$-$to$-$end evaluation spans competition$-$style, perturbation$-$aligned, and theorem$-$based problems across various LLMs. Results show consistent in$-$domain gains over both a vanilla model and a single$-$label RL baseline, larger improvements on applicability$-$breaking perturbations, and parity or modest gains on end$-$to$-$end tasks; ablations indicate that the two$-$section outputs and section$-$aware reinforcement are both necessary for robustness.
摘要：最近的大型语言模型（LLM）在数学基准上表现强劲，但经常误用引理，在没有验证假设的情况下导入结论。我们将引理$-$判断形式化为结构化预测任务：给定一个陈述和候选引理，模型必须输出前提条件检查和结论$-$效用检查，从中得出有用性决策。我们提出了规则，它通过两个 $-$section 输出对这个规范进行编码，并使用强化学习加上section$-$aware 损失掩码进行训练，以将惩罚分配给负责错误的部分。训练和评估利用不同的自然语言和形式证明语料库；鲁棒性通过持有$-$out扰动套件进行评估； end$-$to$-$end 评估涵盖各种法学硕士的竞争$-$风格、扰动$-$对齐以及基于定理$-$的问题。结果显示，相对于普通模型和单 $-$ 标签 RL 基线，在 $-$ 域中获得一致的收益，在适用性 $-$ 破坏性扰动方面有更大的改进，并且在最终 $-$ 到 $-$end 任务上获得同等或适度的收益；消融表明，两个$-$section 输出和section$-$aware 强化对于鲁棒性都是必要的。

Title: Distilling Token-Trained Models into Byte-Level Models

Authors: Zishuo Bao, Jiaqi Leng, Junxiong Wang, Bowen Peng, Yucheng Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01007
Pdf URL: https://arxiv.org/pdf/2602.01007
Copy Paste: [[2602.01007]] Distilling Token-Trained Models into Byte-Level Models(https://arxiv.org/abs/2602.01007)
Keywords: language model, llm
Abstract: Byte Language Models (BLMs) have emerged as a promising direction for scaling language models beyond tokenization. However, existing BLMs typically require training from scratch on trillions of bytes, making them prohibitively expensive. In this paper, we propose an efficient distillation recipe that converts existing token-trained LLMs into BLMs while retaining comparable capabilities. Our recipe follows a two-stage curriculum: (1) Progressive Knowledge Distillation, which aligns byte-level representations with the embeddings of the token-trained teacher model; and (2) Byte-Level Supervised Fine-Tuning, which enables end-to-end generation entirely in the byte space. We validate our approach across multiple model families, including Llama, Qwen, and OLMo, and demonstrate that the distilled BLMs retain most of the teacher models' performance using only approximately 125B bytes.
摘要：字节语言模型 (BLM) 已成为扩展语言模型超越标记化的一个有前途的方向。然而，现有的 BLM 通常需要从头开始进行数万亿字节的训练，这使得它们的成本过高。在本文中，我们提出了一种有效的蒸馏方法，将现有的代币训练的 LLM 转换为 BLM，同时保留可比较的功能。我们的方案遵循两个阶段的课程：（1）渐进式知识蒸馏，它将字节级表示与经过令牌训练的教师模型的嵌入对齐； (2) 字节级监督微调，完全在字节空间中实现端到端生成。我们在多个模型系列（包括 Llama、Qwen 和 OLMo）中验证了我们的方法，并证明经过提炼的 BLM 仅使用大约 125B 字节就保留了大部分教师模型的性能。

Title: Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident

Authors: Conrad Borchers, Jill-Jênn Vie, Roger Azevedo
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.01015
Pdf URL: https://arxiv.org/pdf/2602.01015
Copy Paste: [[2602.01015]] Large Language Models as Students Who Think Aloud: Overly Coherent, Verbose, and Confident(https://arxiv.org/abs/2602.01015)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly embedded in AI-based tutoring systems. Can they faithfully model novice reasoning and metacognitive judgments? Existing evaluations emphasize problem-solving accuracy, overlooking the fragmented and imperfect reasoning that characterizes human learning. We evaluate LLMs as novices using 630 think-aloud utterances from multi-step chemistry tutoring problems with problem-solving logs of student hint use, attempts, and problem context. We compare LLM-generated reasoning to human learner utterances under minimal and extended contextual prompting, and assess the models' ability to predict step-level learner success. Although GPT-4.1 generates fluent and contextually appropriate continuations, its reasoning is systematically over-coherent, verbose, and less variable than human think-alouds. These effects intensify with a richer problem-solving context during prompting. Learner performance was consistently overestimated. These findings highlight epistemic limitations of simulating learning with LLMs. We attribute these limitations to LLM training data, including expert-like solutions devoid of expressions of affect and working memory constraints during problem solving. Our evaluation framework can guide future design of adaptive systems that more faithfully support novice learning and self-regulation using generative artificial intelligence.
摘要：大型语言模型 (LLM) 越来越多地嵌入基于人工智能的辅导系统中。他们能否忠实地模拟新手推理和元认知判断？现有的评估强调解决问题的准确性，忽视了人类学习特征的碎片化和不完善的推理。我们使用来自多步骤化学辅导问题的 630 个有声思考话语以及学生提示使用、尝试和问题背景的问题解决日志来评估法学硕士作为新手。我们将法学硕士生成的推理与人类学习者在最小和扩展的上下文提示下的话语进行比较，并评估模型预测阶梯学习者成功的能力。尽管 GPT-4.1 生成流畅且适合上下文的延续，但其推理在系统上过于连贯、冗长，并且比人类有声思考的变量更少。在提示过程中，随着解决问题的背景变得更加丰富，这些效果会增强。学习者的表现始终被高估。这些发现凸显了法学硕士模拟学习的认知局限性。我们将这些限制归因于法学硕士培训数据，包括在问题解决过程中缺乏情感表达和工作记忆限制的专家式解决方案。我们的评估框架可以指导自适应系统的未来设计，更忠实地支持使用生成人工智能的新手学习和自我调节。

Title: Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations

Authors: Sheng-Lun Wei, Yu-Ling Liao, Yen-Hua Chang, Hen-Hsen Huang, Hsin-Hsi Chen
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2602.01030
Pdf URL: https://arxiv.org/pdf/2602.01030
Copy Paste: [[2602.01030]] Bias in the Ear of the Listener: Assessing Sensitivity in Audio Language Models Across Linguistic, Demographic, and Positional Variations(https://arxiv.org/abs/2602.01030)
Keywords: language model, llm
Abstract: This work presents the first systematic investigation of speech bias in multilingual MLLMs. We construct and release the BiasInEar dataset, a speech-augmented benchmark based on Global MMLU Lite, spanning English, Chinese, and Korean, balanced by gender and accent, and totaling 70.8 hours ($\approx$4,249 minutes) of speech with 11,200 questions. Using four complementary metrics (accuracy, entropy, APES, and Fleiss' $\kappa$), we evaluate nine representative models under linguistic (language and accent), demographic (gender), and structural (option order) perturbations. Our findings reveal that MLLMs are relatively robust to demographic factors but highly sensitive to language and option order, suggesting that speech can amplify existing structural biases. Moreover, architectural design and reasoning strategy substantially affect robustness across languages. Overall, this study establishes a unified framework for assessing fairness and robustness in speech-integrated LLMs, bridging the gap between text- and speech-based evaluation. The resources can be found at this https URL.
摘要：这项工作首次对多语言 MLLM 中的言语偏见进行系统研究。我们构建并发布了 BiasInEar 数据集，这是一个基于 Global MMLU Lite 的语音增强基准，涵盖英语、中文和韩语，按性别和口音进行平衡，总计 70.8 小时（约 4,249 分钟）的语音，包含 11,200 个问题。使用四个互补指标（准确性、熵、APES 和 Fleiss'$\kappa$），我们评估了语言（语言和口音）、人口统计（性别）和结构（选项顺序）扰动下的九个代表性模型。我们的研究结果表明，MLLM 对人口因素相对稳健，但对语言和选项顺序高度敏感，这表明言论可以放大现有的结构性偏见。此外，架构设计和推理策略极大地影响跨语言的鲁棒性。总体而言，本研究建立了一个统一的框架来评估语音集成法学硕士的公平性和鲁棒性，弥合了基于文本和基于语音的评估之间的差距。可以在此 https URL 找到资源。

Title: Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents

Authors: Bin Han, Deuksin Kwon, Jonathan Gratch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01063
Pdf URL: https://arxiv.org/pdf/2602.01063
Copy Paste: [[2602.01063]] Personality Expression Across Contexts: Linguistic and Behavioral Variation in LLM Agents(https://arxiv.org/abs/2602.01063)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) can be conditioned with explicit personality prompts, yet their behavioral realization often varies depending on context. This study examines how identical personality prompts lead to distinct linguistic, behavioral, and emotional outcomes across four conversational settings: ice-breaking, negotiation, group decision, and empathy tasks. Results show that contextual cues systematically influence both personality expression and emotional tone, suggesting that the same traits are expressed differently depending on social and affective demands. This raises an important question for LLM-based dialogue agents: whether such variations reflect inconsistency or context-sensitive adaptation akin to human behavior. Viewed through the lens of Whole Trait Theory, these findings highlight that LLMs exhibit context-sensitive rather than fixed personality expression, adapting flexibly to social interaction goals and affective conditions.
摘要：大型语言模型（LLM）可以通过明确的个性提示来调节，但它们的行为实现通常会根据上下文而变化。这项研究探讨了相同的人格提示如何在四种对话环境中导致不同的语言、行为和情感结果：破冰、谈判、群体决策和同理心任务。结果表明，情境线索系统地影响个性表达和情绪基调，这表明相同的特征根据社会和情感需求的不同而表达不同。这给基于法学硕士的对话代理提出了一个重要问题：这种变化是否反映了类似于人类行为的不一致或上下文敏感的适应。从整体特质理论的角度来看，这些发现强调法学硕士表现出情境敏感而非固定的个性表达，灵活地适应社会互动目标和情感条件。

Title: Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs

Authors: Ruihan Jin, Pengpeng Shao, Zhengqi Wen, Jinyang Wu, Mingkuan Feng, Shuo Yang, Chu Yuan Zhang, Jianhua Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01064
Pdf URL: https://arxiv.org/pdf/2602.01064
Copy Paste: [[2602.01064]] Exploring Knowledge Purification in Multi-Teacher Knowledge Distillation for LLMs(https://arxiv.org/abs/2602.01064)
Keywords: language model, llm
Abstract: Knowledge distillation has emerged as a pivotal technique for transferring knowledge from stronger large language models (LLMs) to smaller, more efficient models. However, traditional distillation approaches face challenges related to knowledge conflicts and high resource demands, particularly when leveraging multiple teacher models. In this paper, we introduce the concept of \textbf{Knowledge Purification}, which consolidates the rationales from multiple teacher LLMs into a single rationale, thereby mitigating conflicts and enhancing efficiency. To investigate the effectiveness of knowledge purification, we further propose five purification methods from various perspectives. Our experiments demonstrate that these methods not only improve the performance of the distilled model but also effectively alleviate knowledge conflicts. Moreover, router-based methods exhibit robust generalization capabilities, underscoring the potential of innovative purification techniques in optimizing multi-teacher distillation and facilitating the practical deployment of powerful yet lightweight models.
摘要：知识蒸馏已成为将知识从更强大的大型语言模型 (LLM) 转移到更小、更高效的模型的关键技术。然而，传统的蒸馏方法面临着与知识冲突和高资源需求相关的挑战，特别是在利用多个教师模型时。在本文中，我们引入了\textbf{知识纯化}的概念，它将多个法学硕士教师的基本原理合并为一个基本原理，从而减轻冲突并提高效率。为了研究知识净化的有效性，我们从不同角度进一步提出了五种净化方法。我们的实验表明，这些方法不仅提高了蒸馏模型的性能，而且有效缓解了知识冲突。此外，基于路由器的方法表现出强大的泛化能力，强调了创新纯化技术在优化多教师蒸馏和促进强大而轻量级模型的实际部署方面的潜力。

Title: From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization

Authors: Chaoqun Cui, Shijing Wang, Liangbin Huang, Qingqing Gu, Zhaolong Huang, Xiao Zeng, Wenji Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01068
Pdf URL: https://arxiv.org/pdf/2602.01068
Copy Paste: [[2602.01068]] From Utterance to Vividity: Training Expressive Subtitle Translation LLM via Adaptive Local Preference Optimization(https://arxiv.org/abs/2602.01068)
Keywords: language model, llm
Abstract: The rapid development of Large Language Models (LLMs) has significantly enhanced the general capabilities of machine translation. However, as application scenarios become more complex, the limitations of LLMs in vertical domain translations are gradually becoming apparent. In this study, we focus on how to construct translation LLMs that meet the needs of domain customization. We take visual media subtitle translation as our topic and explore how to train expressive and vivid translation LLMs. We investigated the situations of subtitle translation and other domains of literal and liberal translation, verifying the reliability of LLM as reward model and evaluator for translation. Additionally, to train an expressive translation LLM, we constructed and released a multidirectional subtitle parallel corpus dataset and proposed the Adaptive Local Preference Optimization (ALPO) method to address fine-grained preference alignment. Experimental results demonstrate that ALPO achieves outstanding performance in multidimensional evaluation of translation quality.
摘要：大型语言模型（LLM）的快速发展显着增强了机器翻译的通用能力。然而，随着应用场景变得更加复杂，LLM在垂直领域翻译方面的局限性逐渐显现。在本研究中，我们关注如何构建满足领域定制需求的翻译法学硕士。我们以视觉媒体字幕翻译为主题，探讨如何培养富有表现力和生动性的翻译法学硕士。我们调查了字幕翻译以及直译和自由翻译的其他领域的情况，验证了LLM作为翻译奖励模型和评估器的可靠性。此外，为了训练表达性翻译法学硕士，我们构建并发布了多向字幕并行语料库数据集，并提出了自适应局部偏好优化（ALPO）方法来解决细粒度偏好对齐问题。实验结果表明，ALPO在翻译质量的多维度评价中取得了优异的表现。

Title: What If We Allocate Test-Time Compute Adaptively?

Authors: Ahsan Bilal, Ahmed Mohsin, Muhammad Umer, Ali Subhan, Hassan Rizwan, Ayesha Mohsin, Dean Hougen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01070
Pdf URL: https://arxiv.org/pdf/2602.01070
Copy Paste: [[2602.01070]] What If We Allocate Test-Time Compute Adaptively?(https://arxiv.org/abs/2602.01070)
Keywords: agent
Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.
摘要：测试时计算扩展统一分配推理计算，使用固定采样策略，并且仅对重新排名应用验证。相比之下，我们提出了一种验证者引导的自适应框架，将推理视为迭代轨迹生成和选择。对于每个问题，代理都会运行多次推理迭代。在每次迭代中，它选择性地生成一个高级计划，选择一组推理工具和计算策略以及探索参数，然后生成候选推理轨迹。过程奖励模型（PRM）充当统一的控制信号：在每次迭代中，步骤级PRM分数被聚合以指导生成期间的修剪和扩展，并且在迭代期间，聚合的轨迹奖励用于选择最终响应。在整个数据集中，我们的动态 PRM 引导方法始终优于直接测试时间缩放，在 MATH-500 上产生了巨大的收益，并在 AIME24 和 AMO-Bench 等更难的基准测试上实现了数倍的改进。我们使用理论 FLOP 和惩罚浪费的生成和工具开销的计算强度指标来表征效率，证明验证引导的分配将计算集中在高实用性推理路径上。

Title: Logic-Oriented Retriever Enhancement via Contrastive Learning

Authors: Wenxuan Zhang, Yuan-Hao Jiang, Changyong Qi, Rui Jia, Yonghe Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01116
Pdf URL: https://arxiv.org/pdf/2602.01116
Copy Paste: [[2602.01116]] Logic-Oriented Retriever Enhancement via Contrastive Learning(https://arxiv.org/abs/2602.01116)
Keywords: language model, llm
Abstract: Large language models (LLMs) struggle in knowledge-intensive tasks, as retrievers often overfit to surface similarity and fail on queries involving complex logical relations. The capacity for logical analysis is inherent in model representations but remains underutilized in standard training. LORE (Logic ORiented Retriever Enhancement) introduces fine-grained contrastive learning to activate this latent capacity, guiding embeddings toward evidence aligned with logical structure rather than shallow similarity. LORE requires no external upervision, resources, or pre-retrieval analysis, remains index-compatible, and consistently improves retrieval utility and downstream generation while maintaining efficiency. The datasets and code are publicly available at this https URL.
摘要：大型语言模型（LLM）在知识密集型任务中举步维艰，因为检索器经常过度适应表面相似性，并且在涉及复杂逻辑关系的查询上失败。逻辑分析能力是模型表示所固有的，但在标准训练中仍未得到充分利用。 LORE（面向逻辑的检索器增强）引入了细粒度的对比学习来激活这种潜在能力，引导嵌入与逻辑结构而不是浅层相似性相一致的证据。 LORE 不需要外部监督、资源或检索前分析，保持索引兼容，并在保持效率的同时持续改进检索实用性和下游生成。数据集和代码可通过此 https URL 公开获取。

Title: Tendem: A Hybrid AI+Human Platform

Authors: Konstantin Chernyshev, Ekaterina Artemova, Viacheslav Zhukov, Maksim Nerush, Mariia Fedorova, Iryna Repik, Olga Shapovalova, Aleksey Sukhorosov, Vladimir Dobrovolskii, Natalia Mikhailova, Sergei Tilga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01119
Pdf URL: https://arxiv.org/pdf/2602.01119
Copy Paste: [[2602.01119]] Tendem: A Hybrid AI+Human Platform(https://arxiv.org/abs/2602.01119)
Keywords: agent
Abstract: Tendem is a hybrid system where AI handles structured, repeatable work and Human Experts step in when the models fail or to verify results. Each result undergoes a comprehensive quality review before delivery to the Client. To assess Tendem's performance, we conducted a series of in-house evaluations on 94 real-world tasks, comparing it with AI-only agents and human-only workflows carried out by Upwork freelancers. The results show that Tendem consistently delivers higher-quality outputs with faster turnaround times. At the same time, its operational costs remain comparable to human-only execution. On third-party agentic benchmarks, Tendem's AI Agent (operating autonomously, without human involvement) performs near state-of-the-art on web browsing and tool-use tasks while demonstrating strong results in frontier domain knowledge and reasoning.
摘要：Tendem 是一个混合系统，其中人工智能处理结构化、可重复的工作，当模型失败或验证结果时，人类专家会介入。每个结果在交付给客户之前都会经过全面的质量审查。为了评估 Tendem 的性能，我们对 94 项实际任务进行了一系列内部评估，将其与 Upwork 自由职业者执行的纯人工智能代理和纯人类工作流程进行了比较。结果表明，Tendem 始终能够以更快的周转时间提供更高质量的输出。与此同时，其运营成本仍然与纯人力执行相当。在第三方代理基准测试中，Tendem 的 AI 代理（自主运行，无需人工参与）在网页浏览和工具使用任务方面表现接近最先进，同时在前沿领域知识和推理方面展现出强劲的成果。

Title: Long-range Modeling and Processing of Multimodal Event Sequences

Authors: Jichu Li, Yilun Zhong, Zhiting Li, Feng Zhou, Quyu Kong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01125
Pdf URL: https://arxiv.org/pdf/2602.01125
Copy Paste: [[2602.01125]] Long-range Modeling and Processing of Multimodal Event Sequences(https://arxiv.org/abs/2602.01125)
Keywords: llm
Abstract: Temporal point processes (TPPs) have emerged as powerful tools for modeling asynchronous event sequences. While recent advances have extended TPPs to handle textual information, existing approaches are limited in their ability to generate rich, multimodal content and reason about event dynamics. A key challenge is that incorporating multimodal data dramatically increases sequence length, hindering the ability of attention-based models to generate coherent, long-form textual descriptions that require long-range understanding. In this paper, we propose a novel framework that extends LLM-based TPPs to the visual modality, positioning text generation as a core capability alongside time and type prediction. Our approach addresses the long-context problem through an adaptive sequence compression mechanism based on temporal similarity, which reduces sequence length while preserving essential patterns. We employ a two-stage paradigm of pre-training on compressed sequences followed by supervised fine-tuning for downstream tasks. Extensive experiments, including on the challenging DanmakuTPP-QA benchmark, demonstrate that our method outperforms state-of-the-art baselines in both predictive accuracy and the quality of its generated textual analyses.
摘要：时间点过程 (TPP) 已成为异步事件序列建模的强大工具。虽然最近的进展已经扩展了 TPP 来处理文本信息，但现有方法在生成丰富的多模式内容和事件动态推理方面受到限制。一个关键的挑战是，合并多模态数据极大地增加了序列长度，阻碍了基于注意力的模型生成需要远程理解的连贯、长格式文本描述的能力。在本文中，我们提出了一种新颖的框架，将基于 LLM 的 TPP 扩展到视觉模态，将文本生成定位为与时间和类型预测并列的核心功能。我们的方法通过基于时间相似性的自适应序列压缩机制解决了长上下文问题，该机制减少了序列长度，同时保留了基本模式。我们采用两阶段范例，对压缩序列进行预训练，然后对下游任务进行监督微调。广泛的实验，包括在具有挑战性的 DanmakuTPP-QA 基准上的实验，表明我们的方法在预测准确性和生成的文本分析的质量方面都优于最先进的基线。

Title: Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation

Authors: Abhilekh Borah, Shubhra Ghosh, Kedar Joshi, Aditya Kumar Guru, Kripabandhu Ghosh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01132
Pdf URL: https://arxiv.org/pdf/2602.01132
Copy Paste: [[2602.01132]] Don't Judge a Book by its Cover: Testing LLMs' Robustness Under Logical Obfuscation(https://arxiv.org/abs/2602.01132)
Keywords: language model, gpt, llm
Abstract: Tasks such as solving arithmetic equations, evaluating truth tables, and completing syllogisms are handled well by large language models (LLMs) in their standard form, but they often fail when the same problems are posed in logically equivalent yet obfuscated formats. To study this vulnerability, we introduce Logifus, a structure-preserving logical obfuscation framework, and, utilizing this, we present LogiQAte, a first-of-its-kind diagnostic benchmark with 1,108 questions across four reasoning tasks: (i) Obfus FOL (first-order logic entailment under equivalence-preserving rewrites), (ii) Obfus Blood Relation (family-graph entailment under indirect relational chains), (iii) Obfus Number Series (pattern induction under symbolic substitutions), and (iv) Obfus Direction Sense (navigation reasoning under altered directions and reference frames). Across all the tasks, evaluating six state-of-the-art models, we find that obfuscation severely degrades zero-shot performance, with performance dropping on average by 47% for GPT-4o, 27% for GPT-5, and 22% for reasoning model, o4-mini. Our findings reveal that current LLMs parse questions without deep understanding, highlighting the urgency of building models that genuinely comprehend and preserve meaning beyond surface form.
摘要：大型语言模型（LLM）以其标准形式可以很好地处理诸如求解算术方程、评估真值表和完成三段论等任务，但当以逻辑上等效但模糊的格式提出相同的问题时，它们常常会失败。为了研究这个漏洞，我们引入了 Logifus，一个保留结构的逻辑混淆框架，并利用它，我们提出了 LogiQAte，这是一个首个诊断基准，包含四个推理任务的 1,108 个问题：(i) Obfus FOL（等价保留重写下的一阶逻辑蕴涵），(ii) Obfus Blood Relation（间接关系链下的族图蕴涵），(iii) Obfus Number系列（符号替换下的模式归纳），以及（iv）混淆方向感（改变方向和参考框架下的导航推理）。在所有任务中，评估六种最先进的模型，我们发现混淆严重降低了零样本性能，GPT-4o 的性能平均下降 47%，GPT-5 的性能平均下降 27%，推理模型 o4-mini 的性能平均下降 22%。我们的研究结果表明，当前的法学硕士在没有深入理解的情况下解析问题，凸显了建立真正理解和保留超越表面形式的意义的模型的紧迫性。

Title: Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models

Authors: Reem I. Masoud, Chen Feng, Shunta Asano, Saied Alshahrani, Philip Colin Treleaven, Miguel R. D. Rodrigues
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01161
Pdf URL: https://arxiv.org/pdf/2602.01161
Copy Paste: [[2602.01161]] Beyond Training for Cultural Awareness: The Role of Dataset Linguistic Structure in Large Language Models(https://arxiv.org/abs/2602.01161)
Keywords: language model, llm
Abstract: The global deployment of large language models (LLMs) has raised concerns about cultural misalignment, yet the linguistic properties of fine-tuning datasets used for cultural adaptation remain poorly understood. We adopt a dataset-centric view of cultural alignment and ask which linguistic properties of fine-tuning data are associated with cultural performance, whether these properties are predictive prior to training, and how these effects vary across models. We compute lightweight linguistic, semantic, and structural metrics for Arabic, Chinese, and Japanese datasets and apply principal component analysis separately within each language. This design ensures that the resulting components capture variation among datasets written in the same language rather than differences between languages. The resulting components correspond to broadly interpretable axes related to semantic coherence, surface-level lexical and syntactic diversity, and lexical or structural richness, though their composition varies across languages. We fine-tune three major LLM families (LLaMA, Mistral, DeepSeek) and evaluate them on benchmarks of cultural knowledge, values, and norms. While PCA components correlate with downstream performance, these associations are strongly model-dependent. Through controlled subset interventions, we show that lexical-oriented components (PC3) are the most robust, yielding more consistent performance across models and benchmarks, whereas emphasizing semantic or diversity extremes (PC1-PC2) is often neutral or harmful.
摘要：大型语言模型（LLM）的全球部署引起了人们对文化失调的担忧，但用于文化适应的微调数据集的语言特性仍然知之甚少。我们采用以数据集为中心的文化一致性观点，并询问微调数据的哪些语言属性与文化表现相关，这些属性在训练之前是否具有预测性，以及这些影响在不同模型中有何不同。我们计算阿拉伯语、中文和日语数据集的轻量级语言、语义和结构指标，并在每种语言中分别应用主成分分析。这种设计确保生成的组件捕获用相同语言编写的数据集之间的差异，而不是语言之间的差异。由此产生的组件对应于与语义连贯性、表面词汇和句法多样性以及词汇或结构丰富性相关的广泛可解释的轴，尽管它们的组成因语言而异。我们对三个主要的法学硕士系列（LLaMA、Mistral、DeepSeek）进行微调，并根据文化知识、价值观和规范的基准对其进行评估。虽然 PCA 组件与下游性能相关，但这些关联强烈依赖于模型。通过受控的子集干预，我们表明面向词汇的组件（PC3）是最稳健的，在模型和基准测试中产生更一致的性能，而强调语义或多样性极端（PC1-PC2）通常是中性或有害的。

Title: Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages

Authors: Nipuna Abeykoon, Ashen Weerathunga, Pubudu Wijesinghe, Parameswari Krishnamurthy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01162
Pdf URL: https://arxiv.org/pdf/2602.01162
Copy Paste: [[2602.01162]] Typologically-Informed Candidate Reranking for LLM-based Translation into Low-Resource Languages(https://arxiv.org/abs/2602.01162)
Keywords: language model, llm
Abstract: Large language models trained predominantly on high-resource languages exhibit systematic biases toward dominant typological patterns, leading to structural non-conformance when translating into typologically divergent low-resource languages. We present a framework that leverages linguistic typology to improve translation quality without parallel training data or model retraining. The framework consists of two components: the Universal Metalinguistic Framework (UMF), which represents languages as structured profiles across 16 typological dimensions with divergence-weighted scoring, and the Computational Engine, which operates through linguistic disambiguation during generation and typological compliance scoring during selection. Evaluation across nine language pairs demonstrates intervention rates strongly correlating with typological distance from English. In experiments on 341 English sentences each having different morphological and syntactic phenomena, the framework shows an intervention precision of 48.16% for conservatively treated languages, 28.15% for morphologically dense languages, and 86.26% for structurally profiled languages. The framework requires no parallel training data and operates with any LLM capable of producing multiple candidate outputs, enabling practical deployment for under-resourced languages.
摘要：主要在高资源语言上训练的大型语言模型表现出对主要类型模式的系统偏见，导致在翻译成类型上不同的低资源语言时出现结构性不一致。我们提出了一个框架，利用语言类型学来提高翻译质量，而无需并行训练数据或模型再训练。该框架由两个组件组成：通用元语言框架 (UMF)，它将语言表示为跨 16 个类型学维度的结构化配置文件，并具有分歧加权评分；以及计算引擎，它通过生成过程中的语言消歧和选择过程中的类型合规性评分进行操作。对九种语言对的评估表明，干预率与英语的类型距离密切相关。在对 341 个具有不同形态和句法现象的英语句子进行实验时，该框架显示，保守处理语言的干预精度为 48.16%，形态密集语言的干预精度为 28.15%，结构剖析语言的干预精度为 86.26%。该框架不需要并行训练数据，并且可以与任何能够产生多个候选输出的法学硕士一起运行，从而能够对资源不足的语言进行实际部署。

Title: PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues

Authors: Shahem Sultan, Shahem Fadi, Yousef Melhim, Ibrahim Alsarraj, Besher Hassan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01169
Pdf URL: https://arxiv.org/pdf/2602.01169
Copy Paste: [[2602.01169]] PedagoSense: A Pedology Grounded LLM System for Pedagogical Strategy Detection and Contextual Response Generation in Learning Dialogues(https://arxiv.org/abs/2602.01169)
Keywords: language model, llm
Abstract: This paper addresses the challenge of improving interaction quality in dialogue based learning by detecting and recommending effective pedagogical strategies in tutor student conversations. We introduce PedagoSense, a pedology grounded system that combines a two stage strategy classifier with large language model generation. The system first detects whether a pedagogical strategy is present using a binary classifier, then performs fine grained classification to identify the specific strategy. In parallel, it recommends an appropriate strategy from the dialogue context and uses an LLM to generate a response aligned with that strategy. We evaluate on human annotated tutor student dialogues, augmented with additional non pedagogical conversations for the binary task. Results show high performance for pedagogical strategy detection and consistent gains when using data augmentation, while analysis highlights where fine grained classes remain challenging. Overall, PedagoSense bridges pedagogical theory and practical LLM based response generation for more adaptive educational technologies.
摘要：本文通过检测和推荐导师学生对话中的有效教学策略，解决了提高基于对话的学习中交互质量的挑战。我们介绍 PedagoSense，这是一个基于教育学的系统，它将两阶段策略分类器与大型语言模型生成相结合。该系统首先使用二元分类器检测是否存在教学策略，然后执行细粒度分类以识别特定策略。同时，它从对话背景中推荐适当的策略，并使用法学硕士生成与该策略一致的响应。我们评估人类注释的导师学生对话，并针对二元任务添加额外的非教学对话。结果显示，使用数据增强时，教学策略检测具有高性能和一致的收益，而分析则强调了细粒度课程仍然具有挑战性。总体而言，PedagoSense 将教学理论和基于法学硕士的实际响应生成联系起来，以实现更具适应性的教育技术。

Title: Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation

Authors: Shashini Nilukshi, Deshan Sumanathilaka
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2602.01193
Pdf URL: https://arxiv.org/pdf/2602.01193
Copy Paste: [[2602.01193]] Bridging Lexical Ambiguity and Vision: A Mini Review on Visual Word Sense Disambiguation(https://arxiv.org/abs/2602.01193)
Keywords: language model, llm, prompt
Abstract: This paper offers a mini review of Visual Word Sense Disambiguation (VWSD), which is a multimodal extension of traditional Word Sense Disambiguation (WSD). VWSD helps tackle lexical ambiguity in vision-language tasks. While conventional WSD depends only on text and lexical resources, VWSD uses visual cues to find the right meaning of ambiguous words with minimal text input. The review looks at developments from early multimodal fusion methods to new frameworks that use contrastive models like CLIP, diffusion-based text-to-image generation, and large language model (LLM) support. Studies from 2016 to 2025 are examined to show the growth of VWSD through feature-based, graph-based, and contrastive embedding techniques. It focuses on prompt engineering, fine-tuning, and adapting to multiple languages. Quantitative results show that CLIP-based fine-tuned models and LLM-enhanced VWSD systems consistently perform better than zero-shot baselines, achieving gains of up to 6-8\% in Mean Reciprocal Rank (MRR). However, challenges still exist, such as limitations in context, model bias toward common meanings, a lack of multilingual datasets, and the need for better evaluation frameworks. The analysis highlights the growing overlap of CLIP alignment, diffusion generation, and LLM reasoning as the future path for strong, context-aware, and multilingual disambiguation systems.
摘要：本文对视觉词义消歧（VWSD）进行了简要回顾，它是传统词义消歧（WSD）的多模态扩展。 VWSD 有助于解决视觉语言任务中的词汇歧义。传统的 WSD 仅依赖于文本和词汇资源，而 VWSD 使用视觉提示以最少的文本输入找到模糊单词的正确含义。该综述着眼于从早期多模态融合方法到使用 CLIP 等对比模型、基于扩散的文本到图像生成和大语言模型 (LLM) 支持的新框架的发展。对 2016 年至 2025 年的研究进行了检查，以显示 VWSD 通过基于特征、基于图形和对比嵌入技术的增长。它专注于快速工程、微调和适应多种语言。定量结果表明，基于 CLIP 的微调模型和 LLM 增强型 VWSD 系统的性能始终优于零样本基线，在平均倒数排名 (MRR) 方面实现了高达 6-8\% 的增益。然而，挑战仍然存在，例如上下文的限制、模型对常见含义的偏见、缺乏多语言数据集以及需要更好的评估框架。该分析强调了 CLIP 对齐、扩散生成和法学硕士推理的日益重叠，这是强大的上下文感知和多语言消歧系统的未来道路。

Title: Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

Authors: Zizhuo Fu, Wenxuan Zeng, Runsheng Wang, Meng Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01203
Pdf URL: https://arxiv.org/pdf/2602.01203
Copy Paste: [[2602.01203]] Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse(https://arxiv.org/abs/2602.01203)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of the relationship among these attention mechanisms is lacking. In this work, we provide both theoretical and empirical evidence demonstrating that the sink in Vanilla Attention and Sink Attention naturally construct a Mixture-of-Experts (MoE) mechanism within attention layers. This insight explains the head collapse phenomenon observed in prior work, where only a fixed subset of attention heads contributes to generation. To mitigate head collapse, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. Extensive experiments show that our method achieves effective head load balancing and improves model performance across Vanilla Attention, Sink Attention, and Gated Attention. We hope this study offers a new perspective on attention mechanisms and encourages further exploration of the inherent MoE structure within attention layers.
摘要：大型语言模型 (LLM) 通常会对第一个标记分配不成比例的注意力，这种现象称为注意力沉降。最近的几种方法旨在解决这个问题，包括 GPT-OSS 中的 Sink Attention 和 Qwen3-Next 中的 Gated Attention。然而，缺乏对这些注意力机制之间关系的全面分析。在这项工作中，我们提供了理论和经验证据，证明 Vanilla Attention 中的沉降和 Sink Attention 自然地在注意层内构建了专家混合（MoE）机制。这种见解解释了在之前的工作中观察到的头部崩溃现象，其中只有固定的注意力头子集有助于生成。为了减轻头部崩溃，我们提出了一种接收器感知训练算法，具有专为注意层设计的辅助负载平衡损失。大量实验表明，我们的方法实现了有效的头部负载平衡，并提高了 Vanilla Attention、Sink Attention 和 Gated Attention 的模型性能。我们希望这项研究为注意力机制提供一个新的视角，并鼓励进一步探索注意力层内固有的 MoE 结构。

Title: ASTER: Agentic Scaling with Tool-integrated Extended Reasoning

Authors: Xuqin Zhang, Quan He, Zhenrui Zheng, Zongzhang Zhang, Xu He, Dong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01204
Pdf URL: https://arxiv.org/pdf/2602.01204
Copy Paste: [[2602.01204]] ASTER: Agentic Scaling with Tool-integrated Extended Reasoning(https://arxiv.org/abs/2602.01204)
Keywords: language model, llm, agent
Abstract: Reinforcement learning (RL) has emerged as a dominant paradigm for eliciting long-horizon reasoning in Large Language Models (LLMs). However, scaling Tool-Integrated Reasoning (TIR) via RL remains challenging due to interaction collapse: a pathological state where models fail to sustain multi-turn tool usage, instead degenerating into heavy internal reasoning with only trivial, post-hoc code verification. We systematically study three questions: (i) how cold-start SFT induces an agentic, tool-using behavioral prior, (ii) how the interaction density of cold-start trajectories shapes exploration and downstream RL outcomes, and (iii) how the RL interaction budget affects learning dynamics and generalization under varying inference-time budgets. We then introduce ASTER (Agentic Scaling with Tool-integrated Extended Reasoning), a framework that circumvents this collapse through a targeted cold-start strategy prioritizing interaction-dense trajectories. We find that a small expert cold-start set of just 4K interaction-dense trajectories yields the strongest downstream performance, establishing a robust prior that enables superior exploration during extended RL training. Extensive evaluations demonstrate that ASTER-4B achieves state-of-the-art results on competitive mathematical benchmarks, reaching 90.0% on AIME 2025, surpassing leading frontier open-source models, including DeepSeek-V3.2-Exp.
摘要：强化学习 (RL) 已成为大型语言模型 (LLM) 中引发长期推理的主导范式。然而，由于交互崩溃，通过强化学习扩展工具集成推理（TIR）仍然具有挑战性：模型无法维持多轮工具使用的病态状态，而是退化为仅通过微不足道的事后代码验证进行繁重的内部推理。我们系统地研究了三个问题：(i) 冷启动 SFT 如何引发代理的、使用工具的行为先验；(ii) 冷启动轨迹的交互密度如何塑造探索和下游 RL 结果；(iii) RL 交互预算如何在不同的推理时间预算下影响学习动态和泛化。然后，我们介绍 ASTER（具有工具集成扩展推理的代理扩展），这是一个框架，通过优先考虑交互密集轨迹的有针对性的冷启动策略来避免这种崩溃。我们发现，仅由 4K 交互密集轨迹组成的小型专家冷启动集可产生最强的下游性能，建立强大的先验，从而在扩展 RL 训练期间实现出色的探索。广泛的评估表明，ASTER-4B 在竞争性数学基准上取得了最先进的结果，在 AIME 2025 上达到了 90.0%，超越了包括 DeepSeek-V3.2-Exp 在内的领先前沿开源模型。

Title: Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling

Authors: Kai Zhang, Jiayi Liao, Chengpeng Li, Ziyuan Xie, Sihang Li, Xiang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01208
Pdf URL: https://arxiv.org/pdf/2602.01208
Copy Paste: [[2602.01208]] Chronos: Learning Temporal Dynamics of Reasoning Chains for Test-Time Scaling(https://arxiv.org/abs/2602.01208)
Keywords: language model, llm
Abstract: Test-Time Scaling (TTS) has emerged as an effective paradigm for improving the reasoning performance of large language models (LLMs). However, existing methods -- most notably majority voting and heuristic token-level scoring -- treat reasoning traces or tokens equally, thereby being susceptible to substantial variations in trajectory quality and localized logical failures. In this work, we introduce \textbf{Chronos}, a lightweight and plug-and-play chronological reasoning scorer that models each trajectory as a time series. Specifically, Chronos learns to capture trajectory features of token probabilities, assigns quality scores accordingly, and employs a weighted voting mechanism. Extensive evaluations on both in-domain and out-of-domain benchmarks demonstrate that Chronos consistently delivers substantial gains across a variety of models, with negligible computational overhead. Notably, Chronos@128 achieves relative improvements of 34.21\% over Pass@1 and 22.70\% over Maj@128 on HMMT25 using Qwen3-4B-Thinking-2507, highlighting its effectiveness.
摘要：测试时间缩放（TTS）已成为提高大型语言模型（LLM）推理性能的有效范例。然而，现有的方法——最显着的是多数投票和启发式标记级评分——平等地对待推理轨迹或标记，因此容易受到轨迹质量和局部逻辑故障的巨大变化的影响。在这项工作中，我们引入了 \textbf{Chronos}，这是一种轻量级、即插即用的时间顺序推理评分器，它将每个轨迹建模为时间序列。具体来说，Chronos 学习捕获代币概率的轨迹特征，相应地分配质量分数，并采用加权投票机制。对域内和域外基准的广泛评估表明，Chronos 在各种模型中始终如一地提供了可观的收益，而计算开销可以忽略不计。值得注意的是，在使用 Qwen3-4B-Thinking-2507 的 HMMT25 上，Chronos@128 比 Pass@1 实现了 34.21\% 的相对改进，比 Maj@128 实现了 22.70\% 的相对改进，凸显了其有效性。

Title: Inferential Question Answering

Authors: Jamshid Mozafari, Hamed Zamani, Guido Zuccon, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.01239
Pdf URL: https://arxiv.org/pdf/2602.01239
Copy Paste: [[2602.01239]] Inferential Question Answering(https://arxiv.org/abs/2602.01239)
Keywords: llm
Abstract: Despite extensive research on a wide range of question answering (QA) systems, most existing work focuses on answer containment-i.e., assuming that answers can be directly extracted and/or generated from documents in the corpus. However, some questions require inference, i.e., deriving answers that are not explicitly stated but can be inferred from the available information. We introduce Inferential QA -- a new task that challenges models to infer answers from answer-supporting passages which provide only clues. To study this problem, we construct QUIT (QUestions requiring Inference from Texts) dataset, comprising 7,401 questions and 2.4M passages built from high-convergence human- and machine-authored hints, labeled across three relevance levels using LLM-based answerability and human verification. Through comprehensive evaluation of retrievers, rerankers, and LLM-based readers, we show that methods effective on traditional QA tasks struggle in inferential QA: retrievers underperform, rerankers offer limited gains, and fine-tuning provides inconsistent improvements. Even reasoning-oriented LLMs fail to outperform smaller general-purpose models. These findings reveal that current QA pipelines are not yet ready for inference-based reasoning. Inferential QA thus establishes a new class of QA tasks that move towards understanding and reasoning from indirect textual evidence.
摘要：尽管对各种问答（QA）系统进行了广泛的研究，但大多数现有工作都集中在答案包含上，即假设可以从语料库中的文档直接提取和/或生成答案。然而，有些问题需要推理，即得出未明确说明但可以从现有信息中推断出的答案。我们引入了推理 QA——一项新任务，挑战模型从仅提供线索的答案支持段落中推断出答案。为了研究这个问题，我们构建了 QUIT（需要从文本中推理的问题）数据集，其中包含 7,401 个问题和 240 万个段落，这些段落是根据高度融合的人类和机器编写的提示构建的，并使用基于 LLM 的可回答性和人工验证在三个相关级别上进行标记。通过对检索器、重排序器和基于 LLM 的读者的综合评估，我们表明，对传统 QA 任务有效的方法在推理 QA 中遇到困难：检索器表现不佳，重排序器提供有限的收益，微调提供不一致的改进。即使是面向推理的法学硕士也无法胜过较小的通用模型。这些发现表明，当前的 QA 管道尚未为基于推理的推理做好准备。因此，推理 QA 建立了一类新的 QA 任务，从间接文本证据转向理解和推理。

Title: Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection

Authors: Ke Sun, Guangsheng Bao, Han Cui, Yue Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01240
Pdf URL: https://arxiv.org/pdf/2602.01240
Copy Paste: [[2602.01240]] Minimizing Mismatch Risk: A Prototype-Based Routing Framework for Zero-shot LLM-generated Text Detection(https://arxiv.org/abs/2602.01240)
Keywords: llm
Abstract: Zero-shot methods detect LLM-generated text by computing statistical signatures using a surrogate model. Existing approaches typically employ a fixed surrogate for all inputs regardless of the unknown source. We systematically examine this design and find that detection performance varies substantially depending on surrogate-source alignment. We observe that while no single surrogate achieves optimal performance universally, a well-matched surrogate typically exists within a diverse pool for any given input. This finding transforms robust detection into a routing problem: selecting the most appropriate surrogate for each input. We propose DetectRouter, a prototype-based framework that learns text-detector affinity through two-stage training. The first stage constructs discriminative prototypes from white-box models; the second generalizes to black-box sources by aligning geometric distances with observed detection scores. Experiments on EvoBench and MAGE benchmarks demonstrate consistent improvements across multiple detection criteria and model families.
摘要：零样本方法通过使用代理模型计算统计签名来检测 LLM 生成的文本。现有方法通常对所有输入采用固定代理，而不管来源如何。我们系统地检查了这种设计，发现检测性能根据替代源对齐而有很大差异。我们观察到，虽然没有一个代理能够普遍实现最佳性能，但对于任何给定的输入，匹配良好的代理通常存在于不同的池中。这一发现将鲁棒检测转化为路由问题：为每个输入选择最合适的代理。我们提出了 DetectRouter，这是一个基于原型的框架，可以通过两阶段训练来学习文本检测器亲和力。第一阶段从白盒模型构建判别原型；第二个通过将几何距离与观察到的检测分数对齐来推广到黑盒源。 EvoBench 和 MAGE 基准测试证明了跨多个检测标准和模型系列的一致改进。

Title: Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments

Authors: Siwei Wu, Yizhi Li, Yuyang Song, Wei Zhang, Yang Wang, Riza Batista-Navarro, Xian Yang, Mingjie Tang, Bryan Dai, Jian Yang, Chenghua Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01244
Pdf URL: https://arxiv.org/pdf/2602.01244
Copy Paste: [[2602.01244]] Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments(https://arxiv.org/abs/2602.01244)
Keywords: agent
Abstract: Training agentic models for terminal-based tasks critically depends on high-quality terminal trajectories that capture realistic long-horizon interactions across diverse domains. However, constructing such data at scale remains challenging due to two key requirements: \textbf{\emph{Executability}}, since each instance requires a suitable and often distinct Docker environment; and \textbf{\emph{Verifiability}}, because heterogeneous task outputs preclude unified, standardized verification. To address these challenges, we propose \textbf{TerminalTraj}, a scalable pipeline that (i) filters high-quality repositories to construct Dockerized execution environments, (ii) generates Docker-aligned task instances, and (iii) synthesizes agent trajectories with executable validation code. Using TerminalTraj, we curate 32K Docker images and generate 50,733 verified terminal trajectories across eight domains. Models trained on this data with the Qwen2.5-Coder backbone achieve consistent performance improvements on TerminalBench (TB), with gains of up to 20\% on TB~1.0 and 10\% on TB~2.0 over their respective backbones. Notably, \textbf{TerminalTraj-32B} achieves strong performance among models with fewer than 100B parameters, reaching 35.30\% on TB~1.0 and 22.00\% on TB~2.0, and demonstrates improved test-time scaling behavior. All code and data are available at this https URL.
摘要：训练基于终端的任务的代理模型关键取决于高质量的终端轨迹，这些轨迹可以捕获跨不同领域的真实的长期交互。然而，由于两个关键要求，大规模构建此类数据仍然具有挑战性： \textbf{\emph{可执行性}}，因为每个实例都需要合适且通常不同的 Docker 环境；和 \textbf{\emph{可验证性}}，因为异构任务输出妨碍了统一、标准化的验证。为了应对这些挑战，我们提出了 \textbf{TerminalTraj}，这是一个可扩展的管道，它（i）过滤高质量的存储库以构建 Docker 化的执行环境，（ii）生成 Docker 对齐的任务实例，以及（iii）使用可执行验证代码合成代理轨迹。使用 TerminalTraj，我们管理 32K Docker 镜像并生成跨 8 个域的 50,733 个经过验证的终端轨迹。使用 Qwen2.5-Coder 主干网基于此数据进行训练的模型在 TerminalBench (TB) 上实现了一致的性能改进，与各自的主干网相比，在 TB~1.0 上提升了高达 20%，在 TB~2.0 上提升了 10%。值得注意的是，\textbf{TerminalTraj-32B} 在参数少于 100B 的模型中实现了强大的性能，在 TB~1.0 上达到 35.30\%，在 TB~2.0 上达到 22.00\%，并表现出改进的测试时间扩展行为。所有代码和数据均可在此 https URL 中获取。

Title: PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian

Authors: Jamshid Mozafari, Seyed Parsa Mousavinasab, Adam Jatowt
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.01246
Pdf URL: https://arxiv.org/pdf/2602.01246
Copy Paste: [[2602.01246]] PARSE: An Open-Domain Reasoning Question Answering Benchmark for Persian(https://arxiv.org/abs/2602.01246)
Keywords: language model, llm, prompt
Abstract: Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.
摘要：以推理为中心的问答 (QA) 通过大型语言模型 (LLM) 迅速发展，但针对低资源语言的高质量基准仍然稀缺。波斯语约有 1.3 亿人使用，缺乏全面的开放域资源来评估具有推理能力的 QA 系统。我们推出了 PARSE，这是第一个开放域波斯语推理 QA 基准，包含 10,800 个布尔、多项选择和事实陈述格式的问题，具有不同的推理类型、难度级别和答案结构。该基准是通过受控的基于法学硕士的生成管道构建的，并通过人工评估进行验证。我们还通过多阶段过滤、注释和一致性检查来确保语言和事实质量。我们在多种提示策略下对多语言和波斯语法学硕士进行了基准测试，结果表明波斯语提示和结构化提示（布尔/多项选择的 CoT；事实陈述的少样本）可以提高表现。微调可以进一步提高结果，特别是对于波斯语专用模型。这些发现强调了 PARSE 如何支持公平比较和实际模型适应。 PARSE 填补了波斯语 QA 研究的一个关键空白，并为在资源匮乏的环境中开发和评估具有推理能力的法学硕士提供了坚实的基础。

Title: PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length

Authors: Situo Zhang, Yifan Zhang, Zichen Zhu, Hankun Wang, Da Ma, Danyang Zhang, Lu Chen, Kai Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01274
Pdf URL: https://arxiv.org/pdf/2602.01274
Copy Paste: [[2602.01274]] PACER: Blockwise Pre-verification for Speculative Decoding with Adaptive Length(https://arxiv.org/abs/2602.01274)
Keywords: language model, llm
Abstract: Speculative decoding (SD) is a powerful technique for accelerating the inference process of large language models (LLMs) without sacrificing accuracy. Typically, SD employs a small draft model to generate a fixed number of draft tokens, which are then verified in parallel by the target model. However, our experiments reveal that the optimal draft length varies significantly across different decoding steps. This variation suggests that using a fixed draft length limits the potential for further improvements in decoding speed. To address this challenge, we propose Pacer, a novel approach that dynamically controls draft length using a lightweight, trainable pre-verification layer. This layer pre-verifies draft tokens blockwise before they are sent to the target model, allowing the draft model to stop token generation if the blockwise pre-verification fails. We implement Pacer on multiple SD model pairs and evaluate its performance across various benchmarks. Our results demonstrate that Pacer achieves up to 2.66x Speedup over autoregressive decoding and consistently outperforms standard speculative decoding. Furthermore, when integrated with Ouroboros, Pacer attains up to 3.09x Speedup.
摘要：推测性解码 (SD) 是一种强大的技术，可在不牺牲准确性的情况下加速大型语言模型 (LLM) 的推理过程。通常，SD 采用小型草案模型来生成固定数量的草案代币，然后由目标模型并行验证。然而，我们的实验表明，最佳草稿长度在不同的解码步骤中差异很大。这种变化表明使用固定草稿长度限制了进一步提高解码速度的潜力。为了应对这一挑战，我们提出了 Pacer，这是一种使用轻量级、可训练的预验证层动态控制草稿长度的新颖方法。该层在将草稿令牌发送到目标模型之前按块进行预验证，如果按块预验证失败，则允许草稿模型停止令牌生成。我们在多个 SD 模型对上实施 Pacer，并在各种基准测试中评估其性能。我们的结果表明，Pacer 比自回归解码实现了高达 2.66 倍的加速，并且始终优于标准推测解码。此外，当与 Ouroboros 集成时，Pacer 可实现高达 3.09 倍的加速。

Title: EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

Authors: Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Dannong Xu, Yi Bai, Tianwei Lin, Xinda Zhao, Xiaohong Li, Jiaqi An, Yunyun Han, Jian Pei, Yafeng Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01313
Pdf URL: https://arxiv.org/pdf/2602.01313
Copy Paste: [[2602.01313]] EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language ModelsEverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models(https://arxiv.org/abs/2602.01313)
Keywords: language model, llm
Abstract: Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.
摘要：长期对话记忆对于法学硕士助理来说至关重要，但现有的基准测试侧重于二元、单一主题的对话，无法捕捉现实世界的复杂性。我们推出了 EverMemBench，这是一个基准测试，具有跨超过 100 万个代币的多方、多组对话，具有随时间变化的信息、跨主题交错和特定于角色的角色。 EverMemBench 通过 1,000 多个 QA 对从三个维度评估内存系统：细粒度回忆、内存感知和用户配置文件理解。我们的评估揭示了严重的局限性：（1）多跳推理在多方环境中崩溃，即使是预言机模型也只能达到 26%； (2) 时间推理仍未解决，需要时间戳匹配之外的版本语义； (3) 记忆意识受到检索的瓶颈，当前基于相似性的方法无法弥合查询和隐式相关记忆之间的语义差距。 EverMemBench 为开发下一代内存架构提供了一个具有挑战性的测试平台。

Title: DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas

Authors: Zirui Wu, Lin Zheng, Zhihui Xie, Jiacheng Ye, Jiahui Gao, Shansan Gong, Yansong Feng, Zhenguo Li, Wei Bi, Guorui Zhou, Lingpeng Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01326
Pdf URL: https://arxiv.org/pdf/2602.01326
Copy Paste: [[2602.01326]] DreamOn: Diffusion Language Models For Code Infilling Beyond Fixed-size Canvas(https://arxiv.org/abs/2602.01326)
Keywords: language model, prompt
Abstract: Diffusion Language Models (DLMs) present a compelling alternative to autoregressive models, offering flexible, any-order infilling without specialized prompting design. However, their practical utility is blocked by a critical limitation: the requirement of a fixed-length masked sequence for generation. This constraint severely degrades code infilling performance when the predefined mask size mismatches the ideal completion length. To address this, we propose DreamOn, a novel diffusion framework that enables dynamic, variable-length generation. DreamOn augments the diffusion process with two length control states, allowing the model to autonomously expand or contract the output length based solely on its own predictions. We integrate this mechanism into existing DLMs with minimal modifications to the training objective and no architectural changes. Built upon Dream-Coder-7B and DiffuCoder-7B, DreamOn achieves infilling performance on par with state-of-the-art autoregressive models on HumanEval-Infilling and SantaCoder-FIM and matches oracle performance achieved with ground-truth length. Our work removes a fundamental barrier to the practical deployment of DLMs, significantly advancing their flexibility and applicability for variable-length generation. Our code is available at this https URL.
摘要：扩散语言模型 (DLM) 为自回归模型提供了一种引人注目的替代方案，无需专门的提示设计即可提供灵活的任意顺序填充。然而，它们的实际用途受到一个关键限制的阻碍：需要固定长度的掩码序列来生成。当预定义掩码大小与理想完成长度不匹配时，此约束会严重降低代码填充性能。为了解决这个问题，我们提出了 DreamOn，一种新颖的扩散框架，可以实现动态、可变长度的生成。 DreamOn 通过两个长度控制状态增强了扩散过程，允许模型仅根据自己的预测自主扩展或收缩输出长度。我们将此机制集成到现有的 DLM 中，对培训目标进行最小程度的修改，并且无需进行架构更改。 DreamOn 基于 Dream-Coder-7B 和 DiffuCoder-7B 构建，在 HumanEval-Infilling 和 SantaCoder-FIM 上实现了与最先进的自回归模型相当的填充性能，并与基于真实长度的预言机性能相匹配。我们的工作消除了 DLM 实际部署的根本障碍，显着提高了它们对可变长度生成的灵活性和适用性。我们的代码可以在这个 https URL 上找到。

Title: CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering

Authors: Yu Liu, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Weizhuo Chen, Cheng Hu, Pin Xu, Yuling Yang, Kun Peng, Diandian Guo, Qiang Sun, Yanbing Liu, Jin B. Hong, Zhiyuan Ma
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01348
Pdf URL: https://arxiv.org/pdf/2602.01348
Copy Paste: [[2602.01348]] CRAFT: Calibrated Reasoning with Answer-Faithful Traces via Reinforcement Learning for Multi-Hop Question Answering(https://arxiv.org/abs/2602.01348)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought
Abstract: Retrieval-augmented generation (RAG) is widely used to ground Large Language Models (LLMs) for multi-hop question answering. Recent work mainly focused on improving answer accuracy via fine-tuning and structured or reinforcement-based optimization. However, reliable reasoning in response generation faces three challenges: 1) Reasoning Collapse. Reasoning in multi-hop QA is inherently complex due to multi-hop composition and is further destabilized by noisy retrieval. 2) Reasoning-answer inconsistency. Due to the intrinsic uncertainty of LLM generation and exposure to evidence--distractor mixtures, models may produce correct answers that are not faithfully supported by their intermediate reasoning or evidence. 3) Loss of format control. Traditional chain-of-thought generation often deviates from required structured output formats, leading to incomplete or malformed structured content. To address these challenges, we propose CRAFT (Calibrated Reasoning with Answer-Faithful Traces), a Group Relative Policy Optimization (GRPO) based reinforcement learning framework that trains models to perform faithful reasoning during response generation. CRAFT employs dual reward mechanisms to optimize multi-hop reasoning: deterministic rewards ensure structural correctness while judge-based rewards verify semantic faithfulness. This optimization framework supports controllable trace variants that enable systematic analysis of how structure and scale affect reasoning performance and faithfulness. Experiments on three multi-hop QA benchmarks show that CRAFT improves both answer accuracy and reasoning faithfulness across model scales, with the CRAFT 7B model achieving competitive performance with closed-source LLMs across multiple reasoning trace settings.
摘要：检索增强生成 (RAG) 被广泛用于为多跳问答构建大型语言模型 (LLM)。最近的工作主要集中在通过微调和结构化或基于强化的优化来提高答案准确性。然而，响应生成中的可靠推理面临三个挑战：1）推理崩溃。由于多跳组合，多跳 QA 中的推理本质上是复杂的，并且由于噪声检索而进一步不稳定。 2）推理-答案不一致。由于法学硕士生成的内在不确定性和证据干扰混合物的暴露，模型可能会产生正确的答案，但其中间推理或证据并不忠实支持。 3) 失去格式控制。传统的思想链生成通常偏离所需的结构化输出格式，导致结构化内容不完整或格式错误。为了应对这些挑战，我们提出了 CRAFT（带有答案忠实轨迹的校准推理），这是一种基于组相对策略优化（GRPO）的强化学习框架，可训练模型在响应生成过程中执行忠实推理。 CRAFT 采用双重奖励机制来优化多跳推理：确定性奖励确保结构正确性，而基于判断的奖励则验证语义忠实度。该优化框架支持可控跟踪变体，可以系统分析结构和规模如何影响推理性能和可信度。对三个多跳 QA 基准的实验表明，CRAFT 提高了跨模型规模的答案准确性和推理可信度，CRAFT 7B 模型在多个推理跟踪设置中实现了与闭源 LLM 的竞争性能。

Title: Balancing Understanding and Generation in Discrete Diffusion Models

Authors: Yue Liu, Yuzhong Zhao, Zheyong Xie, Qixiang Ye, Jianbin Jiao, Yao Hu, Shaosheng Cao, Yunfan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01362
Pdf URL: https://arxiv.org/pdf/2602.01362
Copy Paste: [[2602.01362]] Balancing Understanding and Generation in Discrete Diffusion Models(https://arxiv.org/abs/2602.01362)
Keywords: language model
Abstract: In discrete generative modeling, two dominant paradigms demonstrate divergent capabilities: Masked Diffusion Language Models (MDLM) excel at semantic understanding and zero-shot generalization, whereas Uniform-noise Diffusion Language Models (UDLM) achieve strong few-step generation quality, yet neither attains balanced performance across both dimensions. To address this, we propose XDLM, which bridges the two paradigms via a stationary noise kernel. XDLM offers two key contributions: (1) it provides a principled theoretical unification of MDLM and UDLM, recovering each paradigm as a special case; and (2) an alleviated memory bottleneck enabled by an algebraic simplification of the posterior probabilities. Experiments demonstrate that XDLM advances the Pareto frontier between understanding capability and generation quality. Quantitatively, XDLM surpasses UDLM by 5.4 points on zero-shot text benchmarks and outperforms MDLM in few-step image generation (FID 54.1 vs. 80.8). When scaled to tune an 8B-parameter large language model, XDLM achieves 15.0 MBPP in just 32 steps, effectively doubling the baseline performance. Finally, analysis of training dynamics reveals XDLM's superior potential for long-term scaling. Code is available at this https URL
摘要：在离散生成建模中，两种主要范式表现出不同的能力：掩蔽扩散语言模型（MDLM）擅长语义理解和零样本泛化，而均匀噪声扩散语言模型（UDLM）实现了强大的少步生成质量，但两者都没有在两个维度上实现平衡的性能。为了解决这个问题，我们提出了 XDLM，它通过固定噪声内核连接这两种范式。 XDLM 提供了两个关键贡献：（1）它提供了 MDLM 和 UDLM 原则上的理论统一，将每种范式恢复为特例； (2) 通过后验概率的代数简化来缓解内存瓶颈。实验表明，XDLM 推进了理解能力和发电质量之间的帕累托边界。从数量上来说，XDLM 在零样本文本基准测试中超过 UDLM 5.4 个百分点，并且在少步图像生成中优于 MDLM（FID 54.1 vs. 80.8）。当扩展以调整 8B 参数大型语言模型时，XDLM 只需 32 个步骤即可实现 15.0 MBPP，有效地将基准性能提高了一倍。最后，训练动态分析揭示了 XDLM 在长期扩展方面的卓越潜力。代码可在此 https URL 获取

Title: Context Dependence and Reliability in Autoregressive Language Models

Authors: Poushali Sengupta, Shashi Raj Pandey, Sabita Maharjan, Frank Eliassen
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2602.01378
Pdf URL: https://arxiv.org/pdf/2602.01378
Copy Paste: [[2602.01378]] Context Dependence and Reliability in Autoregressive Language Models(https://arxiv.org/abs/2602.01378)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) generate outputs by utilizing extensive context, which often includes redundant information from prompts, retrieved passages, and interaction history. In critical applications, it is vital to identify which context elements actually influence the output, as standard explanation methods struggle with redundancy and overlapping context. Minor changes in input can lead to unpredictable shifts in attribution scores, undermining interpretability and raising concerns about risks like prompt injection. This work addresses the challenge of distinguishing essential context elements from correlated ones. We introduce RISE (Redundancy-Insensitive Scoring of Explanation), a method that quantifies the unique influence of each input relative to others, minimizing the impact of redundancies and providing clearer, stable attributions. Experiments demonstrate that RISE offers more robust explanations than traditional methods, emphasizing the importance of conditional information for trustworthy LLM explanations and monitoring.
摘要：大型语言模型 (LLM) 通过利用广泛的上下文生成输出，其中通常包括来自提示、检索的段落和交互历史记录的冗余信息。在关键应用中，确定哪些上下文元素实际上影响输出至关重要，因为标准解释方法与冗余和重叠上下文作斗争。输入的微小变化可能会导致归因分数发生不可预测的变化，从而破坏可解释性并引发对即时注入等风险的担忧。这项工作解决了区分基本上下文元素和相关元素的挑战。我们引入了 RISE（冗余不敏感解释评分），这是一种量化每个输入相对于其他输入的独特影响的方法，最大限度地减少冗余的影响并提供更清晰、稳定的归因。实验表明，RISE 提供了比传统方法更稳健的解释，强调了条件信息对于值得信赖的 LLM 解释和监控的重要性。

Title: On the Power of (Approximate) Reward Models for Inference-Time Scaling

Authors: Youheng Zhu, Yiping Lu
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2602.01381
Pdf URL: https://arxiv.org/pdf/2602.01381
Copy Paste: [[2602.01381]] On the Power of (Approximate) Reward Models for Inference-Time Scaling(https://arxiv.org/abs/2602.01381)
Keywords: language model, llm
Abstract: Inference-time scaling has recently emerged as a powerful paradigm for improving the reasoning capability of large language models. Among various approaches, Sequential Monte Carlo (SMC) has become a particularly important framework, enabling iterative generation, evaluation, rejection, and resampling of intermediate reasoning trajectories. A central component in this process is the reward model, which evaluates partial solutions and guides the allocation of computation during inference. However, in practice, true reward models are never available. All deployed systems rely on approximate reward models, raising a fundamental question: Why and when do approximate reward models suffice for effective inference-time scaling? In this work, we provide a theoretical answer. We identify the Bellman error of the approximate reward model as the key quantity governing the effectiveness of SMC-based inference-time scaling. For a reasoning process of length $T$, we show that if the Bellman error of the approximate reward model is bounded by $O(1/T)$, then combining this reward model with SMC reduces the computational complexity of reasoning from exponential in $T$ to polynomial in $T$. This yields an exponential improvement in inference efficiency despite using only approximate rewards.
摘要：推理时间缩放最近已成为提高大型语言模型推理能力的强大范例。在各种方法中，顺序蒙特卡罗（SMC）已经成为一个特别重要的框架，它能够迭代生成、评估、拒绝和重采样中间推理轨迹。此过程的核心组成部分是奖励模型，它评估部分解决方案并指导推理过程中的计算分配。然而，在实践中，真正的奖励模型永远不可用。所有部署的系统都依赖于近似奖励模型，这就提出了一个基本问题：为什么以及何时近似奖励模型足以有效地扩展推理时间？在这项工作中，我们提供了理论上的答案。我们将近似奖励模型的贝尔曼误差确定为控制基于 SMC 的推理时间缩放有效性的关键量。对于长度为 $T$ 的推理过程，我们表明，如果近似奖励模型的贝尔曼误差以 $O(1/T)$ 为界，则将此奖励模型与 SMC 相结合可降低从 $T$ 中的指数推理到 $T$ 中的多项式推理的计算复杂度。尽管仅使用近似奖励，这仍会导致推理效率呈指数级提高。

Title: Rethinking Selective Knowledge Distillation

Authors: Almog Tavor, Itay Ebenspanger, Neil Cnaan, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01395
Pdf URL: https://arxiv.org/pdf/2602.01395
Copy Paste: [[2602.01395]] Rethinking Selective Knowledge Distillation(https://arxiv.org/abs/2602.01395)
Keywords: language model, llm
Abstract: Growing efforts to improve knowledge distillation (KD) in large language models (LLMs) replace dense teacher supervision with selective distillation, which uses a subset of token positions, vocabulary classes, or training samples for supervision. However, it remains unclear which importance signals, selection policies, and their interplay are most effective. In this work, we revisit where and how to distill in autoregressive LLMs. We disentangle selective KD along the position, class, and sample axes and systematically compare importance signals and selection policies. Then, guided by this analysis, we identify underexplored opportunities and introduce student-entropy-guided position selection (SE-KD). Across a suite of benchmarks, SE-KD often improves accuracy, downstream task adherence, and memory efficiency over dense distillation. Extending this approach across the class and sample axes (SE-KD 3X) yields complementary efficiency gains that make offline teacher caching feasible. In practice, this reduces wall time by 70% and peak memory by 18%, while cutting storage usage by 80% over prior methods without sacrificing performance.
摘要：人们越来越多地努力改进大型语言模型（LLM）中的知识蒸馏（KD），用选择性蒸馏取代密集的教师监督，选择性蒸馏使用令牌位置、词汇类别或训练样本的子集进行监督。然而，目前尚不清楚哪些重要信号、选择政策及其相互作用最有效。在这项工作中，我们重新审视自回归法学硕士的提取地点和方式。我们沿着位置、类别和样本轴解开选择性 KD，并系统地比较重要性信号和选择策略。然后，在此分析的指导下，我们确定了尚未开发的机会，并引入了学生熵引导的职位选择（SE-KD）。在一系列基准测试中，SE-KD 通常会比密集蒸馏提高准确性、下游任务依从性和内存效率。将这种方法扩展到类和样本轴 (SE-KD 3X) 会产生互补的效率增益，使离线教师缓存变得可行。实际上，与之前的方法相比，这可将挂机时间减少 70%，峰值内存减少 18%，同时将存储使用量减少 80%，而不会牺牲性能。

Title: From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis

Authors: Niansong Zhang, Sunwoo Kim, Shreesha Srinath, Zhiru Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01401
Pdf URL: https://arxiv.org/pdf/2602.01401
Copy Paste: [[2602.01401]] From Pragmas to Partners: A Symbiotic Evolution of Agentic High-Level Synthesis(https://arxiv.org/abs/2602.01401)
Keywords: language model, agent
Abstract: The rise of large language models has sparked interest in AI-driven hardware design, raising the question: does high-level synthesis (HLS) still matter in the agentic era? We argue that HLS remains essential. While we expect mature agentic hardware systems to leverage both HLS and RTL, this paper focuses on HLS and its role in enabling agentic optimization. HLS offers faster iteration cycles, portability, and design permutability that make it a natural layer for agentic this http URL position paper makes three contributions. First, we explain why HLS serves as a practical abstraction layer and a golden reference for agentic hardware design. Second, we identify key limitations of current HLS tools, namely inadequate performance feedback, rigid interfaces, and limited debuggability that agents are uniquely positioned to address. Third, we propose a taxonomy for the symbiotic evolution of agentic HLS, clarifying how responsibility shifts from human designers to AI agents as systems advance from copilots to autonomous design partners.
摘要：大型语言模型的兴起引发了人们对人工智能驱动的硬件设计的兴趣，提出了一个问题：高级综合（HLS）在代理时代仍然重要吗？我们认为 HLS 仍然至关重要。虽然我们期望成熟的代理硬件系统能够同时利用 HLS 和 RTL，但本文重点介绍 HLS 及其在实现代理优化中的作用。 HLS 提供更快的迭代周期、可移植性和设计可排列性，使其成为代理的自然层。这篇 http URL 立场文件做出了三个贡献。首先，我们解释为什么 HLS 可以作为实用的抽象层和代理硬件设计的黄金参考。其次，我们确定了当前 HLS 工具的主要局限性，即性能反馈不足、接口僵化以及代理只能解决的可调试性有限。第三，我们提出了代理 HLS 共生进化的分类法，阐明了随着系统从副驾驶发展到自主设计合作伙伴，责任如何从人类设计师转移到人工智能代理。

Title: Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language

Authors: Umme Abira Azmary, MD Ikramul Kayes, Swakkhar Shatabda, Farig Yousuf Sadeque
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01451
Pdf URL: https://arxiv.org/pdf/2602.01451
Copy Paste: [[2602.01451]] Understanding QA generation: Extracting Parametric and Contextual Knowledge with CQA for Low Resource Bangla Language(https://arxiv.org/abs/2602.01451)
Keywords: llm, prompt, chain-of-thought
Abstract: Question-Answering (QA) models for low-resource languages like Bangla face challenges due to limited annotated data and linguistic complexity. A key issue is determining whether models rely more on pre-encoded (parametric) knowledge or contextual input during answer generation, as existing Bangla QA datasets lack the structure required for such analysis. We introduce BanglaCQA, the first Counterfactual QA dataset in Bangla, by extending a Bangla dataset while integrating counterfactual passages and answerability annotations. In addition, we propose fine-tuned pipelines for encoder-decoder language-specific and multilingual baseline models, and prompting-based pipelines for decoder-only LLMs to disentangle parametric and contextual knowledge in both factual and counterfactual scenarios. Furthermore, we apply LLM-based and human evaluation techniques that measure answer quality based on semantic similarity. We also present a detailed analysis of how models perform across different QA settings in low-resource languages, and show that Chain-of-Thought (CoT) prompting reveals a uniquely effective mechanism for extracting parametric knowledge in counterfactual scenarios, particularly in decoder-only LLMs. Our work not only introduces a novel framework for analyzing knowledge sources in Bangla QA but also uncovers critical findings that open up broader directions for counterfactual reasoning in low-resource language settings.
摘要：由于注释数据有限和语言复杂性，孟加拉语等低资源语言的问答 (QA) 模型面临挑战。一个关键问题是确定模型在答案生成过程中是否更多地依赖于预编码（参数）知识或上下文输入，因为现有的 Bangla QA 数据集缺乏此类分析所需的结构。我们通过扩展 Bangla 数据集同时集成反事实段落和可回答性注释，引入了 BanglaCQA，这是孟加拉语中第一个反事实 QA 数据集。此外，我们还提出了针对编码器-解码器特定语言和多语言基线模型的微调管道，以及针对仅解码器的法学硕士的基于提示的管道，以理清事实和反事实场景中的参数和上下文知识。此外，我们应用基于法学硕士的人工评估技术，根据语义相似性来衡量答案质量。我们还详细分析了模型如何在低资源语言的不同 QA 设置中执行，并表明思想链 (CoT) 提示揭示了一种在反事实场景中提取参数知识的独特有效机制，特别是在仅解码器的法学硕士中。我们的工作不仅引入了一种用于分析孟加拉语问答中知识源的新颖框架，而且还揭示了重要发现，为低资源语言环境中的反事实推理开辟了更广阔的方向。

Title: ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure

Authors: Jie Deng, Shining Liang, Jun Li, Hongzhi Li, Yutao Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01472
Pdf URL: https://arxiv.org/pdf/2602.01472
Copy Paste: [[2602.01472]] ConPress: Learning Efficient Reasoning from Multi-Question Contextual Pressure(https://arxiv.org/abs/2602.01472)
Keywords: prompt, chain-of-thought
Abstract: Large reasoning models (LRMs) typically solve reasoning-intensive tasks by generating long chain-of-thought (CoT) traces, leading to substantial inference overhead. We identify a reproducible inference-time phenomenon, termed Self-Compression: when multiple independent and answerable questions are presented within a single prompt, the model spontaneously produces shorter reasoning traces for each question. This phenomenon arises from multi-question contextual pressure during generation and consistently manifests across models and benchmarks. Building on this observation, we propose ConPress (Learning from Contextual Pressure), a lightweight self-supervised fine-tuning approach. ConPress constructs multi-question prompts to induce self-compression, samples the resulting model outputs, and parses and filters per-question traces to obtain concise yet correct reasoning trajectories. These trajectories are directly used for supervised fine-tuning, internalizing compressed reasoning behavior in single-question settings without external teachers, manual pruning, or reinforcement learning. With only 8k fine-tuning examples, ConPress reduces reasoning token usage by 59% on MATH500 and 33% on AIME25, while maintaining competitive accuracy.
摘要：大型推理模型 (LRM) 通常通过生成长思想链 (CoT) 跟踪来解决推理密集型任务，从而导致大量推理开销。我们发现了一种可重复的推理时间现象，称为自我压缩：当在单个提示中提出多个独立且可回答的问题时，模型会自发地为每个问题生成较短的推理轨迹。这种现象是由生成过程中的多问题上下文压力引起的，并且在模型和基准中一致表现出来。基于这一观察，我们提出了 ConPress（从上下文压力中学习），这是一种轻量级的自我监督微调方法。 ConPress 构建多问题提示来诱导自压缩，对结果模型输出进行采样，并解析和过滤每个问题的痕迹以获得简洁而正确的推理轨迹。这些轨迹直接用于监督微调，在单问题设置中内化压缩推理行为，无需外部教师、手动修剪或强化学习。仅通过 8k 微调示例，ConPress 在 MATH500 上将推理令牌使用量减少了 59%，在 AIME25 上减少了 33%，同时保持了有竞争力的准确性。

Title: Ebisu: Benchmarking Large Language Models in Japanese Finance

Authors: Xueqing Peng, Ruoyu Xiang, Fan Zhang, Mingzi Song, Mingyang Jiang, Yan Wang, Lingfei Qian, Taiki Hara, Yuqing Guo, Jimin Huang, Junichi Tsujii, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01479
Pdf URL: https://arxiv.org/pdf/2602.01479
Copy Paste: [[2602.01479]] Ebisu: Benchmarking Large Language Models in Japanese Finance(https://arxiv.org/abs/2602.01479)
Keywords: language model, llm
Abstract: Japanese finance combines agglutinative, head-final linguistic structure, mixed writing systems, and high-context communication norms that rely on indirect expression and implicit commitment, posing a substantial challenge for LLMs. We introduce Ebisu, a benchmark for native Japanese financial language understanding, comprising two linguistically and culturally grounded, expert-annotated tasks: JF-ICR, which evaluates implicit commitment and refusal recognition in investor-facing Q&A, and JF-TE, which assesses hierarchical extraction and ranking of nested financial terminology from professional disclosures. We evaluate a diverse set of open-source and proprietary LLMs spanning general-purpose, Japanese-adapted, and financial models. Results show that even state-of-the-art systems struggle on both tasks. While increased model scale yields limited improvements, language- and domain-specific adaptation does not reliably improve performance, leaving substantial gaps unresolved. Ebisu provides a focused benchmark for advancing linguistically and culturally grounded financial NLP. All datasets and evaluation scripts are publicly released.
摘要：日本金融结合了粘着性、头尾语言结构、混合书写系统和依赖间接表达和隐性承诺的高语境沟通规范，对法学硕士提出了巨大的挑战。我们介绍 Ebisu，这是日本本土金融语言理解的基准，包括两个基于语言和文化的专家注释任务：JF-ICR（评估面向投资者的问答中的隐含承诺和拒绝识别）和 JF-TE（评估专业披露中嵌套金融术语的分层提取和排名）。我们评估各种开源和专有的法学硕士，涵盖通用模型、日语模型和财务模型。结果表明，即使是最先进的系统也难以完成这两项任务。虽然增加模型规模带来的改进有限，但特定于语言和领域的适应并不能可靠地提高性能，从而留下了巨大的差距尚未解决。惠比寿为推进基于语言和文化的金融 NLP 提供了一个有针对性的基准。所有数据集和评估脚本均公开发布。

Title: Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training

Authors: Ran Xu, Tianci Liu, Zihan Dong, Tony You, Ilgee Hong, Carl Yang, Linjun Zhang, Tao Zhao, Haoyu Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01511
Pdf URL: https://arxiv.org/pdf/2602.01511
Copy Paste: [[2602.01511]] Alternating Reinforcement Learning for Rubric-Based Reward Modeling in Non-Verifiable LLM Post-Training(https://arxiv.org/abs/2602.01511)
Keywords: llm
Abstract: Standard reward models typically predict scalar scores that fail to capture the multifaceted nature of response quality in non-verifiable domains, such as creative writing or open-ended instruction following. To address this limitation, we propose Rubric-ARM, a framework that jointly optimizes a rubric generator and a judge using reinforcement learning from preference feedback. Unlike existing methods that rely on static rubrics or disjoint training pipelines, our approach treats rubric generation as a latent action learned to maximize judgment accuracy. We introduce an alternating optimization strategy to mitigate the non-stationarity of simultaneous updates, providing theoretical analysis that demonstrates how this schedule reduces gradient variance during training. Extensive experiments show that Rubric-ARM achieves state-of-the-art performance among baselines on multiple benchmarks and significantly improves downstream policy alignment in both offline and online reinforcement learning settings.
摘要：标准奖励模型通常预测标量分数，但无法捕捉不可验证领域（例如创意写作或开放式指令遵循）中响应质量的多方面性质。为了解决这个限制，我们提出了 Rubric-ARM，这是一个使用来自偏好反馈的强化学习来联合优化 rubric 生成器和法官的框架。与依赖静态评分标准或不相交训练管道的现有方法不同，我们的方法将评分标准生成视为学习的潜在动作，以最大限度地提高判断准确性。我们引入了一种交替优化策略来减轻同时更新的非平稳性，并提供理论分析来演示该计划如何减少训练期间的梯度方差。大量实验表明，Rubric-ARM 在多个基准测试中实现了最先进的性能，并显着改善了离线和在线强化学习设置中的下游策略一致性。

Title: Argument Rarity-based Originality Assessment for AI-Assisted Writing

Authors: Keito Inoshita, Michiaki Omura, Tsukasa Yamanaka, Go Maeda, Kentaro Tsuji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01560
Pdf URL: https://arxiv.org/pdf/2602.01560
Copy Paste: [[2602.01560]] Argument Rarity-based Originality Assessment for AI-Assisted Writing(https://arxiv.org/abs/2602.01560)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) have become capable of effortlessly generating high-quality text, traditional quality-focused writing assessment is losing its significance. If the essential goal of education is to foster critical thinking and original perspectives, assessment must also shift its paradigm from quality to originality. This study proposes Argument Rarity-based Originality Assessment (AROA), a framework for automatically evaluating argumentative originality in student essays. AROA defines originality as rarity within a reference corpus and evaluates it through four complementary components: structural rarity, claim rarity, evidence rarity, and cognitive depth. The framework quantifies the rarity of each component using density estimation and integrates them with a quality adjustment mechanism, thereby treating quality and originality as independent evaluation axes. Experiments using human essays and AI-generated essays revealed a strong negative correlation between quality and claim rarity, demonstrating a quality-originality trade-off where higher-quality texts tend to rely on typical claim patterns. Furthermore, while AI essays achieved comparable levels of structural complexity to human essays, their claim rarity was substantially lower than that of humans, indicating that LLMs can reproduce the form of argumentation but have limitations in the originality of content.
摘要：随着大型语言模型 (LLM) 能够轻松生成高质量文本，传统的以质量为中心的写作评估正在失去其意义。如果教育的基本目标是培养批判性思维和原创观点，那么评估也必须将其范式从质量转向原创性。本研究提出了基于论证稀有性的原创性评估（AROA），这是一个自动评估学生论文中论证原创性的框架。 AROA 将原创性定义为参考语料库中的稀有性，并通过四个互补的组成部分对其进行评估：结构稀有性、主张稀有性、证据稀有性和认知深度。该框架使用密度估计来量化每个组件的稀有性，并将其与质量调整机制集成，从而将质量和原创性视为独立的评估轴。使用人类论文和人工智能生成论文的实验揭示了质量和声明稀有性之间的强烈负相关性，证明了质量与原创性之间的权衡，即较高质量的文本往往依赖于典型的声明模式。此外，虽然人工智能论文的结构复杂性水平与人类论文相当，但它们的主张稀有性却大大低于人类论文，这表明法学硕士可以重现论证的形式，但在内容的原创性方面受到限制。

Title: FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents

Authors: Chiwei Zhu, Benfeng Xu, Mingxuan Du, Shaohan Wang, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01566
Pdf URL: https://arxiv.org/pdf/2602.01566
Copy Paste: [[2602.01566]] FS-Researcher: Test-Time Scaling for Long-Horizon Research Tasks with File-System-Based Agents(https://arxiv.org/abs/2602.01566)
Keywords: language model, llm, agent
Abstract: Deep research is emerging as a representative long-horizon task for large language model (LLM) agents. However, long trajectories in deep research often exceed model context limits, compressing token budgets for both evidence collection and report writing, and preventing effective test-time scaling. We introduce FS-Researcher, a file-system-based, dual-agent framework that scales deep research beyond the context window via a persistent workspace. Specifically, a Context Builder agent acts as a librarian which browses the internet, writes structured notes, and archives raw sources into a hierarchical knowledge base that can grow far beyond context length. A Report Writer agent then composes the final report section by section, treating the knowledge base as the source of facts. In this framework, the file system serves as a durable external memory and a shared coordination medium across agents and sessions, enabling iterative refinement beyond the context window. Experiments on two open-ended benchmarks (DeepResearch Bench and DeepConsult) show that FS-Researcher achieves state-of-the-art report quality across different backbone models. Further analyses demonstrate a positive correlation between final report quality and the computation allocated to the Context Builder, validating effective test-time scaling under the file-system paradigm. The code and data are anonymously open-sourced at this https URL.
摘要：深度研究正在成为大型语言模型（LLM）代理的代表性长期任务。然而，深度研究中的长轨迹通常会超出模型上下文限制，压缩证据收集和报告编写的代币预算，并阻止有效的测试时间扩展。我们引入了 FS-Researcher，这是一个基于文件系统的双代理框架，可通过持久工作空间将深入研究扩展到上下文窗口之外。具体来说，上下文生成器代理充当图书管理员，它浏览互联网、编写结构化笔记，并将原始资源存档到可以远远超出上下文长度的分层知识库中。然后，报告撰写者代理将知识库视为事实来源，逐节撰写最终报告。在此框架中，文件系统充当持久的外部存储器以及跨代理和会话的共享协调介质，从而实现上下文窗口之外的迭代细化。对两个开放式基准（DeepResearch Bench 和 DeepConsult）的实验表明，FS-Researcher 在不同的骨干模型上实现了最先进的报告质量。进一步的分析表明最终报告质量与分配给上下文生成器的计算之间存在正相关性，从而验证了文件系统范例下有效的测试时间扩展。代码和数据在此 https URL 上匿名开源。

Title: LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States

Authors: Yeqin Zhang, Yunfei Wang, Jiaxuan Chen, Ke Qin, Yizheng Zhao, Cam-Tu Nguyen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.01572
Pdf URL: https://arxiv.org/pdf/2602.01572
Copy Paste: [[2602.01572]] LLM-based Embeddings: Attention Values Encode Sentence Semantics Better Than Hidden States(https://arxiv.org/abs/2602.01572)
Keywords: language model, llm, prompt
Abstract: Sentence representations are foundational to many Natural Language Processing (NLP) applications. While recent methods leverage Large Language Models (LLMs) to derive sentence representations, most rely on final-layer hidden states, which are optimized for next-token prediction and thus often fail to capture global, sentence-level semantics. This paper introduces a novel perspective, demonstrating that attention value vectors capture sentence semantics more effectively than hidden states. We propose Value Aggregation (VA), a simple method that pools token values across multiple layers and token indices. In a training-free setting, VA outperforms other LLM-based embeddings, even matches or surpasses the ensemble-based MetaEOL. Furthermore, we demonstrate that when paired with suitable prompts, the layer attention outputs can be interpreted as aligned weighted value vectors. Specifically, the attention scores of the last token function as the weights, while the output projection matrix ($W_O$) aligns these weighted value vectors with the common space of the LLM residual stream. This refined method, termed Aligned Weighted VA (AlignedWVA), achieves state-of-the-art performance among training-free LLM-based embeddings, outperforming the high-cost MetaEOL by a substantial margin. Finally, we highlight the potential of obtaining strong LLM embedding models through fine-tuning Value Aggregation.
摘要：句子表示是许多自然语言处理 (NLP) 应用程序的基础。虽然最近的方法利用大型语言模型（LLM）来派生句子表示，但大多数方法都依赖于最终层隐藏状态，这些隐藏状态针对下一个标记预测进行了优化，因此通常无法捕获全局的句子级语义。本文引入了一种新颖的视角，证明注意力值向量比隐藏状态更有效地捕获句子语义。我们提出了价值聚合（VA），这是一种跨多个层和代币索引汇集代币值的简单方法。在免训练环境中，VA 优于其他基于 LLM 的嵌入，甚至匹配或超过基于集成的 MetaEOL。此外，我们证明，当与合适的提示配对时，层注意力输出可以解释为对齐的加权值向量。具体来说，最后一个 token 函数的注意力分数作为权重，而输出投影矩阵 ($W_O$) 将这些加权值向量与 LLM 残差流的公共空间对齐。这种改进的方法称为对齐加权 VA (AlignedWVA)，在免训练的基于 LLM 的嵌入中实现了最先进的性能，大大优于高成本的 MetaEOL。最后，我们强调了通过微调价值聚合获得强大的 LLM 嵌入模型的潜力。

Title: Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Authors: Zehua Cheng, Jianwei Yang, Wei Dai, Jiahao Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01587
Pdf URL: https://arxiv.org/pdf/2602.01587
Copy Paste: [[2602.01587]] Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment(https://arxiv.org/abs/2602.01587)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.
摘要：大型语言模型 (LLM) 仍然容易受到自适应越狱的影响，这些越狱很容易绕过 GCG 等经验防御。我们提出了一个可证明的稳健性框架，将安全保证从单遍推理转变为整体的统计稳定性。我们通过分层随机消融引入了认证语义平滑（CSS），这种技术将输入划分为不可变的结构提示和可变的有效负载，以使用超几何分布导出严格的低范数保证。为了解决稀疏上下文中的性能下降问题，我们采用噪声增强对齐调整（NAAT），它将基本模型转换为语义降噪器。 Llama-3 上的大量实验表明，我们的方法将基于梯度的攻击的攻击成功率从 84.2% 降低到 1.2%，同时保持 94.1% 的良性效用，显着优于字符级基线，后者将效用降低到 74.3%。该框架提供了确定性的安全证书，确保模型对可证明半径内的所有对抗变体保持稳健。

Title: Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles

Authors: Shaohan Wang, Benfeng Xu, Licheng Zhang, Mingxuan Du, Chiwei Zhu, Xiaorui Wang, Zhendong Mao, Yongdong Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01590
Pdf URL: https://arxiv.org/pdf/2602.01590
Copy Paste: [[2602.01590]] Wiki Live Challenge: Challenging Deep Research Agents with Expert-Level Wikipedia Articles(https://arxiv.org/abs/2602.01590)
Keywords: llm, agent
Abstract: Deep Research Agents (DRAs) have demonstrated remarkable capabilities in autonomous information retrieval and report generation, showing great potential to assist humans in complex research tasks. Current evaluation frameworks primarily rely on LLM-generated references or LLM-derived evaluation dimensions. While these approaches offer scalability, they often lack the reliability of expert-verified content and struggle to provide objective, fine-grained assessments of critical dimensions. To bridge this gap, we introduce Wiki Live Challenge (WLC), a live benchmark that leverages the newest Wikipedia Good Articles (GAs) as expert-level references. Wikipedia's strict standards for neutrality, comprehensiveness, and verifiability serve as a great challenge for DRAs, with GAs representing the pinnacle of which. We curate a dataset of 100 recent Good Articles and propose Wiki Eval, a comprehensive evaluation framework comprising a fine-grained evaluation method with 39 criteria for writing quality and rigorous metrics for factual verifiability. Extensive experiments on various DRA systems demonstrate a significant gap between current DRAs and human expert-level Wikipedia articles, validating the effectiveness of WLC in advancing agent research. We release our benchmark at this https URL
摘要：深度研究代理（DRA）在自主信息检索和报告生成方面表现出了卓越的能力，显示出协助人类完成复杂研究任务的巨大潜力。目前的评估框架主要依赖于LLM生成的参考文献或LLM衍生的评估维度。虽然这些方法提供了可扩展性，但它们通常缺乏经过专家验证的内容的可靠性，并且难以提供对关键维度的客观、细粒度的评估。为了弥补这一差距，我们引入了 Wiki Live Challenge (WLC)，这是一个实时基准测试，利用最新的 Wikipedia Good Articles (GA) 作为专家级参考。维基百科对中立性、全面性和可验证性的严格标准对 DRA 来说是一个巨大的挑战，而 GA 则代表了其中的顶峰。我们整理了包含 100 篇近期好文章的数据集，并提出了 Wiki Eval，这是一个综合评估框架，包含细粒度的评估方法、39 项写作质量标准和严格的事实可验证性指标。对各种 DRA 系统的大量实验表明，当前 DRA 与人类专家级维基百科文章之间存在显着差距，验证了 WLC 在推进代理研究方面的有效性。我们在此 https URL 发布基准测试

Title: The Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation

Authors: Mingwen Zhang, Minqiang Yang, Changsheng Ma, Yang Yu, Hui Bai, Chen Xu, Xiangzhen Kong, Bin Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01598
Pdf URL: https://arxiv.org/pdf/2602.01598
Copy Paste: [[2602.01598]] The Art of Socratic Inquiry: A Framework for Proactive Template-Guided Therapeutic Conversation Generation(https://arxiv.org/abs/2602.01598)
Keywords: language model, llm
Abstract: Proactive questioning, where therapists deliberately initiate structured, cognition-guiding inquiries, is a cornerstone of cognitive behavioral therapy (CBT). Yet, current psychological large language models (LLMs) remain overwhelmingly reactive, defaulting to empathetic but superficial responses that fail to surface latent beliefs or guide behavioral change. To bridge this gap, we propose the \textbf{Socratic Inquiry Framework (SIF)}, a lightweight, plug-and-play therapeutic intent planner that transforms LLMs from passive listeners into active cognitive guides. SIF decouples \textbf{when to ask} (via Strategy Anchoring) from \textbf{what to ask} (via Template Retrieval), enabling context-aware, theory-grounded questioning without end-to-end retraining. Complementing SIF, we introduce \textbf{Socratic-QA}, a high-quality dataset of strategy-aligned Socratic sequences that provides explicit supervision for proactive reasoning. Experiments show that SIF significantly enhances proactive questioning frequency, conversational depth, and therapeutic alignment, marking a clear shift from reactive comfort to proactive exploration. Our work establishes a new paradigm for psychologically informed LLMs: not just to respond, but to guide.
摘要：主动提问，即治疗师有意发起结构化的、认知引导的询问，是认知行为疗法（CBT）的基石。然而，当前的心理大语言模型（LLM）仍然处于压倒性的反应性状态，默认为同理心但肤浅的反应，无法揭示潜在的信念或指导行为改变。为了弥补这一差距，我们提出了\textbf{苏格拉底式探究框架（SIF）}，这是一个轻量级、即插即用的治疗意图规划器，可以将法学硕士从被动的听众转变为主动的认知指南。 SIF 将 \textbf{何时提问}（通过策略锚定）与 \textbf{提问内容}（通过模板检索）解耦，从而实现上下文感知、基于理论的提问，无需端到端再训练。作为 SIF 的补充，我们引入了 \textbf{Socratic-QA}，这是一个策略一致的苏格拉底序列的高质量数据集，为主动推理提供显式监督。实验表明，SIF 显着提高了主动提问频率、对话深度和治疗一致性，标志着从被动舒适到主动探索的明显转变。我们的工作为心理知情的法学硕士建立了一个新的范式：不仅是回应，而且是指导。

Title: SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia

Authors: Panuthep Tasawong, Jian Gang Ngui, Alham Fikri Aji, Trevor Cohn, Peerat Limkonchotiwat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01618
Pdf URL: https://arxiv.org/pdf/2602.01618
Copy Paste: [[2602.01618]] SEA-Guard: Culturally Grounded Multilingual Safeguard for Southeast Asia(https://arxiv.org/abs/2602.01618)
Keywords: agent
Abstract: Culturally aware safeguards are crucial for AI alignment in real-world settings, where safety extends beyond common sense and encompasses diverse local values, norms, and region-specific regulations. However, building large-scale, culturally grounded datasets is challenging due to limited resources and a scarcity of native annotators. Consequently, many safeguard models rely on machine translation of English datasets, often missing regional and cultural nuances. We present a novel agentic data-generation framework to scalably create authentic, region-specific safety datasets for Southeast Asia (SEA). On this foundation, we introduce the SEA-Guard family, the first multilingual safeguard models grounded in SEA cultural contexts. Evaluated across multiple benchmarks and cultural variants, SEA-Guard consistently outperforms existing safeguards at detecting regionally sensitive or harmful content while maintaining strong general safety performance.
摘要：具有文化意识的保障措施对于人工智能在现实世界中的协调至关重要，在现实世界中，安全超越了常识，涵盖了不同的当地价值观、规范和特定地区的法规。然而，由于资源有限和本地注释器稀缺，构建大规模的、基于文化的数据集具有挑战性。因此，许多保障模型依赖于英语数据集的机器翻译，往往忽略了区域和文化的细微差别。我们提出了一种新颖的代理数据生成框架，可以为东南亚（SEA）大规模创建真实的、特定于地区的安全数据集。在此基础上，我们推出了 SEA-Guard 系列，这是第一个基于 SEA 文化背景的多语言保障模型。经过多个基准和文化变体的评估，SEA-Guard 在检测区域敏感或有害内容方面始终优于现有的防护措施，同时保持强大的总体安全性能。

Title: A2Eval: Agentic and Automated Evaluation for Embodied Brain

Authors: Shuai Zhang, Jiayu Hu, Zijie Chen, Zeyuan Ding, Yi Zhang, Yingji Zhang, Ziyi Zhou, Junwei Liao, Shengjie Zhou, Yong Dai, Zhenzhong Lan, Xiaozhu Ju
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01640
Pdf URL: https://arxiv.org/pdf/2602.01640
Copy Paste: [[2602.01640]] A2Eval: Agentic and Automated Evaluation for Embodied Brain(https://arxiv.org/abs/2602.01640)
Keywords: agent
Abstract: Current embodied VLM evaluation relies on static, expert-defined, manually annotated benchmarks that exhibit severe redundancy and coverage imbalance. This labor intensive paradigm drains computational and annotation resources, inflates costs, and distorts model rankings, ultimately stifling iterative development. To address this, we propose Agentic Automatic Evaluation (A2Eval), the first agentic framework that automates benchmark curation and evaluation through two collaborative agents. The Data Agent autonomously induces capability dimensions and assembles a balanced, compact evaluation suite, while the Eval Agent synthesizes and validates executable evaluation pipelines, enabling fully autonomous, high-fidelity assessment. Evaluated across 10 benchmarks and 13 models, A2Eval compresses evaluation suites by 85%, reduces overall computational costs by 77%, and delivers a 4.6x speedup while preserving evaluation quality. Crucially, A2Eval corrects systematic ranking biases, improves human alignment to Spearman's rho=0.85, and maintains high ranking fidelity (Kendall's tau=0.81), establishing a new standard for high-fidelity, low-cost embodied assessment. Our code and data will be public soon.
摘要：当前的具体 VLM 评估依赖于静态的、专家定义的、手动注释的基准，这些基准表现出严重的冗余和覆盖不平衡。这种劳动密集型范式耗尽了计算和注释资源，增加了成本，并扭曲了模型排名，最终抑制了迭代开发。为了解决这个问题，我们提出了代理自动评估（A2Eval），这是第一个通过两个协作代理自动执行基准管理和评估的代理框架。数据代理自主归纳能力维度并组装平衡、紧凑的评估套件，而评估代理则综合并验证可执行的评估管道，从而实现完全自主的高保真评估。 A2Eval 通过 10 个基准测试和 13 个模型进行评估，将评估套件压缩了 85%，将总体计算成本降低了 77%，并在保持评估质量的同时实现了 4.6 倍的加速。至关重要的是，A2Eval 纠正了系统排名偏差，将人类对齐提高到 Spearman 的 rho=0.85，并保持高排名保真度（Kendall 的 tau=0.81），为高保真、低成本的体现评估建立了新标准。我们的代码和数据很快就会公开。

Title: Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models

Authors: Jiaqian Li, Yanshu Li, Kuan-Hao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01654
Pdf URL: https://arxiv.org/pdf/2602.01654
Copy Paste: [[2602.01654]] Steering Vector Fields for Context-Aware Inference-Time Control in Large Language Models(https://arxiv.org/abs/2602.01654)
Keywords: language model, llm, prompt
Abstract: Steering vectors (SVs) offer a lightweight way to control large language models (LLMs) at inference time by shifting hidden activations, providing a practical middle ground between prompting and fine-tuning. Yet SVs can be unreliable in practice. Some concepts are unsteerable, and even when steering helps on average it can backfire for a non-trivial fraction of inputs. Reliability also degrades in long-form generation and multi-attribute steering. We take a geometric view of these failures. A static SV applies the same update vector everywhere in representation space, implicitly assuming that the concept-improving direction is constant across contexts. When the locally effective direction varies with the current activation, a single global vector can become misaligned, which yields weak or reversed effects. Guided by this perspective, we propose Steering Vector Fields (SVF), which learns a differentiable concept scoring function whose local gradient defines the steering direction at each activation, making interventions explicitly context-dependent. This formulation supports coordinated multi-layer interventions in a shared, aligned concept space, and enables efficient long-form and multi-attribute control within a unified framework. Across multiple LLMs and steering tasks, SVF delivers stronger and more reliable control, improving the practicality of inference-time steering.
摘要：引导向量 (SV) 提供了一种轻量级的方法，通过转移隐藏的激活来在推理时控制大型语言模型 (LLM)，从而在提示和微调之间提供实用的中间立场。然而，SV 在实践中可能并不可靠。有些概念是不可操纵的，即使操纵对平均而言有帮助，它也可能会因为大量的输入而产生适得其反的效果。长格式生成和多属性控制中的可靠性也会降低。我们从几何角度来看待这些失败。静态 SV 在表示空间中的任何地方应用相同的更新向量，隐含地假设概念改进方向在上下文中是恒定的。当局部有效方向随着当前激活而变化时，单个全局向量可能会变得不对齐，从而产生微弱或相反的效果。在这个观点的指导下，我们提出了转向向量场（SVF），它学习一个可微分的概念评分函数，其局部梯度定义了每次激活时的转向方向，使干预明确地依赖于上下文。该公式支持在共享的、一致的概念空间中进行协调的多层干预，并在统一的框架内实现高效的长形式和多属性控制。在多个法学硕士和引导任务中，SVF 提供更强、更可靠的控制，提高推理时间引导的实用性。

Title: Scaling Search-Augmented LLM Reasoning via Adaptive Information Control

Authors: Siheng Xiong, Oguzhan Gungordu, Blair Johnson, James C. Kerce, Faramarz Fekri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01672
Pdf URL: https://arxiv.org/pdf/2602.01672
Copy Paste: [[2602.01672]] Scaling Search-Augmented LLM Reasoning via Adaptive Information Control(https://arxiv.org/abs/2602.01672)
Keywords: llm, agent
Abstract: Search-augmented reasoning agents interleave multi-step reasoning with external information retrieval, but uncontrolled retrieval often leads to redundant evidence, context saturation, and unstable learning. Existing approaches rely on outcome-based reinforcement learning (RL), which provides limited guidance for regulating information acquisition. We propose DeepControl, a framework for adaptive information control based on a formal notion of information utility, which measures the marginal value of retrieved evidence under a given reasoning state. Building on this utility, we introduce retrieval continuation and granularity control mechanisms that selectively regulate when to continue and stop retrieval, and how much information to expand. An annealed control strategy enables the agent to internalize effective information acquisition behaviors during training. Extensive experiments across seven benchmarks demonstrate that our method consistently outperforms strong baselines. In particular, our approach achieves average performance improvements of 9.4% and 8.6% on Qwen2.5-7B and Qwen2.5-3B, respectively, over strong outcome-based RL baselines, and consistently outperforms both retrieval-free and retrieval-based reasoning methods without explicit information control. These results highlight the importance of adaptive information control for scaling search-augmented reasoning agents to complex, real-world information environments.
摘要：搜索增强推理代理将多步推理与外部信息检索交织在一起，但不受控制的检索通常会导致冗余证据、上下文饱和和不稳定的学习。现有方法依赖于基于结果的强化学习（RL），它为调节信息获取提供了有限的指导。我们提出了 DeepControl，一种基于信息效用的形式概念的自适应信息控制框架，它衡量给定推理状态下检索到的证据的边际价值。在此实用程序的基础上，我们引入了检索继续和粒度控制机制，可以选择性地调节何时继续和停止检索以及扩展多少信息。退火控制策略使代理能够在训练期间内化有效的信息获取行为。七个基准的广泛实验表明，我们的方法始终优于强大的基线。特别是，与强大的基于结果的 RL 基线相比，我们的方法在 Qwen2.5-7B 和 Qwen2.5-3B 上的平均性能分别提高了 9.4% 和 8.6%，并且在没有明确信息控制的情况下始终优于无检索和基于检索的推理方法。这些结果凸显了自适应信息控制对于将搜索增强推理代理扩展到复杂的现实世界信息环境的重要性。

Title: Counting Hypothesis: Potential Mechanism of In-Context Learning

Authors: Jung H. Lee, Sujith Vijayan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01687
Pdf URL: https://arxiv.org/pdf/2602.01687
Copy Paste: [[2602.01687]] Counting Hypothesis: Potential Mechanism of In-Context Learning(https://arxiv.org/abs/2602.01687)
Keywords: language model, llm, prompt
Abstract: In-Context Learning (ICL) indicates that large language models (LLMs) pretrained on a massive amount of data can learn specific tasks from input prompts' examples. ICL is notable for two reasons. First, it does not need modification of LLMs' internal structure. Second, it enables LLMs to perform a wide range of tasks/functions with a few examples demonstrating a desirable task. ICL opens up new ways to utilize LLMs in more domains, but its underlying mechanisms still remain poorly understood, making error correction and diagnosis extremely challenging. Thus, it is imperative that we better understand the limitations of ICL and how exactly LLMs support ICL. Inspired by ICL properties and LLMs' functional modules, we propose 1the counting hypothesis' of ICL, which suggests that LLMs' encoding strategy may underlie ICL, and provide supporting evidence.
摘要：上下文学习（ICL）表明，在大量数据上预训练的大型语言模型（LLM）可以从输入提示的示例中学习特定任务。 ICL 之所以引人注目有两个原因。首先，不需要修改法学硕士的内部结构。其次，它使法学硕士能够执行广泛的任务/功能，并通过一些示例来展示所需的任务。 ICL 开辟了在更多领域利用法学硕士的新方法，但其底层机制仍然知之甚少，这使得纠错和诊断极具挑战性。因此，我们有必要更好地了解 ICL 的局限性以及法学硕士到底如何支持 ICL。受ICL特性和LLM功能模块的启发，我们提出1ICL的计数假设，该假设表明LLM的编码策略可能是ICL的基础，并提供了支持证据。

Title: Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory

Authors: Langyuan Cui, Chun Kai Ling, Hwee Tou Ng
Subjects: cs.CL, cs.AI, cs.GT
Abstract URL: https://arxiv.org/abs/2602.01708
Pdf URL: https://arxiv.org/pdf/2602.01708
Copy Paste: [[2602.01708]] Game of Thought: Robust Information Seeking with Large Language Models Using Game Theory(https://arxiv.org/abs/2602.01708)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world scenarios where they may lack sufficient information to complete a given task. In such settings, the ability to actively seek out missing information becomes a critical capability. Existing approaches to enhancing this ability often rely on simplifying assumptions that degrade \textit{worst-case} performance. This is an issue with serious implications in high-stakes applications. In this work, we use the game of Twenty Questions to evaluate the information-seeking ability of LLMs. We introduce and formalize its adversarial counterpart, the Strategic Language Search (SLS) problem along with its variants as a two-player zero-sum extensive form game. We propose Game of Thought (GoT), a framework that applies game-theoretic techniques to approximate a Nash equilibrium (NE) strategy for the restricted variant of the game. Empirical results demonstrate that our approach consistently improves worst-case performance compared to (1) direct prompting-based methods and (2) heuristic-guided search methods across all tested settings.
摘要：大型语言模型 (LLM) 越来越多地部署在现实场景中，在这些场景中，它们可能缺乏足够的信息来完成给定的任务。在这种情况下，主动寻找缺失信息的能力就成为一项关键能力。增强这种能力的现有方法通常依赖于降低 \textit{最坏情况} 性能的简化假设。这是一个对高风险应用程序产生严重影响的问题。在这项工作中，我们使用二十个问题的游戏来评估法学硕士的信息寻求能力。我们引入并形式化了它的对抗性对应物，即战略语言搜索（SLS）问题及其变体，作为两人零和扩展形式游戏。我们提出了思想博弈（GoT），这是一个应用博弈论技术来近似博弈受限变体的纳什均衡（NE）策略的框架。经验结果表明，与（1）基于直接提示的方法和（2）在所有测试设置中的启发式引导搜索方法相比，我们的方法始终如一地提高了最坏情况的性能。

Title: ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation

Authors: Xingshan Zeng, Lingzhi Wang, Weiwen Liu, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01709
Pdf URL: https://arxiv.org/pdf/2602.01709
Copy Paste: [[2602.01709]] ARTIS: Agentic Risk-Aware Test-Time Scaling via Iterative Simulation(https://arxiv.org/abs/2602.01709)
Keywords: language model, llm, agent
Abstract: Current test-time scaling (TTS) techniques enhance large language model (LLM) performance by allocating additional computation at inference time, yet they remain insufficient for agentic settings, where actions directly interact with external environments and their effects can be irreversible and costly. We propose \emph{\name}, \emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling via \underline{I}terative \underline{S}imulation}, a framework that decouples exploration from commitment by enabling test-time exploration through simulated interactions prior to real-world execution. This design allows extending inference-time computation to improve action-level reliability and robustness without incurring environmental risk. We further show that naive LLM-based simulators struggle to capture rare but high-impact failure modes, substantially limiting their effectiveness for agentic decision making. To address this limitation, we introduce a \emph{risk-aware tool simulator} that emphasizes fidelity on failure-inducing actions via targeted data generation and rebalanced training. Experiments on multi-turn and multi-step agentic benchmarks demonstrate that iterative simulation substantially improves agent reliability, and that risk-aware simulation is essential for consistently realizing these gains across models and tasks.
摘要：当前的测试时间扩展（TTS）技术通过在推理时分配额外的计算来增强大型语言模型（LLM）的性能，但它们对于代理设置仍然不够，在代理设置中，操作直接与外部环境交互，其效果可能是不可逆的且成本高昂。我们通过 \underline{I}terative \underline{S}imulation} 提出 \emph{\name}、\emph{\underline{A}gentic \underline{R}isk-Aware \underline{T}est-Time Scaling，这是一个框架，通过在现实世界执行之前通过模拟交互启用测试时探索，将探索与承诺分离。这种设计允许扩展推理时间计算，以提高操作级别的可靠性和鲁棒性，而不会产生环境风险。我们进一步表明，基于 LLM 的简单模拟器很难捕获罕见但影响大的故障模式，从而大大限制了它们对代理决策的有效性。为了解决这个限制，我们引入了一个 \emph{风险意识工具模拟器}，它强调通过有针对性的数据生成和重新平衡训练来实现失败诱导操作的保真度。多回合和多步骤代理基准的实验表明，迭代模拟大大提高了代理的可靠性，而风险感知模拟对于跨模型和任务一致实现这些收益至关重要。

Title: MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark

Authors: Mouath Abu-Daoud, Leen Kharouf, Omar El Hajj, Dana El Samad, Mariam Al-Omari, Jihad Mallat, Khaled Saleh, Nizar Habash, Farah E. Shamout
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01714
Pdf URL: https://arxiv.org/pdf/2602.01714
Copy Paste: [[2602.01714]] MedAraBench: Large-Scale Arabic Medical Question Answering Dataset and Benchmark(https://arxiv.org/abs/2602.01714)
Keywords: language model, gpt, llm
Abstract: Arabic remains one of the most underrepresented languages in natural language processing research, particularly in medical applications, due to the limited availability of open-source data and benchmarks. The lack of resources hinders efforts to evaluate and advance the multilingual capabilities of Large Language Models (LLMs). In this paper, we introduce MedAraBench, a large-scale dataset consisting of Arabic multiple-choice question-answer pairs across various medical specialties. We constructed the dataset by manually digitizing a large repository of academic materials created by medical professionals in the Arabic-speaking region. We then conducted extensive preprocessing and split the dataset into training and test sets to support future research efforts in the area. To assess the quality of the data, we adopted two frameworks, namely expert human evaluation and LLM-as-a-judge. Our dataset is diverse and of high quality, spanning 19 specialties and five difficulty levels. For benchmarking purposes, we assessed the performance of eight state-of-the-art open-source and proprietary models, such as GPT-5, Gemini 2.0 Flash, and Claude 4-Sonnet. Our findings highlight the need for further domain-specific enhancements. We release the dataset and evaluation scripts to broaden the diversity of medical data benchmarks, expand the scope of evaluation suites for LLMs, and enhance the multilingual capabilities of models for deployment in clinical settings.
摘要：由于开源数据和基准的可用性有限，阿拉伯语仍然是自然语言处理研究中代表性最不足的语言之一，特别是在医学应用中。资源缺乏阻碍了评估和提升大型语言模型 (LLM) 的多语言能力的努力。在本文中，我们介绍了 MedAraBench，这是一个由各个医学专业的阿拉伯语多项选择问答对组成的大型数据集。我们通过手动数字化阿拉伯语地区医疗专业人员创建的大型学术材料存储库来构建数据集。然后，我们进行了广泛的预处理，并将数据集分为训练集和测试集，以支持该领域未来的研究工作。为了评估数据的质量，我们采用了两个框架，即专家人工评估和法学硕士法官。我们的数据集多样化且高质量，涵盖 19 个专业和 5 个难度级别。出于基准测试的目的，我们评估了八种最先进的开源和专有模型的性能，例如 GPT-5、Gemini 2.0 Flash 和 Claude 4-Sonnet。我们的研究结果强调需要进一步增强特定领域的功能。我们发布数据集和评估脚本，以扩大医学数据基准的多样性，扩大法学硕士评估套件的范围，并增强模型在临床环境中部署的多语言能力。

Title: Mechanistic Indicators of Steering Effectiveness in Large Language Models

Authors: Mehdi Jafari, Hao Xue, Flora Salim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01716
Pdf URL: https://arxiv.org/pdf/2602.01716
Copy Paste: [[2602.01716]] Mechanistic Indicators of Steering Effectiveness in Large Language Models(https://arxiv.org/abs/2602.01716)
Keywords: language model, llm
Abstract: Activation-based steering enables Large Language Models (LLMs) to exhibit targeted behaviors by intervening on intermediate activations without retraining. Despite its widespread use, the mechanistic factors that govern when steering succeeds or fails remain poorly understood, as prior work has relied primarily on black-box outputs or LLM-based judges. In this study, we investigate whether the reliability of steering can be diagnosed using internal model signals. We focus on two information-theoretic measures: the entropy-derived Normalized Branching Factor (NBF), and the Kullback-Leibler (KL) divergence between steered activations and targeted concepts in the vocabulary space. We hypothesize that effective steering corresponds to structured entropy preservation and coherent KL alignment across decoding steps. Building on a reliability study demonstrating high inter-judge agreement between two architecturally distinct LLMs, we use LLM-generated annotations as ground truth and show that these mechanistic signals provide meaningful predictive power for identifying successful steering and estimating failure probability. We further introduce a stronger evaluation baseline for Contrastive Activation Addition (CAA) and Sparse Autoencoder-based steering, the two most widely adopted activation-steering methods.
摘要：基于激活的引导使大型语言模型 (LLM) 能够通过干预中间激活来展示目标行为，而无需重新训练。尽管其广泛使用，但控制转向成功或失败的机械因素仍然知之甚少，因为之前的工作主要依赖于黑匣子输出或基于法学硕士的判断。在本研究中，我们研究是否可以使用内部模型信号来诊断转向的可靠性。我们关注两个信息论度量：熵导出的归一化分支因子（NBF），以及词汇空间中引导激活和目标概念之间的 Kullback-Leibler（KL）分歧。我们假设有效的引导对应于解码步骤中的结构化熵保存和相干 KL 对齐。基于可靠性研究证明两个架构不同的法学硕士之间的法官间高度一致性，我们使用法学硕士生成的注释作为基本事实，并表明这些机械信号为识别成功转向和估计故障概率提供了有意义的预测能力。我们进一步为对比激活添加（CAA）和基于稀疏自动编码器的转向（两种最广泛采用的激活转向方法）引入更强的评估基线。

Title: COMI: Coarse-to-fine Context Compression via Marginal Information Gain

Authors: Jiwei Tang, Shilei Liu, Zhicheng Zhang, Yujin Yuan, Libin Zheng, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01719
Pdf URL: https://arxiv.org/pdf/2602.01719
Copy Paste: [[2602.01719]] COMI: Coarse-to-fine Context Compression via Marginal Information Gain(https://arxiv.org/abs/2602.01719)
Keywords: language model, llm, long context
Abstract: Large Language Models (LLMs) have demonstrated exceptional capabilities across diverse tasks. However, their deployment in long context scenarios remains hindered by computational inefficiency and information redundancy. Context compression methods address these challenges by significantly reducing input length and eliminating redundancy. We propose COMI, a coarse-to-fine adaptive context compression framework that jointly optimizes for semantic relevance and diversity under high compression rates. We introduce Marginal Information Gain (MIG), a metric defined as the relevance of a unit to the input query minus its semantic redundancy with other units, guiding the compression process to prioritize information that is both relevant and low redundant. The framework operates in two stages: (1) Coarse-Grained Group Reallocation, where the context is partitioned into groups and dynamically assigned compression rates based on inter-group MIG, ensuring compression budgets align with information value distribution; and (2) Fine-Grained Token Merging, where tokens within each group are fused via an intra-group MIG-based weighting mechanism, thereby preserving key semantics while avoiding the accumulation of redundancy. Extensive experiments across question-answering (e.g., NaturalQuestions, 2WikiMQA, HotpotQA and NarrativeQA), summarization (e.g., MultiNews) with various backbones (e.g., LLaMA-2-7B, Qwen2-7B) show that COMI outperforms existing baselines by a large margin, e.g., approximately 25-point Exact Match (EM) improvement under 32x compression constraint with Qwen2-7B on NaturalQuestions.
摘要：大型语言模型 (LLM) 在不同的任务中表现出了卓越的能力。然而，它们在长上下文场景中的部署仍然受到计算效率低下和信息冗余的阻碍。上下文压缩方法通过显着减少输入长度和消除冗余来解决这些挑战。我们提出了 COMI，一种从粗到细的自适应上下文压缩框架，可在高压缩率下联合优化语义相关性和多样性。我们引入了边际信息增益（MIG），这是一种度量，定义为单元与输入查询的相关性减去其与其他单元的语义冗余，指导压缩过程优先考虑相关且低冗余的信息。该框架分两个阶段运行：（1）粗粒度组重新分配，将上下文划分为组，并根据组间 MIG 动态分配压缩率，确保压缩预算与信息价值分布一致； (2)细粒度令牌合并，其中每个组内的令牌通过基于组内 MIG 的加权机制进行融合，从而保留关键语义，同时避免冗余的积累。跨问答（例如 NaturalQuestions、2WikiMQA、HotpotQA 和 NarrativeQA）、摘要（例如 MultiNews）以及各种主干网（例如 LLaMA-2-7B、Qwen2-7B）的广泛实验表明，COMI 大幅优于现有基线，例如，在 32 倍压缩约束下，精确匹配（EM）提高了大约 25 点自然问题上的 Qwen2-7B。

Title: SafePred: A Predictive Guardrail for Computer-Using Agents via World Models

Authors: Yurun Chen, Zeyi Liao, Ping Yin, Taotao Xie, Keting Yin, Shengyu Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01725
Pdf URL: https://arxiv.org/pdf/2602.01725
Copy Paste: [[2602.01725]] SafePred: A Predictive Guardrail for Computer-Using Agents via World Models(https://arxiv.org/abs/2602.01725)
Keywords: agent
Abstract: With the widespread deployment of Computer-using Agents (CUAs) in complex real-world environments, prevalent long-term risks often lead to severe and irreversible consequences. Most existing guardrails for CUAs adopt a reactive approach, constraining agent behavior only within the current observation space. While these guardrails can prevent immediate short-term risks (e.g., clicking on a phishing link), they cannot proactively avoid long-term risks: seemingly reasonable actions can lead to high-risk consequences that emerge with a delay (e.g., cleaning logs leads to future audits being untraceable), which reactive guardrails cannot identify within the current observation space. To address these limitations, we propose a predictive guardrail approach, with the core idea of aligning predicted future risks with current decisions. Based on this approach, we present SafePred, a predictive guardrail framework for CUAs that establishes a risk-to-decision loop to ensure safe agent behavior. SafePred supports two key abilities: (1) Short- and long-term risk prediction: by using safety policies as the basis for risk prediction, SafePred leverages the prediction capability of the world model to generate semantic representations of both short-term and long-term risks, thereby identifying and pruning actions that lead to high-risk states; (2) Decision optimization: translating predicted risks into actionable safe decision guidances through step-level interventions and task-level re-planning. Extensive experiments show that SafePred significantly reduces high-risk behaviors, achieving over 97.6% safety performance and improving task utility by up to 21.4% compared with reactive baselines.
摘要：随着计算机使用代理（CUA）在复杂的现实环境中的广泛部署，普遍存在的长期风险往往会导致严重且不可逆转的后果。大多数现有的 CUA 护栏都采用反应性方法，仅在当前观察空间内限制智能体的行为。虽然这些护栏可以防止立即的短期风险（例如，点击网络钓鱼链接），但它们无法主动避免长期风险：看似合理的行动可能会导致延迟出现的高风险后果（例如，清理日志导致未来的审计无法追踪），而反应性护栏无法在当前观察空间内识别这些后果。为了解决这些限制，我们提出了一种预测护栏方法，其核心思想是将预测的未来风险与当前决策相结合。基于这种方法，我们提出了 SafePred，这是一种针对 CUA 的预测护栏框架，它建立了风险到决策循环以确保安全代理行为。 SafePred支持两个关键能力：（1）短期和长期风险预测：以安全策略作为风险预测的基础，SafePred利用世界模型的预测能力生成短期和长期风险的语义表示，从而识别和修剪导致高风险状态的行为；（2）决策优化：通过步骤级干预和任务级重新规划，将预测风险转化为可操作的安全决策指导。大量实验表明，SafePred 显着减少了高风险行为，与反应基线相比，实现了 97.6% 以上的安全性能，并将任务效用提高了高达 21.4%。

Title: Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training

Authors: Hongseok Choi, Serynn Kim, Wencke Liermann, Jin Seong, Jin-Xia Huang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01747
Pdf URL: https://arxiv.org/pdf/2602.01747
Copy Paste: [[2602.01747]] Enhancing Automated Essay Scoring with Three Techniques: Two-Stage Fine-Tuning, Score Alignment, and Self-Training(https://arxiv.org/abs/2602.01747)
Keywords: prompt
Abstract: Automated Essay Scoring (AES) plays a crucial role in education by providing scalable and efficient assessment tools. However, in real-world settings, the extreme scarcity of labeled data severely limits the development and practical adoption of robust AES systems. This study proposes a novel approach to enhance AES performance in both limited-data and full-data settings by introducing three key techniques. First, we introduce a Two-Stage fine-tuning strategy that leverages low-rank adaptations to better adapt an AES model to target prompt essays. Second, we introduce a Score Alignment technique to improve consistency between predicted and true score distributions. Third, we employ uncertainty-aware self-training using unlabeled data, effectively expanding the training set with pseudo-labeled samples while mitigating label noise propagation. We implement above three key techniques on DualBERT. We conduct extensive experiments on the ASAP++ dataset. As a result, in the 32-data setting, all three key techniques improve performance, and their integration achieves 91.2% of the full-data performance trained on approximately 1,000 labeled samples. In addition, the proposed Score Alignment technique consistently improves performance in both limited-data and full-data settings: e.g., it achieves state-of-the-art results in the full-data setting when integrated into DualBERT.
摘要：自动作文评分 (AES) 通过提供可扩展且高效的评估工具，在教育中发挥着至关重要的作用。然而，在现实环境中，标记数据的极度稀缺严重限制了强大的 AES 系统的开发和实际采用。本研究提出了一种新方法，通过引入三种关键技术来增强有限数据和全数据设置中的 AES 性能。首先，我们引入了一种两阶段微调策略，该策略利用低秩适应来更好地使 AES 模型适应目标提示论文。其次，我们引入了分数对齐技术来提高预测分数分布和真实分数分布之间的一致性。第三，我们使用未标记数据进行不确定性感知自我训练，有效地使用伪标记样本扩展训练集，同时减轻标签噪声传播。我们在 DualBERT 上实现了上述三个关键技术。我们在 ASAP++ 数据集上进行了广泛的实验。因此，在 32 数据设置中，所有三种关键技术都提高了性能，并且它们的集成实现了在大约 1,000 个标记样本上训练的全数据性能的 91.2%。此外，所提出的分数对齐技术持续提高了有限数据和全数据设置中的性能：例如，当集成到 DualBERT 中时，它在全数据设置中实现了最先进的结果。

Title: WorldCup Sampling for Multi-bit LLM Watermarking

Authors: Yidan Wang, Yubing Ren, Yanan Cao, Li Guo
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2602.01752
Pdf URL: https://arxiv.org/pdf/2602.01752
Copy Paste: [[2602.01752]] WorldCup Sampling for Multi-bit LLM Watermarking(https://arxiv.org/abs/2602.01752)
Keywords: language model, llm
Abstract: As large language models (LLMs) generate increasingly human-like text, watermarking offers a promising solution for reliable attribution beyond mere detection. While multi-bit watermarking enables richer provenance encoding, existing methods largely extend zero-bit schemes through seed-driven steering, leading to indirect information flow, limited effective capacity, and suboptimal decoding. In this paper, we propose WorldCup, a multi-bit watermarking framework for LLMs that treats sampling as a natural communication channel and embeds message bits directly into token selection via a hierarchical competition mechanism guided by complementary signals. Moreover, WorldCup further adopts entropy-aware modulation to preserve generation quality and supports robust message recovery through confidence-aware decoding. Comprehensive experiments show that WorldCup achieves a strong balance across capacity, detectability, robustness, text quality, and decoding efficiency, consistently outperforming prior baselines and laying a solid foundation for future LLM watermarking studies.
摘要：随着大型语言模型 (LLM) 生成越来越像人类的文本，水印为可靠归因提供了一种有前途的解决方案，而不仅仅是检测。虽然多比特水印可以实现更丰富的来源编码，但现有方法在很大程度上通过种子驱动的引导来扩展零比特方案，从而导致间接信息流、有限的有效容量和次优解码。在本文中，我们提出了 WorldCup，一种用于 LLM 的多位水印框架，它将采样视为自然通信通道，并通过互补信号引导的分层竞争机制将消息位直接嵌入到令牌选择中。此外，世界杯还采用熵感知调制来保持生成质量，并通过置信感知解码支持稳健的消息恢复。综合实验表明，WorldCup 在容量、可检测性、鲁棒性、文本质量和解码效率方面实现了强有力的平衡，始终优于先前的基线，并为未来的 LLM 水印研究奠定了坚实的基础。

Title: Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings

Authors: Doohyun Kim, Donghwa Kang, Kyungjae Lee, Hyeongboo Baek, Brent Byunghoon Kang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01757
Pdf URL: https://arxiv.org/pdf/2602.01757
Copy Paste: [[2602.01757]] Zero2Text: Zero-Training Cross-Domain Inversion Attacks on Textual Embeddings(https://arxiv.org/abs/2602.01757)
Keywords: llm, retrieval-augmented generation
Abstract: The proliferation of retrieval-augmented generation (RAG) has established vector databases as critical infrastructure, yet they introduce severe privacy risks via embedding inversion attacks. Existing paradigms face a fundamental trade-off: optimization-based methods require computationally prohibitive queries, while alignment-based approaches hinge on the unrealistic assumption of accessible in-domain training data. These constraints render them ineffective in strict black-box and cross-domain settings. To dismantle these barriers, we introduce Zero2Text, a novel training-free framework based on recursive online alignment. Unlike methods relying on static datasets, Zero2Text synergizes LLM priors with a dynamic ridge regression mechanism to iteratively align generation to the target embedding on-the-fly. We further demonstrate that standard defenses, such as differential privacy, fail to effectively mitigate this adaptive threat. Extensive experiments across diverse benchmarks validate Zero2Text; notably, on MS MARCO against the OpenAI victim model, it achieves 1.8x higher ROUGE-L and 6.4x higher BLEU-2 scores compared to baselines, recovering sentences from unknown domains without a single leaked data pair.
摘要：检索增强生成（RAG）的激增已将矢量数据库确立为关键基础设施，但它们通过嵌入反转攻击引入了严重的隐私风险。现有的范例面临着一个基本的权衡：基于优化的方法需要计算上禁止的查询，而基于对齐的方法取决于可访问的域内训练数据的不切实际的假设。这些限制使得它们在严格的黑盒和跨域设置中无效。为了消除这些障碍，我们引入了 Zero2Text，这是一种基于递归在线对齐的新型免训练框架。与依赖静态数据集的方法不同，Zero2Text 将 LLM 先验与动态岭回归机制相结合，以迭代方式将生成过程与动态目标嵌入对齐。我们进一步证明，标准防御（例如差异隐私）无法有效减轻这种自适应威胁。跨不同基准的大量实验验证了 Zero2Text；值得注意的是，在针对 OpenAI 受害者模型的 MS MARCO 上，与基线相比，它的 ROUGE-L 分数提高了 1.8 倍，BLEU-2 分数提高了 6.4 倍，从未知域中恢复了句子，而没有单个泄漏的数据对。

Title: : One LLM Token for Explicit Graph Structural Understanding

Authors: Jingyao Wu, Bin Lu, Zijun Di, Xiaoying Gan, Meng Jin, Luoyi Fu, Xinbing Wang, Chenghu Zhou
Subjects: cs.CL, cs.AI, cs.NI
Abstract URL: https://arxiv.org/abs/2602.01771
Pdf URL: https://arxiv.org/pdf/2602.01771
Copy Paste: [[2602.01771]] : One LLM Token for Explicit Graph Structural Understanding(https://arxiv.org/abs/2602.01771)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models show great potential in unstructured data understanding, but still face significant challenges with graphs due to their structural hallucination. Existing approaches mainly either verbalize graphs into natural language, which leads to excessive token consumption and scattered attention, or transform graphs into trainable continuous embeddings (i.e., soft prompt), but exhibit severe misalignment with original text tokens. To solve this problem, we propose to incorporate one special token to fully represent the Structure Of Graph within a unified token space, facilitating explicit topology input and structural information sharing. Specifically, we propose a topology-aware structural tokenizer that maps each graph topology into a highly selective single token. Afterwards, we construct a set of hybrid structure Question-Answering corpora to align new structural tokens with existing text tokens. With this approach, empowers LLMs to understand, generate, and reason in a concise and accurate manner. Extensive experiments on five graph-level benchmarks demonstrate the superiority of our method, achieving a performance improvement of 9.9% to 41.4% compared to the baselines while exhibiting interpretability and consistency. Furthermore, our method provides a flexible extension to node-level tasks, enabling both global and local structural understanding. The codebase is publicly available at this https URL.
摘要：大型语言模型在非结构化数据理解方面显示出巨大潜力，但由于其结构幻觉，仍然面临着图的重大挑战。现有的方法主要要么将图语言化为自然语言，这会导致过度的标记消耗和注意力分散，要么将图转换为可训练的连续嵌入（即软提示），但与原始文本标记严重不一致。为了解决这个问题，我们建议合并一个特殊的令牌来在统一的令牌空间内完整地表示图的结构，从而促进显式的拓扑输入和结构信息共享。具体来说，我们提出了一种拓扑感知的结构标记器，它将每个图拓扑映射为高度选择性的单个标记。然后，我们构建了一组混合结构问答语料库，将新的结构标记与现有的文本标记对齐。通过这种方法，使法学硕士能够以简洁准确的方式理解、生成和推理。对五个图形级基准的广泛实验证明了我们方法的优越性，与基线相比，性能提高了 9.9% 至 41.4%，同时表现出可解释性和一致性。此外，我们的方法提供了对节点级任务的灵活扩展，从而实现了全局和局部结构理解。代码库可通过此 https URL 公开获取。

Title: Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model

Authors: Kangtao Lv, Jiwei Tang, Langming Liu, Haibin Chen, Weidong Zhang, Shilei Liu, Yongwei Wang, Yujin Yuan, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01778
Pdf URL: https://arxiv.org/pdf/2602.01778
Copy Paste: [[2602.01778]] Data Distribution Matters: A Data-Centric Perspective on Context Compression for Large Language Model(https://arxiv.org/abs/2602.01778)
Keywords: language model, llm
Abstract: The deployment of Large Language Models (LLMs) in long-context scenarios is hindered by computational inefficiency and significant information redundancy. Although recent advancements have widely adopted context compression to address these challenges, existing research only focus on model-side improvements, the impact of the data distribution itself on context compression remains largely unexplored. To bridge this gap, we are the first to adopt a data-centric perspective to systematically investigate how data distribution impacts compression quality, including two dimensions: input data and intrinsic data (i.e., the model's internal pretrained knowledge). We evaluate the semantic integrity of compressed representations using an autoencoder-based framework to systematically investigate it. Our experimental results reveal that: (1) encoder-measured input entropy negatively correlates with compression quality, while decoder-measured entropy shows no significant relationship under a frozen-decoder setting; and (2) the gap between intrinsic data of the encoder and decoder significantly diminishes compression gains, which is hard to mitigate. Based on these findings, we further present practical guidelines to optimize compression gains.
摘要：计算效率低下和大量信息冗余阻碍了大型语言模型（LLM）在长上下文场景中的部署。尽管最近的进展已广泛采用上下文压缩来应对这些挑战，但现有研究仅关注模型端的改进，数据分布本身对上下文压缩的影响在很大程度上仍未被探索。为了弥补这一差距，我们率先采用以数据为中心的视角，系统地研究数据分布如何影响压缩质量，包括两个维度：输入数据和内在数据（即模型内部的预训练知识）。我们使用基于自动编码器的框架来评估压缩表示的语义完整性，以系统地研究它。我们的实验结果表明：（1）编码器测量的输入熵与压缩质量负相关，而解码器测量的熵在冻结解码器设置下没有显示出显着关系； (2)编码器和解码器的固有数据之间的差距显着降低了压缩增益，这是很难缓解的。基于这些发现，我们进一步提出了优化压缩增益的实用指南。

Title: CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding

Authors: Yuling Shi, Chaoxiang Xie, Zhensu Sun, Yeheng Chen, Chenxu Zhang, Longfei Yun, Chengcheng Wan, Hongyu Zhang, David Lo, Xiaodong Gu
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2602.01785
Pdf URL: https://arxiv.org/pdf/2602.01785
Copy Paste: [[2602.01785]] CodeOCR: On the Effectiveness of Vision Language Models in Code Understanding(https://arxiv.org/abs/2602.01785)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have achieved remarkable success in source code understanding, yet as software systems grow in scale, computational efficiency has become a critical bottleneck. Currently, these models rely on a text-based paradigm that treats source code as a linear sequence of tokens, which leads to a linear increase in context length and associated computational costs. The rapid advancement of Multimodal LLMs (MLLMs) introduces an opportunity to optimize efficiency by representing source code as rendered images. Unlike text, which is difficult to compress without losing semantic meaning, the image modality is inherently suitable for compression. By adjusting resolution, images can be scaled to a fraction of their original token cost while remaining recognizable to vision-capable models. To explore the feasibility of this approach, we conduct the first systematic study on the effectiveness of MLLMs for code understanding. Our experiments reveal that: (1) MLLMs can effectively understand code with substantial token reduction, achieving up to 8x compression; (2) MLLMs can effectively leverage visual cues such as syntax highlighting, improving code completion performance under 4x compression; and (3) Code-understanding tasks like clone detection exhibit exceptional resilience to visual compression, with some compression ratios even slightly outperforming raw text inputs. Our findings highlight both the potential and current limitations of MLLMs in code understanding, which points out a shift toward image-modality code representation as a pathway to more efficient inference.
摘要：大型语言模型（LLM）在源代码理解方面取得了显着的成功，但随着软件系统规模的增长，计算效率已成为关键瓶颈。目前，这些模型依赖于基于文本的范例，将源代码视为令牌的线性序列，这导致上下文长度和相关计算成本线性增加。多模态 LLM (MLLM) 的快速发展带来了通过将源代码表示为渲染图像来优化效率的机会。与难以在不丢失语义的情况下压缩的文本不同，图像模态本质上适合压缩。通过调整分辨率，图像可以缩放到原始代币成本的一小部分，同时保持视觉模型的可识别性。为了探索这种方法的可行性，我们对 MLLM 在代码理解方面的有效性进行了首次系统研究。我们的实验表明：（1）MLLM 可以有效地理解代码，并大幅减少标记，实现高达 8 倍的压缩； (2) MLLM 可以有效利用语法突出显示等视觉提示，提高 4 倍压缩下的代码完成性能； (3) 克隆检测等代码理解任务对视觉压缩表现出卓越的弹性，某些压缩率甚至略优于原始文本输入。我们的研究结果强调了 MLLM 在代码理解方面的潜在和当前局限性，这指出了向图像模态代码表示的转变作为更有效推理的途径。

Title: Sentence Curve Language Models

Authors: DongNyeong Heo, Heelyoul Choi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.01807
Pdf URL: https://arxiv.org/pdf/2602.01807
Copy Paste: [[2602.01807]] Sentence Curve Language Models(https://arxiv.org/abs/2602.01807)
Keywords: language model
Abstract: Language models (LMs) are a central component of modern AI systems, and diffusion-based language models (DLMs) have recently emerged as a competitive alternative. Both paradigms rely on word embeddings not only to represent the input sentence, but also to represent the target sentence that backbone models are trained to predict. We argue that such static embedding of the target word is insensitive to neighboring words, encouraging locally accurate word prediction while neglecting global structure across the target sentence. To address this limitation, we propose a continuous sentence representation, termed sentence curve, defined as a spline curve whose control points affect multiple words in the sentence. Based on this representation, we introduce sentence curve language model (SCLM), which extends DLMs to predict sentence curves instead of the static word embeddings. We theoretically show that sentence curve prediction induces a regularization effect that promotes global structure modeling, and characterize how different sentence curve types affect this behavior. Empirically, SCLM achieves SOTA performance among DLMs on IWSLT14 and WMT14, shows stable training without burdensome knowledge distillation, and demonstrates promising potential compared to discrete DLMs on LM1B.
摘要：语言模型 (LM) 是现代人工智能系统的核心组成部分，基于扩散的语言模型 (DLM) 最近已成为一种有竞争力的替代方案。这两种范式都依赖词嵌入来表示输入句子，而且还表示骨干模型训练预测的目标句子。我们认为，目标单词的这种静态嵌入对相邻单词不敏感，鼓励局部准确的单词预测，同时忽略目标句子的全局结构。为了解决这个限制，我们提出了一种连续的句子表示，称为句子曲线，定义为样条曲线，其控制点影响句子中的多个单词。基于这种表示，我们引入了句子曲线语言模型（SCLM），它扩展了 DLM 来预测句子曲线而不是静态词嵌入。我们从理论上证明，句子曲线预测会产生正则化效应，从而促进全局结构建模，并描述不同句子曲线类型如何影响这种行为。根据经验，SCLM 在 IWSLT14 和 WMT14 上的 DLM 中实现了 SOTA 性能，显示出稳定的训练，无需繁琐的知识蒸馏，并且与 LM1B 上的离散 DLM 相比，展现出广阔的潜力。

Title: AXE: Low-Cost Cross-Domain Web Structured Information Extraction

Authors: Abdelrahman Mansour, Khaled W. Alshaer, Moataz Elsaban
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01838
Pdf URL: https://arxiv.org/pdf/2602.01838
Copy Paste: [[2602.01838]] AXE: Low-Cost Cross-Domain Web Structured Information Extraction(https://arxiv.org/abs/2602.01838)
Keywords: language model, llm
Abstract: Extracting structured data from the web is often a trade-off between the brittle nature of manual heuristics and the prohibitive cost of Large Language Models. We introduce AXE (Adaptive X-Path Extractor), a pipeline that rethinks this process by treating the HTML DOM as a tree that needs pruning rather than just a wall of text to be read. AXE uses a specialized "pruning" mechanism to strip away boilerplate and irrelevant nodes, leaving behind a distilled, high-density context that allows a tiny 0.6B LLM to generate precise, structured outputs. To keep the model honest, we implement Grounded XPath Resolution (GXR), ensuring every extraction is physically traceable to a source node. Despite its low footprint, AXE achieves state-of-the-art zero-shot performance, outperforming several much larger, fully-trained alternatives with an F1 score of 88.1% on the SWDE dataset. By releasing our specialized adaptors, we aim to provide a practical, cost-effective path for large-scale web information extraction.
摘要：从网络中提取结构化数据通常是手动启发式方法的脆弱性和大型语言模型的高昂成本之间的权衡。我们引入了 AX（自适应 X 路径提取器），这是一种管道，它通过将 HTML DOM 视为需要修剪的树而不仅仅是要读取的文本墙来重新思考此过程。 AX 使用专门的“修剪”机制来去除样板文件和不相关的节点，留下经过提炼的高密度上下文，允许微小的 0.6B LLM 生成精确的结构化输出。为了保持模型的诚实性，我们实施了接地 XPath 解析 (GXR)，确保每次提取都可以在物理上追溯到源节点。尽管占地面积较小，但 AX 仍实现了最先进的零样本性能，优于几个更大的、经过充分训练的替代方案，在 SWDE 数据集上的 F1 分数为 88.1%。通过发布我们的专用适配器，我们的目标是为大规模网络信息提取提供实用、经济高效的路径。

Title: Read As Human: Compressing Context via Parallelizable Close Reading and Skimming

Authors: Jiwei Tang, Shilei Liu, Zhicheng Zhang, Qingsong Lv, Runsong Zhao, Tingwei Lu, Langming Liu, Haibin Chen, Yujin Yuan, Hai-Tao Zheng, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01840
Pdf URL: https://arxiv.org/pdf/2602.01840
Copy Paste: [[2602.01840]] Read As Human: Compressing Context via Parallelizable Close Reading and Skimming(https://arxiv.org/abs/2602.01840)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) demonstrate exceptional capability across diverse tasks. However, their deployment in long-context scenarios is hindered by two challenges: computational inefficiency and redundant information. We propose RAM (Read As HuMan), a context compression framework that adopts an adaptive hybrid reading strategy, to address these challenges. Inspired by human reading behavior (i.e., close reading important content while skimming less relevant content), RAM partitions the context into segments and encodes them with the input query in parallel. High-relevance segments are fully retained (close reading), while low-relevance ones are query-guided compressed into compact summary vectors (skimming). Both explicit textual segments and implicit summary vectors are concatenated and fed into decoder to achieve both superior performance and natural language format interpretability. To refine the decision boundary between close reading and skimming, we further introduce a contrastive learning objective based on positive and negative query-segment pairs. Experiments demonstrate that RAM outperforms existing baselines on multiple question answering and summarization benchmarks across two backbones, while delivering up to a 12x end-to-end speedup on long inputs (average length 16K; maximum length 32K).
摘要：大型语言模型 (LLM) 在不同的任务中展示了卓越的能力。然而，它们在长上下文场景中的部署受到两个挑战的阻碍：计算效率低下和冗余信息。我们提出 RAM（Read As HuMan），一种采用自适应混合读取策略的上下文压缩框架来应对这些挑战。受人类阅读行为（即仔细阅读重要内容，同时略读不太相关的内容）的启发，RAM 将上下文划分为多个片段，并使用输入查询并行对它们进行编码。高相关性片段被完全保留（仔细阅读），而低相关性片段则被查询引导压缩为紧凑的摘要向量（略读）。显式文本片段和隐式摘要向量都被连接并输入解码器，以实现卓越的性能和自然语言格式的可解释性。为了细化精读和略读之间的决策边界，我们进一步引入了基于正负查询段对的对比学习目标。实验表明，RAM 在跨两个骨干网的多个问答和摘要基准测试中优于现有基准，同时在长输入（平均长度 16K；最大长度 32K）上提供高达 12 倍的端到端加速。

Title: PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning

Authors: Langming Liu, Kangtao Lv, Haibin Chen, Weidong Zhang, Yejing Wang, Shilei Liu, Xin Tong, Yujin Yuan, Yongwei Wang, Wenbo Su, Bo Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01875
Pdf URL: https://arxiv.org/pdf/2602.01875
Copy Paste: [[2602.01875]] PretrainRL: Alleviating Factuality Hallucination of Large Language Models at the Beginning(https://arxiv.org/abs/2602.01875)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs), despite their powerful capabilities, suffer from factual hallucinations where they generate verifiable falsehoods. We identify a root of this issue: the imbalanced data distribution in the pretraining corpus, which leads to a state of "low-probability truth" and "high-probability falsehood". Recent approaches, such as teaching models to say "I don't know" or post-hoc knowledge editing, either evade the problem or face catastrophic forgetting. To address this issue from its root, we propose \textbf{PretrainRL}, a novel framework that integrates reinforcement learning into the pretraining phase to consolidate factual knowledge. The core principle of PretrainRL is "\textbf{debiasing then learning}." It actively reshapes the model's probability distribution by down-weighting high-probability falsehoods, thereby making "room" for low-probability truths to be learned effectively. To enable this, we design an efficient negative sampling strategy to discover these high-probability falsehoods and introduce novel metrics to evaluate the model's probabilistic state concerning factual knowledge. Extensive experiments on three public benchmarks demonstrate that PretrainRL significantly alleviates factual hallucinations and outperforms state-of-the-art methods.
摘要：大型语言模型 (LLM) 尽管功能强大，但仍会产生事实幻觉，产生可验证的谎言。我们找到了这个问题的根源：预训练语料库中的数据分布不平衡，导致出现“低概率真实”和“高概率虚假”的状态。最近的方法，例如教学模型说“我不知道”或事后知识编辑，要么逃避问题，要么面临灾难性遗忘。为了从根本上解决这个问题，我们提出了 \textbf{PretrainRL}，这是一种将强化学习集成到预训练阶段以巩固事实知识的新颖框架。 PretrainRL 的核心原则是“\textbf{去偏然后学习}”。它通过降低高概率错误的权重来主动重塑模型的概率分布，从而为有效学习低概率事实腾出“空间”。为了实现这一点，我们设计了一种有效的负采样策略来发现这些高概率的谎言，并引入新的指标来评估模型有关事实知识的概率状态。对三个公共基准的广泛实验表明，PretrainRL 显着减轻了事实幻觉，并且优于最先进的方法。

Title: ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support

Authors: Tiantian Chen, Jiaqi Lu, Ying Shen, Lin Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01885
Pdf URL: https://arxiv.org/pdf/2602.01885
Copy Paste: [[2602.01885]] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support(https://arxiv.org/abs/2602.01885)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.
摘要：大型语言模型（LLM）已显示出作为对话代理的强大潜力。然而，它们的有效性仍然受到强大的长期记忆缺陷的限制，特别是在复杂的、长期的基于网络的服务（例如在线情感支持）中。然而，现有的长期对话基准主要侧重于静态和显式的事实检索，无法在用户信息分散、隐式和不断演变的关键场景中评估代理。为了解决这一差距，我们引入了 ES-MemEval，这是一个综合基准测试，它在长期情感支持环境中系统地评估五种核心记忆能力：信息提取、时间推理、冲突检测、弃权和用户建模，涵盖问答、摘要和对话生成任务。为了支持该基准，我们还提出了 EvoEmo，这是一个用于个性化长期情感支持的多会话数据集，可捕获分散的、隐式的用户披露和不断变化的用户状态。对开源长上下文、商业和检索增强 (RAG) 法学硕士的大量实验表明，外显长期记忆对于减少幻觉和实现有效的个性化至关重要。与此同时，RAG 提高了事实一致性，但在时间动态和不断变化的用户状态方面遇到了困难。这些发现强调了当前范式的潜力和局限性，并激发了长期个性化对话系统的记忆和检索的更强大的整合。

Title: GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs

Authors: Chengguang Gan, Yoshihiro Tsujii, Yunhao Liang, Tatsunori Mori, Shiwen Ni, Hiroki Itoh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01917
Pdf URL: https://arxiv.org/pdf/2602.01917
Copy Paste: [[2602.01917]] GuideWeb: A Benchmark for Automatic In-App Guide Generation on Real-World Web UIs(https://arxiv.org/abs/2602.01917)
Keywords: agent
Abstract: Digital Adoption Platform (DAP) provide web-based overlays that deliver operation guidance and contextual hints to help users navigate complex websites. Although modern DAP tools enable non-experts to author such guidance, maintaining these guides remains labor-intensive because website layouts and functionalities evolve continuously, which requires repeated manual updates and re-annotation. In this work, we introduce \textbf{GuideWeb}, a new benchmark for automatic in-app guide generation on real-world web UIs. GuideWeb formulates the task as producing page-level guidance by selecting \textbf{guide target elements} grounded in the webpage and generating concise guide text aligned with user intent. We also propose a comprehensive evaluation suite that jointly measures the accuracy of guide target element selection and the quality of generated intents and guide texts. Experiments show that our proposed \textbf{GuideWeb Agent} achieves \textbf{30.79\%} accuracy in guide target element prediction, while obtaining BLEU scores of \textbf{44.94} for intent generation and \textbf{21.34} for guide-text generation. Existing baselines perform substantially worse, which highlights that automatic guide generation remains challenging and that further advances are necessary before such systems can be reliably deployed in real-world settings.
摘要：数字采用平台 (DAP) 提供基于 Web 的覆盖层，提供操作指导和上下文提示，帮助用户浏览复杂的网站。尽管现代 DAP 工具使非专家能够编写此类指南，但维护这些指南仍然是劳动密集型的，因为网站布局和功能不断发展，这需要重复的手动更新和重新注释。在这项工作中，我们引入了 \textbf{GuideWeb}，这是在真实 Web UI 上自动生成应用内指南的新基准。 GuideWeb 通过选择基于网页的 \textbf{guide target elements} 并生成与用户意图一致的简洁指南文本，将任务制定为生成页面级指南。我们还提出了一个综合评估套件，联合衡量指南目标元素选择的准确性以及生成的意图和指南文本的质量。实验表明，我们提出的 \textbf{GuideWeb Agent} 在引导目标元素预测方面达到了 \textbf{30.79\%} 准确率，同时在意图生成方面获得了 \textbf{44.94} 的 BLEU 分数，在引导文本生成方面获得了 \textbf{21.34} 的 BLEU 分数。现有的基线性能要差得多，这凸显了自动指南生成仍然具有挑战性，并且在此类系统能够可靠地部署在现实世界中之前，需要进一步的进步。

Title: From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding"

Authors: Hend Al-Khalifa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01919
Pdf URL: https://arxiv.org/pdf/2602.01919
Copy Paste: [[2602.01919]] From Code-Centric to Concept-Centric: Teaching NLP with LLM-Assisted "Vibe Coding"(https://arxiv.org/abs/2602.01919)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of Large Language Models (LLMs) presents both challenges and opportunities for Natural Language Processing (NLP) education. This paper introduces ``Vibe Coding,'' a pedagogical approach that leverages LLMs as coding assistants while maintaining focus on conceptual understanding and critical thinking. We describe the implementation of this approach in a senior-level undergraduate NLP course, where students completed seven labs using LLMs for code generation while being assessed primarily on conceptual understanding through critical reflection questions. Analysis of end-of-course feedback from 19 students reveals high satisfaction (mean scores 4.4-4.6/5.0) across engagement, conceptual learning, and assessment fairness. Students particularly valued the reduced cognitive load from debugging, enabling deeper focus on NLP concepts. However, challenges emerged around time constraints, LLM output verification, and the need for clearer task specifications. Our findings suggest that when properly structured with mandatory prompt logging and reflection-based assessment, LLM-assisted learning can shift focus from syntactic fluency to conceptual mastery, preparing students for an AI-augmented professional landscape.
摘要：大型语言模型（LLM）的快速发展为自然语言处理（NLP）教育带来了挑战和机遇。本文介绍了“Vibe Coding”，这是一种利用法学硕士作为编码助手，同时保持对概念理解和批判性思维的关注的教学方法。我们描述了这种方法在高级本科 NLP 课程中的实施，其中学生使用法学硕士完成了七个实验，用于代码生成，同时主要通过批判性反思问题对概念理解进行评估。对 19 名学生的课程结束反馈的分析表明，他们在参与度、概念学习和评估公平性方面都非常满意（平均分 4.4-4.6/5.0）。学生们特别重视调试所减少的认知负荷，从而能够更深入地关注 NLP 概念。然而，围绕时间限制、法学硕士输出验证以及更清晰的任务规范的需求出现了挑战。我们的研究结果表明，当通过强制性即时记录和基于反思的评估进行适当的结构时，法学硕士辅助学习可以将重点从句法流畅性转移到概念掌握上，为学生进入人工智能增强的专业环境做好准备。

Title: Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation

Authors: Kwun Hang Lau, Fangyuan Zhang, Boyu Ruan, Yingli Zhou, Qintian Guo, Ruiyuan Zhang, Xiaofang Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.01965
Pdf URL: https://arxiv.org/pdf/2602.01965
Copy Paste: [[2602.01965]] Breaking the Static Graph: Context-Aware Traversal for Robust Retrieval-Augmented Generation(https://arxiv.org/abs/2602.01965)
Keywords: retrieval-augmented generation
Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have shifted from simple vector similarity to structure-aware approaches like HippoRAG, which leverage Knowledge Graphs (KGs) and Personalized PageRank (PPR) to capture multi-hop dependencies. However, these methods suffer from a "Static Graph Fallacy": they rely on fixed transition probabilities determined during indexing. This rigidity ignores the query-dependent nature of edge relevance, causing semantic drift where random walks are diverted into high-degree "hub" nodes before reaching critical downstream evidence. Consequently, models often achieve high partial recall but fail to retrieve the complete evidence chain required for multi-hop queries. To address this, we propose CatRAG, Context-Aware Traversal for robust RAG, a framework that builds on the HippoRAG 2 architecture and transforms the static KG into a query-adaptive navigation structure. We introduce a multi-faceted framework to steer the random walk: (1) Symbolic Anchoring, which injects weak entity constraints to regularize the random walk; (2) Query-Aware Dynamic Edge Weighting, which dynamically modulates graph structure, to prune irrelevant paths while amplifying those aligned with the query's intent; and (3) Key-Fact Passage Weight Enhancement, a cost-efficient bias that structurally anchors the random walk to likely evidence. Experiments across four multi-hop benchmarks demonstrate that CatRAG consistently outperforms state of the art baselines. Our analysis reveals that while standard Recall metrics show modest gains, CatRAG achieves substantial improvements in reasoning completeness, the capacity to recover the entire evidence path without gaps. These results reveal that our approach effectively bridges the gap between retrieving partial context and enabling fully grounded reasoning. Resources are available at this https URL.
摘要：检索增强生成 (RAG) 的最新进展已从简单的向量相似性转向 HippoRAG 等结构感知方法，后者利用知识图 (KG) 和个性化 PageRank (PPR) 来捕获多跳依赖性。然而，这些方法存在“静态图谬误”：它们依赖于索引期间确定的固定转移概率。这种僵化忽略了边缘相关性的查询依赖性质，导致语义漂移，其中随机游走在到达关键下游证据之前被转移到高度“中心”节点。因此，模型通常会实现较高的部分召回率，但无法检索多跳查询所需的完整证据链。为了解决这个问题，我们提出了 CatRAG（用于稳健 RAG 的上下文感知遍历），这是一个基于 HippoRAG 2 架构构建的框架，并将静态 KG 转换为查询自适应导航结构。我们引入了一个多方面的框架来引导随机游走：（1）符号锚定，注入弱实体约束来规范随机游走； (2) 查询感知动态边缘加权，动态调整图结构，修剪不相关的路径，同时放大与查询意图一致的路径； (3) 关键事实段落权重增强，这是一种具有成本效益的偏差，可以在结构上将随机游走锚定到可能的证据。四个多跳基准测试的实验表明，CatRAG 始终优于最先进的基准。我们的分析表明，虽然标准召回指标显示出适度的收益，但 CatRAG 在推理完整性、无间隙恢复整个证据路径的能力方面实现了实质性改进。这些结果表明，我们的方法有效地弥合了检索部分上下文和实现完全有根据的推理之间的差距。此 https URL 提供资源。

Title: Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models

Authors: Bin Cao, Huixian Lu, Chenwen Ma, Ting Wang, Ruizhe Li, Jing Fan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.01969
Pdf URL: https://arxiv.org/pdf/2602.01969
Copy Paste: [[2602.01969]] Orthogonal Hierarchical Decomposition for Structure-Aware Table Understanding with Large Language Models(https://arxiv.org/abs/2602.01969)
Keywords: language model, llm
Abstract: Complex tables with multi-level headers, merged cells and heterogeneous layouts pose persistent challenges for LLMs in both understanding and reasoning. Existing approaches typically rely on table linearization or normalized grid modeling. However, these representations struggle to explicitly capture hierarchical structures and cross-dimensional dependencies, which can lead to misalignment between structural semantics and textual representations for non-standard tables. To address this issue, we propose an Orthogonal Hierarchical Decomposition (OHD) framework that constructs structure-preserving input representations of complex tables for LLMs. OHD introduces an Orthogonal Tree Induction (OTI) method based on spatial--semantic co-constraints, which decomposes irregular tables into a column tree and a row tree to capture vertical and horizontal hierarchical dependencies, respectively. Building on this representation, we design a dual-pathway association protocol to symmetrically reconstruct semantic lineage of each cell, and incorporate an LLM as a semantic arbitrator to align multi-level semantic information. We evaluate OHD framework on two complex table question answering benchmarks, AITQA and HiTab. Experimental results show that OHD consistently outperforms existing representation paradigms across multiple evaluation metrics.
摘要：具有多级标题、合并单元格和异构布局的复杂表格对法学硕士在理解和推理方面提出了持续的挑战。现有方法通常依赖于表线性化或标准化网格建模。然而，这些表示很难显式地捕获层次结构和跨维度依赖关系，这可能导致非标准表的结构语义和文本表示之间的不一致。为了解决这个问题，我们提出了一个正交层次分解（OHD）框架，为法学硕士构建复杂表的结构保留输入表示。 OHD 引入了基于空间语义协同约束的正交树归纳（OTI）方法，该方法将不规则表分解为列树和行树，以分别捕获垂直和水平层次依赖性。在此表示的基础上，我们设计了一个双路径关联协议来对称地重建每个细胞的语义谱系，并将LLM作为语义仲裁器来对齐多级语义信息。我们在两个复杂的表格问答基准 AITQA 和 HiTab 上评估 OHD 框架。实验结果表明，OHD 在多个评估指标上始终优于现有的表示范式。

Title: Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing

Authors: Shuainan Liu, Xuanang Chen, Ben He, Le Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01977
Pdf URL: https://arxiv.org/pdf/2602.01977
Copy Paste: [[2602.01977]] Beyond Local Edits: Embedding-Virtualized Knowledge for Broader Evaluation and Preservation of Model Editing(https://arxiv.org/abs/2602.01977)
Keywords: language model
Abstract: Knowledge editing methods for large language models are commonly evaluated using predefined benchmarks that assess edited facts together with a limited set of related or neighboring knowledge. While effective, such evaluations remain confined to finite, dataset-bounded samples, leaving the broader impact of editing on the model's knowledge system insufficiently understood. To address this gap, we introduce Embedding-Virtualized Knowledge (EVK) that characterizes model knowledge through controlled perturbations in embedding space, enabling the exploration of a substantially broader and virtualized knowledge region beyond explicit data annotations. Based on EVK, we construct an embedding-level evaluation benchmark EVK-Bench that quantifies potential knowledge drift induced by editing, revealing effects that are not captured by conventional sample-based metrics. Furthermore, we propose a plug-and-play EVK-Align module that constrains embedding-level knowledge drift during editing and can be seamlessly integrated into existing editing methods. Experiments demonstrate that our approach enables more comprehensive evaluation while significantly improving knowledge preservation without sacrificing editing accuracy.
摘要：大型语言模型的知识编辑方法通常使用预定义的基准进行评估，这些基准评估编辑的事实以及有限的相关或邻近知识集。虽然有效，但此类评估仍然仅限于有限的、受数据集限制的样本，导致编辑对模型知识系统的更广泛影响尚未得到充分理解。为了解决这一差距，我们引入了嵌入虚拟知识（EVK），它通过嵌入空间中的受控扰动来表征模型知识，从而能够探索超出显式数据注释的更广泛的虚拟化知识区域。基于 EVK，我们构建了一个嵌入级评估基准 EVK-Bench，该基准可以量化编辑引起的潜在知识漂移，揭示传统基于样本的指标无法捕获的影响。此外，我们提出了一种即插即用的 EVK-Align 模块，该模块可以在编辑过程中限制嵌入级知识漂移，并且可以无缝集成到现有的编辑方法中。实验表明，我们的方法可以进行更全面的评估，同时显着改善知识保存，而不会牺牲编辑准确性。

Title: S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs

Authors: Yanrui Du, Sendong Zhao, Yibo Gao, Danyang Zhao, Qika Lin, Ming Ma, Jiayun Li, Yi Jiang, Kai He, Qianyi Xu, Bing Qin, Mengling Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01982
Pdf URL: https://arxiv.org/pdf/2602.01982
Copy Paste: [[2602.01982]] S3-CoT: Self-Sampled Succinct Reasoning Enables Efficient Chain-of-Thought LLMs(https://arxiv.org/abs/2602.01982)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) equipped with chain-of-thought (CoT) achieve strong performance and offer a window into LLM behavior. However, recent evidence suggests that improvements in CoT capabilities often come with redundant reasoning processes, motivating a key question: Can LLMs acquire a fast-thinking mode analogous to human System 1 reasoning? To explore this, our study presents a self-sampling framework based on activation steering for efficient CoT learning. Our method can induce style-aligned and variable-length reasoning traces from target LLMs themselves without any teacher guidance, thereby alleviating a central bottleneck of SFT-based methods-the scarcity of high-quality supervision data. Using filtered data by gold answers, we perform SFT for efficient CoT learning with (i) a human-like dual-cognitive system, and (ii) a progressive compression curriculum. Furthermore, we explore a self-evolution regime in which SFT is driven solely by prediction-consistent data of variable-length variants, eliminating the need for gold answers. Extensive experiments on math benchmarks, together with cross-domain generalization tests in medicine, show that our method yields stable improvements for both general and R1-style LLMs. Our data and model checkpoints can be found at this https URL.
摘要：配备思想链 (CoT) 的大型语言模型 (LLM) 实现了强大的性能，并提供了了解 LLM 行为的窗口。然而，最近的证据表明，CoT 能力的提高往往伴随着冗余的推理过程，这就引发了一个关键问题：法学硕士能否获得类似于人类系统 1 推理的快速思维模式？为了探索这一点，我们的研究提出了一个基于激活引导的自采样框架，以实现高效的 CoT 学习。我们的方法可以在没有任何教师指导的情况下从目标法学硕士本身诱导风格一致和可变长度的推理轨迹，从而缓解基于 SFT 的方法的中心瓶颈——高质量监督数据的稀缺。使用按黄金答案过滤的数据，我们通过 (i) 类人双认知系统和 (ii) 渐进式压缩课程执行 SFT 来实现高效的 CoT 学习。此外，我们探索了一种自我进化机制，其中 SFT 仅由可变长度变体的预测一致数据驱动，从而消除了对黄金答案的需要。数学基准的大量实验以及医学领域的跨领域泛化测试表明，我们的方法对普通法学硕士和 R1 型法学硕士都产生了稳定的改进。我们的数据和模型检查点可以在此 https URL 中找到。

Title: From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs

Authors: Yanrui Du, Yibo Gao, Sendong Zhao, Jiayun Li, Haochun Wang, Qika Lin, Kai He, Bing Qin, Mengling Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.01999
Pdf URL: https://arxiv.org/pdf/2602.01999
Copy Paste: [[2602.01999]] From Latent Signals to Reflection Behavior: Tracing Meta-Cognitive Activation Trajectory in R1-Style LLMs(https://arxiv.org/abs/2602.01999)
Keywords: llm, prompt
Abstract: R1-style LLMs have attracted growing attention for their capacity for self-reflection, yet the internal mechanisms underlying such behavior remain unclear. To bridge this gap, we anchor on the onset of reflection behavior and trace its layer-wise activation trajectory. Using the logit lens to read out token-level semantics, we uncover a structured progression: (i) Latent-control layers, where an approximate linear direction encodes the semantics of thinking budget; (ii) Semantic-pivot layers, where discourse-level cues, including turning-point and summarization cues, surface and dominate the probability mass; and (iii) Behavior-overt layers, where the likelihood of reflection-behavior tokens begins to rise until they become highly likely to be sampled. Moreover, our targeted interventions uncover a causal chain across these stages: prompt-level semantics modulate the projection of activations along latent-control directions, thereby inducing competition between turning-point and summarization cues in semantic-pivot layers, which in turn regulates the sampling likelihood of reflection-behavior tokens in behavior-overt layers. Collectively, our findings suggest a human-like meta-cognitive process-progressing from latent monitoring, to discourse-level regulation, and to finally overt self-reflection. Our analysis code can be found at this https URL.
摘要：R1型法学硕士因其自我反思的能力而受到越来越多的关注，但这种行为背后的内部机制仍不清楚。为了弥补这一差距，我们锚定反射行为的开始并追踪其分层激活轨迹。使用逻辑透镜读出标记级语义，我们发现了一个结构化的进程：（i）潜在控制层，其中近似线性方向编码思维预算的语义； (ii) 语义枢轴层，其中话语级线索，包括转折点和总结线索，浮现并主导概率质量； (iii) 行为公开层，其中反射行为标记的可能性开始上升，直到它们变得很可能被采样。此外，我们的有针对性的干预揭示了跨越这些阶段的因果链：提示级语义沿着潜在控制方向调节激活的投影，从而引起语义枢轴层中转折点和总结线索之间的竞争，这反过来又调节了行为显性层中反射行为标记的采样可能性。总的来说，我们的研究结果表明了一种类似人类的元认知过程——从潜在的监控，到话语层面的调节，最后到公开的自我反思。我们的分析代码可以在这个 https URL 中找到。

Title: Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Authors: Zhanghao Hu, Qinglin Zhu, Hanqi Yan, Yulan He, Lin Gui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02007
Pdf URL: https://arxiv.org/pdf/2602.02007
Copy Paste: [[2602.02007]] Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation(https://arxiv.org/abs/2602.02007)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Agent memory systems often adopt the standard Retrieval-Augmented Generation (RAG) pipeline, yet its underlying assumptions differ in this setting. RAG targets large, heterogeneous corpora where retrieved passages are diverse, whereas agent memory is a bounded, coherent dialogue stream with highly correlated spans that are often duplicates. Under this shift, fixed top-$k$ similarity retrieval tends to return redundant context, and post-hoc pruning can delete temporally linked prerequisites needed for correct reasoning. We argue retrieval should move beyond similarity matching and instead operate over latent components, following decoupling to aggregation: disentangle memories into semantic components, organise them into a hierarchy, and use this structure to drive retrieval. We propose xMemory, which builds a hierarchy of intact units and maintains a searchable yet faithful high-level node organisation via a sparsity--semantics objective that guides memory split and merge. At inference, xMemory retrieves top-down, selecting a compact, diverse set of themes and semantics for multi-fact queries, and expanding to episodes and raw messages only when it reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across the three latest LLMs show consistent gains in answer quality and token efficiency.
摘要：代理内存系统通常采用标准检索增强生成（RAG）管道，但其基本假设在此设置中有所不同。 RAG 的目标是大型、异构的语料库，其中检索到的段落是多种多样的，而代理记忆是有界的、连贯的对话流，具有高度相关的跨度，而且通常是重复的。在这种转变下，固定的 top-$k$ 相似性检索往往会返回冗余上下文，而事后修剪可以删除正确推理所需的时间相关的先决条件。我们认为检索应该超越相似性匹配，而是在解耦到聚合之后对潜在组件进行操作：将记忆分解为语义组件，将它们组织成层次结构，并使用此结构来驱动检索。我们提出 xMemory，它构建完整单元的层次结构，并通过指导内存分割和合并的稀疏语义目标维护可搜索但忠实的高级节点组织。在推理时，xMemory 自上而下检索，为多事实查询选择一组紧凑、多样化的主题和语义，并仅在减少读者的不确定性时扩展到情节和原始消息。在三个最新的法学硕士上对 LoCoMo 和 PerLTQA 进行的实验表明，答案质量和令牌效率都有所提高。

Title: WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora

Authors: Pengyu Wang, Benfeng Xu, Licheng Zhang, Shaohan Wang, Mingxuan Du, Chiwei Zhu, Zhendong Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02053
Pdf URL: https://arxiv.org/pdf/2602.02053
Copy Paste: [[2602.02053]] WildGraphBench: Benchmarking GraphRAG with Wild-Source Corpora(https://arxiv.org/abs/2602.02053)
Keywords: long context, retrieval-augmented generation
Abstract: Graph-based Retrieval-Augmented Generation (GraphRAG) organizes external knowledge as a hierarchical graph, enabling efficient retrieval and aggregation of scattered evidence across multiple documents. However, many existing benchmarks for GraphRAG rely on short, curated passages as external knowledge, failing to adequately evaluate systems in realistic settings involving long contexts and large-scale heterogeneous documents. To bridge this gap, we introduce WildGraphBench, a benchmark designed to assess GraphRAG performance in the wild. We leverage Wikipedia's unique structure, where cohesive narratives are grounded in long and heterogeneous external reference documents, to construct a benchmark reflecting real-word scenarios. Specifically, we sample articles across 12 top-level topics, using their external references as the retrieval corpus and citation-linked statements as ground truth, resulting in 1,100 questions spanning three levels of complexity: single-fact QA, multi-fact QA, and section-level summarization. Experiments across multiple baselines reveal that current GraphRAG pipelines help on multi-fact aggregation when evidence comes from a moderate number of sources, but this aggregation paradigm may overemphasize high-level statements at the expense of fine-grained details, leading to weaker performance on summarization tasks. Project page:this https URL.
摘要：基于图的检索增强生成（GraphRAG）将外部知识组织为分层图，从而能够高效检索和聚合多个文档中的分散证据。然而，许多现有的 GraphRAG 基准依赖于简短的、精心策划的段落作为外部知识，无法在涉及长上下文和大规模异构文档的现实环境中充分评估系统。为了弥补这一差距，我们引入了 WildGraphBench，这是一个旨在评估 GraphRAG 实际性能的基准测试。我们利用维基百科的独特结构（其中连贯的叙述基于冗长且异构的外部参考文档）来构建反映真实场景的基准。具体来说，我们对 12 个顶级主题的文章进行了抽样，使用它们的外部参考文献作为检索语料库，使用引用链接语句作为基本事实，产生了跨越三个复杂程度的 1,100 个问题：单事实 QA、多事实 QA 和章节级摘要。跨多个基线的实验表明，当证据来自中等数量的来源时，当前的 GraphRAG 管道有助于多事实聚合，但这种聚合范式可能会过分强调高层陈述，而牺牲细粒度的细节，导致摘要任务的性能较差。项目页面：此 https URL。

Title: Closing the Loop: Universal Repository Representation with RPG-Encoder

Authors: Jane Luo, Chengyu Yin, Xin Zhang, Qingtao Li, Steven Liu, Yiming Huang, Jie Wu, Hao Liu, Yangyu Huang, Yu Kang, Fangkai Yang, Ying Xin, Scarlett Li
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2602.02084
Pdf URL: https://arxiv.org/pdf/2602.02084
Copy Paste: [[2602.02084]] Closing the Loop: Universal Repository Representation with RPG-Encoder(https://arxiv.org/abs/2602.02084)
Keywords: agent
Abstract: Current repository agents encounter a reasoning disconnect due to fragmented representations, as existing methods rely on isolated API documentation or dependency graphs that lack semantic depth. We consider repository comprehension and generation to be inverse processes within a unified cycle: generation expands intent into implementation, while comprehension compresses implementation back into intent. To address this, we propose RPG-Encoder, a framework that generalizes the Repository Planning Graph (RPG) from a static generative blueprint into a unified, high-fidelity representation. RPG-Encoder closes the reasoning loop through three mechanisms: (1) Encoding raw code into the RPG that combines lifted semantic features with code dependencies; (2) Evolving the topology incrementally to decouple maintenance costs from repository scale, reducing overhead by 95.7%; and (3) Operating as a unified interface for structure-aware navigation. In evaluations, RPG-Encoder establishes state-of-the-art repository understanding on SWE-bench Verified with 93.7% Acc@5 and exceeds the best baseline by over 10% on SWE-bench Live Lite. These results highlight our superior fine-grained localization accuracy in complex codebases. Furthermore, it achieves 98.5% reconstruction coverage on RepoCraft, confirming RPG's high-fidelity capacity to mirror the original codebase and closing the loop between intent and implementation.
摘要：当前的存储库代理由于碎片化的表示而遇到推理脱节，因为现有方法依赖于孤立的 API 文档或缺乏语义深度的依赖关系图。我们认为存储库理解和生成是统一循环内的逆过程：生成将意图扩展为实现，而理解将实现压缩回意图。为了解决这个问题，我们提出了 RPG-Encoder，这是一个将存储库规划图（RPG）从静态生成蓝图概括为统一的高保真表示的框架。 RPG-Encoder 通过三种机制关闭推理循环：（1）将原始代码编码到 RPG 中，将提升的语义特征与代码依赖关系相结合； (2) 逐步改进拓扑，将维护成本与存储库规模解耦，开销降低 95.7%； (3) 作为结构感知导航的统一界面进行操作。在评估中，RPG-Encoder 在 SWE-bench Verified 上以 93.7% Acc@5 建立了最先进的存储库理解，并在 SWE-bench Live Lite 上超出了最佳基线 10% 以上。这些结果凸显了我们在复杂代码库中卓越的细粒度定位准确性。此外，它在 RepoCraft 上实现了 98.5% 的重建覆盖率，证实了 RPG 镜像原始代码库的高保真能力，并关闭了意图和实现之间的循环。

Title: LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction -- A Case Study on SDGs

Authors: Yikai Zeng, Yingchao Piao, Jianhui Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02090
Pdf URL: https://arxiv.org/pdf/2602.02090
Copy Paste: [[2602.02090]] LEC-KG: An LLM-Embedding Collaborative Framework for Domain-Specific Knowledge Graph Construction -- A Case Study on SDGs(https://arxiv.org/abs/2602.02090)
Keywords: language model, llm, chain-of-thought
Abstract: Constructing domain-specific knowledge graphs from unstructured text remains challenging due to heterogeneous entity mentions, long-tail relation distributions, and the absence of standardized schemas. We present LEC-KG, a bidirectional collaborative framework that integrates the semantic understanding of Large Language Models (LLMs) with the structural reasoning of Knowledge Graph Embeddings (KGE). Our approach features three key components: (1) hierarchical coarse-to-fine relation extraction that mitigates long-tail bias, (2) evidence-guided Chain-of-Thought feedback that grounds structural suggestions in source text, and (3) semantic initialization that enables structural validation for unseen entities. The two modules enhance each other iteratively-KGE provides structure-aware feedback to refine LLM extractions, while validated triples progressively improve KGE representations. We evaluate LEC-KG on Chinese Sustainable Development Goal (SDG) reports, demonstrating substantial improvements over LLM baselines, particularly on low-frequency relations. Through iterative refinement, our framework reliably transforms unstructured policy text into validated knowledge graph triples.
摘要：由于异构实体提及、长尾关系分布以及标准化模式的缺乏，从非结构化文本构建特定领域的知识图仍然具有挑战性。我们提出了 LEC-KG，这是一个双向协作框架，它将大型语言模型（LLM）的语义理解与知识图嵌入（KGE）的结构推理相结合。我们的方法具有三个关键组成部分：（1）分层从粗到细的关系提取，可以减轻长尾偏差；（2）证据引导的思想链反馈，以源文本中的结构建议为基础；（3）语义初始化，可以对未见过的实体进行结构验证。这两个模块迭代地相互增强 - KGE 提供结构感知反馈来完善 LLM 提取，同时经过验证的三元组逐步改进 KGE 表示。我们根据中国可持续发展目标 (SDG) 报告评估了 LEC-KG，显示其相对于 LLM 基线的显着改进，特别是在低频关系方面。通过迭代细化，我们的框架可靠地将非结构化策略文本转换为经过验证的知识图三元组。

Title: Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs

Authors: Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02104
Pdf URL: https://arxiv.org/pdf/2602.02104
Copy Paste: [[2602.02104]] Dicta-LM 3.0: Advancing The Frontier of Hebrew Sovereign LLMs(https://arxiv.org/abs/2602.02104)
Keywords: language model, llm, chat
Abstract: Open-weight LLMs have been released by frontier labs; however, sovereign Large Language Models (for languages other than English) remain low in supply yet high in demand. Training large language models (LLMs) for low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce Dicta-LM 3.0: an open-weight collection of LLMs trained on substantially-sized corpora of Hebrew and English texts. The model is released in three sizes: 24B - adapted from the Mistral-Small-3.1 base model, 12B - adapted from the NVIDIA Nemotron Nano V2 model, and 1.7B - adapted from the Qwen3-1.7B base model. We are releasing multiple variants of each model, each with a native context length of 65k tokens; base model and chat model with tool-calling support. To rigorously evaluate our models, we introduce a new benchmark suite for evaluation of Hebrew chat-LLMs, covering a diverse set of tasks including Translation, Summarization, Winograd, Israeli Trivia, and Diacritization (nikud). Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.
摘要：前沿实验室已发布开放式法学硕士；然而，主权大型语言模型（针对英语以外的语言）的供应量仍然较低，但需求量却很高。为希伯来语等资源匮乏的语言训练大型语言模型 (LLM) 带来了独特的挑战。在本文中，我们介绍了 Dicta-LM 3.0：一个开放权重的法学硕士集合，经过大规模希伯来语和英语文本语料库的训练。该模型发布了三种尺寸：24B - 改编自 Mistral-Small-3.1 基础模型，12B - 改编自 NVIDIA Nemotron Nano V2 模型，以及 1.7B - 改编自 Qwen3-1.7B 基础模型。我们正在发布每个模型的多个变体，每个变体的本机上下文长度为 65k 个令牌；基本模型和带有工具调用支持的聊天模型。为了严格评估我们的模型，我们引入了一个新的基准套件来评估希伯来语聊天法学硕士，涵盖一系列不同的任务，包括翻译、摘要、Winograd、以色列琐事和变音符号 (nikud)。我们的工作不仅解决了用资源匮乏的语言培训法学硕士的复杂问题，还提出了一个框架，可用于使其他法学硕士适应各种非英语语言，从而为更广泛的多语言 NLP 领域做出贡献。

Title: Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts

Authors: Wenhao Li, Daohai Yu, Gen Luo, Yuxin Zhang, Fei Chao, Rongrong Ji, Yifan Wu, Jiaxin Liu, Ziyang Gong, Zimu Liao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02108
Pdf URL: https://arxiv.org/pdf/2602.02108
Copy Paste: [[2602.02108]] Out of the Memory Barrier: A Highly Memory Efficient Training System for LLMs with Million-Token Contexts(https://arxiv.org/abs/2602.02108)
Keywords: language model, llm, long context
Abstract: Training Large Language Models (LLMs) on long contexts is severely constrained by prohibitive GPU memory overhead, not training time. The primary culprits are the activations, whose memory footprints scale linearly with sequence length. We introduce OOMB, a highly memory-efficient training system that directly confronts this barrier. Our approach employs a chunk-recurrent training framework with on-the-fly activation recomputation, which maintains a constant activation memory footprint (O(1)) and shifts the primary bottleneck to the growing KV cache. To manage the KV cache, OOMB integrates a suite of synergistic optimizations: a paged memory manager for both the KV cache and its gradients to eliminate fragmentation, asynchronous CPU offloading to hide data transfer latency, and page-level sparse attention to reduce both computational complexity and communication overhead. The synergy of these techniques yields exceptional efficiency. Our empirical results show that for every additional 10K tokens of context, the end-to-end training memory overhead increases by a mere 10MB for Qwen2.5-7B. This allows training Qwen2.5-7B with a 4M-token context on a single H200 GPU, a feat that would otherwise require a large cluster using context parallelism. This work represents a substantial advance in resource efficiency for long-context LLM training. The source code is available at this https URL.
摘要：在长上下文中训练大型语言模型 (LLM) 受到令人望而却步的 GPU 内存开销的严重限制，而不是训练时间。罪魁祸首是激活，其内存占用随序列长度线性扩展。我们引入了 OOMB，一种高内存效率的训练系统，可以直接应对这一障碍。我们的方法采用具有即时激活重新计算功能的块循环训练框架，该框架保持恒定的激活内存占用（O(1)）并将主要瓶颈转移到不断增长的 KV 缓存。为了管理 KV 缓存，OOMB 集成了一套协同优化：用于 KV 缓存及其梯度的分页内存管理器以消除碎片、异步 CPU 卸载以隐藏数据传输延迟，以及页级稀疏注意力以降低计算复杂性和通信开销。这些技术的协同作用产生了非凡的效率。我们的实证结果表明，对于 Qwen2.5-7B，每增加 10K 个上下文标记，端到端训练内存开销仅增加 10MB。这允许在单个 H200 GPU 上使用 4M 令牌上下文训练 Qwen2.5-7B，否则需要使用上下文并行性的大型集群。这项工作代表了长背景法学硕士培训资源效率的重大进步。源代码可从此 https URL 获取。

Title: There Is More to Refusal in Large Language Models than a Single Direction

Authors: Faaiz Joad, Majd Hawasly, Sabri Boughorbel, Nadir Durrani, Husrev Taha Sencar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02132
Pdf URL: https://arxiv.org/pdf/2602.02132
Copy Paste: [[2602.02132]] There Is More to Refusal in Large Language Models than a Single Direction(https://arxiv.org/abs/2602.02132)
Keywords: language model
Abstract: Prior work argues that refusal in large language models is mediated by a single activation-space direction, enabling effective steering and ablation. We show that this account is incomplete. Across eleven categories of refusal and non-compliance, including safety, incomplete or unsupported requests, anthropomorphization, and over-refusal, we find that these refusal behaviors correspond to geometrically distinct directions in activation space. Yet despite this diversity, linear steering along any refusal-related direction produces nearly identical refusal to over-refusal trade-offs, acting as a shared one-dimensional control knob. The primary effect of different directions is not whether the model refuses, but how it refuses.
摘要：先前的工作认为，大型语言模型中的拒绝是由单一激活空间方向介导的，从而实现有效的引导和消融。我们证明这个账户是不完整的。在十一个拒绝和不遵守类别中，包括安全、不完整或不受支持的请求、拟人化和过度拒绝，我们发现这些拒绝行为对应于激活空间中几何上不同的方向。然而，尽管存在这种多样性，沿着任何与拒绝相关的方向的线性转向都会产生几乎相同的拒绝与过度拒绝权衡，充当共享的一维控制旋钮。不同方向的主要影响不在于模型是否拒绝，而在于如何拒绝。

Title: Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing

Authors: Lingkun Long, Yushi Huang, Shihao Bai, Ruihao Gong, Jun Zhang, Ao Zhou, Jianlei Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02159
Pdf URL: https://arxiv.org/pdf/2602.02159
Copy Paste: [[2602.02159]] Focus-dLLM: Accelerating Long-Context Diffusion LLM Inference via Confidence-Guided Context Focusing(https://arxiv.org/abs/2602.02159)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (dLLMs) deliver strong long-context processing capability in a non-autoregressive decoding paradigm. However, the considerable computational cost of bidirectional full attention limits the inference efficiency. Although sparse attention is promising, existing methods remain ineffective. This stems from the need to estimate attention importance for tokens yet to be decoded, while the unmasked token positions are unknown during diffusion. In this paper, we present Focus-dLLM, a novel training-free attention sparsification framework tailored for accurate and efficient long-context dLLM inference. Based on the finding that token confidence strongly correlates across adjacent steps, we first design a past confidence-guided indicator to predict unmasked regions. Built upon this, we propose a sink-aware pruning strategy to accurately estimate and remove redundant attention computation, while preserving highly influential attention sinks. To further reduce overhead, this strategy reuses identified sink locations across layers, leveraging the observed cross-layer consistency. Experimental results show that our method offers more than $29\times$ lossless speedup under $32K$ context length. The code is publicly available at: this https URL
摘要：扩散大语言模型 (dLLM) 在非自回归解码范例中提供强大的长上下文处理能力。然而，双向全注意力的巨大计算成本限制了推理效率。尽管稀疏注意力很有希望，但现有方法仍然无效。这是因为需要估计尚未解码的令牌的注意力重要性，而未屏蔽的令牌位置在扩散过程中是未知的。在本文中，我们提出了 Focus-dLLM，这是一种新颖的免训练注意力稀疏框架，专为准确高效的长上下文 dLLM 推理而定制。基于令牌置信度与相邻步骤之间强烈相关的发现，我们首先设计一个过去的置信度引导指标来预测未屏蔽的区域。在此基础上，我们提出了一种水槽感知修剪策略，以准确估计和消除冗余的注意力计算，同时保留具有高度影响力的注意力水槽。为了进一步减少开销，该策略跨层重用已识别的接收器位置，利用观察到的跨层一致性。实验结果表明，我们的方法在 $32K$ 上下文长度下提供了超过 $29\times$ 无损加速。该代码可在以下位置公开获取：此 https URL

Title: AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?

Authors: Liang Lin, Feng Xiong, Zengbin Wang, Kun Wang, Junhao Dong, Xuecai Hu, Yong Wang, Xiangxiang Chu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02178
Pdf URL: https://arxiv.org/pdf/2602.02178
Copy Paste: [[2602.02178]] AR-MAP: Are Autoregressive Large Language Models Implicit Teachers for Diffusion Large Language Models?(https://arxiv.org/abs/2602.02178)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (DLLMs) have emerged as a powerful alternative to autoregressive models, enabling parallel token generation across multiple positions. However, preference alignment of DLLMs remains challenging due to high variance introduced by Evidence Lower Bound (ELBO)-based likelihood estimation. In this work, we propose AR-MAP, a novel transfer learning framework that leverages preference-aligned autoregressive LLMs (AR-LLMs) as implicit teachers for DLLM alignment. We reveal that DLLMs can effectively absorb alignment knowledge from AR-LLMs through simple weight scaling, exploiting the shared architectural structure between these divergent generation paradigms. Crucially, our approach circumvents the high variance and computational overhead of direct DLLM alignment and comprehensive experiments across diverse preference alignment tasks demonstrate that AR-MAP achieves competitive or superior performance compared to existing DLLM-specific alignment methods, achieving 69.08\% average score across all tasks and models. Our Code is available at this https URL.
摘要：扩散大型语言模型 (DLLM) 已成为自回归模型的强大替代方案，可实现跨多个位置的并行令牌生成。然而，由于基于证据下界 (ELBO) 的似然估计引入了高方差，DLLM 的偏好对齐仍然具有挑战性。在这项工作中，我们提出了 AR-MAP，这是一种新颖的迁移学习框架，它利用偏好对齐的自回归 LLM（AR-LLM）作为 DLLM 对齐的隐式教师。我们发现 DLLM 可以通过简单的权重缩放有效地吸收来自 AR-LLM 的对齐知识，利用这些不同的生成范式之间的共享架构结构。至关重要的是，我们的方法规避了直接 DLLM 对齐的高方差和计算开销，并且跨不同偏好对齐任务的综合实验表明，与现有的 DLLM 特定对齐方法相比，AR-MAP 实现了有竞争力或优越的性能，在所有任务和模型中实现了 69.08% 的平均得分。我们的代码可通过此 https URL 获取。

Title: Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages

Authors: Tjaša Arčon (1), Matej Klemen (1), Marko Robnik-Šikonja (1), Kaja Dobrovoljc (1, 2, 3) ((1) University of Ljubljana, Faculty of Computer and Information Science, Slovenia (2) University of Ljubljana, Faculty of Arts, Slovenia, (3) Jožef Stefan Institute, Ljubljana, Slovenia)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02182
Pdf URL: https://arxiv.org/pdf/2602.02182
Copy Paste: [[2602.02182]] Evaluating Metalinguistic Knowledge in Large Language Models across the World's Languages(https://arxiv.org/abs/2602.02182)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are routinely evaluated on language use tasks, yet their knowledge of linguistic structure remains poorly understood. Existing linguistic benchmarks typically focus on narrow phenomena, emphasize high-resource languages, and rarely evaluate metalinguistic knowledge-explicit reasoning about language structure rather than language use. Using accuracy and macro F1, together with majority-class and chance baselines, we analyse overall performance and examine variation by linguistic domains and language-related factors. Our results show that metalinguistic knowledge in current LLMs is limited: GPT-4o performs best but achieves only moderate accuracy (0.367), while open-source models lag behind. All models perform above chance but fail to outperform the majority-class baseline, suggesting they capture cross-linguistic patterns but lack fine-grained grammatical distinctions. Performance varies across linguistic domains, with lexical features showing the highest accuracy and phonological features among the lowest, partially reflecting differences in online visibility. At the language level, accuracy shows a strong association with digital language status: languages with higher digital presence and resource availability are evaluated more accurately, while low-resource languages show substantially lower performance. Analyses of predictive factors confirm that resource-related indicators (Wikipedia size, corpus availability) are more informative predictors of accuracy than geographical, genealogical, or sociolinguistic factors. Together, these results suggest that LLMs' metalinguistic knowledge is fragmented and shaped by data availability rather than generalizable grammatical competence across the world's languages. We release our benchmark as an open-source dataset to support systematic evaluation and encourage greater global linguistic diversity in future LLMs.
摘要：大型语言模型（LLM）通常在语言使用任务上进行评估，但人们对它们的语言结构知识仍然知之甚少。现有的语言基准通常关注狭隘现象，强调高资源语言，很少评估关于语言结构而不是语言使用的元语言知识显式推理。使用准确性和宏 F1，以及多数类和机会基线，我们分析整体性能并检查语言领域和语言相关因素的变化。我们的结果表明，当前法学硕士中的元语言知识是有限的：GPT-4o 表现最好，但仅达到中等准确度（0.367），而开源模型则落后。所有模型的表现都高于偶然性，但未能超越多数类基线，这表明它们捕获了跨语言模式，但缺乏细粒度的语法区别。不同语言领域的表现各不相同，词汇特征的准确性最高，而语音特征的准确性最低，部分反映了在线可见性的差异。在语言层面，准确性与数字语言状态密切相关：具有较高数字存在和资源可用性的语言可以更准确地评估，而资源较少的语言则表现出明显较低的性能。对预测因素的分析证实，与地理、家谱或社会语言因素相比，与资源相关的指标（维基百科大小、语料库可用性）是更能提供信息的准确性预测因素。总之，这些结果表明，法学硕士的元语言知识是碎片化的，并且是由数据可用性而不是跨世界语言的通用语法能力决定的。我们将基准作为开源数据集发布，以支持系统评估并鼓励未来法学硕士实现更大的全球语言多样性。

Title: Sinhala Physical Common Sense Reasoning Dataset for Global PIQA

Authors: Nisansa de Silva, Surangika Ranathunga
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02207
Pdf URL: https://arxiv.org/pdf/2602.02207
Copy Paste: [[2602.02207]] Sinhala Physical Common Sense Reasoning Dataset for Global PIQA(https://arxiv.org/abs/2602.02207)
Keywords: prompt
Abstract: This paper presents the first-ever Sinhala physical common sense reasoning dataset created as part of Global PIQA. It contains 110 human-created and verified data samples, where each sample consists of a prompt, the corresponding correct answer, and a wrong answer. Most of the questions refer to the Sri Lankan context, where Sinhala is an official language.
摘要：本文介绍了作为 Global PIQA 一部分创建的第一个僧伽罗物理常识推理数据集。它包含110个人工创建并经过验证的数据样本，每个样本由一个提示、相应的正确答案和一个错误答案组成。大多数问题涉及斯里兰卡的情况，僧伽罗语是斯里兰卡的官方语言。

Title: Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study

Authors: Md. Toufique Hasan, Ayman Asad Khan, Mika Saari, Vaishnavi Bankhele, Pekka Abrahamsson
Subjects: cs.CL, cs.AI, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2602.02208
Pdf URL: https://arxiv.org/pdf/2602.02208
Copy Paste: [[2602.02208]] Towards AI Evaluation in Domain-Specific RAG Systems: The AgriHubi Case Study(https://arxiv.org/abs/2602.02208)
Keywords: language model, retrieval-augmented generation
Abstract: Large language models show promise for knowledge-intensive domains, yet their use in agriculture is constrained by weak grounding, English-centric training data, and limited real-world evaluation. These issues are amplified for low-resource languages, where high-quality domain documentation exists but remains difficult to access through general-purpose models. This paper presents AgriHubi, a domain-adapted retrieval-augmented generation (RAG) system for Finnish-language agricultural decision support. AgriHubi integrates Finnish agricultural documents with open PORO family models and combines explicit source grounding with user feedback to support iterative refinement. Developed over eight iterations and evaluated through two user studies, the system shows clear gains in answer completeness, linguistic accuracy, and perceived reliability. The results also reveal practical trade-offs between response quality and latency when deploying larger models. This study provides empirical guidance for designing and evaluating domain-specific RAG systems in low-resource language settings.
摘要：大型语言模型显示出在知识密集型领域的应用前景，但其在农业中的使用受到基础薄弱、以英语为中心的训练数据和有限的现实世界评估的限制。对于资源匮乏的语言来说，这些问题会被放大，因为这些语言存在高质量的领域文档，但仍然很难通过通用模型访问。本文介绍了 AgriHubi，这是一种用于芬兰语农业决策支持的领域自适应检索增强生成 (RAG) 系统。 AgriHubi 将芬兰农业文献与开放式 PORO 系列模型相结合，并将显式源头基础与用户反馈相结合，以支持迭代细化。该系统经过八次迭代开发并通过两次用户研究进行评估，在答案完整性、语言准确性和感知可靠性方面显示出明显的进步。结果还揭示了部署较大模型时响应质量和延迟之间的实际权衡。这项研究为在资源匮乏的语言环境中设计和评估特定领域的 RAG 系统提供了经验指导。

Title: Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge

Authors: Yuzheng Xu, Tosho Hirasawa, Tadashi Kozuno, Yoshitaka Ushiku
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02219
Pdf URL: https://arxiv.org/pdf/2602.02219
Copy Paste: [[2602.02219]] Am I More Pointwise or Pairwise? Revealing Position Bias in Rubric-Based LLM-as-a-Judge(https://arxiv.org/abs/2602.02219)
Keywords: language model, llm
Abstract: Large language models (LLMs) are now widely used to evaluate the quality of text, a field commonly referred to as LLM-as-a-judge. While prior works mainly focus on point-wise and pair-wise evaluation paradigms. Rubric-based evaluation, where LLMs select a score from multiple rubrics, has received less analysis. In this work, we show that rubric-based evaluation implicitly resembles a multi-choice setting and therefore has position bias: LLMs prefer score options appearing at specific positions in the rubric list. Through controlled experiments across multiple models and datasets, we demonstrate consistent position bias. To mitigate this bias, we propose a balanced permutation strategy that evenly distributes each score option across positions. We show that aggregating scores across balanced permutations not only reveals latent position bias, but also improves correlation between the LLM-as-a-Judge and human. Our results suggest that rubric-based LLM-as-a-Judge is not inherently point-wise and that simple permutation-based calibration can substantially improve its reliability.
摘要：大型语言模型 (LLM) 现在广泛用于评估文本质量，这一领域通常被称为 LLM 作为法官。而之前的工作主要集中在逐点和成对的评估范式上。基于评分标准的评估（法学硕士从多个评分标准中选择一个分数）得到的分析较少。在这项工作中，我们表明基于评分标准的评估隐含地类似于多项选择设置，因此存在位置偏差：法学硕士更喜欢出现在评分标准列表中特定位置的分数选项。通过跨多个模型和数据集的受控实验，我们证明了一致的位置偏差。为了减轻这种偏差，我们提出了一种平衡排列策略，将每个分数选项均匀分布在各个位置上。我们表明，平衡排列的总分不仅揭示了潜在的位置偏差，而且还提高了法学硕士法官与人类之间的相关性。我们的结果表明，基于标准的法学硕士法官本质上并不是逐点的，并且简单的基于排列的校准可以大大提高其可靠性。

Title: OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data

Authors: Tan Sang Nguyen, Muhammad Reza Qorib, Hwee Tou Ng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02266
Pdf URL: https://arxiv.org/pdf/2602.02266
Copy Paste: [[2602.02266]] OpenSeal: Good, Fast, and Cheap Construction of an Open-Source Southeast Asian LLM via Parallel Data(https://arxiv.org/abs/2602.02266)
Keywords: language model, llm
Abstract: Large language models (LLMs) have proven to be effective tools for a wide range of natural language processing (NLP) applications. Although many LLMs are multilingual, most remain English-centric and perform poorly on low-resource languages. Recently, several Southeast Asia-focused LLMs have been developed, but none are truly open source, as they do not publicly disclose their training data. Truly open-source models are important for transparency and for enabling a deeper and more precise understanding of LLM internals and development, including biases, generalization, and multilinguality. Motivated by recent advances demonstrating the effectiveness of parallel data in improving multilingual performance, we conduct controlled and comprehensive experiments to study the effectiveness of parallel data in continual pretraining of LLMs. Our findings show that using only parallel data is the most effective way to extend an LLM to new languages. Using just 34.7B tokens of parallel data and 180 hours on 8x NVIDIA H200 GPUs, we built OpenSeal, the first truly open Southeast Asian LLM that rivals the performance of existing models of similar size.
摘要：大型语言模型 (LLM) 已被证明是适用于各种自然语言处理 (NLP) 应用的有效工具。尽管许多法学硕士都是多语言的，但大多数仍然以英语为中心，并且在资源匮乏的语言上表现不佳。最近，已经开发了一些专注于东南亚的法学硕士，但没有一个是真正开源的，因为它们没有公开披露他们的培训数据。真正的开源模型对于透明度以及使人们能够更深入、更准确地理解 LLM 内部结构和发展（包括偏见、泛化和多语言性）非常重要。最近的进展证明了并行数据在提高多语言性能方面的有效性，我们进行了受控和综合实验来研究并行数据在法学硕士持续预训练中的有效性。我们的研究结果表明，仅使用并行数据是将法学硕士扩展到新语言的最有效方法。仅使用 34.7B 并行数据令牌和 8 个 NVIDIA H200 GPU 上的 180 小时，我们构建了 OpenSeal，这是第一个真正开放的东南亚 LLM，其性能可与类似规模的现有模型相媲美。

Title: dziribot: rag based intelligent conversational agent for algerian arabic dialect

Authors: El Batoul Bechiri, Dihia Lanasri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02270
Pdf URL: https://arxiv.org/pdf/2602.02270
Copy Paste: [[2602.02270]] dziribot: rag based intelligent conversational agent for algerian arabic dialect(https://arxiv.org/abs/2602.02270)
Keywords: language model, retrieval-augmented generation, agent
Abstract: The rapid digitalization of customer service has intensified the demand for conversational agents capable of providing accurate and natural interactions. In the Algerian context, this is complicated by the linguistic complexity of Darja, a dialect characterized by non-standardized orthography, extensive code-switching with French, and the simultaneous use of Arabic and Latin (Arabizi) scripts. This paper introduces DziriBOT, a hybrid intelligent conversational agent specifically engineered to overcome these challenges. We propose a multi-layered architecture that integrates specialized Natural Language Understanding (NLU) with Retrieval-Augmented Generation (RAG), allowing for both structured service flows and dynamic, knowledge-intensive responses grounded in curated enterprise documentation. To address the low-resource nature of Darja, we systematically evaluate three distinct approaches: a sparse-feature Rasa pipeline, classical machine learning baselines, and transformer-based fine-tuning. Our experimental results demonstrate that the fine-tuned DziriBERT model achieves state-of-the-art performance. These results significantly outperform traditional baselines, particularly in handling orthographic noise and rare intents. Ultimately, DziriBOT provides a robust, scalable solution that bridges the gap between formal language models and the linguistic realities of Algerian users, offering a blueprint for dialect-aware automation in the regional market.
摘要：客户服务的快速数字化加剧了对能够提供准确、自然交互的对话代理的需求。在阿尔及利亚语境中，由于 Darja 语言的复杂性，这一问题变得更加复杂，Darja 是一种方言，其特点是非标准化正字法、与法语的广泛语码转换以及同时使用阿拉伯语和拉丁语 (Arabizi) 脚本。本文介绍了 DziriBOT，这是一种专门为克服这些挑战而设计的混合智能对话代理。我们提出了一种多层架构，将专业的自然语言理解 (NLU) 与检索增强生成 (RAG) 集成在一起，允许结构化服务流和基于精选企业文档的动态、知识密集型响应。为了解决 Darja 的低资源性质，我们系统地评估了三种不同的方法：稀疏特征 Rasa 管道、经典机器学习基线和基于 Transformer 的微调。我们的实验结果表明，经过微调的 DziriBERT 模型实现了最先进的性能。这些结果明显优于传统基线，特别是在处理正交噪声和罕见意图方面。最终，DziriBOT 提供了一个强大的、可扩展的解决方案，弥合了正式语言模型与阿尔及利亚用户的语言现实之间的差距，为区域市场中的方言感知自动化提供了蓝图。

Title: Kimi K2.5: Visual Agentic Intelligence

Authors: Kimi Team: Tongtong Bai, Yifan Bai, Yiping Bao, S.H. Cai, Yuan Cao, Y. Charles, H.S. Che, Cheng Chen, Guanduo Chen, Huarong Chen, Jia Chen, Jiahao Chen, Jianlong Chen, Jun Chen, Kefan Chen, Liang Chen, Ruijue Chen, Xinhao Chen, Yanru Chen, Yanxu Chen, Yicun Chen, Yimin Chen, Yingjiang Chen, Yuankun Chen, Yujie Chen, Yutian Chen, Zhirong Chen, Ziwei Chen, Dazhi Cheng, Minghan Chu, Jialei Cui, Jiaqi Deng, Muxi Diao, Hao Ding, Mengfan Dong, Mengnan Dong, Yuxin Dong, Yuhao Dong, Angang Du, Chenzhuang Du, Dikang Du, Lingxiao Du, Yulun Du, Yu Fan, Shengjun Fang, Qiulin Feng, Yichen Feng, Garimugai Fu, Kelin Fu, Hongcheng Gao, Tong Gao, Yuyao Ge, Shangyi Geng, Chengyang Gong, Xiaochen Gong, Zhuoma Gongque, Qizheng Gu, Xinran Gu, Yicheng Gu, Longyu Guan, Yuanying Guo, Xiaoru Hao, Weiran He, Wenyang He, Yunjia He, Chao Hong, Hao Hu, Jiaxi Hu, Yangyang Hu, Zhenxing Hu, Ke Huang, Ruiyuan Huang, Weixiao Huang, Zhiqi Huang, Tao Jiang, Zhejun Jiang, Xinyi Jin, Yu Jing, Guokun Lai, Aidi Li, C. Li, Cheng Li, Fang Li, Guanghe Li, Guanyu Li, Haitao Li, Haoyang Li, Jia Li, Jingwei Li, Junxiong Li, Lincan Li, Mo Li, Weihong Li, Wentao Li, Xinhang Li, Xinhao Li, Yang Li, Yanhao Li, Yiwei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02276
Pdf URL: https://arxiv.org/pdf/2602.02276
Copy Paste: [[2602.02276]] Kimi K2.5: Visual Agentic Intelligence(https://arxiv.org/abs/2602.02276)
Keywords: agent
Abstract: We introduce Kimi K2.5, an open-source multimodal agentic model designed to advance general agentic intelligence. K2.5 emphasizes the joint optimization of text and vision so that two modalities enhance each other. This includes a series of techniques such as joint text-vision pre-training, zero-vision SFT, and joint text-vision reinforcement learning. Building on this multimodal foundation, K2.5 introduces Agent Swarm, a self-directed parallel agent orchestration framework that dynamically decomposes complex tasks into heterogeneous sub-problems and executes them concurrently. Extensive evaluations show that Kimi K2.5 achieves state-of-the-art results across various domains including coding, vision, reasoning, and agentic tasks. Agent Swarm also reduces latency by up to $4.5\times$ over single-agent baselines. We release the post-trained Kimi K2.5 model checkpoint to facilitate future research and real-world applications of agentic intelligence.
摘要：我们推出 Kimi K2.5，这是一种开源多模式代理模型，旨在推进通用代理智能。 K2.5强调文本和视觉的联合优化，使两种方式相互促进。这包括联合文本视觉预训练、零视觉SFT、联合文本视觉强化学习等一系列技术。在此多模态基础上，K2.5 引入了 Agent Swarm，这是一种自主并行代理编排框架，可动态地将复杂任务分解为异构子问题并同时执行它们。广泛的评估表明，Kimi K2.5 在编码、视觉、推理和代理任务等各个领域都取得了最先进的结果。与单代理基线相比，Agent Swarm 还可以将延迟减少高达 4.5 美元\倍$。我们发布了训练后的 Kimi K2.5 模型检查点，以促进代理智能的未来研究和实际应用。

Title: Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages

Authors: Isaac Chung, Linda Freienthal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02287
Pdf URL: https://arxiv.org/pdf/2602.02287
Copy Paste: [[2602.02287]] Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages(https://arxiv.org/abs/2602.02287)
Keywords: language model, llm
Abstract: Cross-lingual evaluation of large language models (LLMs) typically conflates two sources of variance: genuine model performance differences and measurement instability. We investigate evaluation reliability by holding generation conditions constant while varying target language. Using synthetic customer-support dialogues generated with identical parameters across Estonian, Finnish, and Hungarian, we test whether automatic metrics and LLM-as-a-judge scoring produce stable model rankings across these morphologically rich, related Finno-Ugric languages. With a small set of Estonian native speaker annotations as a reference point, we find systematic ranking instabilities: surface-level metrics (lexical diversity, surface and semantic similarity) maintain cross-language stability, but pragmatic judgments (coherence, instruction-following) exhibit rank inversions and near-zero correlations. Because generation is controlled, these inconsistencies reflect how judge scoring behaves differently across languages rather than true model differences. This controlled design provides a diagnostic probe: evaluation methods that fail to maintain stability under identical generation conditions signal transfer failure before deployment. Our findings suggest that zero-shot judge transfer is unreliable for discourse-level assessment in morphologically rich languages, motivating language-specific calibration against targeted human baselines. We release our controlled generation protocol, synthetic data, and evaluation framework to enable replication across language families at this https URL.
摘要：大型语言模型 (LLM) 的跨语言评估通常会混淆两个方差来源：真正的模型性能差异和测量不稳定。我们通过在改变目标语言的同时保持生成条件不变来研究评估可靠性。使用爱沙尼亚语、芬兰语和匈牙利语中相同参数生成的综合客户支持对话，我们测试自动指标和法学硕士作为法官评分是否能在这些形态丰富、相关的芬兰-乌戈尔语中产生稳定的模型排名。以一小部分爱沙尼亚语母语注释为参考点，我们发现系统排名不稳定性：表面级指标（词汇多样性、表面和语义相似性）保持跨语言稳定性，但语用判断（连贯性、指令遵循）表现出排名倒置和接近零的相关性。由于生成是受控的，这些不一致反映了法官评分在不同语言之间的不同表现，而不是真正的模型差异。这种受控设计提供了一种诊断探针：在相同发电条件下无法保持稳定性的评估方法在部署之前会发出传输失败的信号。我们的研究结果表明，零样本判断迁移对于形态丰富的语言中的话语级别评估来说是不可靠的，这促使针对目标人类基线进行特定于语言的校准。我们在此 https URL 发布了受控生成协议、合成数据和评估框架，以实现跨语言家族的复制。

Title: Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?

Authors: Alex Argese, Pasquale Lisena, Raphaël Troncy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02290
Pdf URL: https://arxiv.org/pdf/2602.02290
Copy Paste: [[2602.02290]] Hallucination or Creativity: How to Evaluate AI-Generated Scientific Stories?(https://arxiv.org/abs/2602.02290)
Keywords: hallucination
Abstract: Generative AI can turn scientific articles into narratives for diverse audiences, but evaluating these stories remains challenging. Storytelling demands abstraction, simplification, and pedagogical creativity-qualities that are not often well-captured by standard summarization metrics. Meanwhile, factual hallucinations are critical in scientific contexts, yet, detectors often misclassify legitimate narrative reformulations or prove unstable when creativity is involved. In this work, we propose StoryScore, a composite metric for evaluating AI-generated scientific stories. StoryScore integrates semantic alignment, lexical grounding, narrative control, structural fidelity, redundancy avoidance, and entity-level hallucination detection into a unified framework. Our analysis also reveals why many hallucination detection methods fail to distinguish pedagogical creativity from factual errors, highlighting a key limitation: while automatic metrics can effectively assess semantic similarity with original content, they struggle to evaluate how it is narrated and controlled.
摘要：生成式人工智能可以将科学文章转化为面向不同受众的叙述，但评估这些故事仍然具有挑战性。讲故事需要抽象、简化和教学创造力，而标准摘要指标通常无法很好地体现这些品质。与此同时，事实幻觉在科学背景下至关重要，然而，当涉及创造力时，探测器经常对合理的叙述重新表述进行错误分类或证明不稳定。在这项工作中，我们提出了 StoryScore，这是一种用于评估人工智能生成的科学故事的综合指标。 StoryScore 将语义对齐、词汇基础、叙事控制、结构保真度、冗余避免和实体级幻觉检测集成到一个统一的框架中。我们的分析还揭示了为什么许多幻觉检测方法无法区分教学创造力和事实错误，突出了一个关键的局限性：虽然自动度量可以有效地评估与原始内容的语义相似性，但它们很难评估它是如何叙述和控制的。

Title: Advancing General-Purpose Reasoning Models with Modular Gradient Surgery

Authors: Min Cai, Yu Liang, Longzheng Wang, Yan Wang, Yueyang Zhang, Long Xia, Zhiyuan Sun, Xi Ye, Daiting Shi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02301
Pdf URL: https://arxiv.org/pdf/2602.02301
Copy Paste: [[2602.02301]] Advancing General-Purpose Reasoning Models with Modular Gradient Surgery(https://arxiv.org/abs/2602.02301)
Keywords: chat
Abstract: Reinforcement learning (RL) has played a central role in recent advances in large reasoning models (LRMs), yielding strong gains in verifiable and open-ended reasoning. However, training a single general-purpose LRM across diverse domains remains challenging due to pronounced domain heterogeneity. Through a systematic study of two widely used strategies, Sequential RL and Mixed RL, we find that both incur substantial cross-domain interference at the behavioral and gradient levels, resulting in limited overall gains. To address these challenges, we introduce **M**odular **G**radient **S**urgery (**MGS**), which resolves gradient conflicts at the module level within the transformer. When applied to Llama and Qwen models, MGS achieves average improvements of 4.3 (16.6\%) and 4.5 (11.1\%) points, respectively, over standard multi-task RL across three representative domains (math, general chat, and instruction following). Further analysis demonstrates that MGS remains effective under prolonged training. Overall, our study clarifies the sources of interference in multi-domain RL and presents an effective solution for training general-purpose LRMs.
摘要：强化学习 (RL) 在大型推理模型 (LRM) 的最新进展中发挥了核心作用，在可验证和开放式推理方面取得了巨大进展。然而，由于明显的领域异质性，跨不同领域训练单一通用 LRM 仍然具有挑战性。通过对两种广泛使用的策略（顺序强化学习和混合强化学习）的系统研究，我们发现这两种策略都会在行为和梯度层面产生大量的跨域干扰，导致整体收益有限。为了解决这些挑战，我们引入了 **M**odular **G**radient **S**urgery (**MGS**)，它解决了变压器内模块级别的梯度冲突。当应用于 Llama 和 Qwen 模型时，MGS 在三个代表性领域（数学、一般聊天和指令跟踪）上比标准多任务 RL 分别实现了 4.3 (16.6\%) 和 4.5 (11.1\%) 点的平均改进。进一步分析表明，MGS 在长时间训练下仍然有效。总的来说，我们的研究阐明了多域强化学习中的干扰源，并为训练通用 LRM 提供了有效的解决方案。

Title: The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors

Authors: Raphaël Sarfati, Eric Bigelow, Daniel Wurgaft, Jack Merullo, Atticus Geiger, Owen Lewis, Tom McGrath, Ekdeep Singh Lubana
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02315
Pdf URL: https://arxiv.org/pdf/2602.02315
Copy Paste: [[2602.02315]] The Shape of Beliefs: Geometry, Dynamics, and Interventions along Representation Manifolds of Language Models' Posteriors(https://arxiv.org/abs/2602.02315)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) represent prompt-conditioned beliefs (posteriors over answers and claims), but we lack a mechanistic account of how these beliefs are encoded in representation space, how they update with new evidence, and how interventions reshape them. We study a controlled setting in which Llama-3.2 generates samples from a normal distribution by implicitly inferring its parameters (mean and standard deviation) given only samples from the distribution in context. We find representations of curved "belief manifolds" for these parameters form with sufficient in-context learning and study how the model adapts when the distribution suddenly changes. While standard linear steering often pushes the model off-manifold and induces coupled, out-of-distribution shifts, geometry and field-aware steering better preserves the intended belief family. Our work demonstrates an example of linear field probing (LFP) as a simple approach to tile the data manifold and make interventions that respect the underlying geometry. We conclude that rich structure emerges naturally in LLMs and that purely linear concept representations are often an inadequate abstraction.
摘要：大语言模型（LLM）代表即时条件信念（答案和主张的后验），但我们缺乏对这些信念如何在表示空间中编码、它们如何用新证据更新以及干预如何重塑它们的机械解释。我们研究了一种受控设置，其中 Llama-3.2 通过隐式推断其参数（均值和标准差）来从正态分布生成样本，仅给出来自上下文分布的样本。我们通过充分的上下文学习找到了这些参数形成的弯曲“置信流形”的表示，并研究了当分布突然变化时模型如何适应。虽然标准线性转向通常会将模型推离流形并引起耦合的、不符合分布的变化，但几何形状和场感知转向可以更好地保留预期的信念族。我们的工作展示了线性场探测（LFP）的一个例子，它是一种平铺数据流形并进行尊重底层几何形状的干预的简单方法。我们的结论是，丰富的结构在法学硕士中自然出现，而纯粹的线性概念表示通常是不充分的抽象。

Title: A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method

Authors: Feiyang Cai, Guijuan He, Yi Hu, Jingjing Wang, Joshua Luo, Tianyu Zhu, Srikanth Pilla, Gang Li, Ling Liu, Feng Luo
Subjects: cs.CL, cs.AI, q-bio.BM
Abstract URL: https://arxiv.org/abs/2602.02320
Pdf URL: https://arxiv.org/pdf/2602.02320
Copy Paste: [[2602.02320]] A Large-Scale Dataset for Molecular Structure-Language Description via a Rule-Regularized Method(https://arxiv.org/abs/2602.02320)
Keywords: language model, llm
Abstract: Molecular function is largely determined by structure. Accurately aligning molecular structure with natural language is therefore essential for enabling large language models (LLMs) to reason about downstream chemical tasks. However, the substantial cost of human annotation makes it infeasible to construct large-scale, high-quality datasets of structure-grounded descriptions. In this work, we propose a fully automated annotation framework for generating precise molecular structure descriptions at scale. Our approach builds upon and extends a rule-based chemical nomenclature parser to interpret IUPAC names and construct enriched, structured XML metadata that explicitly encodes molecular structure. This metadata is then used to guide LLMs in producing accurate natural-language descriptions. Using this framework, we curate a large-scale dataset of approximately $163$k molecule-description pairs. A rigorous validation protocol combining LLM-based and expert human evaluation on a subset of $2,000$ molecules demonstrates a high description precision of $98.6\%$. The resulting dataset provides a reliable foundation for future molecule-language alignment, and the proposed annotation method is readily extensible to larger datasets and broader chemical tasks that rely on structural descriptions.
摘要：分子功能很大程度上由结构决定。因此，准确地将分子结构与自然语言对齐对于大型语言模型 (LLM) 推理下游化学任务至关重要。然而，人工注释的巨大成本使得构建大规模、高质量的基于结构的描述数据集变得不可行。在这项工作中，我们提出了一个全自动注释框架，用于大规模生成精确的分子结构描述。我们的方法建立并扩展了基于规则的化学命名解析器，以解释 IUPAC 名称并构建丰富的结构化 XML 元数据，这些元数据显式编码分子结构。然后，该元数据用于指导法学硕士生成准确的自然语言描述。使用这个框架，我们管理了大约 163$k 分子描述对的大规模数据集。严格的验证协议结合了基于法学硕士和专家对 2,000 美元分子子集的人类评估，证明了高达 98.6\%$ 的描述精度。所得的数据集为未来的分子语言比对提供了可靠的基础，并且所提出的注释方法很容易扩展到更大的数据集和依赖于结构描述的更广泛的化学任务。

Title: Language Steering for Multilingual In-Context Learning

Authors: Neeraja Kirtane, Kuan-Hao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02326
Pdf URL: https://arxiv.org/pdf/2602.02326
Copy Paste: [[2602.02326]] Language Steering for Multilingual In-Context Learning(https://arxiv.org/abs/2602.02326)
Keywords: language model, llm
Abstract: While multilingual large language models have gained widespread adoption, their performance on non-English languages remains substantially inferior to English. This disparity is particularly evident in in-context learning scenarios, where providing demonstrations in English but testing on non-English inputs leads to significant performance degradation. In this paper, we hypothesize that LLMs develop a universal semantic space for understanding languages, where different languages are encoded as distinct directions within this space. Based on this hypothesis, we propose language vectors -- a training-free language steering approach that leverages activation differences between source and target languages to guide model behavior. We steer the model generations by adding the vector to the intermediate model activations during inference. This is done to make the model's internal representations shift towards the target language space without any parameter updates. We evaluate our method across three datasets and test on a total of 19 languages on three different models. Our results show consistent improvements on multilingual in-context learning over baselines across all tasks and languages tested. Beyond performance gains, hierarchical clustering of steering vectors reveals meaningful linguistic structure aligned with language families. These vectors also successfully transfer across tasks, demonstrating that these representations are task-agnostic.
摘要：虽然多语言大语言模型已得到广泛采用，但它们在非英语语言上的性能仍然大大低于英语。这种差异在情境学习场景中尤其明显，在这种场景中，用英语提供演示，但对非英语输入进行测试会导致性能显着下降。在本文中，我们假设法学硕士开发了一个用于理解语言的通用语义空间，其中不同的语言在该空间内被编码为不同的方向。基于这个假设，我们提出了语言向量——一种免训练的语言引导方法，利用源语言和目标语言之间的激活差异来指导模型行为。我们通过在推理过程中将向量添加到中间模型激活来引导模型生成。这样做是为了使模型的内部表示向目标语言空间转移，而无需任何参数更新。我们在三个数据集上评估我们的方法，并在三个不同模型上测试总共 19 种语言。我们的结果表明，在所有测试的任务和语言中，多语言情境学习相对于基线的持续改进。除了性能提升之外，引导向量的层次聚类还揭示了与语系一致的有意义的语言结构。这些向量还成功地跨任务转移，证明这些表示与任务无关。

Title: Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

Authors: Ziwen Xu, Chenyan Wu, Hengyu Sun, Haiwen Hong, Mengru Wang, Yunzhi Yao, Longtao Huang, Hui Xue, Shumin Deng, Zhixuan Chu, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02343
Pdf URL: https://arxiv.org/pdf/2602.02343
Copy Paste: [[2602.02343]] Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics(https://arxiv.org/abs/2602.02343)
Keywords: language model, llm
Abstract: Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at this https URL.
摘要：控制大语言模型 (LLM) 的方法，包括局部权重微调、基于 LoRA 的适应和基于激活的干预，通常是孤立研究的，模糊了它们之间的联系并使比较变得困难。在这项工作中，我们提出了一个统一的观点，将这些干预措施描述为由控制信号引起的动态权重更新，并将它们置于单个概念框架内。基于这一观点，我们提出了一种统一的偏好-效用分析，将控制效果分为偏好（定义为目标概念的倾向）和效用（定义为连贯且任务有效的生成），并使用极性配对对比示例在共享对数优势尺度上衡量两者。在各种方法中，我们观察到偏好和效用之间存在一致的权衡：更强的控制会增加偏好，同时可预见地降低效用。我们通过激活流形视角进一步解释这种行为，其中控制沿着目标概念方向转移表征以增强偏好，而当干预将表征推离模型的有效生成流形时，效用主要下降。最后，我们引入了一种以此分析为指导的新的转向方法 SPLIT，该方法可以提高偏好，同时更好地保留效用。代码可从此 https URL 获取。

Title: Automated Multiple Mini Interview (MMI) Scoring

Authors: Ryan Huynh, Frank Guerin, Alison Callwood
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02360
Pdf URL: https://arxiv.org/pdf/2602.02360
Copy Paste: [[2602.02360]] Automated Multiple Mini Interview (MMI) Scoring(https://arxiv.org/abs/2602.02360)
Keywords: language model, llm, prompt, agent
Abstract: Assessing soft skills such as empathy, ethical judgment, and communication is essential in competitive selection processes, yet human scoring is often inconsistent and biased. While Large Language Models (LLMs) have improved Automated Essay Scoring (AES), we show that state-of-the-art rationale-based fine-tuning methods struggle with the abstract, context-dependent nature of Multiple Mini-Interviews (MMIs), missing the implicit signals embedded in candidate narratives. We introduce a multi-agent prompting framework that breaks down the evaluation process into transcript refinement and criterion-specific scoring. Using 3-shot in-context learning with a large instruct-tuned model, our approach outperforms specialised fine-tuned baselines (Avg QWK 0.62 vs 0.32) and achieves reliability comparable to human experts. We further demonstrate the generalisability of our framework on the ASAP benchmark, where it rivals domain-specific state-of-the-art models without additional training. These findings suggest that for complex, subjective reasoning tasks, structured prompt engineering may offer a scalable alternative to data-intensive fine-tuning, altering how LLMs can be applied to automated assessment.
摘要：评估同理心、道德判断和沟通等软技能在竞争性选拔过程中至关重要，但人工评分往往不一致且存在偏见。虽然大型语言模型 (LLM) 改进了自动作文评分 (AES)，但我们表明，最先进的基于理由的微调方法与多重迷你面试 (MMI) 的抽象、上下文依赖性质相矛盾，错过了候选人叙述中嵌入的隐式信号。我们引入了一个多代理提示框架，将评估过程分解为成绩单细化和特定标准评分。我们的方法使用 3-shot 上下文学习和大型指令调整模型，其性能优于专门的微调基线（平均 QWK 0.62 与 0.32），并实现了与人类专家相当的可靠性。我们进一步证明了我们的框架在 ASAP 基准上的通用性，无需额外训练即可与特定领域的最先进模型相媲美。这些发现表明，对于复杂的主观推理任务，结构化提示工程可以为数据密集型微调提供可扩展的替代方案，从而改变法学硕士应用于自动化评估的方式。

Title: Proof-RM: A Scalable and Generalizable Reward Model for Math Proof

Authors: Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02377
Pdf URL: https://arxiv.org/pdf/2602.02377
Copy Paste: [[2602.02377]] Proof-RM: A Scalable and Generalizable Reward Model for Math Proof(https://arxiv.org/abs/2602.02377)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have demonstrated strong math reasoning abilities through Reinforcement Learning with *Verifiable Rewards* (RLVR), many advanced mathematical problems are proof-based, with no guaranteed way to determine the authenticity of a proof by simple answer matching. To enable automatic verification, a Reward Model (RM) capable of reliably evaluating full proof processes is required. In this work, we design a *scalable* data-construction pipeline that, with minimal human effort, leverages LLMs to generate a large quantity of high-quality "**question-proof-check**" triplet data. By systematically varying problem sources, generation methods, and model configurations, we create diverse problem-proof pairs spanning multiple difficulty levels, linguistic styles, and error types, subsequently filtered through hierarchical human review for label alignment. Utilizing these data, we train a proof-checking RM, incorporating additional process reward and token weight balance to stabilize the RL process. Our experiments validate the model's scalability and strong performance from multiple perspectives, including reward accuracy, generalization ability and test-time guidance, providing important practical recipes and tools for strengthening LLM mathematical capabilities.
摘要：虽然大型语言模型 (LLM) 通过带有“可验证奖励”(RLVR) 的强化学习展示了强大的数学推理能力，但许多高级数学问题都是基于证明的，无法保证通过简单的答案匹配来确定证明的真实性。为了实现自动验证，需要一个能够可靠地评估完整证明过程的奖励模型（RM）。在这项工作中，我们设计了一个*可扩展的*数据构建管道，以最少的人力，利用法学硕士生成大量高质量的“**问题-证明-检查**”三元组数据。通过系统地改变问题来源、生成方法和模型配置，我们创建了跨越多个难度级别、语言风格和错误类型的多样化问题证明对，随后通过分层人工审查进行过滤以进行标签对齐。利用这些数据，我们训练一个验证 RM，结合额外的过程奖励和代币权重平衡来稳定 RL 过程。我们的实验从奖励准确性、泛化能力和测试时指导等多个角度验证了模型的可扩展性和强大性能，为加强LLM数学能力提供了重要的实用秘诀和工具。

Title: From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making

Authors: Raunak Jain, Mudita Khurana, John Stephens, Srinivas Dharmasanam, Shankar Venkataraman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02378
Pdf URL: https://arxiv.org/pdf/2602.02378
Copy Paste: [[2602.02378]] From Sycophancy to Sensemaking: Premise Governance for Human-AI Decision Making(https://arxiv.org/abs/2602.02378)
Keywords: llm
Abstract: As LLMs expand from assistance to decision support, a dangerous pattern emerges: fluent agreement without calibrated judgment. Low-friction assistants can become sycophantic, baking in implicit assumptions and pushing verification costs onto experts, while outcomes arrive too late to serve as reward signals. In deep-uncertainty decisions (where objectives are contested and reversals are costly), scaling fluent agreement amplifies poor commitments faster than it builds expertise. We argue reliable human-AI partnership requires a shift from answer generation to collaborative premise governance over a knowledge substrate, negotiating only what is decision-critical. A discrepancy-driven control loop operates over this substrate: detecting conflicts, localizing misalignment via typed discrepancies (teleological, epistemic, procedural), and triggering bounded negotiation through decision slices. Commitment gating blocks action on uncommitted load-bearing premises unless overridden under logged risk; value-gated challenge allocates probing under interaction cost. Trust then attaches to auditable premises and evidence standards, not conversational fluency. We illustrate with tutoring and propose falsifiable evaluation criteria.
摘要：随着法学硕士从协助扩展到决策支持，一种危险的模式出现了：没有经过校准的判断就达成一致。低摩擦的助手可能会变得阿谀奉承，做出隐含的假设，并将验证成本推给专家，而结果来得太晚而无法作为奖励信号。在高度不确定的决策中（目标存在争议且逆转成本高昂），扩大流畅的协议会比建立专业知识更快地放大不良承诺。我们认为，可靠的人类与人工智能伙伴关系需要从答案生成转变为基于知识基础的协作前提治理，仅协商决策关键的内容。差异驱动的控制循环在此基础上运行：检测冲突，通过类型差异（目的论、认知、程序）定位偏差，并通过决策切片触发有界谈判。承诺门控会阻止对未承诺的承重场所采取行动，除非在记录的风险下被覆盖；价值门控挑战在交互成本下分配探测。信任依赖于可审计的前提和证据标准，而不是对话的流畅性。我们通过辅导进行说明并提出可证伪的评估标准。

Title: ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs

Authors: Ziyan Zhang, Chao Wang, Zhuo Chen, Chiyi Li, Kai Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02382
Pdf URL: https://arxiv.org/pdf/2602.02382
Copy Paste: [[2602.02382]] ROG: Retrieval-Augmented LLM Reasoning for Complex First-Order Queries over Knowledge Graphs(https://arxiv.org/abs/2602.02382)
Keywords: language model, llm, chain-of-thought
Abstract: Answering first-order logic (FOL) queries over incomplete knowledge graphs (KGs) is difficult, especially for complex query structures that compose projection, intersection, union, and negation. We propose ROG, a retrieval-augmented framework that combines query-aware neighborhood retrieval with large language model (LLM) chain-of-thought reasoning. ROG decomposes a multi-operator query into a sequence of single-operator sub-queries and grounds each step in compact, query-relevant neighborhood evidence. Intermediate answer sets are cached and reused across steps, improving consistency on deep reasoning chains. This design reduces compounding errors and yields more robust inference on complex and negation-heavy queries. Overall, ROG provides a practical alternative to embedding-based logical reasoning by replacing learned operators with retrieval-grounded, step-wise inference. Experiments on standard KG reasoning benchmarks show consistent gains over strong embedding-based baselines, with the largest improvements on high-complexity and negation-heavy query types.
摘要：回答对不完整知识图 (KG) 的一阶逻辑 (FOL) 查询很困难，特别是对于由投影、交集、并集和否定组成的复杂查询结构。我们提出了 ROG，一种检索增强框架，它将查询感知邻域检索与大语言模型 (LLM) 思想链推理相结合。 ROG 将多运算符查询分解为一系列单运算符子查询，并将每个步骤基于紧凑的、与查询相关的邻域证据。中间答案集被缓存并跨步骤重用，从而提高深度推理链的一致性。这种设计减少了复合错误，并对复杂和否定较多的查询产生更稳健的推理。总体而言，ROG 通过用基于检索的逐步推理替换学习运算符，为基于嵌入的逻辑推理提供了一种实用的替代方案。标准 KG 推理基准的实验表明，与基于强大嵌入的基线相比，具有一致的增益，其中在高复杂性和否定重查询类型上的改进最大。

Title: Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank

Authors: Joshua Mitton, Prarthana Bhattacharyya, Digory Smith, Thomas Christie, Ralph Abboud, Simon Woodhead
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02414
Pdf URL: https://arxiv.org/pdf/2602.02414
Copy Paste: [[2602.02414]] Misconception Diagnosis From Student-Tutor Dialogue: Generate, Retrieve, Rerank(https://arxiv.org/abs/2602.02414)
Keywords: language model, llm
Abstract: Timely and accurate identification of student misconceptions is key to improving learning outcomes and pre-empting the compounding of student errors. However, this task is highly dependent on the effort and intuition of the teacher. In this work, we present a novel approach for detecting misconceptions from student-tutor dialogues using large language models (LLMs). First, we use a fine-tuned LLM to generate plausible misconceptions, and then retrieve the most promising candidates among these using embedding similarity with the input dialogue. These candidates are then assessed and re-ranked by another fine-tuned LLM to improve misconception relevance. Empirically, we evaluate our system on real dialogues from an educational tutoring platform. We consider multiple base LLM models including LLaMA, Qwen and Claude on zero-shot and fine-tuned settings. We find that our approach improves predictive performance over baseline models and that fine-tuning improves both generated misconception quality and can outperform larger closed-source models. Finally, we conduct ablation studies to both validate the importance of our generation and reranking steps on misconception generation quality.
摘要：及时准确地识别学生的误解是提高学习成果和防止学生错误复合的关键。然而，这项任务高度依赖于老师的努力和直觉。在这项工作中，我们提出了一种使用大型语言模型（LLM）检测学生与导师对话中的误解的新颖方法。首先，我们使用经过微调的 LLM 来生成看似合理的误解，然后使用与输入对话的嵌入相似性来检索其中最有希望的候选者。然后，另一位经过微调的法学硕士会对这些候选人进行评估和重新排名，以提高误解的相关性。根据经验，我们根据教育辅导平台的真实对话来评估我们的系统。我们在零样本和微调设置上考虑了多个基本 LLM 模型，包括 LLaMA、Qwen 和 Claude。我们发现我们的方法提高了基线模型的预测性能，并且微调提高了生成的误解质量，并且可以超越更大的闭源模型。最后，我们进行消融研究，以验证我们这一代的重要性，并对误解生成质量的步骤进行重新排序。

Title: Large Language Models for Mental Health: A Multilingual Evaluation

Authors: Nishat Raihan, Sadiya Sayara Chowdhury Puspo, Ana-Maria Bucur, Stevie Chancellor, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02440
Pdf URL: https://arxiv.org/pdf/2602.02440
Copy Paste: [[2602.02440]] Large Language Models for Mental Health: A Multilingual Evaluation(https://arxiv.org/abs/2602.02440)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have remarkable capabilities across NLP tasks. However, their performance in multilingual contexts, especially within the mental health domain, has not been thoroughly explored. In this paper, we evaluate proprietary and open-source LLMs on eight mental health datasets in various languages, as well as their machine-translated (MT) counterparts. We compare LLM performance in zero-shot, few-shot, and fine-tuned settings against conventional NLP baselines that do not employ LLMs. In addition, we assess translation quality across language families and typologies to understand its influence on LLM performance. Proprietary LLMs and fine-tuned open-source LLMs achieve competitive F1 scores on several datasets, often surpassing state-of-the-art results. However, performance on MT data is generally lower, and the extent of this decline varies by language and typology. This variation highlights both the strengths of LLMs in handling mental health tasks in languages other than English and their limitations when translation quality introduces structural or lexical mismatches.
摘要：大型语言模型 (LLM) 在 NLP 任务中具有卓越的能力。然而，它们在多语言环境中的表现，特别是在心理健康领域，尚未得到彻底探讨。在本文中，我们在八种不同语言的心理健康数据集及其机器翻译（MT）对应数据集上评估了专有和开源法学硕士。我们将零样本、少样本和微调设置中的 LLM 性能与不采用 LLM 的传统 NLP 基线进行比较。此外，我们还评估跨语言家族和类型的翻译质量，以了解其对法学硕士表现的影响。专有法学硕士和经过微调的开源法学硕士在多个数据集上取得了具有竞争力的 F1 分数，通常超过了最先进的结果。然而，机器翻译数据的性能普遍较低，而且下降的程度因语言和类型而异。这种变化凸显了法学硕士在处理英语以外语言的心理健康任务方面的优势，以及当翻译质量引入结构或词汇不匹配时的局限性。

Title: Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models

Authors: Gabriele Maraia, Marco Valentino, Fabio Massimo Zanzotto, Leonardo Ranaldi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02462
Pdf URL: https://arxiv.org/pdf/2602.02462
Copy Paste: [[2602.02462]] Abstract Activation Spaces for Content-Invariant Reasoning in Large Language Models(https://arxiv.org/abs/2602.02462)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often struggle with deductive judgment in syllogistic reasoning, systematically conflating semantic plausibility with formal validity a phenomenon known as content effect. This bias persists even when models generate step-wise explanations, indicating that intermediate rationales may inherit the same semantic shortcuts that affect answers. Recent approaches propose mitigating this issue by increasing inference-time structural constraints, either by encouraging abstract intermediate representations or by intervening directly in the model's internal computations; however, reliably suppressing semantic interference remains an open challenge. To make formal deduction less sensitive to semantic content, we introduce a framework for abstraction-guided reasoning that explicitly separates structural inference from lexical semantics. We construct paired content-laden and abstract syllogisms and use the model's activations on abstract inputs to define an abstract reasoning space. We then learn lightweight Abstractors that, from content-conditioned residual-stream states, predict representations aligned with this space and integrate these predictions via multi-layer interventions during the forward pass. Using cross-lingual transfer as a test bed, we show that abstraction-aligned steering reduces content-driven errors and improves validity-sensitive performance. Our results position activation-level abstraction as a scalable mechanism for enhancing the robustness of formal reasoning in LLMs against semantic interference.
摘要：大型语言模型（LLM）经常在三段论推理中与演绎判断作斗争，系统地将语义合理性与形式有效性混为一谈，这种现象称为内容效应。即使模型生成逐步解释，这种偏见仍然存在，这表明中间原理可能继承影响答案的相同语义快捷方式。最近的方法建议通过增加推理时间结构约束来缓解这个问题，或者通过鼓励抽象中间表示，或者直接干预模型的内部计算；然而，可靠地抑制语义干扰仍然是一个开放的挑战。为了使形式演绎对语义内容不那么敏感，我们引入了一个抽象引导推理的框架，该框架明确地将结构推理与词汇语义分开。我们构建配对的内容负载和抽象三段论，并使用模型对抽象输入的激活来定义抽象推理空间。然后，我们学习轻量级抽象器，根据内容条件的残差流状态，预测与该空间对齐的表示，并在前向传递过程中通过多层干预整合这些预测。使用跨语言迁移作为测试平台，我们表明抽象对齐的转向减少了内容驱动的错误并提高了有效性敏感的性能。我们的结果将激活级抽象定位为一种可扩展的机制，用于增强法学硕士形式推理对抗语义干扰的鲁棒性。

Title: From Directions to Regions: Decomposing Activations in Language Models via Local Geometry

Authors: Or Shafran, Shaked Ronen, Omri Fahn, Shauli Ravfogel, Atticus Geiger, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02464
Pdf URL: https://arxiv.org/pdf/2602.02464
Copy Paste: [[2602.02464]] From Directions to Regions: Decomposing Activations in Language Models via Local Geometry(https://arxiv.org/abs/2602.02464)
Keywords: language model
Abstract: Activation decomposition methods in language models are tightly coupled to geometric assumptions on how concepts are realized in activation space. Existing approaches search for individual global directions, implicitly assuming linear separability, which overlooks concepts with nonlinear or multi-dimensional structure. In this work, we leverage Mixture of Factor Analyzers (MFA) as a scalable, unsupervised alternative that models the activation space as a collection of Gaussian regions with their local covariance structure. MFA decomposes activations into two compositional geometric objects: the region's centroid in activation space, and the local variation from the centroid. We train large-scale MFAs for Llama-3.1-8B and Gemma-2-2B, and show they capture complex, nonlinear structures in activation space. Moreover, evaluations on localization and steering benchmarks show that MFA outperforms unsupervised baselines, is competitive with supervised localization methods, and often achieves stronger steering performance than sparse autoencoders. Together, our findings position local geometry, expressed through subspaces, as a promising unit of analysis for scalable concept discovery and model control, accounting for complex structures that isolated directions fail to capture.
摘要：语言模型中的激活分解方法与关于如何在激活空间中实现概念的几何假设紧密耦合。现有的方法搜索单个全局方向，隐含地假设线性可分离性，这忽略了具有非线性或多维结构的概念。在这项工作中，我们利用混合因子分析器 (MFA) 作为可扩展的无监督替代方案，将激活空间建模为具有局部协方差结构的高斯区域的集合。 MFA 将激活分解为两个组合几何对象：激活空间中区域的质心，以及相对于质心的局部变化。我们为 Llama-3.1-8B 和 Gemma-2-2B 训练大规模 MFA，并表明它们捕获激活空间中的复杂非线性结构。此外，对定位和转向基准的评估表明，MFA 优于无监督基线，与有监督定位方法具有竞争力，并且通常比稀疏自动编码器具有更强的转向性能。总之，我们的发现将通过子空间表达的局部几何定位为可扩展概念发现和模型控制的有前景的分析单元，解释了孤立方向无法捕获的复杂结构。

Title: Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models

Authors: Noam Steinmetz Yalon, Ariel Goldstein, Liad Mudrik, Mor Geva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02467
Pdf URL: https://arxiv.org/pdf/2602.02467
Copy Paste: [[2602.02467]] Indications of Belief-Guided Agency and Meta-Cognitive Monitoring in Large Language Models(https://arxiv.org/abs/2602.02467)
Keywords: language model, llm
Abstract: Rapid advancements in large language models (LLMs) have sparked the question whether these models possess some form of consciousness. To tackle this challenge, Butlin et al. (2023) introduced a list of indicators for consciousness in artificial systems based on neuroscientific theories. In this work, we evaluate a key indicator from this list, called HOT-3, which tests for agency guided by a general belief-formation and action selection system that updates beliefs based on meta-cognitive monitoring. We view beliefs as representations in the model's latent space that emerge in response to a given input, and introduce a metric to quantify their dominance during generation. Analyzing the dynamics between competing beliefs across models and tasks reveals three key findings: (1) external manipulations systematically modulate internal belief formation, (2) belief formation causally drives the model's action selection, and (3) models can monitor and report their own belief states. Together, these results provide empirical support for the existence of belief-guided agency and meta-cognitive monitoring in LLMs. More broadly, our work lays methodological groundwork for investigating the emergence of agency, beliefs, and meta-cognition in LLMs.
摘要：大型语言模型（LLM）的快速进步引发了这些模型是否拥有某种形式的意识的问题。为了应对这一挑战，Butlin 等人。（2023）介绍了基于神经科学理论的人工系统中的意识指标列表。在这项工作中，我们评估了该列表中的一个关键指标，称为 HOT-3，该指标测试由一般信念形成和行动选择系统指导的机构，该系统根据元认知监控更新信念。我们将信念视为模型潜在空间中响应给定输入而出现的表示，并引入一个度量来量化它们在生成过程中的主导地位。分析跨模型和任务的竞争信念之间的动态揭示了三个关键发现：（1）外部操纵系统地调节内部信念形成，（2）信念形成因果驱动模型的行动选择，以及（3）模型可以监控和报告自己的信念状态。总之，这些结果为法学硕士中信仰引导机构和元认知监控的存在提供了实证支持。更广泛地说，我们的工作为调查法学硕士中代理、信念和元认知的出现奠定了方法论基础。

Title: MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

Authors: Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02474
Pdf URL: https://arxiv.org/pdf/2602.02474
Copy Paste: [[2602.02474]] MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents(https://arxiv.org/abs/2602.02474)
Keywords: language model, llm, agent
Abstract: Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.
摘要：大多数大型语言模型 (LLM) 代理内存系统依赖于一小组静态的、手工设计的操作来提取内存。这些固定的程序硬编码了人类关于存储什么以及如何修改记忆的先验知识，使得它们在不同的交互模式下变得僵化，并且在长期历史中效率低下。为此，我们提出了 \textbf{MemSkill}，它将这些操作重新构建为可学习和可进化的记忆技能、结构化和可重用的例程，用于从交互痕迹中提取、合并和修剪信息。受代理技能设计理念的启发，MemSkill 采用了一个 \emph{controller} 来学习选择一小组相关技能，并与一个基于 LLM 的 \emph{executor} 配对，以产生技能引导的记忆。除了学习技能选择之外，MemSkill 还引入了 \emph{designer}，它会定期审查所选技能产生不正确或不完整记忆的困难情况，并通过提出改进和新技能来发展技能集。 MemSkill 共同形成了一个闭环程序，可以改进技能选择策略和技能集本身。 LoCoMo、LongMemEval、HotpotQA 和 ALFWorld 上的实验表明，MemSkill 在强基线上提高了任务性能，并且在不同设置中具有良好的泛化能力。进一步的分析揭示了技能如何演变，为法学硕士代理人的更具适应性、自我进化的记忆管理提供了见解。

Title: Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Authors: Xiao Liang, Zhong-Zhi Li, Zhenghao Lin, Eric Hancheng Jiang, Hengyuan Zhang, Yelong Shen, Kai-Wei Chang, Ying Nian Wu, Yeyun Gong, Weizhu Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.02477
Pdf URL: https://arxiv.org/pdf/2602.02477
Copy Paste: [[2602.02477]] Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability(https://arxiv.org/abs/2602.02477)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.
摘要：大型语言模型（LLM）通过逐步的思想链（CoT）推理展示了强大的推理能力。然而，在模型能力的限制下，CoT 通常被证明是不够的，并且其严格的顺序性质限制了测试时间的可扩展性。一种潜在的替代方案是分而治之 (DAC) 推理，它将复杂问题分解为子问题，以便更有效地探索解决方案。尽管前景广阔，但我们的分析揭示了通用后训练和 DAC 式推理之间存在根本性的不一致，这限制了模型充分利用这一潜力的能力。为了弥补这一差距并充分释放法学硕士在最具挑战性的任务上的推理能力，我们提出了一个端到端强化学习（RL）框架来增强他们的 DAC 式推理能力。在每一步中，该策略将问题分解为一组子问题，依次解决它们，并根据子问题的解决方案解决原始问题，并将分解和解决方案集成到 RL 训练中。在可比训练下，我们的 DAC 式框架赋予模型更高的性能上限和更强的测试时间可扩展性，在竞赛级别的基准测试中，Pass@1 中的 CoT 超过 8.6%，Pass@32 中的 CoT 超过 6.3%。

Title: RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents

Authors: Jialiang Zhu, Gongrui Zhang, Xiaolong Ma, Lin Xu, Miaosen Zhang, Ruiqi Yang, Song Wang, Kai Qiu, Zhirong Wu, Qi Dai, Ruichun Ma, Bei Liu, Yifan Yang, Chong Luo, Zhengyuan Yang, Linjie Li, Lijuan Wang, Weizhu Chen, Xin Geng, Baining Guo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.02486
Pdf URL: https://arxiv.org/pdf/2602.02486
Copy Paste: [[2602.02486]] RE-TRAC: REcursive TRAjectory Compression for Deep Search Agents(https://arxiv.org/abs/2602.02486)
Keywords: llm, long context, agent
Abstract: LLM-based deep research agents are largely built on the ReAct framework. This linear design makes it difficult to revisit earlier states, branch into alternative search directions, or maintain global awareness under long contexts, often leading to local optima, redundant exploration, and inefficient search. We propose Re-TRAC, an agentic framework that performs cross-trajectory exploration by generating a structured state representation after each trajectory to summarize evidence, uncertainties, failures, and future plans, and conditioning subsequent trajectories on this state representation. This enables iterative reflection and globally informed planning, reframing research as a progressive process. Empirical results show that Re-TRAC consistently outperforms ReAct by 15-20% on BrowseComp with frontier LLMs. For smaller models, we introduce Re-TRAC-aware supervised fine-tuning, achieving state-of-the-art performance at comparable scales. Notably, Re-TRAC shows a monotonic reduction in tool calls and token usage across rounds, indicating progressively targeted exploration driven by cross-trajectory reflection rather than redundant search.
摘要：基于 LLM 的深度研究代理主要建立在 ReAct 框架之上。这种线性设计使得很难重新访问早期状态、分支到替代搜索方向或在长上下文下保持全局意识，通常会导致局部最优、冗余探索和低效搜索。我们提出了 Re-TRAC，这是一种代理框架，它通过在每个轨迹之后生成结构化状态表示来总结证据、不确定性、失败和未来计划，并根据该状态表示调节后续轨迹来执行跨轨迹探索。这使得迭代反思和全球知情规划成为可能，将研究重新定义为一个渐进的过程。实证结果表明，在拥有前沿法学硕士的 BrowseComp 上，Re-TRAC 的表现始终优于 ReAct 15-20%。对于较小的模型，我们引入了 Re-TRAC 感知的监督微调，在可比规模上实现了最先进的性能。值得注意的是，Re-TRAC 显示了各轮中工具调用和代币使用的单调减少，这表明由跨轨迹反射而不是冗余搜索驱动的逐步有针对性的探索。

Title: Reward-free Alignment for Conflicting Objectives

Authors: Peter Chen, Xiaopeng Li, Xi Chen, Tianyi Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.02495
Pdf URL: https://arxiv.org/pdf/2602.02495
Copy Paste: [[2602.02495]] Reward-free Alignment for Conflicting Objectives(https://arxiv.org/abs/2602.02495)
Keywords: language model, llm
Abstract: Direct alignment methods are increasingly used to align large language models (LLMs) with human preferences. However, many real-world alignment problems involve multiple conflicting objectives, where naive aggregation of preferences can lead to unstable training and poor trade-offs. In particular, weighted loss methods may fail to identify update directions that simultaneously improve all objectives, and existing multi-objective approaches often rely on explicit reward models, introducing additional complexity and distorting user-specified preferences. The contributions of this paper are two-fold. First, we propose a Reward-free Alignment framework for Conflicted Objectives (RACO) that directly leverages pairwise preference data and resolves gradient conflicts via a novel clipped variant of conflict-averse gradient descent. We provide convergence guarantees to Pareto-critical points that respect user-specified objective weights, and further show that clipping can strictly improve convergence rate in the two-objective setting. Second, we improve our method using some heuristics and conduct experiments to demonstrate the compatibility of the proposed framework for LLM alignment. Both qualitative and quantitative evaluations on multi-objective summarization and safety alignment tasks across multiple LLM families (Qwen 3, Llama 3, Gemma 3) show that our method consistently achieves better Pareto trade-offs compared to existing multi-objective alignment baselines.
摘要：直接对齐方法越来越多地用于使大型语言模型 (LLM) 与人类偏好保持一致。然而，许多现实世界的对齐问题涉及多个相互冲突的目标，其中天真的偏好聚合可能会导致不稳定的训练和糟糕的权衡。特别是，加权损失方法可能无法识别同时改进所有目标的更新方向，并且现有的多目标方法通常依赖于显式奖励模型，引入额外的复杂性并扭曲用户指定的偏好。本文的贡献有两个方面。首先，我们提出了一种针对冲突目标的无奖励对齐框架（RACO），该框架直接利用成对偏好数据，并通过一种新型的冲突厌恶梯度下降的剪裁变体来解决梯度冲突。我们为尊重用户指定的目标权重的帕累托临界点提供收敛保证，并进一步表明裁剪可以严格提高双目标设置中的收敛速度。其次，我们使用一些启发式方法改进我们的方法，并进行实验来证明所提出的 LLM 对齐框架的兼容性。对多个 LLM 系列（Qwen 3、Llama 3、Gemma 3）的多目标总结和安全对齐任务的定性和定量评估表明，与现有的多目标对齐基线相比，我们的方法始终实现更好的帕累托权衡。