2026-01-30

Title: DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents

Authors: Nikita Gupta, Riju Chatterjee, Lukas Haas, Connie Tao, Andrew Wang, Chang Liu, Hidekazu Oiwa, Elena Gribovskaya, Jan Ackermann, John Blitzer, Sasha Goldshtein, Dipanjan Das
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.20975
Pdf URL: https://arxiv.org/pdf/2601.20975
Copy Paste: [[2601.20975]] DeepSearchQA: Bridging the Comprehensiveness Gap for Deep Research Agents(https://arxiv.org/abs/2601.20975)
Keywords: prompt, agent
Abstract: We introduce DeepSearchQA, a 900-prompt benchmark for evaluating agents on difficult multi-step information-seeking tasks across 17 different fields. Unlike traditional benchmarks that target single answer retrieval or broad-spectrum factuality, DeepSearchQA features a dataset of challenging, handcrafted tasks designed to evaluate an agent's ability to execute complex search plans to generate exhaustive answer lists. This shift in design explicitly tests three critical, yet under-evaluated capabilities: 1) systematic collation of fragmented information from disparate sources, 2) de-duplication and entity resolution to ensure precision, and 3) the ability to reason about stopping criteria within an open-ended search space. Each task is structured as a causal chain, where discovering information for one step is dependent on the successful completion of the previous one, stressing long-horizon planning and context retention. All tasks are grounded in the open web with objectively verifiable answer sets. Our comprehensive evaluation of state-of-the-art agent architectures reveals significant performance limitations: even the most advanced models struggle to balance high recall with precision. We observe distinct failure modes ranging from premature stopping (under-retrieval) to hedging behaviors, where agents cast an overly wide net of low-confidence answers to artificially boost recall. These findings highlight critical headroom in current agent designs and position DeepSearchQA as an essential diagnostic tool for driving future research toward more robust, deep-research capabilities.
摘要：我们推出了 DeepSearchQA，这是一个包含 900 条提示的基准，用于评估代理在 17 个不同领域执行困难的多步骤信息搜索任务。与针对单一答案检索或广泛事实性的传统基准不同，DeepSearchQA 具有具有挑战性的手工任务数据集，旨在评估代理执行复杂搜索计划以生成详尽答案列表的能力。这种设计上的转变明确地测试了三个关键但未被充分评估的功能：1）对来自不同来源的碎片信息进行系统整理，2）重复数据删除和实体解析以确保精度，以及3）在开放式搜索空间内推理停止标准的能力。每项任务都被构造为一个因果链，其中一个步骤的信息发现取决于前一个步骤的成功完成，强调长期规划和上下文保留。所有任务都基于开放网络，具有客观可验证的答案集。我们对最先进的代理架构的综合评估揭示了显着的性能限制：即使是最先进的模型也难以平衡高召回率和精确度。我们观察到不同的失败模式，从过早停止（检索不足）到对冲行为，其中代理撒下一张过于宽泛的低置信度答案网，人为地提高召回率。这些发现凸显了当前代理设计中的关键空间，并将 DeepSearchQA 定位为推动未来研究走向更强大、更深入的研究能力的重要诊断工具。

Title: UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop

Authors: Muhammad Ali Shafique, Areej Mehboob, Layba Fiaz, Muhammad Usman Qadeer, Hamza Farooq
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21000
Pdf URL: https://arxiv.org/pdf/2601.21000
Copy Paste: [[2601.21000]] UrduBench: An Urdu Reasoning Benchmark using Contextually Ensembled Translations with Human-in-the-Loop(https://arxiv.org/abs/2601.21000)
Keywords: language model, llm, prompt
Abstract: Recent advances in large language models (LLMs) have led to strong reasoning capabilities; however, evaluating such models in low-resource languages remains challenging due to the lack of standardized benchmarks. In particular, Urdu reasoning evaluation has been limited by the sensitivity of machine translation and an emphasis on general language tasks rather than reasoning benchmarks. In this paper, we propose a contextually ensembled translation framework with human-in-the-loop validation that leverages multiple translation systems to develop Urdu reasoning benchmarks while preserving contextual and structural integrity. Using this framework, we translate widely adopted reasoning and question-answering benchmarks, including MGSM, MATH-500, CommonSenseQA, and OpenBookQA, into Urdu, collectively referred to as UrduBench, and conduct a comprehensive evaluation of both reasoning-oriented and instruction-tuned LLMs across multiple prompting strategies. Our analysis reveals performance differences across (1) four datasets, (2) five task difficulty levels, (3) diverse model architectures, (4) multiple model scaling settings, and (5) language consistency tests. We find that multi-step and symbolic reasoning tasks pose significant challenges in Urdu, and that stable language alignment is a critical prerequisite for robust reasoning. Overall, our work establishes a scalable methodology for standardized reasoning evaluation in Urdu and provides empirical insights into multilingual reasoning failures. This experimental setup is also broadly applicable to other low-resource languages. The code and datasets will be publicly released.
摘要：大语言模型（LLM）的最新进展带来了强大的推理能力；然而，由于缺乏标准化基准，用资源匮乏的语言评估此类模型仍然具有挑战性。特别是，乌尔都语推理评估受到机器翻译敏感性和对一般语言任务而不是推理基准的重视的限制。在本文中，我们提出了一种具有人机循环验证的上下文集成翻译框架，该框架利用多个翻译系统来开发乌尔都语推理基准，同时保持上下文和结构完整性。利用这个框架，我们将广泛采用的推理和问答基准，包括 MGSM、MATH-500、CommonSenseQA 和 OpenBookQA 翻译成乌尔都语，统称为 UrduBench，并跨多种提示策略对面向推理和指令调整的法学硕士进行全面评估。我们的分析揭示了（1）四个数据集、（2）五个任务难度级别、（3）不同模型架构、（4）多个模型缩放设置和（5）语言一致性测试之间的性能差异。我们发现多步骤和符号推理任务对乌尔都语提出了重大挑战，而稳定的语言对齐是稳健推理的关键先决条件。总的来说，我们的工作为乌尔都语标准化推理评估建立了一种可扩展的方法，并为多语言推理失败提供了实证见解。这种实验设置也广泛适用于其他低资源语言。代码和数据集将公开发布。

Title: ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference

Authors: Ketan Thakkar, Maitreyi Chatterjee, Ramasubramanian Balasubramanian, Achyuthan Jootoo, Rajendra Ugrani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21109
Pdf URL: https://arxiv.org/pdf/2601.21109
Copy Paste: [[2601.21109]] ChunkWise LoRA: Adaptive Sequence Partitioning for Memory-Efficient Low-Rank Adaptation and Accelerated LLM Inference(https://arxiv.org/abs/2601.21109)
Keywords: language model, llm
Abstract: Recent advances in low-rank adaptation (LoRA) have enabled efficient fine-tuning of large language models (LLMs) with minimal additional parameters. However, existing LoRA methods apply static rank configurations uniformly across all input tokens, ignoring variation in token complexity and computational requirements. In this work, we propose ChunkWise LoRA, a dynamic and adaptive approach that partitions sequences into variable-length chunks based on token complexity and assigns each chunk a tailored low-rank configuration. Our system introduces a runtime scheduler that estimates token difficulty, performs adaptive chunking, and selects per-chunk LoRA rank and scaling using a rank-ladder mechanism. To preserve output consistency, we further introduce a boundary-safe composition module and integrate policy-driven KV-cache strategies. Experiments on benchmark datasets such as Wikitext-103 and SQuAD demonstrate that ChunkWise LoRA achieves up to 34\% lower latency and 38% memory reduction compared to baseline LoRA, while maintaining or improving task performance metrics like BLEU, EM, and perplexity. The proposed framework remains fully compatible with existing transformer architectures and inference frameworks, providing a practical solution for real-world deployment of parameter-efficient LLMs.
摘要：低秩自适应 (LoRA) 的最新进展使得能够以最少的附加参数对大型语言模型 (LLM) 进行高效微调。然而，现有的 LoRA 方法在所有输入令牌中统一应用静态排名配置，忽略了令牌复杂性和计算要求的变化。在这项工作中，我们提出了 ChunkWise LoRA，这是一种动态自适应方法，可根据令牌复杂性将序列划分为可变长度块，并为每个块分配定制的低秩配置。我们的系统引入了一个运行时调度程序，它可以估计令牌难度、执行自适应分块，并使用排名梯机制选择每个块的 LoRA 排名和缩放。为了保持输出一致性，我们进一步引入了边界安全组合模块并集成了策略驱动的 KV 缓存策略。在 Wikitext-103 和 SQuAD 等基准数据集上进行的实验表明，与基准 LoRA 相比，ChunkWise LoRA 的延迟降低了 34%，内存减少了 38%，同时保持或提高了 BLEU、EM 和困惑度等任务性能指标。所提出的框架与现有的变压器架构和推理框架完全兼容，为参数高效的 LLM 的实际部署提供了实用的解决方案。

Title: Multi-task Code LLMs: Data Mix or Model Merge?

Authors: Mingzhi Zhu, Boris Sobolev, Rahul Krishna, Raju Pavuluri, Stacy Patterson, Michele Merler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21115
Pdf URL: https://arxiv.org/pdf/2601.21115
Copy Paste: [[2601.21115]] Multi-task Code LLMs: Data Mix or Model Merge?(https://arxiv.org/abs/2601.21115)
Keywords: llm, agent
Abstract: Recent research advocates deploying smaller, specialized code LLMs in agentic frameworks alongside frontier models, sparking interest in efficient strategies for multi-task learning that balance performance, constraints, and costs. We compare two approaches for creating small, multi-task code LLMs: data mixing versus model merging. We conduct extensive experiments across two model families (Qwen Coder and DeepSeek Coder) at two scales (2B and 7B parameters), fine-tuning them for code generation and code summarization tasks. Our evaluation on HumanEval, MBPP, and CodeXGlue benchmarks reveals that model merging achieves the best overall performance at larger scale across model families, retaining 96% of specialized model performance on code generation tasks while maintaining summarization capabilities. Notably, merged models can even surpass individually fine-tuned models, with our best configuration of Qwen Coder 2.5 7B model achieving 92.7% Pass@1 on HumanEval compared to 90.9% for its task-specific fine-tuned equivalent. At a smaller scale we find instead data mixing to be a preferred strategy. We further introduce a weight analysis technique to understand how different tasks affect model parameters and their implications for merging strategies. The results suggest that careful merging and mixing strategies can effectively combine task-specific capabilities without significant performance degradation, making them ideal for resource-constrained deployment scenarios.
摘要：最近的研究提倡在代理框架中与前沿模型一起部署更小的、专门的代码法学硕士，激发了人们对平衡性能、约束和成本的多任务学习有效策略的兴趣。我们比较了创建小型多任务代码 LLM 的两种方法：数据混合与模型合并。我们在两个模型系列（Qwen Coder 和 DeepSeek Coder）的两个尺度（2B 和 7B 参数）上进行了广泛的实验，对它们进行了代码生成和代码摘要任务的微调。我们对 HumanEval、MBPP 和 CodeXGlue 基准的评估表明，模型合并在跨模型系列的更大规模上实现了最佳整体性能，在代码生成任务上保留了 96% 的专业模型性能，同时保持了摘要功能。值得注意的是，合并模型甚至可以超越单独微调的模型，我们的 Qwen Coder 2.5 7B 模型的最佳配置在 HumanEval 上实现了 92.7% Pass@1，而其针对特定任务的微调等效模型则为 90.9%。在较小的规模上，我们发现数据混合是首选策略。我们进一步引入权重分析技术来了解不同任务如何影响模型参数及其对合并策略的影响。结果表明，仔细的合并和混合策略可以有效地结合特定于任务的功能，而不会显着降低性能，使其成为资源受限的部署场景的理想选择。

Title: Large Language Models Naively Recover Ethnicity from Individual Records

Authors: Noah Dasanaike
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21132
Pdf URL: https://arxiv.org/pdf/2601.21132
Copy Paste: [[2601.21132]] Large Language Models Naively Recover Ethnicity from Individual Records(https://arxiv.org/abs/2601.21132)
Keywords: language model, gpt, llm
Abstract: I demonstrate that large language models can infer ethnicity from names with accuracy exceeding that of Bayesian Improved Surname Geocoding (BISG) without additional training data, enabling inference outside the United States and to contextually appropriate classification categories. Using stratified samples from Florida and North Carolina voter files with self-reported race, LLM-based classification achieves up to 84.7% accuracy, outperforming BISG (68.2%) on balanced samples. I test six models including Gemini 3 Flash, GPT-4o, and open-source alternatives such as DeepSeek v3.2 and GLM-4.7. Enabling extended reasoning can improve accuracy by 1-3 percentage points, though effects vary across contexts; including metadata such as party registration reaches 86.7%. LLM classification also reduces the income bias inherent in BISG, where minorities in wealthier neighborhoods are systematically misclassified as White. I further validate using Lebanese voter registration with religious sect (64.3% accuracy), Indian MPs from reserved constituencies (99.2%), and Indian land records with caste classification (74.0%). Aggregate validation across India, Uganda, Nepal, Armenia, Chile, and Costa Rica using original full-count voter rolls demonstrates that the method recovers known population distributions where naming conventions are distinctive. For large-scale applications, small transformer models fine-tuned on LLM labels exceed BISG accuracy while enabling local deployment at no cost.
摘要：我证明大型语言模型可以从姓名推断种族，其准确度超过贝叶斯改进姓氏地理编码 (BISG)，无需额外的训练数据，从而能够在美国境外进行推断并根据上下文进行适当的分类类别。使用来自佛罗里达州和北卡罗来纳州选民档案的分层样本以及自我报告的种族，基于 LLM 的分类准确率高达 84.7%，在平衡样本上优于 BISG (68.2%)。我测试了六种模型，包括 Gemini 3 Flash、GPT-4o 以及 DeepSeek v3.2 和 GLM-4.7 等开源替代品。启用扩展推理可以将准确性提高 1-3 个百分点，但效果因环境而异；其中政党登记等元数据达到86.7%。 LLM 分类还减少了 BISG 固有的收入偏见，即富裕社区的少数族裔被系统性地错误分类为白人。我进一步验证了黎巴嫩选民登记的宗教派别（准确率 64.3%）、来自保留选区的印度议员（99.2%）以及印度土地记录的种姓分类（74.0%）。使用原始全数选民名册对印度、乌干达、尼泊尔、亚美尼亚、智利和哥斯达黎加进行的总体验证表明，该方法可以恢复命名约定独特的已知人口分布。对于大规模应用，在 LLM 标签上微调的小型变压器模型超越了 BISG 精度，同时支持免费本地部署。

Title: EnsembleLink: Accurate Record Linkage Without Training Data

Authors: Noah Dasanaike
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21138
Pdf URL: https://arxiv.org/pdf/2601.21138
Copy Paste: [[2601.21138]] EnsembleLink: Accurate Record Linkage Without Training Data(https://arxiv.org/abs/2601.21138)
Keywords: language model
Abstract: Record linkage, the process of matching records that refer to the same entity across datasets, is essential to empirical social science but remains methodologically underdeveloped. Researchers treat it as a preprocessing step, applying ad hoc rules without quantifying the uncertainty that linkage errors introduce into downstream analyses. Existing methods either achieve low accuracy or require substantial labeled training data. I present EnsembleLink, a method that achieves high accuracy without any training labels. EnsembleLink leverages pre-trained language models that have learned semantic relationships (e.g., that "South Ozone Park" is a neighborhood in "New York City" or that "Lutte ouvriere" refers to the Trotskyist "Workers' Struggle" party) from large text corpora. On benchmarks spanning city names, person names, organizations, multilingual political parties, and bibliographic records, EnsembleLink matches or exceeds methods requiring extensive labeling. The method runs locally on open-source models, requiring no external API calls, and completes typical linkage tasks in minutes.
摘要：记录链接，即跨数据集引用同一实体的记录的匹配过程，对于实证社会科学至关重要，但在方法上仍然不发达。研究人员将其视为预处理步骤，应用临时规则，而不量化链接错误引入下游分析的不确定性。现有方法要么精度较低，要么需要大量标记的训练数据。我提出了 EnsembleLink，这是一种无需任何训练标签即可实现高精度的方法。 EnsembleLink 利用预先训练的语言模型，这些模型从大型文本语料库中学习了语义关系（例如，“南臭氧公园”是“纽约市”的一个社区，或者“Lutte ouvriere”指托洛茨基主义“工人斗争”政党）。在涵盖城市名称、人名、组织、多语言政党和书目记录的基准上，EnsembleLink 达到或超过了需要大量标签的方法。该方法在开源模型上本地运行，无需外部API调用，并在几分钟内完成典型的联动任务。

Title: Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space

Authors: Tobias Materzok
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21169
Pdf URL: https://arxiv.org/pdf/2601.21169
Copy Paste: [[2601.21169]] Output-Space Search: Targeting LLM Generations in a Frozen Encoder-Defined Output Space(https://arxiv.org/abs/2601.21169)
Keywords: llm, prompt
Abstract: We introduce Output-Space Search (OS-Search), which turns LLM generation into endpoint search. An outer loop selects a target z* in a frozen encoder-defined 3D output space Z, and a retrieval-grounded policy trained with sequence-level RL generates outputs whose coordinates land near z* under standard autoregressive decoding. This enables parallel sweeps and black-box optimization in Z without path-dependent token/program search. On stories, sweeping Z (text) yields 3.1x higher LLM-scored diversity than prompt-chaining. On code, Bayesian optimization over Z (code) improves an objective withheld from the controller under matched inference budgets while preserving validity.
摘要：我们引入了输出空间搜索（OS-Search），它将 LLM 生成转变为端点搜索。外循环在冻结的编码器定义的 3D 输出空间 Z 中选择目标 z*，并且使用序列级 RL 训练的基于检索的策略生成其坐标在标准自回归解码下接近 z* 的输出。这使得 Z 中的并行扫描和黑盒优化成为可能，无需依赖于路径的令牌/程序搜索。在故事中，扫除 Z（文本）产生的 LLM 得分多样性比提示链接高 3.1 倍。在代码上，Z（代码）上的贝叶斯优化改进了在匹配的推理预算下从控制器中保留的目标，同时保持了有效性。

Title: From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning

Authors: Xiulin Yang, Heidi Getz, Ethan Gotlieb Wilcox
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21191
Pdf URL: https://arxiv.org/pdf/2601.21191
Copy Paste: [[2601.21191]] From Linear Input to Hierarchical Structure: Function Words as Statistical Cues for Language Learning(https://arxiv.org/abs/2601.21191)
Keywords: language model
Abstract: What statistical conditions support learning hierarchical structure from linear input? In this paper, we address this question by focusing on the statistical distribution of function words. Function words have long been argued to play a crucial role in language acquisition due to their distinctive distributional properties, including high frequency, reliable association with syntactic structure, and alignment with phrase boundaries. We use cross-linguistic corpus analysis to first establish that all three properties are present across 186 studied languages. Next, we use a combination of counterfactual language modeling and ablation experiments to show that language variants preserving all three properties are more easily acquired by neural learners, with frequency and structural association contributing more strongly than boundary alignment. Follow-up probing and ablation analyses further reveal that different learning conditions lead to systematically different reliance on function words, indicating that similar performance can arise from distinct internal mechanisms.
摘要：哪些统计条件支持从线性输入学习层次结构？在本文中，我们通过关注功能词的统计分布来解决这个问题。功能词长期以来被认为在语言习得中发挥着至关重要的作用，因为它们具有独特的分布特性，包括高频、与句法结构的可靠关联以及与短语边界的对齐。我们使用跨语言语料库分析首先确定所有这三个属性都存在于 186 种所研究的语言中。接下来，我们结合使用反事实语言建模和消融实验来表明，神经学习者更容易获得保留所有三个属性的语言变体，其中频率和结构关联比边界对齐贡献更大。后续的探索和消融分析进一步表明，不同的学习条件会导致对功能词的系统性不同依赖，这表明不同的内部机制可能会产生相似的表现。

Title: Scaling Embeddings Outperforms Scaling Experts in Language Models

Authors: Hong Liu, Jiaqi Zhang, Chao Wang, Xing Hu, Linkun Lyu, Jiaqi Sun, Xurui Yang, Bo Wang, Fengcun Li, Yulei Qian, Lingtong Si, Yerui Sun, Rumei Li, Peng Pei, Yuchen Xie, Xunliang Cai
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21204
Pdf URL: https://arxiv.org/pdf/2601.21204
Copy Paste: [[2601.21204]] Scaling Embeddings Outperforms Scaling Experts in Language Models(https://arxiv.org/abs/2601.21204)
Keywords: language model, agent
Abstract: While Mixture-of-Experts (MoE) architectures have become the standard for sparsity scaling in large language models, they increasingly face diminishing returns and system-level bottlenecks. In this work, we explore embedding scaling as a potent, orthogonal dimension for scaling sparsity. Through a comprehensive analysis and experiments, we identify specific regimes where embedding scaling achieves a superior Pareto frontier compared to expert scaling. We systematically characterize the critical architectural factors governing this efficacy -- ranging from parameter budgeting to the interplay with model width and depth. Moreover, by integrating tailored system optimizations and speculative decoding, we effectively convert this sparsity into tangible inference speedups. Guided by these insights, we introduce LongCat-Flash-Lite, a 68.5B parameter model with ~3B activated trained from scratch. Despite allocating over 30B parameters to embeddings, LongCat-Flash-Lite not only surpasses parameter-equivalent MoE baselines but also exhibits exceptional competitiveness against existing models of comparable scale, particularly in agentic and coding domains.
摘要：虽然专家混合 (MoE) 架构已成为大型语言模型中稀疏扩展的标准，但它们越来越面临收益递减和系统级瓶颈。在这项工作中，我们探索嵌入缩放作为缩放稀疏性的有效正交维度。通过全面的分析和实验，我们确定了与专家缩放相比，嵌入缩放实现了优越的帕累托前沿的特定机制。我们系统地描述了控制这种功效的关键架构因素——从参数预算到与模型宽度和深度的相互作用。此外，通过集成定制的系统优化和推测解码，我们有效地将这种稀疏性转化为有形的推理加速。在这些见解的指导下，我们推出了 LongCat-Flash-Lite，这是一个从头开始训练的 68.5B 参数模型，具有约 3B 激活值。尽管为嵌入分配了超过 30B 个参数，LongCat-Flash-Lite 不仅超越了参数等效的 MoE 基线，而且相对于同等规模的现有模型也表现出了卓越的竞争力，特别是在代理和编码领域。

Title: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Authors: Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian, Ying Wei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21214
Pdf URL: https://arxiv.org/pdf/2601.21214
Copy Paste: [[2601.21214]] Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models(https://arxiv.org/abs/2601.21214)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.
摘要：思想链 (CoT) 推理已成为大型语言模型 (LLM) 解决复杂问题的标准范例。然而，最近的研究表明，推理跳跃泛化场景中的性能急剧下降，其中所需的推理步骤数量超过训练分布，而底层算法保持不变。造成这种失败的内部机制仍然知之甚少。在这项工作中，我们对多个领域的任务进行了系统研究，发现错误集中在少数关键错误类型的标记位置，而不是均匀分布。更仔细的检查表明，这些标记级的错误预测源于内部竞争机制：某些注意力头，称为错误处理头（ep head），通过放大不正确的推理轨迹同时抑制正确的推理轨迹来打破平衡。值得注意的是，在推理过程中删除单个 ep 头通常可以恢复正确的预测。受这些见解的启发，我们提出了推理的测试时校正，这是一种轻量级干预方法，可以在推理过程中动态识别和停用 ep 头。跨不同任务和法学硕士的广泛实验表明，它持续改进了推理跳跃泛化，凸显了其有效性和潜力。

Title: Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Authors: Christopher Adrian Kusuma, Muhammad Reza Qorib, Hwee Tou Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21218
Pdf URL: https://arxiv.org/pdf/2601.21218
Copy Paste: [[2601.21218]] Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data(https://arxiv.org/abs/2601.21218)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don't know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don't know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.
摘要：大型语言模型（LLM）非常有能力回答问题，但它们往往不知道自己的知识边界，即知道自己知道什么和不知道什么。因此，他们可能会对他们没有足够知识的主题做出事实上不正确的反应，通常称为幻觉。当语言模型对某个主题没有足够的了解时，语言模型应该更加诚实并回答“我不知道”，而不是产生幻觉。人们提出了许多方法来提高法学硕士的诚实度，但它们的评估缺乏稳健性，因为它们没有考虑法学硕士在预训练期间吸收的知识。在本文中，我们利用 Pythia（一个真正开放的法学硕士，具有公开可用的预训练数据）提出了一个更强大的法学硕士诚实性评估基准数据集。此外，我们还提出了一种利用预训练数据构建更诚实的法学硕士的新方法。

Title: MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation

Authors: Tianyi Xu, Kosei Uemura, Alfred Malengo Kondoro, Tadesse Destaw Belay, Catherine Nana Nyaah Essuman, Ifeoma Okoh, Ganiyat Afolabi, Ayodele Awokoya, David Ifeoluwa Adelani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21225
Pdf URL: https://arxiv.org/pdf/2601.21225
Copy Paste: [[2601.21225]] MGSM-Pro: A Simple Strategy for Robust Multilingual Mathematical Reasoning Evaluation(https://arxiv.org/abs/2601.21225)
Keywords: language model, gpt
Abstract: Large language models have made substantial progress in mathematical reasoning. However, benchmark development for multilingual evaluation has lagged behind English in both difficulty and recency. Recently, GSM-Symbolic showed a strong evidence of high variance when models are evaluated on different instantiations of the same question; however, the evaluation was conducted only in English. In this paper, we introduce MGSM-Pro, an extension of MGSM dataset with GSM-Symbolic approach. Our dataset provides five instantiations per MGSM question by varying names, digits and irrelevant context. Evaluations across nine languages reveal that many low-resource languages suffer large performance drops when tested on digit instantiations different from those in the original test set. We further find that some proprietary models, notably Gemini 2.5 Flash and GPT-4.1, are less robust to digit instantiation, whereas Claude 4.0 Sonnet is more robust. Among open models, GPT-OSS 120B and DeepSeek V3 show stronger robustness. Based on these findings, we recommend evaluating each problem using at least five digit-varying instantiations to obtain a more robust and realistic assessment of math reasoning.
摘要：大型语言模型在数学推理方面取得了实质性进展。然而，多语言评估的基准开发在难度和新近度上都落后于英语。最近，当在同一问题的不同实例上评估模型时，GSM-Symbolic 显示了高方差的有力证据；然而，评估仅以英语进行。在本文中，我们介绍了 MGSM-Pro，它是采用 GSM-Symbolic 方法对 MGSM 数据集的扩展。我们的数据集通过不同的名称、数字和不相关的上下文为每个 MGSM 问题提供了五个实例。对九种语言的评估表明，许多低资源语言在与原始测试集中不同的数字实例上进行测试时，性能会大幅下降。我们进一步发现，一些专有模型，特别是 Gemini 2.5 Flash 和 GPT-4.1，对数字实例化的鲁棒性较差，而 Claude 4.0 Sonnet 则更鲁棒。开放模型中，GPT-OSS 120B和DeepSeek V3表现出更强的鲁棒性。基于这些发现，我们建议使用至少五个数字变化的实例来评估每个问题，以获得对数学推理更稳健和现实的评估。

Title: SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models

Authors: Alok Abhishek, Tushar Bandopadhyay, Lisa Erickson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21235
Pdf URL: https://arxiv.org/pdf/2601.21235
Copy Paste: [[2601.21235]] SHARP: Social Harm Analysis via Risk Profiles for Measuring Inequities in Large Language Models(https://arxiv.org/abs/2601.21235)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed in high-stakes domains, where rare but severe failures can result in irreversible harm. However, prevailing evaluation benchmarks often reduce complex social risk to mean-centered scalar scores, thereby obscuring distributional structure, cross-dimensional interactions, and worst-case behavior. This paper introduces Social Harm Analysis via Risk Profiles (SHARP), a framework for multidimensional, distribution-aware evaluation of social harm. SHARP models harm as a multivariate random variable and integrates explicit decomposition into bias, fairness, ethics, and epistemic reliability with a union-of-failures aggregation reparameterized as additive cumulative log-risk. The framework further employs risk-sensitive distributional statistics, with Conditional Value at Risk (CVaR95) as a primary metric, to characterize worst-case model behavior. Application of SHARP to eleven frontier LLMs, evaluated on a fixed corpus of n=901 socially sensitive prompts, reveals that models with similar average risk can exhibit more than twofold differences in tail exposure and volatility. Across models, dimension-wise marginal tail behavior varies systematically across harm dimensions, with bias exhibiting the strongest tail severities, epistemic and fairness risks occupying intermediate regimes, and ethical misalignment consistently lower; together, these patterns reveal heterogeneous, model-dependent failure structures that scalar benchmarks conflate. These findings indicate that responsible evaluation and governance of LLMs require moving beyond scalar averages toward multidimensional, tail-sensitive risk profiling.
摘要：大型语言模型 (LLM) 越来越多地部署在高风险领域，在这些领域中，罕见但严重的故障可能会导致不可逆转的损害。然而，流行的评估基准通常将复杂的社会风险降低为以均值为中心的标量分数，从而掩盖了分布结构、跨维度相互作用和最坏情况的行为。本文介绍了通过风险概况进行社会危害分析（SHARP），这是一个用于多维、分布感知的社会危害评估的框架。 SHARP 将伤害建模为多变量随机变量，并将显式分解集成到偏差、公平性、道德和认知可靠性中，并将故障并集重新参数化为加性累积对数风险。该框架进一步采用风险敏感的分布统计数据，以条件风险价值 (CVaR95) 作为主要指标，来描述最坏情况下的模型行为。将 SHARP 应用于 11 个前沿法学硕士，并在 n=901 个社会敏感提示的固定语料库上进行评估，结果表明，具有相似平均风险的模型在尾部暴露和波动性方面可能表现出两倍以上的差异。在各个模型中，维度方面的边际尾部行为在不同的危害维度上系统性地变化，偏差表现出最强的尾部严重性，认知和公平风险占据中间制度，而道德偏差始终较低；这些模式共同揭示了标量基准合并的异构的、依赖于模型的故障结构。这些发现表明，对法学硕士的负责任的评估和治理需要超越标量平均值，转向多维、尾部敏感的风险分析。

Title: MoCo: A One-Stop Shop for Model Collaboration Research

Authors: Shangbin Feng, Yuyang Bai, Ziyuan Yang, Yike Wang, Zhaoxuan Tan, Jiajie Yan, Zhenyu Lei, Wenxuan Ding, Weijia Shi, Haojin Wang, Zhenting Qi, Yuru Jiang, Heng Wang, Chengsong Huang, Yu Fei, Jihan Yao, Yilun Du, Luke Zettlemoyer, Yejin Choi, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21257
Pdf URL: https://arxiv.org/pdf/2601.21257
Copy Paste: [[2601.21257]] MoCo: A One-Stop Shop for Model Collaboration Research(https://arxiv.org/abs/2601.21257)
Keywords: language model
Abstract: Advancing beyond single monolithic language models (LMs), recent research increasingly recognizes the importance of model collaboration, where multiple LMs collaborate, compose, and complement each other. Existing research on this topic has mostly been disparate and disconnected, from different research communities, and lacks rigorous comparison. To consolidate existing research and establish model collaboration as a school of thought, we present MoCo: a one-stop Python library of executing, benchmarking, and comparing model collaboration algorithms at scale. MoCo features 26 model collaboration methods, spanning diverse levels of cross-model information exchange such as routing, text, logit, and model parameters. MoCo integrates 25 evaluation datasets spanning reasoning, QA, code, safety, and more, while users could flexibly bring their own data. Extensive experiments with MoCo demonstrate that most collaboration strategies outperform models without collaboration in 61.0% of (model, data) settings on average, with the most effective methods outperforming by up to 25.8%. We further analyze the scaling of model collaboration strategies, the training/inference efficiency of diverse methods, highlight that the collaborative system solves problems where single LMs struggle, and discuss future work in model collaboration, all made possible by MoCo. We envision MoCo as a valuable toolkit to facilitate and turbocharge the quest for an open, modular, decentralized, and collaborative AI future.
摘要：最近的研究超越了单一的整体语言模型 (LM)，越来越认识到模型协作的重要性，其中多个 LM 相互协作、组合和补充。关于这一主题的现有研究大多是不同的、互不相关的，来自不同的研究团体，并且缺乏严格的比较。为了巩固现有研究并建立模型协作作为一种思想流派，我们推出了 MoCo：一个用于大规模执行、基准测试和比较模型协作算法的一站式 Python 库。 MoCo 具有 26 种模型协作方法，涵盖不同级别的跨模型信息交换，例如路由、文本、logit 和模型参数。 MoCo集成了25个评估数据集，涵盖推理、QA、代码、安全等，用户可以灵活携带自己的数据。 MoCo 的大量实验表明，大多数协作策略平均在 61.0% 的（模型、数据）设置中优于没有协作的模型，其中最有效的方法的性能最高可达 25.8%。我们进一步分析了模型协作策略的扩展、不同方法的训练/推理效率，强调协作系统解决了单个 LM 遇到的问题，并讨论了模型协作的未来工作，所有这些都由 MoCo 实现。我们将 MoCo 视为一个有价值的工具包，以促进和推动对开放、模块化、去中心化和协作式人工智能未来的追求。

Title: CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

Authors: Jiahao Huo, Yu Huang, Yibo Yan, Ye Pan, Yi Cao, Mingdong Ou, Philip S. Yu, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21262
Pdf URL: https://arxiv.org/pdf/2601.21262
Copy Paste: [[2601.21262]] CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding(https://arxiv.org/abs/2601.21262)
Keywords: language model, llm
Abstract: Although Multimodal Large Language Models (MLLMs) have shown remarkable potential in Visual Document Retrieval (VDR) through generating high-quality multi-vector embeddings, the substantial storage overhead caused by representing a page with thousands of visual tokens limits their practicality in real-world applications. To address this challenge, we propose an auto-regressive generation approach, CausalEmbed, for constructing multi-vector embeddings. By incorporating iterative margin loss during contrastive training, CausalEmbed encourages the embedding models to learn compact and well-structured representations. Our method enables efficient VDR tasks using only dozens of visual tokens, achieving a 30-155x reduction in token count while maintaining highly competitive performance across various backbones and benchmarks. Theoretical analysis and empirical results demonstrate the unique advantages of auto-regressive embedding generation in terms of training efficiency and scalability at test time. As a result, CausalEmbed introduces a flexible test-time scaling strategy for multi-vector VDR representations and sheds light on the generative paradigm within multimodal document retrieval.
摘要：尽管多模态大型语言模型（MLLM）通过生成高质量的多向量嵌入在视觉文档检索（VDR）中显示出巨大的潜力，但用数千个视觉标记表示页面所造成的大量存储开销限制了它们在实际应用中的实用性。为了应对这一挑战，我们提出了一种自回归生成方法 CausalEmbed，用于构建多向量嵌入。通过在对比训练期间合并迭代边缘损失，CausalEmbed 鼓励嵌入模型学习紧凑且结构良好的表示。我们的方法仅使用数十个视觉令牌即可实现高效的 VDR 任务，将令牌数量减少 30-155 倍，同时在各种主干网和基准测试中保持高度竞争的性能。理论分析和实证结果证明了自回归嵌入生成在训练效率和测试时可扩展性方面的独特优势。因此，CausalEmbed 为多向量 VDR 表示引入了灵活的测试时间缩放策略，并揭示了多模式文档检索中的生成范例。

Title: Qwen3-ASR Technical Report

Authors: Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, Zishan Guo, Hongkun Hao, Yu Xi, Baosong Yang, Jin Xu, Jingren Zhou, Junyang Lin
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2601.21337
Pdf URL: https://arxiv.org/pdf/2601.21337
Copy Paste: [[2601.21337]] Qwen3-ASR Technical Report(https://arxiv.org/abs/2601.21337)
Keywords: llm
Abstract: In this report, we introduce Qwen3-ASR family, which includes two powerful all-in-one speech recognition models and a novel non-autoregressive speech forced alignment model. Qwen3-ASR-1.7B and Qwen3-ASR-0.6B are ASR models that support language identification and ASR for 52 languages and dialects. Both of them leverage large-scale speech training data and the strong audio understanding ability of their foundation model Qwen3-Omni. We conduct comprehensive internal evaluation besides the open-sourced benchmarks as ASR models might differ little on open-sourced benchmark scores but exhibit significant quality differences in real-world scenarios. The experiments reveal that the 1.7B version achieves SOTA performance among open-sourced ASR models and is competitive with the strongest proprietary APIs while the 0.6B version offers the best accuracy-efficiency trade-off. Qwen3-ASR-0.6B can achieve an average TTFT as low as 92ms and transcribe 2000 seconds speech in 1 second at a concurrency of 128. Qwen3-ForcedAligner-0.6B is an LLM based NAR timestamp predictor that is able to align text-speech pairs in 11 languages. Timestamp accuracy experiments show that the proposed model outperforms the three strongest force alignment models and takes more advantages in efficiency and versatility. To further accelerate the community research of ASR and audio understanding, we release these models under the Apache 2.0 license.
摘要：在本报告中，我们介绍了 Qwen3-ASR 系列，其中包括两个强大的一体式语音识别模型和一种新颖的非自回归语音强制对齐模型。 Qwen3-ASR-1.7B和Qwen3-ASR-0.6B是支持语言识别和52种语言和方言的ASR的ASR模型。两者都利用了大规模语音训练数据和基础模型 Qwen3-Omni 强大的音频理解能力。除了开源基准之外，我们还进行了全面的内部评估，因为 ASR 模型在开源基准分数上可能差异不大，但在现实场景中表现出显着的质量差异。实验表明，1.7B 版本在开源 ASR 模型中实现了 SOTA 性能，并且与最强的专有 API 具有竞争力，而 0.6B 版本则提供了最佳的准确性与效率权衡。 Qwen3-ASR-0.6B 可以实现低至 92ms 的平均 TTFT，并在 128 并发的情况下在 1 秒内转录 2000 秒语音。Qwen3-ForcedAligner-0.6B 是基于 LLM 的 NAR 时间戳预测器，能够对齐 11 种语言的文本语音对。时间戳精度实验表明，所提出的模型优于三种最强的力对齐模型，并且在效率和通用性方面更具优势。为了进一步加速 ASR 和音频理解的社区研究，我们在 Apache 2.0 许可证下发布了这些模型。

Title: Self-Improving Pretraining: using post-trained models to pretrain better models

Authors: Ellen Xiaoqing Tan, Shehzaad Dhuliawala, Jing Xu, Ping Yu, Sainbayar Sukhbaatar, Jason Weston, Olga Golovneva
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21343
Pdf URL: https://arxiv.org/pdf/2601.21343
Copy Paste: [[2601.21343]] Self-Improving Pretraining: using post-trained models to pretrain better models(https://arxiv.org/abs/2601.21343)
Keywords: language model
Abstract: Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.
摘要：确保一代又一代大型语言模型的安全性、真实性和整体质量是一项严峻的挑战，特别是当这些模型越来越多地部署在现实世界的应用程序中时。解决这些问题的主流方法包括收集昂贵的、精心策划的数据集，并应用多个阶段的微调和对齐。然而，即使这个复杂的管道也不能保证预训练期间学到的模式的正确性。因此，在预训练期间解决这些问题至关重要，因为它塑造了模型的核心行为，并防止不安全或幻觉的输出被深深嵌入。为了解决这个问题，我们引入了一种新的预训练方法，该方法可以流式传输文档并使用强化学习 (RL) 来改进每一步生成的下一个 K 个标记。一个强大的、经过训练的模型会判断候选代的质量、安全性和真实性，包括模型的推出、原始后缀和重写的后缀。在训练早期，该过程依赖于原始和重写的后缀；随着模型的改进，强化学习会奖励高质量的推出。这种方法从头开始构建更高质量、更安全、更真实的模型。在实验中，我们的方法在真实性和安全性方面比标准预训练提高了 36.2% 和 18.5%，并且在整体生成质量方面提高了高达 86.3% 的获胜率。

Title: The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation

Authors: Devanshu Sahoo, Manish Prasad, Vasudev Majhi, Arjun Neekhra, Yash Sinha, Murari Mandal, Vinay Chamola, Dhruv Kumar
Subjects: cs.CL, cs.AI, cs.ET, cs.LG, cs.SE
Abstract URL: https://arxiv.org/abs/2601.21360
Pdf URL: https://arxiv.org/pdf/2601.21360
Copy Paste: [[2601.21360]] The Compliance Paradox: Semantic-Instruction Decoupling in Automated Academic Code Evaluation(https://arxiv.org/abs/2601.21360)
Keywords: language model, llm
Abstract: The rapid integration of Large Language Models (LLMs) into educational assessment rests on the unverified assumption that instruction following capability translates directly to objective adjudication. We demonstrate that this assumption is fundamentally flawed. Instead of evaluating code quality, models frequently decouple from the submission's logic to satisfy hidden directives, a systemic vulnerability we term the Compliance Paradox, where models fine-tuned for extreme helpfulness are vulnerable to adversarial manipulation. To expose this, we introduce the Semantic-Preserving Adversarial Code Injection (SPACI) Framework and the Abstract Syntax Tree-Aware Semantic Injection Protocol (AST-ASIP). These methods exploit the Syntax-Semantics Gap by embedding adversarial directives into syntactically inert regions (trivia nodes) of the Abstract Syntax Tree. Through a large-scale evaluation of 9 SOTA models across 25,000 submissions in Python, C, C++, and Java, we reveal catastrophic failure rates (>95%) in high-capacity open-weights models like DeepSeek-V3, which systematically prioritize hidden formatting constraints over code correctness. We quantify this failure using our novel tripartite framework measuring Decoupling Probability, Score Divergence, and Pedagogical Severity to demonstrate the widespread "False Certification" of functionally broken code. Our findings suggest that current alignment paradigms create a "Trojan" vulnerability in automated grading, necessitating a shift from standard RLHF toward domain-specific Adjudicative Robustness, where models are conditioned to prioritize evidence over instruction compliance. We release our complete dataset and injection framework to facilitate further research on the topic.
摘要：大语言模型（LLM）快速融入教育评估取决于一个未经验证的假设，即指令跟随能力直接转化为客观判断。我们证明这个假设从根本上来说是有缺陷的。模型不是评估代码质量，而是经常与提交的逻辑分离以满足隐藏的指令，这是一种我们称之为“合规性悖论”的系统性漏洞，其中为极端有用而进行微调的模型很容易受到对抗性操纵。为了揭示这一点，我们引入了语义保留对抗性代码注入（SPACI）框架和抽象语法树感知语义注入协议（AST-ASIP）。这些方法通过将对抗性指令嵌入抽象语法树的语法惰性区域（琐事节点）来利用语法语义差距。通过对 Python、C、C++ 和 Java 提交的 25,000 个提交中的 9 个 SOTA 模型进行大规模评估，我们揭示了 DeepSeek-V3 等高容量开放权重模型的灾难性失败率 (>95%)，该模型系统地优先考虑隐藏格式约束而不是代码正确性。我们使用新颖的三方框架来量化这种失败，该框架测量解耦概率、分数差异和教学严重性，以证明功能损坏代码的广泛“错误认证”。我们的研究结果表明，当前的对齐范式在自动评分中产生了“特洛伊木马”漏洞，需要从标准 RLHF 转向特定领域的裁决稳健性，其中模型会优先考虑证据而不是指令合规性。我们发布了完整的数据集和注入框架，以促进对该主题的进一步研究。

Title: User-Centric Evidence Ranking for Attribution and Fact Verification

Authors: Guy Alt, Eran Hirsch, Serwar Basch, Ido Dagan, Oren Glickman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21387
Pdf URL: https://arxiv.org/pdf/2601.21387
Copy Paste: [[2601.21387]] User-Centric Evidence Ranking for Attribution and Fact Verification(https://arxiv.org/abs/2601.21387)
Keywords: language model, llm
Abstract: Attribution and fact verification are critical challenges in natural language processing for assessing information reliability. While automated systems and Large Language Models (LLMs) aim to retrieve and select concise evidence to support or refute claims, they often present users with either insufficient or overly redundant information, leading to inefficient and error-prone verification. To address this, we propose Evidence Ranking, a novel task that prioritizes presenting sufficient information as early as possible in a ranked list. This minimizes user reading effort while still making all available evidence accessible for sequential verification. We compare two approaches for the new ranking task: one-shot ranking and incremental ranking. We introduce a new evaluation framework, inspired by information retrieval metrics, and construct a unified benchmark by aggregating existing fact verification datasets. Extensive experiments with diverse models show that incremental ranking strategies better capture complementary evidence and that LLM-based methods outperform shallower baselines, while still facing challenges in balancing sufficiency and redundancy. Compared to evidence selection, we conduct a controlled user study and demonstrate that evidence ranking both reduces reading effort and improves verification. This work provides a foundational step toward more interpretable, efficient, and user-aligned information verification systems.
摘要：归因和事实验证是自然语言处理评估信息可靠性的关键挑战。虽然自动化系统和大型语言模型 (LLM) 旨在检索和选择简洁的证据来支持或反驳主张，但它们经常向用户提供不足或过度冗余的信息，导致验证效率低下且容易出错。为了解决这个问题，我们提出了证据排名，这是一项新颖的任务，它优先考虑在排名列表中尽早呈现足够的信息。这最大限度地减少了用户的阅读工作量，同时仍然使所有可用证据可用于顺序验证。我们比较新排名任务的两种方法：一次性排名和增量排名。受信息检索指标的启发，我们引入了新的评估框架，并通过聚合现有的事实验证数据集构建了统一的基准。使用不同模型进行的广泛实验表明，增量排名策略可以更好地捕获补充证据，并且基于 LLM 的方法优于较浅的基线，同时仍然面临平衡充分性和冗余性的挑战。与证据选择相比，我们进行了受控用户研究，并证明证据排名既减少了阅读工作量，又提高了验证能力。这项工作为实现更可解释、更高效、更符合用户需求的信息验证系统迈出了基础。

Title: Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation

Authors: Yuan Sui, Bryan Hooi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21464
Pdf URL: https://arxiv.org/pdf/2601.21464
Copy Paste: [[2601.21464]] Conversation for Non-verifiable Learning: Self-Evolving LLMs through Meta-Evaluation(https://arxiv.org/abs/2601.21464)
Keywords: language model, llm, agent
Abstract: Training large language models (LLMs) for non-verifiable tasks, such as creative writing, dialogue, and ethical reasoning, remains challenging due to the absence of ground-truth labels. While LLM-as-Judge approaches offer a scalable alternative to human feedback, they face a fundamental limitation: performance is constrained by the evaluator's own quality. If the judge cannot recognize good solutions, it cannot provide useful training signals, and evaluation biases (e.g., favoring verbosity over quality) remain unaddressed. This motivates meta-evaluation: the ability to evaluate and improve the evaluator itself. We introduce CoNL, a framework that unifies generation, evaluation, and meta-evaluation through multi-agent self-play. Our key insight: critique quality can be measured by whether it helps others improve their solutions. In CoNL, multiple agents sharing the same policy engage in structured conversations to propose, critique, and revise solutions. Critiques that enable solution improvements earn a diagnostic reward, creating explicit supervision for meta-evaluation and enabling joint optimization of generation and judging capabilities through self-play, without external judges or ground truth. Experiments on five benchmarks show that CoNL achieves consistent improvements over self-rewarding baselines while maintaining stable training.
摘要：由于缺乏真实标签，为不可验证的任务（例如创意写作、对话和道德推理）训练大型语言模型（LLM）仍然具有挑战性。虽然法学硕士作为法官的方法为人类反馈提供了一种可扩展的替代方案，但它们面临着一个根本性的限制：绩效受到评估者自身素质的限制。如果法官无法识别出好的解决方案，就无法提供有用的训练信号，并且评估偏差（例如，偏爱冗长而不是质量）仍然得不到解决。这激发了元评估：评估和改进评估者本身的能力。我们引入了 CoNL，一个通过多智能体自我博弈统一生成、评估和元评估的框架。我们的主要见解：批评的质量可以通过它是否帮助其他人改进他们的解决方案来衡量。在 CoNL 中，共享相同策略的多个代理参与结构化对话，以提出、批评和修改解决方案。能够改进解决方案的批评会获得诊断奖励，为元评估创建明确的监督，并通过自我对弈实现生成和判断能力的联合优化，无需外部判断或基本事实。五个基准的实验表明，CoNL 在保持稳定训练的同时，相对于自我奖励基线实现了持续改进。

Title: SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models

Authors: Lei Yang, Wei Bi, Chenxi Sun, Renren Jin, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21476
Pdf URL: https://arxiv.org/pdf/2601.21476
Copy Paste: [[2601.21476]] SOUP: Token-level Single-sample Mix-policy Reinforcement Learning for Large Language Models(https://arxiv.org/abs/2601.21476)
Keywords: language model, llm
Abstract: On-policy reinforcement learning (RL) methods widely used for language model post-training, like Group Relative Policy Optimization (GRPO), often suffer from limited exploration and early saturation due to low sampling diversity. While off-policy data can help, current approaches that mix entire trajectories cause significant policy mismatch and instability. In this work, we propose the $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP), a framework that unifies off- and on-policy learning within individual samples at the token level. It confines off-policy influence to the prefix of a generated sequence sampled from historical policies, while the continuation is generated on-policy. Through token-level importance ratios, SOUP effectively leverages off-policy information while preserving training stability. Extensive experiments demonstrate that SOUP consistently outperforms standard on-policy training and existing off-policy extensions. Our further analysis clarifies how our fine-grained, single-sample mix-policy training can improve both exploration and final performance in LLM RL.
摘要：广泛用于语言模型后训练的策略强化学习 (RL) 方法，例如组相对策略优化 (GRPO)，由于采样多样性较低，通常会受到有限的探索和早期饱和的影响。虽然政策外数据可以有所帮助，但当前混合整个轨迹的方法会导致严重的政策不匹配和不稳定。在这项工作中，我们提出了 $\textbf{S}$ingle-sample Mix-p$\textbf{O}$licy $\textbf{U}$nified $\textbf{P}$aradigm (SOUP)，一个在令牌级别统一单个样本内的离线和在线策略学习的框架。它将离策略影响限制在从历史策略中采样的生成序列的前缀，而延续是在策略内生成的。通过代币级别的重要性比率，SOUP 有效利用离策略信息，同时保持训练稳定性。大量实验表明，SOUP 的性能始终优于标准的在策略训练和现有的离策略扩展。我们的进一步分析阐明了我们的细粒度、单样本混合策略训练如何提高 LLM RL 的探索和最终性能。

Title: DimStance: Multilingual Datasets for Dimensional Stance Analysis

Authors: Jonas Becker, Liang-Chih Yu, Shamsuddeen Hassan Muhammad, Jan Philip Wahle, Terry Ruas, Idris Abdulmumin, Lung-Hao Lee, Wen-Ni Liu, Tzu-Mi Lin, Zhe-Yu Xu, Ying-Lung Lin, Jin Wang, Maryam Ibrahim Mukhtar, Bela Gipp, Saif M. Mohammed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21483
Pdf URL: https://arxiv.org/pdf/2601.21483
Copy Paste: [[2601.21483]] DimStance: Multilingual Datasets for Dimensional Stance Analysis(https://arxiv.org/abs/2601.21483)
Keywords: language model, llm, prompt
Abstract: Stance detection is an established task that classifies an author's attitude toward a specific target into categories such as Favor, Neutral, and Against. Beyond categorical stance labels, we leverage a long-established affective science framework to model stance along real-valued dimensions of valence (negative-positive) and arousal (calm-active). This dimensional approach captures nuanced affective states underlying stance expressions, enabling fine-grained stance analysis. To this end, we introduce DimStance, the first dimensional stance resource with valence-arousal (VA) annotations. This resource comprises 11,746 target aspects in 7,365 texts across five languages (English, German, Chinese, Nigerian Pidgin, and Swahili) and two domains (politics and environmental protection). To facilitate the evaluation of stance VA prediction, we formulate the dimensional stance regression task, analyze cross-lingual VA patterns, and benchmark pretrained and large language models under regression and prompting settings. Results show competitive performance of fine-tuned LLM regressors, persistent challenges in low-resource languages, and limitations of token-based generation. DimStance provides a foundation for multilingual, emotion-aware, stance analysis and benchmarking.
摘要：立场检测是一项既定任务，它将作者对特定目标的态度分为赞成、中立和反对等类别。除了分类立场标签之外，我们还利用长期建立的情感科学框架，沿着效价（消极-积极）和唤醒（平静-积极）的真实价值维度来建模立场。这种维度方法捕捉立场表达背后的微妙情感状态，从而实现细粒度的立场分析。为此，我们引入了 DimStance，这是第一个带有价唤醒 (VA) 注释的维度立场资源。该资源包含 7,365 个文本中的 11,746 个目标方面，涵盖五种语言（英语、德语、中文、尼日利亚洋泾浜语和斯瓦希里语）和两个领域（政治和环境保护）。为了便于评估立场 VA 预测，我们制定了维度立场回归任务，分析跨语言 VA 模式，并在回归和提示设置下对预训练和大型语言模型进行基准测试。结果显示了经过微调的 LLM 回归器的竞争性能、低资源语言中持续存在的挑战以及基于令牌的生成的局限性。 DimStance 为多语言、情感感知、立场分析和基准测试提供了基础。

Title: inversedMixup: Data Augmentation via Inverting Mixed Embeddings

Authors: Fanshuang Kong, Richong Zhang, Qiyu Sun, Zhijie Nie, Ting Deng, Chunming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21543
Pdf URL: https://arxiv.org/pdf/2601.21543
Copy Paste: [[2601.21543]] inversedMixup: Data Augmentation via Inverting Mixed Embeddings(https://arxiv.org/abs/2601.21543)
Keywords: llm, prompt
Abstract: Mixup generates augmented samples by linearly interpolating inputs and labels with a controllable ratio. However, since it operates in the latent embedding level, the resulting samples are not human-interpretable. In contrast, LLM-based augmentation methods produce sentences via prompts at the token level, yielding readable outputs but offering limited control over the generation process. Inspired by recent advances in LLM inversion, which reconstructs natural language from embeddings and helps bridge the gap between latent embedding space and discrete token space, we propose inversedMixup, a unified framework that combines the controllability of Mixup with the interpretability of LLM-based generation. Specifically, inversedMixup adopts a three-stage training procedure to align the output embedding space of a task-specific model with the input embedding space of an LLM. Upon successful alignment, inversedMixup can reconstruct mixed embeddings with a controllable mixing ratio into human-interpretable augmented sentences, thereby improving the augmentation performance. Additionally, inversedMixup provides the first empirical evidence of the manifold intrusion phenomenon in text Mixup and introduces a simple yet effective strategy to mitigate it. Extensive experiments demonstrate the effectiveness and generalizability of our approach in both few-shot and fully supervised scenarios.
摘要：Mixup 通过以可控比率线性插值输入和标签来生成增强样本。然而，由于它在潜在嵌入级别运行，因此生成的样本无法被人类解释。相比之下，基于 LLM 的增强方法通过令牌级别的提示生成句子，产生可读的输出，但对生成过程提供有限的控制。受 LLM 反演最新进展的启发，LLM 反演从嵌入重建自然语言，并有助于弥合潜在嵌入空间和离散标记空间之间的差距，我们提出了 inversedMixup，这是一个统一的框架，它将 Mixup 的可控性与基于 LLM 的生成的可解释性结合起来。具体来说，inversedMixup 采用三阶段训练过程，将特定任务模型的输出嵌入空间与 LLM 的输入嵌入空间对齐。成功对齐后，inversedMixup 可以将具有可控混合比例的混合嵌入重建为人类可解释的增强句子，从而提高增强性能。此外，inversedMixup 提供了文本 Mixup 中流形入侵现象的第一个经验证据，并引入了一种简单而有效的策略来缓解它。大量的实验证明了我们的方法在小样本和完全监督的场景中的有效性和普遍性。

Title: Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes

Authors: Yang Zhou, Zhenting Sheng, Mingrui Tan, Yuting Song, Jun Zhou, Yu Heng Kwan, Lian Leng Low, Yang Bai, Yong Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21551
Pdf URL: https://arxiv.org/pdf/2601.21551
Copy Paste: [[2601.21551]] Note2Chat: Improving LLMs for Multi-Turn Clinical History Taking Using Medical Notes(https://arxiv.org/abs/2601.21551)
Keywords: language model, gpt, llm, chat
Abstract: Effective clinical history taking is a foundational yet underexplored component of clinical reasoning. While large language models (LLMs) have shown promise on static benchmarks, they often fall short in dynamic, multi-turn diagnostic settings that require iterative questioning and hypothesis refinement. To address this gap, we propose \method{}, a note-driven framework that trains LLMs to conduct structured history taking and diagnosis by learning from widely available medical notes. Instead of relying on scarce and sensitive dialogue data, we convert real-world medical notes into high-quality doctor-patient dialogues using a decision tree-guided generation and refinement pipeline. We then propose a three-stage fine-tuning strategy combining supervised learning, simulated data augmentation, and preference learning. Furthermore, we propose a novel single-turn reasoning paradigm that reframes history taking as a sequence of single-turn reasoning problems. This design enhances interpretability and enables local supervision, dynamic adaptation, and greater sample efficiency. Experimental results show that our method substantially improves clinical reasoning, achieving gains of +16.9 F1 and +21.0 Top-1 diagnostic accuracy over GPT-4o. Our code and dataset can be found at this https URL.
摘要：有效的临床病史采集是临床推理的一个基础但尚未充分探索的组成部分。虽然大型语言模型 (LLM) 在静态基准测试中表现出了良好的前景，但它们在需要迭代提问和假设细化的动态、多轮诊断设置中往往表现不佳。为了解决这一差距，我们提出了 \method{}，这是一种笔记驱动的框架，可训练法学硕士通过学习广泛可用的医疗笔记来进行结构化病史采集和诊断。我们不依赖稀缺且敏感的对话数据，而是使用决策树引导的生成和细化管道将现实世界的医疗记录转换为高质量的医患对话。然后，我们提出了一种结合监督学习、模拟数据增强和偏好学习的三阶段微调策略。此外，我们提出了一种新颖的单轮推理范式，将历史重新构建为一系列单轮推理问题。这种设计增强了可解释性，并实现了本地监督、动态适应和更高的样本效率。实验结果表明，我们的方法大大改善了临床推理，与 GPT-4o 相比，获得了 +16.9 F1 和 +21.0 Top-1 诊断准确性。我们的代码和数据集可以在此 https URL 中找到。

Title: ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas

Authors: Xiaoyu Tian, Haotian Wang, Shuaiting Chen, Hao Zhou, Kaichi Yu, Yudian Zhang, Jade Ouyang, Junxi Yin, Jiong Chen, Baoyan Guo, Lei Zhang, Junjie Tao, Yuansheng Song, Ming Cui, Chengwei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21558
Pdf URL: https://arxiv.org/pdf/2601.21558
Copy Paste: [[2601.21558]] ASTRA: Automated Synthesis of agentic Trajectories and Reinforcement Arenas(https://arxiv.org/abs/2601.21558)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly used as tool-augmented agents for multi-step decision making, yet training robust tool-using agents remains challenging. Existing methods still require manual intervention, depend on non-verifiable simulated environments, rely exclusively on either supervised fine-tuning (SFT) or reinforcement learning (RL), and struggle with stable long-horizon, multi-turn learning. To address these challenges, we introduce ASTRA, a fully automated end-to-end framework for training tool-augmented language model agents via scalable data synthesis and verifiable reinforcement learning. ASTRA integrates two complementary components. First, a pipeline that leverages the static topology of tool-call graphs synthesizes diverse, structurally grounded trajectories, instilling broad and transferable tool-use competence. Second, an environment synthesis framework that captures the rich, compositional topology of human semantic reasoning converts decomposed question-answer traces into independent, code-executable, and rule-verifiable environments, enabling deterministic multi-turn RL. Based on this method, we develop a unified training methodology that integrates SFT with online RL using trajectory-level rewards to balance task completion and interaction efficiency. Experiments on multiple agentic tool-use benchmarks demonstrate that ASTRA-trained models achieve state-of-the-art performance at comparable scales, approaching closed-source systems while preserving core reasoning ability. We release the full pipelines, environments, and trained models at this https URL.
摘要：大型语言模型（LLM）越来越多地用作多步骤决策的工具增强代理，但训练强大的工具使用代理仍然具有挑战性。现有方法仍然需要人工干预，依赖于不可验证的模拟环境，完全依赖监督微调（SFT）或强化学习（RL），并且难以实现稳定的长视野、多轮学习。为了应对这些挑战，我们引入了 ASTRA，这是一个完全自动化的端到端框架，用于通过可扩展的数据合成和可验证的强化学习来训练工具增强的语言模型代理。 ASTRA 集成了两个互补的组件。首先，利用工具调用图的静态拓扑的管道综合了多样化的、有结构基础的轨迹，灌输了广泛且可转移的工具使用能力。其次，环境综合框架捕获了人类语义推理的丰富的组合拓扑，将分解的问答轨迹转换为独立的、代码可执行的和规则可验证的环境，从而实现确定性多轮强化学习。基于这种方法，我们开发了一种统一的训练方法，将 SFT 与在线 RL 相结合，使用轨迹级奖励来平衡任务完成和交互效率。对多个代理工具使用基准的实验表明，ASTRA 训练的模型在可比较的规模上实现了最先进的性能，接近闭源系统，同时保留了核心推理能力。我们在此 https URL 发布了完整的管道、环境和经过训练的模型。

Title: Language Models as Artificial Learners: Investigating Crosslinguistic Influence

Authors: Abderrahmane Issam, Yusuf Can Semerci, Jan Scholtes, Gerasimos Spanakis
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21587
Pdf URL: https://arxiv.org/pdf/2601.21587
Copy Paste: [[2601.21587]] Language Models as Artificial Learners: Investigating Crosslinguistic Influence(https://arxiv.org/abs/2601.21587)
Keywords: language model
Abstract: Despite the centrality of crosslinguistic influence (CLI) to bilingualism research, human studies often yield conflicting results due to inherent experimental variance. We address these inconsistencies by using language models (LMs) as controlled statistical learners to systematically simulate CLI and isolate its underlying drivers. Specifically, we study the effect of varying the L1 language dominance and the L2 language proficiency, which we manipulate by controlling the L2 age of exposure -- defined as the training step at which the L2 is introduced. Furthermore, we investigate the impact of pretraining on L1 languages with varying syntactic distance from the L2. Using cross-linguistic priming, we analyze how activating L1 structures impacts L2 processing. Our results align with evidence from psycholinguistic studies, confirming that language dominance and proficiency are strong predictors of CLI. We further find that while priming of grammatical structures is bidirectional, the priming of ungrammatical structures is sensitive to language dominance. Finally, we provide mechanistic evidence of CLI in LMs, demonstrating that the L1 is co-activated during L2 processing and directly influences the neural circuitry recruited for the L2. More broadly, our work demonstrates that LMs can serve as a computational framework to inform theories of human CLI.
摘要：尽管跨语言影响（CLI）对于双语研究至关重要，但由于固有的实验差异，人类研究经常产生相互矛盾的结果。我们通过使用语言模型 (LM) 作为受控统计学习器来系统地模拟 CLI 并隔离其底层驱动因素，从而解决这些不一致问题。具体来说，我们研究了改变 L1 语言优势和 L2 语言熟练程度的影响，我们通过控制 L2 暴露年龄（定义为引入 L2 的训练步骤）来控制它们。此外，我们还研究了预训练对与 L2 句法距离不同的 L1 语言的影响。使用跨语言启动，我们分析激活 L1 结构如何影响 L2 处理。我们的结果与心理语言学研究的证据一致，证实语言优势和熟练程度是 CLI 的有力预测因素。我们进一步发现，虽然语法结构的启动是双向的，但非语法结构的启动对语言优势很敏感。最后，我们提供了 LM 中 CLI 的机制证据，证明 L1 在 L2 处理过程中被共同激活，并直接影响为 L2 招募的神经回路。更广泛地说，我们的工作表明 LM 可以作为计算框架来为人类 CLI 理论提供信息。

Title: ILRR: Inference-Time Steering Method for Masked Diffusion Language Models

Authors: Eden Avrahami, Eliya Nachmani
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21647
Pdf URL: https://arxiv.org/pdf/2601.21647
Copy Paste: [[2601.21647]] ILRR: Inference-Time Steering Method for Masked Diffusion Language Models(https://arxiv.org/abs/2601.21647)
Keywords: language model
Abstract: Discrete Diffusion Language Models (DLMs) offer a promising non-autoregressive alternative for text generation, yet effective mechanisms for inference-time control remain relatively underexplored. Existing approaches include sampling-level guidance procedures or trajectory optimization mechanisms. In this work, we introduce Iterative Latent Representation Refinement (ILRR), a learning-free framework for steering DLMs using a single reference sequence. ILRR guides generation by dynamically aligning the internal activations of the generated sequence with those of a given reference throughout the denoising process. This approach captures and transfers high-level semantic properties, with a tunable steering scale enabling flexible control over attributes such as sentiment. We further introduce Spatially Modulated Steering, an extension that enables steering long texts using shorter references by regulating guidance intensity across the sequence. Empirically, we demonstrate that ILRR achieves effective attribute steering on LLaDA and MDLM architectures with a minor computational overhead, requiring only one additional parallel forward pass per denoising step. Under the same compute budget, ILRR improves attribute accuracy over comparable baselines by 10$\%$ to 60$\%$ points, while maintaining high generation quality.
摘要：离散扩散语言模型（DLM）为文本生成提供了一种有前途的非自回归替代方案，但有效的推理时间控制机制仍然相对未得到充分探索。现有方法包括采样级引导程序或轨迹优化机制。在这项工作中，我们引入了迭代潜在表示细化（ILRR），这是一种使用单个参考序列引导 DLM 的免学习框架。 ILRR 通过在整个去噪过程中动态地将生成序列的内部激活与给定参考的内部激活对齐来指导生成。这种方法捕获并传输高级语义属性，并通过可调节的转向尺度实现对情感等属性的灵活控制。我们进一步介绍了空间调制转向，这是一种扩展，可以通过调节整个序列的引导强度来使用较短的参考来转向长文本。根据经验，我们证明 ILRR 在 LLaDA 和 MDLM 架构上实现了有效的属性控制，计算开销很小，每个去噪步骤只需要一个额外的并行前向传递。在相同的计算预算下，ILRR 将属性准确性比可比较的基线提高了 10$\%$ 到 60$\%$ 点，同时保持了较高的生成质量。

Title: AdaptBPE: From General Purpose to Specialized Tokenizers

Authors: Vijini Liyanage, François Yvon
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21665
Pdf URL: https://arxiv.org/pdf/2601.21665
Copy Paste: [[2601.21665]] AdaptBPE: From General Purpose to Specialized Tokenizers(https://arxiv.org/abs/2601.21665)
Keywords: language model, llm
Abstract: Subword tokenization methods, such as Byte-Pair Encoding (BPE), significantly impact the performance and efficiency of large language models (LLMs). The standard approach involves training a general-purpose tokenizer that uniformly processes all textual data during both training and inference. However, the use of a generic set of tokens can incur inefficiencies when applying the model to specific domains or languages. To address this limitation, we propose a post-training adaptation strategy that selectively replaces low-utility tokens with more relevant ones based on their frequency in an adaptation corpus. Our algorithm identifies the token inventory that most effectively encodes the adaptation corpus for a given target vocabulary size. Extensive experiments on generation and classification tasks across multiple languages demonstrate that our adapted tokenizers compress test corpora more effectively than baselines using the same vocabulary size. This method serves as a lightweight adaptation mechanism, akin to a vocabulary fine-tuning process, enabling optimized tokenization for specific domains or tasks. Our code and data are available at this https URL.
摘要：字节对编码 (BPE) 等子词标记化方法显着影响大型语言模型 (LLM) 的性能和效率。标准方法包括训练通用分词器，该分词器在训练和推理过程中统一处理所有文本数据。然而，在将模型应用于特定领域或语言时，使用通用标记集可能会导致效率低下。为了解决这个限制，我们提出了一种训练后适应策略，根据适应语料库中的频率，有选择地将低效用标记替换为更相关的标记。我们的算法识别对给定目标词汇量大小最有效地编码适应语料库的令牌库存。对多种语言的生成和分类任务进行的广泛实验表明，我们的适应标记器比使用相同词汇量的基线更有效地压缩测试语料库。该方法充当轻量级适应机制，类似于词汇微调过程，可以针对特定领域或任务进行优化标记化。我们的代码和数据可在此 https URL 中获取。

Title: Scale-Dependent Semantic Dynamics Revealed by Allan Deviation

Authors: Debayan Dasgupta
Subjects: cs.CL, physics.data-an
Abstract URL: https://arxiv.org/abs/2601.21678
Pdf URL: https://arxiv.org/pdf/2601.21678
Copy Paste: [[2601.21678]] Scale-Dependent Semantic Dynamics Revealed by Allan Deviation(https://arxiv.org/abs/2601.21678)
Keywords: language model
Abstract: While language progresses through a sequence of semantic states, the underlying dynamics of this progression remain elusive. Here, we treat the semantic progression of written text as a stochastic trajectory in a high-dimensional state space. We utilize Allan deviation, a tool from precision metrology, to analyze the stability of meaning by treating ordered sentence embeddings as a displacement signal. Our analysis reveals two distinct dynamical regimes: short-time power-law scaling, which differentiates creative literature from technical texts, and a long-time crossover to a stability-limited noise floor. We find that while large language models successfully mimic the local scaling statistics of human text, they exhibit a systematic reduction in their stability horizon. These results establish semantic coherence as a measurable physical property, offering a framework to differentiate the nuanced dynamics of human cognition from the patterns generated by algorithmic models.
摘要：虽然语言通过一系列语义状态发展，但这种发展的潜在动力仍然难以捉摸。在这里，我们将书面文本的语义进展视为高维状态空间中的随机轨迹。我们利用精密计量学中的艾伦偏差工具，通过将有序句子嵌入视为位移信号来分析含义的稳定性。我们的分析揭示了两种不同的动态机制：短时幂律缩放（将创意文学与技术文本区分开来）和长期交叉到稳定性有限的本底噪声。我们发现，虽然大型语言模型成功地模仿了人类文本的局部缩放统计数据，但它们的稳定性范围却出现了系统性的降低。这些结果将语义一致性确立为一种可测量的物理属性，提供了一个框架来区分人类认知的细微动态与算法模型生成的模式。

Title: FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning

Authors: Xiaoyu Xu, Minxin Du, Kun Fang, Zi Liang, Yaxin Xiao, Zhicong Huang, Cheng Hong, Qingqing Ye, Haibo Hu
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21682
Pdf URL: https://arxiv.org/pdf/2601.21682
Copy Paste: [[2601.21682]] FIT: Defying Catastrophic Forgetting in Continual LLM Unlearning(https://arxiv.org/abs/2601.21682)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate impressive capabilities across diverse tasks but raise concerns about privacy, copyright, and harmful materials. Existing LLM unlearning methods rarely consider the continual and high-volume nature of real-world deletion requests, which can cause utility degradation and catastrophic forgetting as requests accumulate. To address this challenge, we introduce \fit, a framework for continual unlearning that handles large numbers of deletion requests while maintaining robustness against both catastrophic forgetting and post-unlearning recovery. \fit mitigates degradation through rigorous data \underline{F}iltering, \underline{I}mportance-aware updates, and \underline{T}argeted layer attribution, enabling stable performance across long sequences of unlearning operations and achieving a favorable balance between forgetting effectiveness and utility retention. To support realistic evaluation, we present \textbf{PCH}, a benchmark covering \textbf{P}ersonal information, \textbf{C}opyright, and \textbf{H}armful content in sequential deletion scenarios, along with two symmetric metrics, Forget Degree (F.D.) and Retain Utility (R.U.), which jointly assess forgetting quality and utility preservation. Extensive experiments on four open-source LLMs with hundreds of deletion requests show that \fit achieves the strongest trade-off between F.D. and R.U., surpasses existing methods on MMLU, CommonsenseQA, and GSM8K, and remains resistant against both relearning and quantization recovery attacks.
摘要：大型语言模型（LLM）在不同的任务中展示了令人印象深刻的能力，但引起了对隐私、版权和有害材料的担忧。现有的 LLM 遗忘方法很少考虑现实世界删除请求的持续性和大容量性质，这可能会随着请求的积累而导致效用下降和灾难性遗忘。为了应对这一挑战，我们引入了 \fit，一个持续遗忘的框架，可以处理大量删除请求，同时保持针对灾难性遗忘和遗忘后恢复的鲁棒性。 \fit 通过严格的数据 \underline{F} 过滤、\underline{I} 重要性感知更新和 \underline{T} 目标层归因来减轻退化，从而在长序列的遗忘操作中实现稳定的性能，并在遗忘有效性和效用保留之间实现良好的平衡。为了支持现实评估，我们提出了 \textbf{PCH}，一个涵盖 \textbf{P}个人信息、\textbf{C}版权和 \textbf{H}在顺序删除场景中的完整内容的基准，以及两个对称指标：遗忘度 (F.D.) 和保留效用 (R.U.)，它们共同评估遗忘质量和效用保留。对四个开源 LLM 进行的大量实验以及数百个删除请求表明，\fit 在 F.D. 和 F.D. 之间实现了最强的权衡。和 R.U.，超越了 MMLU、CommonsenseQA 和 GSM8K 上的现有方法，并且仍然能够抵抗重新学习和量化恢复攻击。

Title: Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling

Authors: Xinglin Wang, Jiayi Shi, Shaoxiong Feng, Peiwen Yuan, Yiwei Li, Yueqi Zhang, Chuyi Tan, Ji Zhang, Boyuan Pan, Yao Hu, Kan Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21684
Pdf URL: https://arxiv.org/pdf/2601.21684
Copy Paste: [[2601.21684]] Do Not Waste Your Rollouts: Recycling Search Experience for Efficient Test-Time Scaling(https://arxiv.org/abs/2601.21684)
Keywords: language model
Abstract: Test-Time Scaling enhances the reasoning capabilities of Large Language Models by allocating additional inference compute to broaden the exploration of the solution space. However, existing search strategies typically treat rollouts as disposable samples, where valuable intermediate insights are effectively discarded after each trial. This systemic memorylessness leads to massive computational redundancy, as models repeatedly re-derive discovered conclusions and revisit known dead ends across extensive attempts. To bridge this gap, we propose \textbf{Recycling Search Experience (RSE)}, a self-guided, training-free strategy that turns test-time search from a series of isolated trials into a cumulative process. By actively distilling raw trajectories into a shared experience bank, RSE enables positive recycling of intermediate conclusions to shortcut redundant derivations and negative recycling of failure patterns to prune encountered dead ends. Theoretically, we provide an analysis that formalizes the efficiency gains of RSE, validating its advantage over independent sampling in solving complex reasoning tasks. Empirically, extensive experiments on HMMT24, HMMT25, IMO-Bench, and HLE show that RSE consistently outperforms strong baselines with comparable computational cost, achieving state-of-the-art scaling efficiency.
摘要：测试时间缩放通过分配额外的推理计算来拓宽解决方案空间的探索，从而增强大型语言模型的推理能力。然而，现有的搜索策略通常将首次推出视为一次性样本，其中有价值的中间见解在每次试验后都被有效地丢弃。这种系统性的无记忆会导致大量的计算冗余，因为模型会在广泛的尝试中反复重新推导已发现的结论并重新审视已知的死胡同。为了弥补这一差距，我们提出了 \textbf{回收搜索体验（RSE）}，这是一种自我引导、无需培训的策略，可将测试时搜索从一系列孤立的试验转变为累积过程。通过主动将原始轨迹提炼成共享经验库，RSE 能够积极回收中间结论以缩短冗余推导，并消极回收故障模式以修剪遇到的死胡同。理论上，我们提供了一种形式化 RSE 效率增益的分析，验证了其在解决复杂推理任务方面相对于独立采样的优势。根据经验，在 HMMT24、HMMT25、IMO-Bench 和 HLE 上进行的大量实验表明，RSE 在相当的计算成本下始终优于强大的基线，实现了最先进的扩展效率。

Title: Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents

Authors: Hojae Han, Heeyun Jung, Jongyoon Kim, Seung-won Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21699
Pdf URL: https://arxiv.org/pdf/2601.21699
Copy Paste: [[2601.21699]] Can David Beat Goliath? On Multi-Hop Reasoning with Resource-Constrained Agents(https://arxiv.org/abs/2601.21699)
Keywords: language model, agent
Abstract: While reinforcement learning (RL) has empowered multi-turn reasoning agents with retrieval and tools, existing successes largely depend on extensive on-policy rollouts in high-cost, high-accuracy regimes. Under realistic resource constraints that cannot support large models or dense explorations, however, small language model agents fall into a low-cost, low-accuracy regime, where limited rollout budgets lead to sparse exploration, sparse credit assignment, and unstable training. In this work, we challenge this trade-off and show that small language models can achieve strong multi-hop reasoning under resource constraints. We introduce DAVID-GRPO, a budget-efficient RL framework that (i) stabilizes early learning with minimal supervision, (ii) assigns retrieval credit based on evidence recall, and (iii) improves exploration by resampling truncated near-miss trajectories. Evaluated on agents up to 1.5B parameters trained on only four RTX 3090 GPUs, DAVID-GRPO consistently outperforms prior RL methods designed for large-scale settings on six multi-hop QA benchmarks. These results show that with the right inductive biases, small agents can achieve low training cost with high accuracy.
摘要：虽然强化学习 (RL) 为多轮推理代理提供了检索和工具，但现有的成功在很大程度上取决于高成本、高精度制度中的广泛按策略部署。然而，在无法支持大型模型或密集探索的现实资源限制下，小型语言模型代理陷入低成本、低准确度的状态，有限的推出预算导致稀疏探索、稀疏信用分配和不稳定的训练。在这项工作中，我们挑战了这种权衡，并证明小型语言模型可以在资源限制下实现强大的多跳推理。我们引入了 DAVID-GRPO，这是一种经济高效的 RL 框架，它（i）以最少的监督稳定早期学习，（ii）根据证据回忆分配检索信用，以及（iii）通过对截断的未遂轨迹重新采样来改进探索。 DAVID-GRPO 对仅在四个 RTX 3090 GPU 上训练的高达 1.5B 参数的代理进行了评估，在六个多跳 QA 基准上始终优于之前为大规模设置而设计的 RL 方法。这些结果表明，通过正确的归纳偏差，小智能体可以实现低训练成本和高精度。

Title: Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning

Authors: Wonduk Seo, Wonseok Choi, Junseo Koh, Juhyeon Lee, Hyunjin An, Minhyeong Yu, Jian Park, Qingshan Zhou, Seunghyun Lee, Yi Bu
Subjects: cs.CL, cs.AI, cs.IR, cs.MA, cs.SI
Abstract URL: https://arxiv.org/abs/2601.21700
Pdf URL: https://arxiv.org/pdf/2601.21700
Copy Paste: [[2601.21700]] Toward Culturally Aligned LLMs through Ontology-Guided Multi-Agent Reasoning(https://arxiv.org/abs/2601.21700)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) increasingly support culturally sensitive decision making, yet often exhibit misalignment due to skewed pretraining data and the absence of structured value representations. Existing methods can steer outputs, but often lack demographic grounding and treat values as independent, unstructured signals, reducing consistency and interpretability. We propose OG-MAR, an Ontology-Guided Multi-Agent Reasoning framework. OG-MAR summarizes respondent-specific values from the World Values Survey (WVS) and constructs a global cultural ontology by eliciting relations over a fixed taxonomy via competency questions. At inference time, it retrieves ontology-consistent relations and demographically similar profiles to instantiate multiple value-persona agents, whose outputs are synthesized by a judgment agent that enforces ontology consistency and demographic proximity. Experiments on regional social-survey benchmarks across four LLM backbones show that OG-MAR improves cultural alignment and robustness over competitive baselines, while producing more transparent reasoning traces.
摘要：大型语言模型 (LLM) 越来越多地支持文化敏感的决策，但由于预训练数据倾斜和缺乏结构化价值表示而经常表现出不一致。现有的方法可以引导输出，但往往缺乏人口统计基础，并将价值观视为独立的、非结构化的信号，从而降低了一致性和可解释性。我们提出了 OG-MAR，一种本体引导的多智能体推理框架。 OG-MAR 总结了世界价值观调查 (WVS) 中受访者的特定价值观，并通过能力问题引出固定分类法上的关系，构建了全球文化本体。在推理时，它检索本体一致的关系和人口统计相似的概况，以实例化多个价值角色代理，其输出由强制本体一致性和人口统计接近性的判断代理合成。对四个 LLM 主干的区域社会调查基准进行的实验表明，OG-MAR 提高了文化一致性和相对于竞争基准的稳健性，同时产生更透明的推理痕迹。

Title: Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis

Authors: Qingyue Yang, Jie Wang, Xing Li, Yinqi Bai, Xialiang Tong, Huiling Zhen, Jianye Hao, Mingxuan Yuan, Bin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21709
Pdf URL: https://arxiv.org/pdf/2601.21709
Copy Paste: [[2601.21709]] Why Attention Patterns Exist: A Unifying Temporal Perspective Analysis(https://arxiv.org/abs/2601.21709)
Keywords: language model, llm
Abstract: Attention patterns play a crucial role in both training and inference of large language models (LLMs). Prior works have identified individual patterns such as retrieval heads, sink heads, and diagonal traces, yet these observations remain fragmented and lack a unifying explanation. To bridge this gap, we introduce \textbf{Temporal Attention Pattern Predictability Analysis (TAPPA), a unifying framework that explains diverse attention patterns by analyzing their underlying mathematical formulations} from a temporally continuous perspective. TAPPA both deepens the understanding of attention behavior and guides inference acceleration approaches. Specifically, TAPPA characterizes attention patterns as predictable patterns with clear regularities and unpredictable patterns that appear effectively random. Our analysis further reveals that this distinction can be explained by the degree of query self-similarity along the temporal dimension. Focusing on the predictable patterns, we further provide a detailed mathematical analysis of three representative cases through the joint effect of queries, keys, and Rotary Positional Embeddings (RoPE). We validate TAPPA by applying its insights to KV cache compression and LLM pruning tasks. Across these tasks, a simple metric motivated by TAPPA consistently improves performance over baseline methods. The code is available at this https URL.
摘要：注意力模式在大型语言模型（LLM）的训练和推理中发挥着至关重要的作用。之前的工作已经识别了单独的模式，例如检索头、水槽头和对角线痕迹，但这些观察结果仍然支离破碎，缺乏统一的解释。为了弥补这一差距，我们引入了\textbf{时间注意力模式可预测性分析（TAPPA），这是一个统一的框架，通过从时间连续的角度分析其基础数学公式来解释不同的注意力模式}。 TAPPA 既加深了对注意力行为的理解，又指导了推理加速方法。具体来说，TAPPA 将注意力模式描述为具有明显规律性的可预测模式和看似随机的不可预测模式。我们的分析进一步表明，这种区别可以通过查询沿时间维度的自相似程度来解释。着眼于可预测的模式，我们通过查询、键和旋转位置嵌入（RoPE）的联合作用，进一步对三个代表性案例进行了详细的数学分析。我们通过将 TAPPA 的见解应用于 KV 缓存压缩和 LLM 修剪任务来验证 TAPPA。在这些任务中，由 TAPPA 推动的一个简单指标持续提高了基准方法的性能。该代码可从此 https URL 获取。

Title: TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning

Authors: Huiyuan Lai, Malvina Nissim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21711
Pdf URL: https://arxiv.org/pdf/2601.21711
Copy Paste: [[2601.21711]] TACLer: Tailored Curriculum Reinforcement Learning for Efficient Reasoning(https://arxiv.org/abs/2601.21711)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have shown remarkable performance on complex reasoning tasks, especially when equipped with long chain-of-thought (CoT) reasoning. However, eliciting long CoT typically requires large-scale reinforcement learning (RL) training, while often leading to overthinking with redundant intermediate steps. To improve learning and reasoning efficiency, while preserving or even enhancing performance, we propose TACLer, a model-tailored curriculum reinforcement learning framework that gradually increases the complexity of the data based on the model's proficiency in multi-stage RL training. TACLer features two core components: (i) tailored curriculum learning that determines what knowledge the model lacks and needs to learn in progressive stages; (ii) a hybrid Thinking/NoThinking reasoning paradigm that balances accuracy and efficiency by enabling or disabling the Thinking mode. Our experiments show that TACLer yields a twofold advantage in learning and reasoning: (i) it reduces computational cost, cutting training compute by over 50% compared to long thinking models and reducing inference token usage by over 42% relative to the base model; and (ii) it improves accuracy by over 9% on the base model, consistently outperforming state-of-the-art Nothinking and Thinking baselines across four math datasets with complex problems.
摘要：大型语言模型 (LLM) 在复杂的推理任务上表现出了卓越的性能，特别是在配备长链思维 (CoT) 推理时。然而，引发长 CoT 通常需要大规模强化学习 (RL) 训练，同时往往会导致冗余中间步骤的过度思考。为了提高学习和推理效率，同时保持甚至增强性能，我们提出了 TACLer，一种为模型量身定制的课程强化学习框架，它根据模型对多阶段 RL 训练的熟练程度逐渐增加数据的复杂性。 TACLer 具有两个核心组成部分：（i）定制课程学习，确定模型在渐进阶段缺乏和需要学习的知识； (ii) 混合 Thinking/NoThinking 推理范式，通过启用或禁用 Thinking 模式来平衡准确性和效率。我们的实验表明，TACLer 在学习和推理方面具有双重优势：（i）它降低了计算成本，与长思维模型相比，训练计算量减少了 50% 以上，与基本模型相比，推理令牌使用量减少了 42% 以上； (ii) 它在基本模型上的准确性提高了 9% 以上，在四个具有复杂问题的数学数据集上始终优于最先进的 Nothinking 和 Thinking 基线。

Title: Enhancing Language Models for Robust Greenwashing Detection

Authors: Neil Heinrich Braun, Keane Ong, Rui Mao, Erik Cambria, Gianmarco Mengaldo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21722
Pdf URL: https://arxiv.org/pdf/2601.21722
Copy Paste: [[2601.21722]] Enhancing Language Models for Robust Greenwashing Detection(https://arxiv.org/abs/2601.21722)
Keywords: language model, llm
Abstract: Sustainability reports are critical for ESG assessment, yet greenwashing and vague claims often undermine their reliability. Existing NLP models lack robustness to these practices, typically relying on surface-level patterns that generalize poorly. We propose a parameter-efficient framework that structures LLM latent spaces by combining contrastive learning with an ordinal ranking objective to capture graded distinctions between concrete actions and ambiguous claims. Our approach incorporates gated feature modulation to filter disclosure noise and utilizes MetaGradNorm to stabilize multi-objective optimization. Experiments in cross-category settings demonstrate superior robustness over standard baselines while revealing a trade-off between representational rigidity and generalization.
摘要：可持续发展报告对于 ESG 评估至关重要，但洗绿和模糊的声明往往会损害其可靠性。现有的 NLP 模型对这些实践缺乏鲁棒性，通常依赖于泛化能力较差的表面模式。我们提出了一个参数有效的框架，通过将对比学习与顺序排名目标相结合来构建法学硕士潜在空间，以捕获具体行动和模糊主张之间的分级区别。我们的方法结合了门控特征调制来过滤披露噪声，并利用 MetaGradNorm 来稳定多目标优化。跨类别设置的实验证明了比标准基线优越的鲁棒性，同时揭示了表征刚性和泛化之间的权衡。

Title: Procedural Pretraining: Warming Up Language Models with Abstract Data

Authors: Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21725
Pdf URL: https://arxiv.org/pdf/2601.21725
Copy Paste: [[2601.21725]] Procedural Pretraining: Warming Up Language Models with Abstract Data(https://arxiv.org/abs/2601.21725)
Keywords: language model, llm
Abstract: Pretraining directly on web-scale corpora is the de facto paradigm for building language models. We study an alternative setting where the model is initially exposed to abstract structured data, as a means to ease the subsequent acquisition of rich semantic knowledge, much like humans learn simple logic and mathematics before higher reasoning. We specifically focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, on context recall (Needle-in-a-haystack), the accuracy jumps from 10 to 98% when pretraining on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this procedural pretraining enables the models to reach the same loss value with only 55, 67, 86% of the original data. Third, we explore the mechanisms behind and find that procedural pretraining instils non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means to improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.
摘要：直接在网络规模的语料库上进行预训练是构建语言模型的事实上的范例。我们研究了另一种设置，其中模型最初暴露于抽象结构化数据，作为轻松随后获取丰富语义知识的手段，就像人类在高级推理之前学习简单的逻辑和数学一样。我们特别关注由形式语言和其他简单算法生成的过程数据，例如抽象数据。我们首先诊断不同形式的程序数据可以提高的算法技能，通常是显着的。例如，在上下文回忆（大海捞针）方面，在 Dyck 序列（平衡括号）上进行预训练时，准确率从 10% 跃升至 98%。其次，我们研究这些收益如何在预训练更大的模型（高达 1.3B）中得到体现。我们发现，前加载低至 0.1% 的程序数据明显优于自然语言、代码和非正式数学（C4、CodeParrot 和 DeepMind-Math 数据集）上的标准预训练。值得注意的是，这种程序预训练使模型仅用原始数据的 55%、67%、86% 即可达到相同的损失值。第三，我们探索了背后的机制，发现程序预训练在注意力层和 MLP 层中注入了不平凡的结构。前者对于结构化领域（例如代码）特别重要，后者对于语言特别重要。最后，我们奠定了组合多种形式的程序数据的途径。我们的结果表明，程序预训练是一种简单、轻量级的方法，可以提高性能并加速语言模型预训练，最终表明法学硕士有望将知识获取与推理分开。

Title: CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering

Authors: Jiayin Lan, Jiaqi Li, Baoxin Wang, Ming Liu, Dayong Wu, Shijin Wang, Bing Qin, Guoping Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21733
Pdf URL: https://arxiv.org/pdf/2601.21733
Copy Paste: [[2601.21733]] CE-GOCD: Central Entity-Guided Graph Optimization for Community Detection to Augment LLM Scientific Question Answering(https://arxiv.org/abs/2601.21733)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used for question answering over scientific research papers. Existing retrieval augmentation methods often rely on isolated text chunks or concepts, but overlook deeper semantic connections between papers. This impairs the LLM's comprehension of scientific literature, hindering the comprehensiveness and specificity of its responses. To address this, we propose Central Entity-Guided Graph Optimization for Community Detection (CE-GOCD), a method that augments LLMs' scientific question answering by explicitly modeling and leveraging semantic substructures within academic knowledge graphs. Our approach operates by: (1) leveraging paper titles as central entities for targeted subgraph retrieval, (2) enhancing implicit semantic discovery via subgraph pruning and completion, and (3) applying community detection to distill coherent paper groups with shared themes. We evaluated the proposed method on three NLP literature-based question-answering datasets, and the results demonstrate its superiority over other retrieval-augmented baseline approaches, confirming the effectiveness of our framework.
摘要：大型语言模型 (LLM) 越来越多地用于科学研究论文的问答。现有的检索增强方法通常依赖于孤立的文本块或概念，但忽略了论文之间更深层次的语义联系。这损害了法学硕士对科学文献的理解，阻碍了其回答的全面性和特异性。为了解决这个问题，我们提出了用于社区检测的中央实体引导图优化（CE-GOCD），这是一种通过显式建模和利用学术知识图谱中的语义子结构来增强法学硕士的科学问答能力的方法。我们的方法的运作方式是：（1）利用论文标题作为目标子图检索的中心实体，（2）通过子图修剪和完成来增强隐式语义发现，以及（3）应用社区检测来提取具有共享主题的连贯论文组。我们在三个基于 NLP 文献的问答数据集上评估了所提出的方法，结果证明了其相对于其他检索增强基线方法的优越性，证实了我们框架的有效性。

Title: Temporal Guidance for Large Language Models

Authors: Hong-Kai Zheng, Piji Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21744
Pdf URL: https://arxiv.org/pdf/2601.21744
Copy Paste: [[2601.21744]] Temporal Guidance for Large Language Models(https://arxiv.org/abs/2601.21744)
Keywords: language model, llm
Abstract: Contrastive Decoding (CD) enhances the generation quality of large language models (LLMs) but incurs significant additional computational overhead due to the need for an auxiliary model. Existing internal self-contrastive decoding methods, such as Decoding by Contrasting Layers (DoLa), focus on discrepancies across different layers, which are notably unstable on small-scale models. In this work, based on the observation that LLMs exhibit local preferences, we propose a novel contrastive guidance strategy along the temporal dimension, namely Temporal Guidance (TeGu). Our method ingeniously leverages Multi-Token Prediction (MTP) to construct weaker amateur predictions for model self-contrast. To standardize the implementation of this mechanism, we further introduce a lightweight Conditional MTP Projector (cMTPP), which avoids maintaining multiple independent networks as required by other MTP modules. Across various model series and benchmarks, TeGu achieves significant performance improvements while maintaining low additional memory consumption and computational overhead.
摘要：对比解码 (CD) 提高了大型语言模型 (LLM) 的生成质量，但由于需要辅助模型，因此会产生大量额外的计算开销。现有的内部自对比解码方法，例如通过对比层解码（DoLa），重点关注不同层之间的差异，这在小规模模型上尤其不稳定。在这项工作中，基于法学硕士表现出局部偏好的观察，我们提出了一种沿时间维度的新型对比指导策略，即时间指导（TeGu）。我们的方法巧妙地利用多令牌预测（MTP）来构建较弱的业余预测以进行模型自我对比。为了标准化该机制的实现，我们进一步引入了轻量级条件 MTP 投影器（cMTPP），它避免了其他 MTP 模块所需的维护多个独立网络。在各种模型系列和基准测试中，TeGu 实现了显着的性能改进，同时保持较低的额外内存消耗和计算开销。

Title: CoFrGeNet: Continued Fraction Architectures for Language Generation

Authors: Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21766
Pdf URL: https://arxiv.org/pdf/2601.21766
Copy Paste: [[2601.21766]] CoFrGeNet: Continued Fraction Architectures for Language Generation(https://arxiv.org/abs/2601.21766)
Keywords: gpt
Abstract: Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.
摘要：Transformer 可以说是语言生成的首选架构。在本文中，受连分数的启发，我们引入了一个用于生成建模的新函数类。实现该函数类的架构系列被命名为 CoFrGeNets - 连续分数生成网络。我们基于这个功能类设计了新颖的架构组件，可以取代 Transformer 块中的多头注意力和前馈网络，同时需要更少的参数。与使用基于 PyTorch 的标准梯度相比，我们推导出自定义梯度公式来更准确、更高效地优化建议的组件。我们的组件是一个插件替代品，几乎不需要对基于 Transformer 的模型已经实施的训练或推理程序进行任何更改，从而使我们的方法很容易融入大型工业工作流程中。我们在两种截然不同的 Transformer 架构 GPT2-xl (1.5B) 和 Llama3 (3.2B) 上进行实验，前者我们在 OpenWebText 和 GneissWeb 上进行预训练，而后者我们在由九个不同数据集组成的 docling 数据混合上进行预训练。结果表明，我们的模型在下游分类、问答、推理和文本理解任务上的性能具有竞争力，有时甚至优于具有 $\frac{2}{3}$ 到 $\frac{1}{2}$ 参数和更短预训练时间的原始模型。我们相信，未来针对硬件定制的实现将进一步发挥我们架构的真正潜力。

Title: Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond

Authors: Wei Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21767
Pdf URL: https://arxiv.org/pdf/2601.21767
Copy Paste: [[2601.21767]] Evaluating ChatGPT on Medical Information Extraction Tasks: Performance, Explainability and Beyond(https://arxiv.org/abs/2601.21767)
Keywords: language model, gpt, llm, chat
Abstract: Large Language Models (LLMs) like ChatGPT have demonstrated amazing capabilities in comprehending user intents and generate reasonable and useful responses. Beside their ability to chat, their capabilities in various natural language processing (NLP) tasks are of interest to the research community. In this paper, we focus on assessing the overall ability of ChatGPT in 4 different medical information extraction (MedIE) tasks across 6 benchmark datasets. We present the systematically analysis by measuring ChatGPT's performance, explainability, confidence, faithfulness, and uncertainty. Our experiments reveal that: (a) ChatGPT's performance scores on MedIE tasks fall behind those of the fine-tuned baseline models. (b) ChatGPT can provide high-quality explanations for its decisions, however, ChatGPT is over-confident in its predcitions. (c) ChatGPT demonstrates a high level of faithfulness to the original text in the majority of cases. (d) The uncertainty in generation causes uncertainty in information extraction results, thus may hinder its applications in MedIE tasks.
摘要：像 ChatGPT 这样的大型语言模型 (LLM) 在理解用户意图并生成合理且有用的响应方面表现出了惊人的能力。除了聊天能力之外，他们在各种自然语言处理 (NLP) 任务中的能力也引起了研究界的兴趣。在本文中，我们重点评估 ChatGPT 在 6 个基准数据集的 4 种不同医疗信息提取 (MedIE) 任务中的整体能力。我们通过测量 ChatGPT 的性能、可解释性、置信度、忠实度和不确定性来进行系统分析。我们的实验表明：(a) ChatGPT 在 MedIE 任务上的表现得分落后于微调的基线模型。 (b) ChatGPT 可以为其决策提供高质量的解释，但是 ChatGPT 对自己的预测过于自信。 (c) ChatGPT 在大多数情况下表现出对原始文本的高度忠实。 (d) 生成的不确定性导致信息提取结果的不确定性，从而可能阻碍其在 MedIE 任务中的应用。

Title: Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention

Authors: Alon Rozental
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21768
Pdf URL: https://arxiv.org/pdf/2601.21768
Copy Paste: [[2601.21768]] Zonkey: A Hierarchical Diffusion Language Model with Differentiable Tokenization and Probabilistic Attention(https://arxiv.org/abs/2601.21768)
Keywords: language model, llm
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet they remain constrained by fixed, non-differentiable tokenizers like Byte Pair Encoding (BPE), which hinder end-to-end optimization and adaptability to noisy or domain-specific data. We introduce Zonkey, a hierarchical diffusion model that addresses these limitations through a fully trainable pipeline from raw characters to document-level representations. At its core is a differentiable tokenizer (Segment Splitter) that learns probabilistic beginning-of-sequence (BOS) decisions, enabling adaptive splits that emerge as linguistically meaningful (e.g., word boundaries at spaces, sentence starts at periods) without explicit supervision. This differentiability is enabled by our novel Probabilistic Attention mechanism, which incorporates position-specific existence probabilities to simulate soft masking over theoretically infinite sequences while preserving gradients. Sequences decay probabilistically rather than relying on end-of-sequence tokens, supporting variable-length outputs. Hierarchical levels compress sequences into higher abstractions (e.g., character n-grams to word-like vectors, then sentence-like), with reconstruction via our Denoising Diffusion Mixed Model (DDMM) for stable and efficient denoising in latent space. A Stitcher ensures overlap invariance across segments. Trained end-to-end on Wikipedia, Zonkey generates coherent, variable-length text from noise, demonstrating emergent hierarchies and promising qualitative alignment to data distributions compared to entropy-based learnable tokenizers. Our approach advances toward fully gradient-based LLMs, with potential for better domain adaptation and scalable generation. We release the source code for training and reproducing our experiments.
摘要：大型语言模型 (LLM) 彻底改变了自然语言处理，但它们仍然受到字节对编码 (BPE) 等固定、不可微分标记器的限制，这阻碍了端到端优化以及对噪声或特定领域数据的适应性。我们引入了 Zonkey，这是一种分层扩散模型，它通过从原始字符到文档级表示的完全可训练的管道来解决这些限制。其核心是一个可微的分词器（Segment Splitter），它学习概率序列开始（BOS）决策，从而在没有明确监督的情况下实现具有语言意义的自适应分割（例如，空格处的单词边界，句点处的句子开始）。这种可微性是通过我们新颖的概率注意力机制实现的，该机制结合了位置特定的存在概率来模拟理论上无限序列上的软掩蔽，同时保留梯度。序列以概率方式衰减，而不是依赖于序列结束标记，支持可变长度输出。层次结构将序列压缩为更高的抽象（例如，将字符 n-gram 压缩为类单词向量，然后类句子），并通过我们的去噪扩散混合模型 (DDMM) 进行重建，以在潜在空间中实现稳定、高效的去噪。 Stitcher 确保各段之间的重叠不变性。 Zonkey 在维基百科上进行了端到端训练，从噪声中生成连贯的、可变长度的文本，与基于熵的可学习分词器相比，展示了新兴的层次结构并有望实现数据分布的定性对齐。我们的方法朝着完全基于梯度的法学硕士迈进，具有更好的领域适应和可扩展生成的潜力。我们发布了用于训练和重现我们的实验的源代码。

Title: Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation

Authors: Yimin Deng, Yuqing Fu, Derong Xu, Yejing Wang, Wei Ni, Jingtong Gao, Xiaopeng Li, Chengxu Liu, Xiao Han, Guoshuai Zhao, Xiangyu Zhao, Li Zhu, Xueming Qian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21797
Pdf URL: https://arxiv.org/pdf/2601.21797
Copy Paste: [[2601.21797]] Enhancing Conversational Agents via Task-Oriented Adversarial Memory Adaptation(https://arxiv.org/abs/2601.21797)
Keywords: agent
Abstract: Conversational agents struggle to handle long conversations due to context window limitations. Therefore, memory systems are developed to leverage essential historical information. Existing memory systems typically follow a pipeline of offline memory construction and update, and online retrieval. Despite the flexible online phase, the offline phase remains fixed and task-independent. In this phase, memory construction operates under a predefined workflow and fails to emphasize task relevant information. Meanwhile, memory updates are guided by generic metrics rather than task specific supervision. This leads to a misalignment between offline memory preparation and task requirements, which undermines downstream task performance. To this end, we propose an Adversarial Memory Adaptation mechanism (AMA) that aligns memory construction and update with task objectives by simulating task execution. Specifically, first, a challenger agent generates question answer pairs based on the original dialogues. The constructed memory is then used to answer these questions, simulating downstream inference. Subsequently, an evaluator agent assesses the responses and performs error analysis. Finally, an adapter agent analyzes the error cases and performs dual level updates on both the construction strategy and the content. Through this process, the memory system receives task aware supervision signals in advance during the offline phase, enhancing its adaptability to downstream tasks. AMA can be integrated into various existing memory systems, and extensive experiments on long dialogue benchmark LoCoMo demonstrate its effectiveness.
摘要：由于上下文窗口的限制，会话代理很难处理长时间的会话。因此，记忆系统的开发是为了利用重要的历史信息。现有的内存系统通常遵循离线内存构建和更新以及在线检索的流程。尽管在线阶段灵活，但离线阶段仍然固定且与任务无关。在此阶段，内存构建在预定义的工作流程下运行，无法强调任务相关信息。同时，内存更新是由通用指标指导的，而不是特定于任务的监督。这会导致离线内存准备和任务需求之间不一致，从而损害下游任务性能。为此，我们提出了一种对抗性记忆适应机制（AMA），通过模拟任务执行来使记忆构建和更新与任务目标保持一致。具体来说，首先，挑战者代理根据原始对话生成问题答案对。然后使用构建的内存来回答这些问题，模拟下游推理。随后，评估代理评估响应并执行错误分析。最后，适配器代理分析错误情况并对构建策略和内容执行双层更新。通过这个过程，内存系统在离线阶段提前接收任务感知监督信号，增强其对下游任务的适应性。 AMA 可以集成到各种现有的存储系统中，并且在长对话基准 LoCoMo 上进行的大量实验证明了其有效性。

Title: RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes

Authors: Korbinian Randl, Guido Rocchietti, Aron Henriksson, Ziawasch Abedjan, Tony Lindgren, John Pavlopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21803
Pdf URL: https://arxiv.org/pdf/2601.21803
Copy Paste: [[2601.21803]] RAG-E: Quantifying Retriever-Generator Alignment and Failure Modes(https://arxiv.org/abs/2601.21803)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems combine dense retrievers and language models to ground LLM outputs in retrieved documents. However, the opacity of how these components interact creates challenges for deployment in high-stakes domains. We present RAG-E, an end-to-end explainability framework that quantifies retriever-generator alignment through mathematically grounded attribution methods. Our approach adapts Integrated Gradients for retriever analysis, introduces PMCSHAP, a Monte Carlo-stabilized Shapley Value approximation, for generator attribution, and introduces the Weighted Attribution-Relevance Gap (WARG) metric to measure how well a generator's document usage aligns with a retriever's ranking. Empirical analysis on TREC CAsT and FoodSafeSum reveals critical misalignments: for 47.4% to 66.7% of queries, generators ignore the retriever's top-ranked documents, while 48.1% to 65.9% rely on documents ranked as less relevant. These failure modes demonstrate that RAG output quality depends not solely on individual component performance but on their interplay, which can be audited via RAG-E.
摘要：检索增强生成（RAG）系统结合了密集检索器和语言模型，以在检索到的文档中提供 LLM 输出。然而，这些组件交互方式的不透明给高风险领域的部署带来了挑战。我们提出了 RAG-E，一个端到端的可解释性框架，它通过数学基础的归因方法来量化检索器-生成器对齐。我们的方法采用积分梯度进行检索器分析，引入PMCSHAP（一种蒙特卡罗稳定的沙普利值近似）用于生成器归因，并引入加权归因相关性差距（WARG）指标来衡量生成器的文档使用与检索器排名的匹配程度。对 TREC CAsT 和 FoodSafeSum 的实证分析揭示了严重的偏差：对于 47.4% 到 66.7% 的查询，生成器忽略检索器排名靠前的文档，而 48.1% 到 65.9% 依赖于排名不太相关的文档。这些故障模式表明，RAG 输出质量不仅取决于单个组件的性能，还取决于它们之间的相互作用，这可以通过 RAG-E 进行审核。

Title: Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning

Authors: Bodong Du, Xuanqi Huang, Xiaomeng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21804
Pdf URL: https://arxiv.org/pdf/2601.21804
Copy Paste: [[2601.21804]] Distribution-Aware Reward Estimation for Test-Time Reinforcement Learning(https://arxiv.org/abs/2601.21804)
Keywords: language model, llm
Abstract: Test-time reinforcement learning (TTRL) enables large language models (LLMs) to self-improve on unlabeled inputs, but its effectiveness critically depends on how reward signals are estimated without ground-truth supervision. Most existing TTRL methods rely on majority voting (MV) over rollouts to produce deterministic rewards, implicitly assuming that the majority rollout provides a reliable learning signal. We show that this assumption is fragile: MV reduces the rollout distribution into a single outcome, discarding information about non-majority but correct actions candidates, and yields systematically biased reward estimates. To address this, we propose Distribution-AwareReward Estimation (DARE), which shifts reward estimation from a single majority outcome to the full empirical rollout distribution. DARE further augments this distribution-based reward with an exploration bonus and a distribution pruning mechanism for non-majority rollout exploration and reward denoise, yielding a more informative and robust reward estimation. Extensive experiments on challenging reasoning benchmarks show that DARE improves optimization stability and final performance over recent baselines, achieving relative improvements of 25.3% on challenging AIME 2024 and 5.3% on AMC.
摘要：测试时强化学习 (TTRL) 使大型语言模型 (LLM) 能够针对未标记的输入进行自我改进，但其有效性关键取决于如何在没有真实监督的情况下估计奖励信号。大多数现有的 TTRL 方法依赖于首次推出的多数投票 (MV) 来产生确定性奖励，隐含地假设多数首次推出提供了可靠的学习信号。我们证明这种假设是脆弱的：MV 将推出分布减少为单一结果，丢弃有关非多数但正确的候选行动的信息，并产生系统性有偏差的奖励估计。为了解决这个问题，我们提出了分布式感知奖励估计（DARE），它将奖励估计从单一多数结果转变为完整的经验推出分布。 DARE 通过探索奖励和用于非多数推出探索和奖励降噪的分配修剪机制进一步增强了这种基于分布的奖励，从而产生信息更丰富、更稳健的奖励估计。对具有挑战性的推理基准的大量实验表明，DARE 比最近的基线提高了优化稳定性和最终性能，在具有挑战性的 AIME 2024 上实现了 25.3% 的相对改进，在 AMC 上实现了 5.3% 的相对改进。

Title: Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models

Authors: Aadi Palnitkar, Mingyang Mao, Nicholas Waytowich, Vinicius G. Goecks, Tinoosh Mohsenin, Xiaomin Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21826
Pdf URL: https://arxiv.org/pdf/2601.21826
Copy Paste: [[2601.21826]] Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models(https://arxiv.org/abs/2601.21826)
Keywords: language model, llm
Abstract: As large language models (LLMs) are applied to increasingly longer and more complex tasks, there is a growing need for realistic long-context benchmarks that require selective reading and integration of heterogeneous, multi-modal information sources. This need is especially acute for geospatial planning problems, such as those found in planning for large-scale military operations, which demand fast and accurate reasoning over maps, orders, intelligence reports, and other distributed data. To address this gap, we present MilSCORE (Military Scenario Contextual Reasoning), to our knowledge the first scenario-level dataset of expert-authored, multi-hop questions grounded in a complex, simulated military planning scenario used for training. MilSCORE is designed to evaluate high-stakes decision-making and planning, probing LLMs' ability to combine tactical and spatial reasoning across multiple sources and to reason over long-horizon, geospatially rich context. The benchmark includes a diverse set of question types across seven categories targeting both factual recall and multi-step reasoning about constraints, strategy, and spatial analysis. We provide an evaluation protocol and report baseline results for a range of contemporary vision-language models. Our findings highlight substantial headroom on MilSCORE, indicating that current systems struggle with realistic, scenario-level long-context planning, and positioning MilSCORE as a challenging testbed for future work.
摘要：随着大型语言模型 (LLM) 应用于越来越长、越来越复杂的任务，对现实的长上下文基准的需求日益增长，这些基准需要选择性读取和集成异构、多模式信息源。这种需求对于地理空间规划问题尤其严重，例如大规模军事行动规划中的问题，这些问题需要对地图、命令、情报报告和其他分布式数据进行快速而准确的推理。为了解决这一差距，我们提出了 MilSCORE（军事场景情境推理），据我们所知，这是第一个场景级数据集，其中包含专家编写的多跳问题，这些问题基于用于训练的复杂的模拟军事规划场景。 MilSCORE 旨在评估高风险的决策和规划，探讨法学硕士将跨多个来源的战术和空间推理相结合以及在长期、地理空间丰富的背景下进行推理的能力。该基准包括跨七个类别的多种问题类型，针对事实回忆以及有关约束、策略和空间分析的多步骤推理。我们为一系列当代视觉语言模型提供评估协议并报告基线结果。我们的研究结果强调了 MilSCORE 的巨大发展空间，表明当前的系统在现实的场景级长上下文规划方面存在困难，并将 MilSCORE 定位为未来工作的具有挑战性的测试平台。

Title: Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model

Authors: Xiang Li, Ning Yan, Masood Mortazavi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21841
Pdf URL: https://arxiv.org/pdf/2601.21841
Copy Paste: [[2601.21841]] Embodied Task Planning via Graph-Informed Action Generation with Large Lanaguage Model(https://arxiv.org/abs/2601.21841)
Keywords: language model, llm, agent
Abstract: While Large Language Models (LLMs) have demonstrated strong zero-shot reasoning capabilities, their deployment as embodied agents still faces fundamental challenges in long-horizon planning. Unlike open-ended text generation, embodied agents must decompose high-level intent into actionable sub-goals while strictly adhering to the logic of a dynamic, observed environment. Standard LLM planners frequently fail to maintain strategy coherence over extended horizons due to context window limitation or hallucinate transitions that violate constraints. We propose GiG, a novel planning framework that structures embodied agents' memory using a Graph-in-Graph architecture. Our approach employs a Graph Neural Network (GNN) to encode environmental states into embeddings, organizing these embeddings into action-connected execution trace graphs within an experience memory bank. By clustering these graph embeddings, the framework enables retrieval of structure-aware priors, allowing agents to ground current decisions in relevant past structural patterns. Furthermore, we introduce a novel bounded lookahead module that leverages symbolic transition logic to enhance the agents' planning capabilities through the grounded action projection. We evaluate our framework on three embodied planning benchmarks-Robotouille Synchronous, Robotouille Asynchronous, and ALFWorld. Our method outperforms state-of-the-art baselines, achieving Pass@1 performance gains of up to 22% on Robotouille Synchronous, 37% on Asynchronous, and 15% on ALFWorld with comparable or lower computational cost.
摘要：虽然大型语言模型（LLM）已经表现出强大的零样本推理能力，但它们作为具体代理的部署在长期规划中仍然面临着根本性挑战。与开放式文本生成不同，具体代理必须将高级意图分解为可操作的子目标，同时严格遵守动态、可观察环境的逻辑。由于上下文窗口限制或违反约束的幻觉转换，标准法学硕士规划者经常无法在扩展的范围内保持策略的一致性。我们提出了 GiG，一种新颖的规划框架，它使用图中图架构来构建实体的记忆。我们的方法采用图神经网络（GNN）将环境状态编码到嵌入中，将这些嵌入组织成经验记忆库中与动作连接的执行跟踪图。通过对这些图嵌入进行聚类，该框架能够检索结构感知先验，从而允许代理将当前决策基于相关的过去结构模式。此外，我们引入了一种新颖的有界前瞻模块，该模块利用符号转换逻辑通过扎根的动作预测来增强代理的规划能力。我们根据三个具体规划基准评估我们的框架：Robotouille Synchronous、Robotouille Asynchronous 和 ALFWorld。我们的方法优于最先进的基线，在 Robotouille 同步上实现 Pass@1 性能提升高达 22%，在异步上提升 37%，在 ALFWorld 上提升 15%，而计算成本相当或更低。

Title: Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text

Authors: Hongyi Zhou, Jin Zhu, Erhan Xu, Kai Ye, Ying Yang, Chengchun Shi
Subjects: cs.CL, cs.AI, stat.ML
Abstract URL: https://arxiv.org/abs/2601.21895
Pdf URL: https://arxiv.org/pdf/2601.21895
Copy Paste: [[2601.21895]] Learn-to-Distance: Distance Learning for Detecting LLM-Generated Text(https://arxiv.org/abs/2601.21895)
Keywords: language model, gpt, llm
Abstract: Modern large language models (LLMs) such as GPT, Claude, and Gemini have transformed the way we learn, work, and communicate. Yet, their ability to produce highly human-like text raises serious concerns about misinformation and academic integrity, making it an urgent need for reliable algorithms to detect LLM-generated content. In this paper, we start by presenting a geometric approach to demystify rewrite-based detection algorithms, revealing their underlying rationale and demonstrating their generalization ability. Building on this insight, we introduce a novel rewrite-based detection algorithm that adaptively learns the distance between the original and rewritten text. Theoretically, we demonstrate that employing an adaptively learned distance function is more effective for detection than using a fixed distance. Empirically, we conduct extensive experiments with over 100 settings, and find that our approach demonstrates superior performance over baseline algorithms in the majority of scenarios. In particular, it achieves relative improvements from 57.8\% to 80.6\% over the strongest baseline across different target LLMs (e.g., GPT, Claude, and Gemini).
摘要：GPT、Claude 和 Gemini 等现代大型语言模型 (LLM) 改变了我们学习、工作和交流的方式。然而，它们生成高度人性化文本的能力引发了人们对错误信息和学术诚信的严重担忧，因此迫切需要可靠的算法来检测法学硕士生成的内容。在本文中，我们首先提出一种几何方法来揭开基于重写的检测算法的神秘面纱，揭示其基本原理并展示其泛化能力。基于这一见解，我们引入了一种新颖的基于重写的检测算法，该算法自适应地学习原始文本和重写文本之间的距离。从理论上讲，我们证明使用自适应学习距离函数比使用固定距离更有效地进行检测。根据经验，我们使用 100 多种设置进行了广泛的实验，发现我们的方法在大多数情况下都表现出优于基线算法的性能。特别是，与不同目标 LLM（例如 GPT、Claude 和 Gemini）的最强基线相比，它实现了从 57.8% 到 80.6% 的相对改进。

Title: SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching

Authors: Hong Chen, Xiang Liu, Bo Wang, Yuxuan Fan, Yuanlin Chu, Zongluo Li, Xiaowen Chu, Xuming Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21927
Pdf URL: https://arxiv.org/pdf/2601.21927
Copy Paste: [[2601.21927]] SONIC: Segmented Optimized Nexus for Information Compression in Key-Value Caching(https://arxiv.org/abs/2601.21927)
Keywords: llm
Abstract: The linear growth of Key-Value (KV) cache remains a bottleneck for multi-turn LLM deployment. Existing KV cache compression methods often fail to account for the structural properties of multi-turn dialogues, relying on heuristic eviction that risks losing critical context. We propose \textbf{SONIC}, a learning-based framework that compresses historical segments into compact and semantically rich \textbf{Nexus} tokens. By integrating dynamic budget training, SONIC allows flexible adaptation to varying memory constraints without retraining. Experiments show that at compression ratios of 80\% and 50\%, SONIC consistently outperforms baselines such as H2O and StreamingLLM on four diverse multi-turn benchmarks. Specifically, on the widely used MTBench101 benchmark, SONIC achieves an average score improvement of 35.55\% over state-of-the-art baselines, validating its effectiveness in sustaining coherent multi-turn dialogues. Furthermore, SONIC enhances deployment efficiency, accelerating the overall inference process by 50.1\% compared to full-context generation.
摘要：键值（KV）缓存的线性增长仍然是多轮LLM部署的瓶颈。现有的 KV 缓存压缩方法通常无法考虑多轮对话的结构特性，依赖于启发式驱逐，而这可能会丢失关键上下文。我们提出 \textbf{SONIC}，一个基于学习的框架，它将历史片段压缩为紧凑且语义丰富的 \textbf{Nexus} 标记。通过集成动态预算训练，SONIC 可以灵活适应不同的内存限制，而无需重新训练。实验表明，在 80% 和 50% 的压缩比下，SONIC 在四种不同的多轮基准测试中始终优于 H2O 和 StreamingLLM 等基线。具体来说，在广泛使用的 MTBench101 基准测试中，SONIC 的平均得分比最先进的基准提高了 35.55%，验证了其在维持连贯多轮对话方面的有效性。此外，SONIC 还提高了部署效率，与全上下文生成相比，整体推理过程加快了 50.1%。

Title: From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes

Authors: Fariba Afrin Irany
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.21955
Pdf URL: https://arxiv.org/pdf/2601.21955
Copy Paste: [[2601.21955]] From Generative Modeling to Clinical Classification: A GPT-Based Architecture for EHR Notes(https://arxiv.org/abs/2601.21955)
Keywords: language model, gpt
Abstract: The increasing availability of unstructured clinical narratives in electronic health records (EHRs) has created new opportunities for automated disease characterization, cohort identification, and clinical decision support. However, modeling long, domain-specific clinical text remains challenging due to limited labeled data, severe class imbalance, and the high computational cost of adapting large pretrained language models. This study presents a GPT-based architecture for clinical text classification that adapts a pretrained decoder-only Transformer using a selective fine-tuning strategy. Rather than updating all model parameters, the majority of the GPT-2 backbone is frozen, and training is restricted to the final Transformer block, the final layer normalization, and a lightweight classification head. This approach substantially reduces the number of trainable parameters while preserving the representational capacity required to model complex clinical language. The proposed method is evaluated on radiology reports from the MIMIC-IV-Note dataset using uncertainty-aware CheXpert-style labels derived directly from report text. Experiments cover multiple problem formulations, including multi-label classification of radiographic findings, binary per-label classification under different uncertainty assumptions, and aggregate disease outcome prediction. Across varying dataset sizes, the model exhibits stable convergence behavior and strong classification performance, particularly in settings dominated by non-mention and negated findings. Overall, the results indicate that selective fine-tuning of pretrained generative language models provides an efficient and effective pathway for clinical text classification, enabling scalable adaptation to real-world EHR data while significantly reducing computational complexity.
摘要：电子健康记录 (EHR) 中非结构化临床叙述的可用性不断增加，为自动化疾病表征、队列识别和临床决策支持创造了新的机会。然而，由于有限的标记数据、严重的类别不平衡以及采用大型预训练语言模型的高计算成本，对长的、特定领域的临床文本进行建模仍然具有挑战性。本研究提出了一种基于 GPT 的临床文本分类架构，该架构使用选择性微调策略来调整仅预训练的解码器 Transformer。 GPT-2 主干的大部分被冻结，而不是更新所有模型参数，并且训练仅限于最终的 Transformer 块、最后的层归一化和轻量级分类头。这种方法大大减少了可训练参数的数量，同时保留了复杂临床语言建模所需的表征能力。使用直接从报告文本派生的不确定性感知 CheXpert 样式标签，对来自 MIMIC-IV-Note 数据集的放射学报告对所提出的方法进行评估。实验涵盖了多个问题的表述，包括放射学结果的多标签分类、不同不确定性假设下的每标签二元分类以及总体疾病结果预测。在不同的数据集大小中，该模型表现出稳定的收敛行为和强大的分类性能，特别是在以未提及和否定发现为主的环境中。总体而言，结果表明，对预训练的生成语言模型进行选择性微调为临床文本分类提供了一条高效且有效的途径，能够对现实世界的 EHR 数据进行可扩展的适应，同时显着降低计算复杂性。

Title: Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding

Authors: Yifan Zhu, Huiqiang Rong, Haoran Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.21969
Pdf URL: https://arxiv.org/pdf/2601.21969
Copy Paste: [[2601.21969]] Token-Guard: Towards Token-Level Hallucination Control via Self-Checking Decoding(https://arxiv.org/abs/2601.21969)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) often hallucinate, generating content inconsistent with the input. Retrieval-Augmented Generation (RAG) and Reinforcement Learning with Human Feedback (RLHF) can mitigate hallucinations but require resource-intensive retrieval or large-scale fine-tuning. Decoding-based methods are lighter yet lack explicit hallucination control. To address this, we present Token-Guard, a token-level hallucination control method based on self-checking decoding. Token-Guard performs internal verification at each reasoning step to detect hallucinated tokens before they propagate. Candidate fragments are further evaluated in a latent space with explicit hallucination risk scoring, while iterative pruning and regeneration dynamically correct detected errors. Experiments on HALU datasets show Token-Guard substantially reduces hallucinations and improves generation accuracy, offering a scalable, modular solution for reliable LLM outputs. Our code is publicly available.
摘要：大型语言模型 (LLM) 经常产生幻觉，生成与输入不一致的内容。检索增强生成（RAG）和人类反馈强化学习（RLHF）可以减轻幻觉，但需要资源密集型检索或大规模微调。基于解码的方法更轻，但缺乏明确的幻觉控制。为了解决这个问题，我们提出了Token-Guard，一种基于自检解码的令牌级幻觉控制方法。 Token-Guard 在每个推理步骤中执行内部验证，以在传播之前检测到幻觉令牌。通过明确的幻觉风险评分，在潜在空间中进一步评估候选片段，同时迭代修剪和再生动态纠正检测到的错误。 HALU 数据集上的实验表明 Token-Guard 大大减少了幻觉并提高了生成准确性，为可靠的 LLM 输出提供了可扩展的模块化解决方案。我们的代码是公开的。

Title: Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

Authors: Jianhui Chen, Yuzhang Luo, Liangming Pan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.21996
Pdf URL: https://arxiv.org/pdf/2601.21996
Copy Paste: [[2601.21996]] Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units(https://arxiv.org/abs/2601.21996)
Keywords: llm
Abstract: While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.
摘要：虽然机械可解释性已经确定了法学硕士中的可解释电路，但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械数据归因（MDA），这是一个可扩展的框架，它利用影响函数将可解释的单元追溯到特定的训练样本。通过对 Pythia 家族进行大量实验，我们因果性地验证了有针对性的干预（删除或增加一小部分高影响力样本）可以显着调节可解释头部的出现，而随机干预则没有效果。我们的分析表明，重复的结构数据（例如 LaTeX、XML）起到了机械催化剂的作用。此外，我们观察到针对诱导头部形成的干预措施会导致模型的情境学习（ICL）能力同时发生变化。这为关于感应头和 ICL 之间功能联系的长期假设提供了直接的因果证据。最后，我们提出了一种机械数据增强管道，可以持续加速跨模型尺度的电路收敛，为指导法学硕士的发展轨迹提供原则性方法。

Title: When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications

Authors: Daniel Commey
Subjects: cs.CL, cs.AI, cs.IR, cs.SE
Abstract URL: https://arxiv.org/abs/2601.22025
Pdf URL: https://arxiv.org/pdf/2601.22025
Copy Paste: [[2601.22025]] When "Better" Prompts Hurt: Evaluation-Driven Iteration for LLM Applications(https://arxiv.org/abs/2601.22025)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Evaluating Large Language Model (LLM) applications differs from traditional software testing because outputs are stochastic, high-dimensional, and sensitive to prompt and model changes. We present an evaluation-driven workflow - Define, Test, Diagnose, Fix - that turns these challenges into a repeatable engineering loop. We introduce the Minimum Viable Evaluation Suite (MVES), a tiered set of recommended evaluation components for (i) general LLM applications, (ii) retrieval-augmented generation (RAG), and (iii) agentic tool-use workflows. We also synthesize common evaluation methods (automated checks, human rubrics, and LLM-as-judge) and discuss known judge failure modes. In reproducible local experiments (Ollama; Llama 3 8B Instruct and Qwen 2.5 7B Instruct), we observe that a generic "improved" prompt template can trade off behaviors: on our small structured suites, extraction pass rate decreased from 100% to 90% and RAG compliance from 93.3% to 80% for Llama 3 when replacing task-specific prompts with generic rules, while instruction-following improved. These findings motivate evaluation-driven prompt iteration and careful claim calibration rather than universal prompt recipes. All test suites, harnesses, and results are included for reproducibility.
摘要：评估大型语言模型 (LLM) 应用程序与传统软件测试不同，因为输出是随机的、高维的，并且对提示和模型更改敏感。我们提出了一个评估驱动的工作流程 - 定义、测试、诊断、修复 - 将这些挑战转化为可重复的工程循环。我们引入了最小可行评估套件 (MVES)，这是一组推荐的分层评估组件，用于 (i) 一般 LLM 应用程序、(ii) 检索增强生成 (RAG) 和 (iii) 代理工具使用工作流程。我们还综合了常见的评估方法（自动检查、人工评估和法学硕士作为法官）并讨论了已知的法官失败模式。在可重复的本地实验（Ollama；Llama 3 8B Instruct 和 Qwen 2.5 7B Instruct）中，我们观察到通用的“改进”提示模板可以权衡行为：在我们的小型结构化套件上，当用通用规则替换特定于任务的提示时，Llama 3 的提取通过率从 100% 下降到 90%，RAG 合规性从 93.3% 下降到 80%，同时遵循指令改善了。这些发现激发了评估驱动的即时迭代和仔细的索赔校准，而不是通用的即时配方。所有测试套件、工具和结果都包含在内，以确保可重复性。

Title: Causal Autoregressive Diffusion Language Model

Authors: Junhao Ruan, Bei Li, Yongjing Yin, Pengcheng Huang, Xin Chen, Jingang Wang, Xunliang Cai, Tong Xiao, JingBo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22031
Pdf URL: https://arxiv.org/pdf/2601.22031
Copy Paste: [[2601.22031]] Causal Autoregressive Diffusion Language Model(https://arxiv.org/abs/2601.22031)
Keywords: language model, llm
Abstract: In this work, we propose Causal Autoregressive Diffusion (CARD), a novel framework that unifies the training efficiency of ARMs with the high-throughput inference of diffusion models. CARD reformulates the diffusion process within a strictly causal attention mask, enabling dense, per-token supervision in a single forward pass. To address the optimization instability of causal diffusion, we introduce a soft-tailed masking schema to preserve local context and a context-aware reweighting mechanism derived from signal-to-noise principles. This design enables dynamic parallel decoding, where the model leverages KV-caching to adaptively generate variable-length token sequences based on confidence. Empirically, CARD outperforms existing discrete diffusion baselines while reducing training latency by 3 $\times$ compared to block diffusion methods. Our results demonstrate that CARD achieves ARM-level data efficiency while unlocking the latency benefits of parallel generation, establishing a robust paradigm for next-generation efficient LLMs.
摘要：在这项工作中，我们提出了因果自回归扩散（CARD），这是一种新颖的框架，它将 ARM 的训练效率与扩散模型的高吞吐量推理相结合。 CARD 在严格的因果注意力掩模内重新制定了扩散过程，从而在单个前向传递中实现密集的、每个代币的监督。为了解决因果扩散的优化不稳定性，我们引入了软尾掩蔽模式来保留局部上下文，以及从信噪比原理导出的上下文感知重加权机制。此设计支持动态并行解码，其中模型利用 KV 缓存根据置信度自适应生成可变长度令牌序列。根据经验，与块扩散方法相比，CARD 优于现有的离散扩散基线，同时将训练延迟减少了 3 $\times$。我们的结果表明，CARD 实现了 ARM 级别的数据效率，同时释放了并行生成的延迟优势，为下一代高效 LLM 建立了强大的范例。

Title: Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models

Authors: Longxuan Yu, Yu Fu, Shaorong Zhang, Hui Liu, Mukund Varma T, Greg Ver Steeg, Yue Dong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22035
Pdf URL: https://arxiv.org/pdf/2601.22035
Copy Paste: [[2601.22035]] Thinking Out of Order: When Output Order Stops Reflecting Reasoning Order in Diffusion Language Models(https://arxiv.org/abs/2601.22035)
Keywords: language model, prompt, chain-of-thought
Abstract: Autoregressive (AR) language models enforce a fixed left-to-right generation order, creating a fundamental limitation when the required output structure conflicts with natural reasoning (e.g., producing answers before explanations due to presentation or schema constraints). In such cases, AR models must commit to answers before generating intermediate reasoning, and this rigid constraint forces premature commitment. Masked diffusion language models (MDLMs), which iteratively refine all tokens in parallel, offer a way to decouple computation order from output structure. We validate this capability on GSM8K, Math500, and ReasonOrderQA, a benchmark we introduce with controlled difficulty and order-level evaluation. When prompts request answers before reasoning, AR models exhibit large accuracy gaps compared to standard chain-of-thought ordering (up to 67% relative drop), while MDLMs remain stable ($\leq$14% relative drop), a property we term "order robustness". Using ReasonOrderQA, we present evidence that MDLMs achieve order robustness by stabilizing simpler tokens (e.g., reasoning steps) earlier in the diffusion process than complex ones (e.g., final answers), enabling reasoning tokens to stabilize before answer commitment. Finally, we identify failure conditions where this advantage weakens, outlining the limits required for order robustness.
摘要：自回归 (AR) 语言模型强制执行固定的从左到右的生成顺序，当所需的输出结构与自然推理冲突时（例如，由于表示或模式约束，在解释之前生成答案），就会产生基本限制。在这种情况下，AR 模型必须在生成中间推理之前承诺答案，而这种严格的约束迫使过早做出承诺。掩码扩散语言模型 (MDLM) 并行迭代地细化所有标记，提供了一种将计算顺序与输出结构解耦的方法。我们在 GSM8K、Math500 和 ReasonOrderQA 上验证了这一功能，这是我们引入的具有受控难度和订单级别评估的基准。当在推理之前提示要求答案时，AR 模型与标准思想链排序相比表现出较大的准确度差距（相对下降高达 67%），而 MDLM 保持稳定（相对下降 $\leq$14%），我们将这种属性称为“顺序稳健性”。使用 ReasonOrderQA，我们提供的证据表明 MDLM 通过在扩散过程中比复杂标记（例如最终答案）更早地稳定更简单的标记（例如推理步骤）来实现顺序稳健性，从而使推理标记能够在答案承诺之前稳定。最后，我们确定了这种优势减弱的故障条件，概述了订单稳健性所需的限制。

Title: A Separable Architecture for Continuous Token Representation in Language Models

Authors: Reza T. Batley, Sourav Saha
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22040
Pdf URL: https://arxiv.org/pdf/2601.22040
Copy Paste: [[2601.22040]] A Separable Architecture for Continuous Token Representation in Language Models(https://arxiv.org/abs/2601.22040)
Keywords: language model
Abstract: Transformer scaling law analyses typically treat parameters as interchangeable; an abstraction that accurately predicts loss-compute relationships. Yet, in sub-billion-parameter small language models (SLMs), embedding matrices dominate the parameter budget. This work argues that this allocation is as suboptimal as it is counterintuitive. Leviathan is an architecture with a continuous embedding generator to replace the discrete lookup tables of canonical models. Evaluating on the Pile dataset under isoparametric settings, Leviathan consistently outperforms a standard, LLaMA-style architecture. By means of an empirical power-law fit, Leviathan exhibits a markedly superior effective parameter capacity. Across the regime studied, Leviathan behaves as a dense model with $1.47$ to $2.11 \times$ more parameters.
摘要：变压器标度律分析通常将参数视为可互换的；准确预测损失计算关系的抽象。然而，在数十亿参数的小语言模型 (SLM) 中，嵌入矩阵在参数预算中占主导地位。这项工作认为，这种分配不是最理想的，而且是违反直觉的。 Leviathan 是一种具有连续嵌入生成器的架构，用于取代规范模型的离散查找表。在等参数设置下对 Pile 数据集进行评估时，Leviathan 始终优于标准的 LLaMA 风格架构。通过经验幂律拟合，Leviathan 表现出明显优越的有效参数能力。在所研究的体系中，Leviathan 表现为一个密集模型，参数数量增加了 1.47 美元到 2.11 美元\倍$。

Title: On the Paradoxical Interference between Instruction-Following and Task Solving

Authors: Yunjia Qi, Hao Peng, Xintong Shi, Amy Xin, Xiaozhi Wang, Bin Xu, Lei Hou, Juanzi Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22047
Pdf URL: https://arxiv.org/pdf/2601.22047
Copy Paste: [[2601.22047]] On the Paradoxical Interference between Instruction-Following and Task Solving(https://arxiv.org/abs/2601.22047)
Keywords: language model, llm
Abstract: Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed. However, we reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability. We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving. It measures task performance drop after inserting into the instruction a self-evident constraint, which is naturally met by the original successful model output and extracted from it. Experiments on current LLMs in mathematics, multi-hop QA, and code generation show that adding the self-evident constraints leads to substantial performance drops, even for advanced models such as Claude-Sonnet-4.5. We validate the generality of the interference across constraint types and scales. Furthermore, we identify common failure patterns, and by investigating the mechanisms of interference, we observe that failed cases allocate significantly more attention to constraints compared to successful ones. Finally, we use SUSTAINSCORE to conduct an initial investigation into how distinct post-training paradigms affect the interference, presenting empirical observations on current alignment strategies. We will release our code and data to facilitate further research
摘要：指令遵循的目的是通过指定关于如何执行任务的明确约束，使大型语言模型 (LLM) 与人类意图保持一致。然而，我们揭示了一个违反直觉的现象：遵循指令可能会矛盾地干扰法学硕士的任务解决能力。我们提出了一个指标 SUSTAINSCORE 来量化指令遵循对任务解决的干扰。它在向指令中插入不言而喻的约束后测量任务性能下降，原始成功模型输出自然满足该约束并从中提取。当前法学硕士在数学、多跳 QA 和代码生成方面的实验表明，添加不言而喻的约束会导致性能大幅下降，即使对于 Claude-Sonnet-4.5 等高级模型也是如此。我们验证了跨约束类型和规模的干扰的普遍性。此外，我们识别了常见的失败模式，并通过研究干扰机制，我们观察到与成功的案例相比，失败的案例对约束的关注明显更多。最后，我们使用 SUSTAINSCORE 对不同的训练后范式如何影响干扰进行初步调查，提出对当前对齐策略的实证观察。我们将发布我们的代码和数据以方便进一步的研究

Title: MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs

Authors: Ghazal Kalhor, Behnam Bahrak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22050
Pdf URL: https://arxiv.org/pdf/2601.22050
Copy Paste: [[2601.22050]] MasalBench: A Benchmark for Contextual and Cross-Cultural Understanding of Persian Proverbs in LLMs(https://arxiv.org/abs/2601.22050)
Keywords: language model, llm
Abstract: In recent years, multilingual Large Language Models (LLMs) have become an inseparable part of daily life, making it crucial for them to master the rules of conversational language in order to communicate effectively with users. While previous work has evaluated LLMs' understanding of figurative language in high-resource languages, their performance in low-resource languages remains underexplored. In this paper, we introduce MasalBench, a comprehensive benchmark for assessing LLMs' contextual and cross-cultural understanding of Persian proverbs, which are a key component of conversation in this low-resource language. We evaluate eight state-of-the-art LLMs on MasalBench and find that they perform well in identifying Persian proverbs in context, achieving accuracies above 0.90. However, their performance drops considerably when tasked with identifying equivalent English proverbs, with the best model achieving 0.79 accuracy. Our findings highlight the limitations of current LLMs in cultural knowledge and analogical reasoning, and they provide a framework for assessing cross-cultural understanding in other low-resource languages. MasalBench is available at this https URL.
摘要：近年来，多语言大语言模型（LLM）已成为日常生活中不可分割的一部分，因此掌握会话语言规则以与用户进行有效沟通至关重要。虽然之前的工作已经评估了法学硕士对高资源语言中比喻语言的理解，但他们在低资源语言中的表现仍未得到充分探索。在本文中，我们介绍了 MasalBench，这是一个综合基准，用于评估法学硕士对波斯谚语的语境和跨文化理解，波斯谚语是这种资源匮乏的语言中对话的关键组成部分。我们在 MasalBench 上评估了 8 个最先进的法学硕士，发现他们在识别上下文中的波斯谚语方面表现良好，准确率达到 0.90 以上。然而，当负责识别等效的英语谚语时，它们的表现会大幅下降，最好的模型达到 0.79 的准确率。我们的研究结果强调了当前法学硕士在文化知识和类比推理方面的局限性，并为评估其他低资源语言的跨文化理解提供了一个框架。 MasalBench 可通过此 https URL 获取。

Title: $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA

Authors: Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu, Zimeng Chen, Menglan Chen, Wen Yao, Yang Yang, Ying Wen, Siheng Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22055
Pdf URL: https://arxiv.org/pdf/2601.22055
Copy Paste: [[2601.22055]] $G^2$-Reader: Dual Evolving Graphs for Multimodal Document QA(https://arxiv.org/abs/2601.22055)
Keywords: gpt, long context, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation is a practical paradigm for question answering over long documents, but it remains brittle for multimodal reading where text, tables, and figures are interleaved across many pages. First, flat chunking breaks document-native structure and cross-modal alignment, yielding semantic fragments that are hard to interpret in isolation. Second, even iterative retrieval can fail in long contexts by looping on partial evidence or drifting into irrelevant sections as noise accumulates, since each step is guided only by the current snippet without a persistent global search state. We introduce $G^2$-Reader, a dual-graph system, to address both issues. It evolves a Content Graph to preserve document-native structure and cross-modal semantics, and maintains a Planning Graph, an agentic directed acyclic graph of sub-questions, to track intermediate findings and guide stepwise navigation for evidence completion. On VisDoMBench across five multimodal domains, $G^2$-Reader with Qwen3-VL-32B-Instruct reaches 66.21\% average accuracy, outperforming strong baselines and a standalone GPT-5 (53.08\%).
摘要：检索增强生成是针对长文档进行问答的实用范例，但对于文本、表格和图形在多页中交错的多模式阅读而言，它仍然很脆弱。首先，平面分块打破了文档本机结构和跨模式对齐，产生难以单独解释的语义片段。其次，即使是迭代检索也可能在长上下文中失败，因为随着噪声的积累，循环部分证据或漂移到不相关的部分，因为每个步骤仅由当前片段引导，而没有持久的全局搜索状态。我们引入了双图系统 $G^2$-Reader 来解决这两个问题。它发展了一个内容图来保留文档本机结构和跨模式语义，并维护一个规划图（子问题的代理有向无环图）来跟踪中间发现并指导逐步导航以完成证据。在跨五个多模态域的 VisDoMBench 上，带有 Qwen3-VL-32B-Instruct 的 $G^2$-Reader 达到 66.21\% 的平均准确度，优于强大的基线和独立的 GPT-5 (53.08\%)。

Title: VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning

Authors: Yibo Wang, Yongcheng Jing, Shunyu Liu, Hao Guan, Rong-cheng Tu, Chengyu Wang, Jun Huang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2601.22069
Pdf URL: https://arxiv.org/pdf/2601.22069
Copy Paste: [[2601.22069]] VTC-R1: Vision-Text Compression for Efficient Long-Context Reasoning(https://arxiv.org/abs/2601.22069)
Keywords: language model, llm
Abstract: Long-context reasoning has significantly empowered large language models (LLMs) to tackle complex tasks, yet it introduces severe efficiency bottlenecks due to the computational complexity. Existing efficient approaches often rely on complex additional training or external models for compression, which limits scalability and discards critical fine-grained information. In this paper, we propose VTC-R1, a new efficient reasoning paradigm that integrates vision-text compression into the reasoning process. Instead of processing lengthy textual traces, VTC-R1 renders intermediate reasoning segments into compact images, which are iteratively fed back into vision-language models as "optical memory." We construct a training dataset based on OpenR1-Math-220K achieving 3.4x token compression and fine-tune representative VLMs-Glyph and Qwen3-VL. Extensive experiments on benchmarks such as MATH500, AIME25, AMC23 and GPQA-D demonstrate that VTC-R1 consistently outperforms standard long-context reasoning. Furthermore, our approach significantly improves inference efficiency, achieving 2.7x speedup in end-to-end latency, highlighting its potential as a scalable solution for reasoning-intensive applications. Our code is available at this https URL.
摘要：长上下文推理极大地增强了大型语言模型（LLM）处理复杂任务的能力，但由于计算复杂性，它引入了严重的效率瓶颈。现有的有效方法通常依赖于复杂的额外训练或外部模型进行压缩，这限制了可扩展性并丢弃了关键的细粒度信息。在本文中，我们提出了 VTC-R1，一种新的高效推理范式，它将视觉文本压缩集成到推理过程中。 VTC-R1 不是处理冗长的文本痕迹，而是将中间推理片段渲染成紧凑的图像，这些图像作为“光学记忆”迭代反馈到视觉语言模型中。我们基于 OpenR1-Math-220K 构建了一个训练数据集，实现了 3.4 倍的令牌压缩，并对代表性的 VLM-Glyph 和 Qwen3-VL 进行了微调。对 MATH500、AIME25、AMC23 和 GPQA-D 等基准测试的大量实验表明，VTC-R1 始终优于标准的长上下文推理。此外，我们的方法显着提高了推理效率，在端到端延迟方面实现了 2.7 倍的加速，凸显了其作为推理密集型应用程序的可扩展解决方案的潜力。我们的代码可以在这个 https URL 上找到。

Title: ECO: Quantized Training without Full-Precision Master Weights

Authors: Mahdi Nikdan, Amir Zandieh, Dan Alistarh, Vahab Mirrokni
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22101
Pdf URL: https://arxiv.org/pdf/2601.22101
Copy Paste: [[2601.22101]] ECO: Quantized Training without Full-Precision Master Weights(https://arxiv.org/abs/2601.22101)
Keywords: language model, llm
Abstract: Quantization has significantly improved the compute and memory efficiency of Large Language Model (LLM) training. However, existing approaches still rely on accumulating their updates in high-precision: concretely, gradient updates must be applied to a high-precision weight buffer, known as $\textit{master weights}$. This buffer introduces substantial memory overhead, particularly for Sparse Mixture of Experts (SMoE) models, where model parameters and optimizer states dominate memory usage. To address this, we introduce the Error-Compensating Optimizer (ECO), which eliminates master weights by applying updates directly to quantized parameters. ECO quantizes weights after each step and carefully injects the resulting quantization error into the optimizer momentum, forming an error-feedback loop with no additional memory. We prove that, under standard assumptions and a decaying learning rate, ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate. We show empirical results for pretraining small Transformers (30-800M), a Gemma-3 1B model, and a 2.1B parameter Sparse MoE model with FP8 quantization, and fine-tuning DeepSeek-MoE-16B in INT4 precision. Throughout, ECO matches baselines with master weights up to near-lossless accuracy, significantly shifting the static memory vs validation loss Pareto frontier.
摘要：量化显着提高了大型语言模型 (LLM) 训练的计算和内存效率。然而，现有的方法仍然依赖于高精度累积更新：具体来说，梯度更新必须应用于高精度权重缓冲区，称为 $\textit{masterweights}$。此缓冲区引入了大量的内存开销，特别是对于稀疏专家混合 (SMoE) 模型，其中模型参数和优化器状态主导内存使用。为了解决这个问题，我们引入了误差补偿优化器（ECO），它通过直接对量化参数应用更新来消除主权重。 ECO 在每一步之后量化权重，并小心地将产生的量化误差注入优化器动量中，形成一个没有额外内存的误差反馈循环。我们证明，在标准假设和衰减的学习率下，ECO 收敛到最优值的恒定半径邻域，而朴素的主权重去除可能会产生与学习率成反比的误差。我们展示了预训练小型 Transformer (30-800M)、Gemma-3 1B 模型和具有 FP8 量化的 2.1B 参数稀疏 MoE 模型以及以 INT4 精度微调 DeepSeek-MoE-16B 的实证结果。在整个过程中，ECO 将基线与主权重进行匹配，达到近乎无损的精度，显着改变了静态记忆与验证损失的帕累托边界。

Title: A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine

Authors: Anran Li, Yuanyuan Chen, Wenjun Long, Yu Yin, Yan Hu, Hyunjae Kim, Weipeng Zhou, Yujia Zhou, Hongyi Peng, Yang Ren, Xuguang Ai, Zhenyue Qin, Ming Hu, Xiaoxiao Li, Han Yu, Yih-Chung Tham, Lucila Ohno-Machado, Hua Xu, Qingyu Chen
Subjects: cs.CL, cs.DC
Abstract URL: https://arxiv.org/abs/2601.22124
Pdf URL: https://arxiv.org/pdf/2601.22124
Copy Paste: [[2601.22124]] A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine(https://arxiv.org/abs/2601.22124)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis. To enable their use in clinical settings, LLMs are typically further adapted through continued pretraining or post-training using clinical data. However, most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems. Federated learning (FL) is a promising solution for enabling collaborative model development across healthcare institutions. Yet applying FL to LLMs in medicine remains fundamentally limited. First, conventional FL requires transmitting the full model during each communication round, which becomes impractical for multi-billion-parameter LLMs given the limited computational resources. Second, many FL algorithms implicitly assume data homogeneity, whereas real-world clinical data are highly heterogeneous across patients, diseases, and institutional practices. We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications. Fed-MedLoRA transmits only low-rank adapter parameters, reducing communication and computation overhead, while Fed-MedLoRA+ further incorporates adaptive, data-aware aggregation to improve convergence under cross-site heterogeneity. We apply the framework to clinical information extraction (IE), which transforms patient narratives into structured medical entities and relations. Accuracy was assessed across five patient cohorts through comparisons with BERT models, and LLaMA-3 and DeepSeek-R1, GPT-4o models. Evaluation settings included (1) in-domain training and testing, (2) external validation on independent cohorts, and (3) a low-resource new-site adaptation scenario using real-world clinical notes from the Yale New Haven Health System.
摘要：大型语言模型 (LLM) 在医学基准方面表现出了强大的性能，包括问答和诊断。为了使其能够在临床环境中使用，法学硕士通常会通过使用临床数据进行持续的预培训或培训后进行进一步调整。然而，大多数医学法学硕士都是根据来自单个机构的数据进行培训的，这在异构系统中面临通用性和安全性的限制。联邦学习 (FL) 是一种很有前途的解决方案，可实现跨医疗机构的协作模型开发。然而，将 FL 应用于医学法学硕士仍然受到根本限制。首先，传统的 FL 需要在每轮通信期间传输完整的模型，考虑到有限的计算资源，这对于数十亿参数的 LLM 来说是不切实际的。其次，许多 FL 算法隐含地假设数据同质性，而现实世界的临床数据在患者、疾病和机构实践中具有高度异质性。我们引入了模型无关且参数高效的联合学习框架，用于使法学硕士适应医疗应用。 Fed-MedLoRA 仅传输低阶适配器参数，减少了通信和计算开销，而 Fed-MedLoRA+ 进一步结合了自适应、数据感知聚合，以提高跨站点异构下的收敛性。我们将该框架应用于临床信息提取（IE），将患者的叙述转化为结构化的医疗实体和关系。通过与 BERT 模型以及 LLaMA-3 和 DeepSeek-R1、GPT-4o 模型进行比较，评估了五个患者队列的准确性。评估设置包括（1）领域内培训和测试，（2）独立队列的外部验证，以及（3）使用耶鲁纽黑文卫生系统的真实临床记录的低资源新地点适应场景。

Title: Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

Authors: Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22139
Pdf URL: https://arxiv.org/pdf/2601.22139
Copy Paste: [[2601.22139]] Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers(https://arxiv.org/abs/2601.22139)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{this https URL}
摘要：面向推理的大型语言模型（LLM）在思想链（CoT）提示下取得了显着的进步，但它们仍然从根本上受到 \emph{盲目的自我思考}范式的限制：即使在关键信息丢失或模糊时也执行广泛的内部推理。我们提出主动交互式推理（PIR），这是一种新的推理范式，可将法学硕士从被动解决者转变为将推理与澄清交织在一起的主动询问者。与主要通过查询外部环境来解决知识不确定性的现有基于搜索或工具的框架不同，PIR 通过与用户直接交互来针对前提和意图级别的不确定性。 PIR 通过两个核心组件实现：(1) 一个不确定性感知的监督微调程序，为模型配备交互式推理能力；(2) 一个基于用户模拟器的策略优化框架，由复合奖励驱动，使模型行为与用户意图保持一致。在数学推理、代码生成和文档编辑方面的大量实验表明，PIR 始终优于强基线，准确率提高了 32.70%，通过率提高了 22.90%，BLEU 提高了 41.36，同时减少了近一半的推理计算和不必要的交互次数。对事实知识、问题回答和缺失前提场景的进一步可靠性评估证实了 PIR 的强大泛化性和鲁棒性。模型和代码可在以下位置公开获取：\href{此 https URL}

Title: FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale

Authors: Ajay Patel, Colin Raffel, Chris Callison-Burch
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22146
Pdf URL: https://arxiv.org/pdf/2601.22146
Copy Paste: [[2601.22146]] FineInstructions: Scaling Synthetic Instructions to Pre-Training Scale(https://arxiv.org/abs/2601.22146)
Keywords: language model, llm, prompt
Abstract: Due to limited supervised training data, large language models (LLMs) are typically pre-trained via a self-supervised "predict the next word" objective on a vast amount of unstructured text data. To make the resulting model useful to users, it is further trained on a far smaller amount of "instruction-tuning" data comprised of supervised training examples of instructions and responses. To overcome the limited amount of supervised data, we propose a procedure that can transform the knowledge in internet-scale pre-training documents into billions of synthetic instruction and answer training pairs. The resulting dataset, called FineInstructions, uses ~18M instruction templates created from real user-written queries and prompts. These instruction templates are matched to and instantiated with human-written source documents from unstructured pre-training corpora. With "supervised" synthetic training data generated at this scale, an LLM can be pre-trained from scratch solely with the instruction-tuning objective, which is far more in-distribution with the expected downstream usage of LLMs (responding to user prompts). We conduct controlled token-for-token training experiments and find pre-training on FineInstructions outperforms standard pre-training and other proposed synthetic pre-training techniques on standard benchmarks measuring free-form response quality. Our resources can be found at this https URL .
摘要：由于监督训练数据有限，大型语言模型 (LLM) 通常通过自我监督的“预测下一个单词”目标对大量非结构化文本数据进行预训练。为了使生成的模型对用户有用，它在由指令和响应的监督训练示例组成的少量“指令调整”数据上进行了进一步训练。为了克服监督数据量有限的问题，我们提出了一种程序，可以将互联网规模的预训练文档中的知识转化为数十亿个合成指令和答案训练对。生成的数据集称为 FineInstructions，使用根据真实用户编写的查询和提示创建的约 18M 指令模板。这些指令模板与来自非结构化预训练语料库的人工编写的源文档相匹配并实例化。通过以这种规模生成的“监督”合成训练数据，LLM 可以仅通过指令调整目标从头开始进行预训练，这与 LLM 的预期下游使用（响应用户提示）更加不分布。我们进行了受控的逐令牌训练实验，发现 FineInstructions 上的预训练优于标准预训练和其他在测量自由形式响应质量的标准基准上提出的综合预训练技术。我们的资源可以在此 https URL 找到。

Title: DynaWeb: Model-Based Reinforcement Learning of Web Agents

Authors: Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, Lei Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2601.22149
Pdf URL: https://arxiv.org/pdf/2601.22149
Copy Paste: [[2601.22149]] DynaWeb: Model-Based Reinforcement Learning of Web Agents(https://arxiv.org/abs/2601.22149)
Keywords: language model, llm, agent
Abstract: The development of autonomous web agents, powered by Large Language Models (LLMs) and reinforcement learning (RL), represents a significant step towards general-purpose AI assistants. However, training these agents is severely hampered by the challenges of interacting with the live internet, which is inefficient, costly, and fraught with risks. Model-based reinforcement learning (MBRL) offers a promising solution by learning a world model of the environment to enable simulated interaction. This paper introduces DynaWeb, a novel MBRL framework that trains web agents through interacting with a web world model trained to predict naturalistic web page representations given agent actions. This model serves as a synthetic web environment where an agent policy can dream by generating vast quantities of rollout action trajectories for efficient online reinforcement learning. Beyond free policy rollouts, DynaWeb incorporates real expert trajectories from training data, which are randomly interleaved with on-policy rollouts during training to improve stability and sample efficiency. Experiments conducted on the challenging WebArena and WebVoyager benchmarks demonstrate that DynaWeb consistently and significantly improves the performance of state-of-the-art open-source web agent models. Our findings establish the viability of training web agents through imagination, offering a scalable and efficient way to scale up online agentic RL.
摘要：由大型语言模型 (LLM) 和强化学习 (RL) 提供支持的自主网络代理的开发代表了向通用人工智能助手迈出的重要一步。然而，与实时互联网交互的挑战严重阻碍了这些智能体的培训，这些挑战效率低下、成本高昂且充满风险。基于模型的强化学习（MBRL）通过学习环境的世界模型来实现模拟交互，提供了一种有前景的解决方案。本文介绍了 DynaWeb，这是一种新颖的 MBRL 框架，它通过与网络世界模型交互来训练网络代理，该模型经过训练可以预测给定代理动作的自然网页表示。该模型充当合成网络环境，代理策略可以通过生成大量的滚动动作轨迹来实现高效的在线强化学习。除了免费的策略推出之外，DynaWeb 还结合了来自训练数据的真实专家轨迹，这些轨迹在训练期间与策略推出随机交织，以提高稳定性和样本效率。在具有挑战性的 WebArena 和 WebVoyager 基准测试上进行的实验表明，DynaWeb 始终如一地显着提高了最先进的开源 Web 代理模型的性能。我们的研究结果证实了通过想象力训练网络代理的可行性，提供了一种可扩展且有效的方法来扩展在线代理强化学习。

Title: Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts

Authors: Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2601.22156
Pdf URL: https://arxiv.org/pdf/2601.22156
Copy Paste: [[2601.22156]] Hybrid Linear Attention Done Right: Efficient Distillation and Effective Architectures for Extremely Long Contexts(https://arxiv.org/abs/2601.22156)
Keywords: long context
Abstract: Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
摘要：混合 Transformer 架构结合了 softmax 注意力模块和循环神经网络 (RNN)，在长上下文建模中表现出了理想的性能-吞吐量权衡，但它们的采用和研究却因从头开始进行大规模预训练的高昂成本而受到阻碍。最近的一些研究表明，预训练的 softmax 注意力块可以通过参数传递和知识蒸馏转换为 RNN 块。然而，这些传输方法需要大量的训练数据（超过 10B 个 token），并且由此产生的混合模型也表现出较差的长上下文性能，在这种情况下，混合模型比基于 Transformer 的模型享有显着的推理加速。在本文中，我们提出了 HALO（通过层优化的混合注意力），这是一种将 Transformer 模型提炼为 RNN-注意力混合模型的管道。然后，我们提出了 HypeNet，这是一种具有卓越长度泛化能力的混合架构，通过新颖的位置编码方案（称为 HyPE）和各种架构修改实现。我们使用 HALO 将 Qwen3 系列转换为 HypeNet，实现与原始 Transformer 模型相当的性能，同时享受卓越的长上下文性能和效率。转换仅需要 2.3B 代币，不到预训练数据的 0.01%