2025-11-21

Title: What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning

Authors: Jeremias Ferrao, Ezgi Basar, Khondoker Ittehadul Islam, Mahrokh Hassani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15886
Pdf URL: https://arxiv.org/pdf/2511.15886
Copy Paste: [[2511.15886]] What Really Counts? Examining Step and Token Level Attribution in Multilingual CoT Reasoning(https://arxiv.org/abs/2511.15886)
Keywords: llm, prompt, chain-of-thought
Abstract: This study investigates the attribution patterns underlying Chain-of-Thought (CoT) reasoning in multilingual LLMs. While prior works demonstrate the role of CoT prompting in improving task performance, there are concerns regarding the faithfulness and interpretability of the generated reasoning chains. To assess these properties across languages, we applied two complementary attribution methods--ContextCite for step-level attribution and Inseq for token-level attribution--to the Qwen2.5 1.5B-Instruct model using the MGSM benchmark. Our experimental results highlight key findings such as: (1) attribution scores excessively emphasize the final reasoning step, particularly in incorrect generations; (2) structured CoT prompting significantly improves accuracy primarily for high-resource Latin-script languages; and (3) controlled perturbations via negation and distractor sentences reduce model accuracy and attribution coherence. These findings highlight the limitations of CoT prompting, particularly in terms of multilingual robustness and interpretive transparency.
摘要：本研究调查了多语言法学硕士中思想链 (CoT) 推理的归因模式。虽然之前的工作证明了 CoT 提示在提高任务绩效方面的作用，但人们对生成的推理链的忠实性和可解释性存在担忧。为了跨语言评估这些属性，我们使用 MGSM 基准对 Qwen2.5 1.5B-Instruct 模型应用了两种互补的归因方法（用于步骤级归因的 ContextCite 和用于标记级归因的 Inseq）。我们的实验结果突出了以下关键发现：（1）归因分数过度强调最终推理步骤，特别是在不正确的生成中； (2) 结构化 CoT 提示显着提高了主要针对高资源拉丁文字语言的准确性；（3）通过否定和干扰句子进行的受控扰动降低了模型的准确性和归因一致性。这些发现凸显了 CoT 提示的局限性，特别是在多语言稳健性和解释透明度方面。

Title: TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

Authors: Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng, Narayanan Sadagopan, Zhou Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.15976
Pdf URL: https://arxiv.org/pdf/2511.15976
Copy Paste: [[2511.15976]] TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues(https://arxiv.org/abs/2511.15976)
Keywords: llm, agent
Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.
摘要：在现实世界的任务导向对话（TOD）环境中，客服人员在与客户进行多轮对话时需要严格遵守复杂的指令。这些说明通常以自然语言格式呈现，包括一般准则和具有复杂约束的分步过程。现有的 TOD 基准测试通常将这些指令的复杂性质简化为由意图、槽和 API 调用配置组成的简单模式。为了解决这一差距并系统地对法学硕士的指令跟踪能力进行基准测试，我们提出了 TOD-ProcBench，这是一个具有挑战性的基准测试，其特点是复杂的流程指令和复杂、细粒度的约束，可评估各种法学硕士理解和遵循多轮 TOD 中的指令的能力。我们的基准数据集包含源自高质量 ABCD 数据集的说明文档以及人类质量控制下的相应对话。我们将细粒度的约束和动作过程制定为多级条件动作指令语句。我们设计了三个任务来全面衡量法学硕士在多轮 TOD 中的复杂指令跟踪能力。任务 1 评估法学硕士如何从复杂的指令中检索最相关的语句并预测相应的下一步行动。在任务 2 中，我们通过注入不一致和操纵原始指令来合成违反指令的响应，然后分析法学硕士如何有效地识别违反指令的响应。任务 3 研究法学硕士根据原始复杂指令有条件生成指令响应响应的能力。此外，我们还研究多语言设置和不同的指令文本格式对合规绩效的影响。我们根据 Llama 3.3 社区许可协议发布了我们的基准测试。

Title: Liars' Bench: Evaluating Lie Detectors for Language Models

Authors: Kieron Kretschmar (1), Walter Laurito (1 and 2), Sharan Maiya (1 and 3), Samuel Marks (4) ((1) Cadenza Labs, (2) FZI, (3) University of Cambridge, (4) Anthropic)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16035
Pdf URL: https://arxiv.org/pdf/2511.16035
Copy Paste: [[2511.16035]] Liars' Bench: Evaluating Lie Detectors for Language Models(https://arxiv.org/abs/2511.16035)
Keywords: language model, llm
Abstract: Prior work has introduced techniques for detecting when large language models (LLMs) lie, that is, generating statements they believe are false. However, these techniques are typically validated in narrow settings that do not capture the diverse lies LLMs can generate. We introduce LIARS' BENCH, a testbed consisting of 72,863 examples of lies and honest responses generated by four open-weight models across seven datasets. Our settings capture qualitatively different types of lies and vary along two dimensions: the model's reason for lying and the object of belief targeted by the lie. Evaluating three black- and white-box lie detection techniques on LIARS' BENCH, we find that existing techniques systematically fail to identify certain types of lies, especially in settings where it's not possible to determine whether the model lied from the transcript alone. Overall, LIARS' BENCH reveals limitations in prior techniques and provides a practical testbed for guiding progress in lie detection.
摘要：之前的工作已经引入了检测大型语言模型（LLM）何时撒谎的技术，即生成他们认为错误的陈述。然而，这些技术通常在狭窄的环境中得到验证，无法捕捉法学硕士可能产生的各种谎言。我们介绍了 LIARS' BENCH，这是一个测试平台，由七个数据集的四个开放权重模型生成的 72,863 个谎言和诚实回答示例组成。我们的设置捕获了本质上不同类型的谎言，并在两个维度上有所不同：模型说谎的原因和谎言所针对的信念对象。在 LIARS' BENCH 上评估三种黑盒和白盒谎言检测技术时，我们发现现有技术系统性地无法识别某些类型的谎言，特别是在无法仅根据转录本确定模型是否撒谎的情况下。总体而言，LIARS' BENCH 揭示了现有技术的局限性，并为指导测谎进展提供了实用的测试平台。

Title: Learning Tractable Distributions Of Language Model Continuations

Authors: Gwen Yidou-Weng, Ian Li, Anji Liu, Oliver Broadrick, Guy Van den Broeck, Benjie Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16054
Pdf URL: https://arxiv.org/pdf/2511.16054
Copy Paste: [[2511.16054]] Learning Tractable Distributions Of Language Model Continuations(https://arxiv.org/abs/2511.16054)
Keywords: language model
Abstract: Controlled language generation conditions text on sequence-level constraints (for example, syntax, style, or safety). These constraints may depend on future tokens, which makes directly conditioning an autoregressive language model (LM) generally intractable. Prior work uses tractable surrogates such as hidden Markov models (HMMs) to approximate the distribution over continuations and adjust the model's next-token logits at decoding time. However, we find that these surrogates are often weakly context aware, which reduces query quality. We propose Learning to Look Ahead (LTLA), a hybrid approach that pairs the same base language model for rich prefix encoding with a fixed tractable surrogate model that computes exact continuation probabilities. Two efficiency pitfalls arise when adding neural context: (i) naively rescoring the prefix with every candidate next token requires a sweep over the entire vocabulary at each step, and (ii) predicting fresh surrogate parameters for each prefix, although tractable at a single step, forces recomputation of future probabilities for every new prefix and eliminates reuse. LTLA avoids both by using a single batched HMM update to account for all next-token candidates at once, and by conditioning only the surrogate's latent state prior on the LM's hidden representations while keeping the surrogate decoder fixed, so computations can be reused across prefixes. Empirically, LTLA attains higher conditional likelihood than an unconditional HMM, approximates continuation distributions for vision-language models where a standalone HMM cannot encode visual context, and improves constraint satisfaction at comparable fluency on controlled-generation tasks, with minimal inference overhead.
摘要：受控语言生成条件文本的序列级约束（例如语法、样式或安全性）。这些约束可能取决于未来的标记，这使得直接调节自回归语言模型（LM）通常很困难。先前的工作使用易于处理的代理，例如隐马尔可夫模型 (HMM) 来近似连续的分布，并在解码时调整模型的下一个标记逻辑。然而，我们发现这些代理通常上下文感知较弱，这降低了查询质量。我们提出学习向前看（LTLA），这是一种混合方法，它将用于丰富前缀编码的相同基本语言模型与计算精确连续概率的固定易处理代理模型配对。添加神经上下文时会出现两个效率陷阱：（i）天真地对每个候选下一个标记的前缀重新评分需要在每一步扫描整个词汇表，以及（ii）预测每个前缀的新代理参数，尽管在单个步骤中很容易处理，但会强制重新计算每个新前缀的未来概率并消除重用。 LTLA 通过使用单个批量 HMM 更新来一次性考虑所有下一个令牌候选者，以及通过在 LM 的隐藏表示上仅调节代理的潜在状态，同时保持代理解码器固定，从而避免这两种情况，因此可以跨前缀重用计算。根据经验，LTLA 比无条件 HMM 获得了更高的条件似然，在独立 HMM 无法编码视觉上下文的情况下近似视觉语言模型的连续分布，并以最小的推理开销提高受控生成任务的相当流畅性的约束满意度。

Title: Early science acceleration experiments with GPT-5

Authors: Sébastien Bubeck, Christian Coester, Ronen Eldan, Timothy Gowers, Yin Tat Lee, Alexandru Lupsasca, Mehtaab Sawhney, Robert Scherrer, Mark Sellke, Brian K. Spears, Derya Unutmaz, Kevin Weil, Steven Yin, Nikita Zhivotovskiy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16072
Pdf URL: https://arxiv.org/pdf/2511.16072
Copy Paste: [[2511.16072]] Early science acceleration experiments with GPT-5(https://arxiv.org/abs/2511.16072)
Keywords: gpt
Abstract: AI models like GPT-5 are an increasingly valuable tool for scientists, but many remain unaware of the capabilities of frontier AI. We present a collection of short case studies in which GPT-5 produced new, concrete steps in ongoing research across mathematics, physics, astronomy, computer science, biology, and materials science. In these examples, the authors highlight how AI accelerated their work, and where it fell short; where expert time was saved, and where human input was still key. We document the interactions of the human authors with GPT-5, as guiding examples of fruitful collaboration with AI. Of note, this paper includes four new results in mathematics (carefully verified by the human authors), underscoring how GPT-5 can help human mathematicians settle previously unsolved problems. These contributions are modest in scope but profound in implication, given the rate at which frontier AI is progressing.
摘要：像 GPT-5 这样的人工智能模型对科学家来说是越来越有价值的工具，但许多人仍然不了解前沿人工智能的功能。我们提供了一系列简短的案例研究，其中 GPT-5 在数学、物理、天文学、计算机科学、生物学和材料科学领域正在进行的研究中提出了新的具体步骤。在这些例子中，作者强调了人工智能如何加速他们的工作，以及它的不足之处；节省了专家时间，而人工输入仍然是关键。我们记录了人类作者与 GPT-5 的互动，作为与人工智能富有成效的合作的指导示例。值得注意的是，这篇论文包含了四项新的数学结果（经过人类作者仔细验证），强调了 GPT-5 如何帮助人类数学家解决以前未解决的问题。考虑到前沿人工智能的发展速度，这些贡献虽然范围不大，但意义深远。

Title: ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models

Authors: Qing Zhang, Bing Xu, Xudong Zhang, Yifan Shi, Yang Li, Chen Zhang, Yik Chung Wu, Ngai Wong, Yijie Chen, Hong Dai, Xiansen Chen, Mian Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16122
Pdf URL: https://arxiv.org/pdf/2511.16122
Copy Paste: [[2511.16122]] ELPO: Ensemble Learning Based Prompt Optimization for Large Language Models(https://arxiv.org/abs/2511.16122)
Keywords: language model, llm, prompt
Abstract: The remarkable performance of Large Language Models (LLMs) highly relies on crafted prompts. However, manual prompt engineering is a laborious process, creating a core bottleneck for practical application of LLMs. This phenomenon has led to the emergence of a new research area known as Automatic Prompt Optimization (APO), which develops rapidly in recent years. Existing APO methods such as those based on evolutionary algorithms or trial-and-error approaches realize an efficient and accurate prompt optimization to some extent. However, those researches focus on a single model or algorithm for the generation strategy and optimization process, which limits their performance when handling complex tasks. To address this, we propose a novel framework called Ensemble Learning based Prompt Optimization (ELPO) to achieve more accurate and robust results. Motivated by the idea of ensemble learning, ELPO conducts voting mechanism and introduces shared generation strategies along with different search methods for searching superior prompts. Moreover, ELPO creatively presents more efficient algorithms for the prompt generation and search process. Experimental results demonstrate that ELPO outperforms state-of-the-art prompt optimization methods across different tasks, e.g., improving F1 score by 7.6 on ArSarcasm dataset.
摘要：大型语言模型 (LLM) 的卓越性能高度依赖于精心设计的提示。然而，手动提示工程是一个费力的过程，为LLM的实际应用造成了核心瓶颈。这种现象导致了一个新的研究领域的出现，称为自动提示优化（APO），并且近年来发展迅速。现有的基于进化算法或试错法的APO方法在一定程度上实现了高效、准确的即时优化。然而，这些研究集中在生成策略和优化过程的单一模型或算法上，这限制了它们在处理复杂任务时的性能。为了解决这个问题，我们提出了一种称为基于集成学习的提示优化（ELPO）的新颖框架，以实现更准确和稳健的结果。受集成学习思想的启发，ELPO 进行投票机制，并引入共享生成策略以及不同的搜索方法来搜索优质提示。此外，ELPO创造性地为提示生成和搜索过程提出了更高效的算法。实验结果表明，ELPO 在不同的任务中都优于最先进的提示优化方法，例如，在 ArSarcasm 数据集上将 F1 分数提高了 7.6。

Title: SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning

Authors: Sebastian Haan
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2511.16198
Pdf URL: https://arxiv.org/pdf/2511.16198
Copy Paste: [[2511.16198]] SemanticCite: Citation Verification with AI-Powered Full-Text Analysis and Evidence-Based Reasoning(https://arxiv.org/abs/2511.16198)
Keywords: language model
Abstract: Effective scientific communication depends on accurate citations that validate sources and guide readers to supporting evidence. Yet academic literature faces mounting challenges: semantic citation errors that misrepresent sources, AI-generated hallucinated references, and traditional citation formats that point to entire papers without indicating which sections substantiate specific claims. We introduce SemanticCite, an AI-powered system that verifies citation accuracy through full-text source analysis while providing rich contextual information via detailed reasoning and relevant text snippets. Our approach combines multiple retrieval methods with a four-class classification system (Supported, Partially Supported, Unsupported, Uncertain) that captures nuanced claim-source relationships and enables appropriate remedial actions for different error types. Our experiments show that fine-tuned lightweight language models achieve performance comparable to large commercial systems with significantly lower computational requirements, making large-scale citation verification practically feasible. The system provides transparent, evidence-based explanations that support user understanding and trust. We contribute a comprehensive dataset of over 1,000 citations with detailed alignments, functional classifications, semantic annotations, and bibliometric metadata across eight disciplines, alongside fine-tuned models and the complete verification framework as open-source software. SemanticCite addresses critical challenges in research integrity through scalable citation verification, streamlined peer review, and quality control for AI-generated content, providing an open-source foundation for maintaining citation accuracy at scale.
摘要：有效的科学传播取决于准确的引用，以验证来源并引导读者找到支持证据。然而，学术文献面临着越来越多的挑战：歪曲来源的语义引用错误、人工智能生成的幻觉参考文献以及指向整篇论文的传统引用格式，但没有表明哪些部分证实了特定的主张。我们推出 SemanticCite，这是一个人工智能驱动的系统，它通过全文源分析来验证引用的准确性，同时通过详细的推理和相关文本片段提供丰富的上下文信息。我们的方法将多种检索方法与四类分类系统（支持、部分支持、不支持、不确定）相结合，捕获微妙的声明-来源关系，并针对不同的错误类型采取适当的补救措施。我们的实验表明，经过微调的轻量级语言模型的性能可与大型商业系统相媲美，而计算要求却显着降低，使得大规模引文验证实际上变得可行。该系统提供透明、基于证据的解释，支持用户理解和信任。我们提供了包含超过 1,000 条引用的综合数据集，其中包含八个学科的详细对齐、功能分类、语义注释和文献计量元数据，以及经过微调的模型和作为开源软件的完整验证框架。 SemanticCite 通过可扩展的引文验证、简化的同行评审和人工智能生成内容的质量控制，解决了研究完整性方面的关键挑战，为大规模维护引文准确性提供了开源基础。

Title: SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs

Authors: Xingtao Zhao, Hao Peng, Dingli Su, Xianghua Zeng, Chunyang Liu, Jinzhi Liao, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16275
Pdf URL: https://arxiv.org/pdf/2511.16275
Copy Paste: [[2511.16275]] SeSE: A Structural Information-Guided Uncertainty Quantification Framework for Hallucination Detection in LLMs(https://arxiv.org/abs/2511.16275)
Keywords: language model, llm, hallucination
Abstract: Reliable uncertainty quantification (UQ) is essential for deploying large language models (LLMs) in safety-critical scenarios, as it enables them to abstain from responding when uncertain, thereby avoiding hallucinating falsehoods. However, state-of-the-art UQ methods primarily rely on semantic probability distributions or pairwise distances, overlooking latent semantic structural information that could enable more precise uncertainty estimates. This paper presents Semantic Structural Entropy (SeSE), a principled UQ framework that quantifies the inherent semantic uncertainty of LLMs from a structural information perspective for hallucination detection. Specifically, to effectively model semantic spaces, we first develop an adaptively sparsified directed semantic graph construction algorithm that captures directional semantic dependencies while automatically pruning unnecessary connections that introduce negative interference. We then exploit latent semantic structural information through hierarchical abstraction: SeSE is defined as the structural entropy of the optimal semantic encoding tree, formalizing intrinsic uncertainty within semantic spaces after optimal compression. A higher SeSE value corresponds to greater uncertainty, indicating that LLMs are highly likely to generate hallucinations. In addition, to enhance fine-grained UQ in long-form generation -- where existing methods often rely on heuristic sample-and-count techniques -- we extend SeSE to quantify the uncertainty of individual claims by modeling their random semantic interactions, providing theoretically explicable hallucination detection. Extensive experiments across 29 model-dataset combinations show that SeSE significantly outperforms advanced UQ baselines, including strong supervised methods and the recently proposed KLE.
摘要：可靠的不确定性量化 (UQ) 对于在安全关键场景中部署大型语言模型 (LLM) 至关重要，因为它使它们能够在不确定时避免做出响应，从而避免产生幻觉。然而，最先进的 UQ 方法主要依赖于语义概率分布或成对距离，忽略了可以实现更精确的不确定性估计的潜在语义结构信息。本文提出了语义结构熵 (SeSE)，这是一个有原则的昆士兰大学框架，它从幻觉检测的结构信息角度量化了法学硕士固有的语义不确定性。具体来说，为了有效地建模语义空间，我们首先开发一种自适应稀疏有向语义图构建算法，该算法捕获定向语义依赖性，同时自动修剪引入负面干扰的不必要连接。然后，我们通过层次抽象来利用潜在的语义结构信息：SeSE 被定义为最优语义编码树的结构熵，形式化最优压缩后语义空间内的内在不确定性。 SeSE值越高，不确定性越大，表明法学硕士很可能产生幻觉。此外，为了增强长格式生成中的细粒度 UQ（现有方法通常依赖于启发式采样和计数技术），我们扩展 SeSE，通过建模随机语义交互来量化单个声明的不确定性，提供理论上可解释的幻觉检测。跨 29 个模型-数据集组合的广泛实验表明，SeSE 显着优于先进的 UQ 基线，包括强监督方法和最近提出的 KLE。

Title: SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning

Authors: Wei Xia, Zhi-Hong Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16324
Pdf URL: https://arxiv.org/pdf/2511.16324
Copy Paste: [[2511.16324]] SDA: Steering-Driven Distribution Alignment for Open LLMs without Fine-Tuning(https://arxiv.org/abs/2511.16324)
Keywords: language model, llm
Abstract: With the rapid advancement of large language models (LLMs), their deployment in real-world applications has become increasingly widespread. LLMs are expected to deliver robust performance across diverse tasks, user preferences, and practical scenarios. However, as demands grow, ensuring that LLMs produce responses aligned with human intent remains a foundational challenge. In particular, aligning model behavior effectively and efficiently during inference, without costly retraining or extensive supervision, is both a critical requirement and a non-trivial technical endeavor. To address the challenge, we propose SDA (Steering-Driven Distribution Alignment), a training-free and model-agnostic alignment framework designed for open-source LLMs. SDA dynamically redistributes model output probabilities based on user-defined alignment instructions, enhancing alignment between model behavior and human intents without fine-tuning. The method is lightweight, resource-efficient, and compatible with a wide range of open-source LLMs. It can function independently during inference or be integrated with training-based alignment strategies. Moreover, SDA supports personalized preference alignment, enabling flexible control over the model response behavior. Empirical results demonstrate that SDA consistently improves alignment performance across 8 open-source LLMs with varying scales and diverse origins, evaluated on three key alignment dimensions, helpfulness, harmlessness, and honesty (3H). Specifically, SDA achieves average gains of 64.4% in helpfulness, 30% in honesty and 11.5% in harmlessness across the tested models, indicating its effectiveness and generalization across diverse models and application scenarios.
摘要：随着大型语言模型（LLM）的快速发展，它们在实际应用中的部署变得越来越广泛。法学硕士有望在不同的任务、用户偏好和实际场景中提供稳健的表现。然而，随着需求的增长，确保法学硕士产生符合人类意图的回应仍然是一个基本挑战。特别是，在推理过程中有效且高效地调整模型行为，无需昂贵的再训练或广泛的监督，既是一项关键要求，也是一项不平凡的技术努力。为了应对这一挑战，我们提出了 SDA（转向驱动的分布对齐），这是一种专为开源法学硕士设计的免训练且与模型无关的对齐框架。 SDA根据用户定义的对齐指令动态地重新分配模型输出概率，增强模型行为与人类意图之间的对齐，而无需进行微调。该方法是轻量级的、资源高效的，并且与各种开源法学硕士兼容。它可以在推理过程中独立运行，也可以与基于训练的对齐策略集成。此外，SDA支持个性化偏好调整，能够灵活控制模型响应行为。实证结果表明，SDA 持续提高了 8 个规模不同、来源不同的开源法学硕士的对齐绩效，并根据三个关键对齐维度（有用性、无害性和诚实性 (3H)）进行了评估。具体来说，SDA 在测试模型中的有用性平均提高了 64.4%，诚实性平均提高了 30%，无害性平均提高了 11.5%，这表明了其在不同模型和应用场景中的有效性和泛化性。

Title: Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement

Authors: Jiashu Yao, Heyan Huang, Shuang Zeng, Chuwei Luo, WangJie You, Jie Tang, Qingsong Liu, Yuhang Guo, Yangyang Kang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16331
Pdf URL: https://arxiv.org/pdf/2511.16331
Copy Paste: [[2511.16331]] Incorporating Self-Rewriting into Large Language Model Reasoning Reinforcement(https://arxiv.org/abs/2511.16331)
Keywords: language model, llm, prompt
Abstract: Through reinforcement learning (RL) with outcome correctness rewards, large reasoning models (LRMs) with scaled inference computation have demonstrated substantial success on complex reasoning tasks. However, the one-sided reward, focused solely on final correctness, limits its ability to provide detailed supervision over internal reasoning process. This deficiency leads to suboptimal internal reasoning quality, manifesting as issues like over-thinking, under-thinking, redundant-thinking, and disordered-thinking. Inspired by the recent progress in LRM self-rewarding, we introduce self-rewriting framework, where a model rewrites its own reasoning texts, and subsequently learns from the rewritten reasoning to improve the internal thought process quality. For algorithm design, we propose a selective rewriting approach wherein only "simple" samples, defined by the model's consistent correctness, are rewritten, thereby preserving all original reward signals of GRPO. For practical implementation, we compile rewriting and vanilla generation within one single batch, maintaining the scalability of the RL algorithm and introducing only ~10% overhead. Extensive experiments on diverse tasks with different model sizes validate the effectiveness of self-rewriting. In terms of the accuracy-length tradeoff, the self-rewriting approach achieves improved accuracy (+0.6) with substantially shorter reasoning (-46%) even without explicit instructions in rewriting prompts to reduce reasoning length, outperforming existing strong baselines. In terms of internal reasoning quality, self-rewriting achieves significantly higher scores (+7.2) under the LLM-as-a-judge metric, successfully mitigating internal reasoning flaws.
摘要：通过具有结果正确性奖励的强化学习 (RL)，具有规模化推理计算的大型推理模型 (LRM) 在复杂推理任务上取得了巨大成功。然而，仅关注最终正确性的片面奖励限制了其对内部推理过程提供详细监督的能力。这种缺陷会导致内部推理质量不佳，表现为思考过度、思考不足、思考多余、思考混乱等问题。受 LRM 自我奖励最近进展的启发，我们引入了自重写框架，其中模型重写自己的推理文本，随后从重写的推理中学习以提高内部思维过程质量。对于算法设计，我们提出了一种选择性重写方法，其中仅重写由模型的一致正确性定义的“简单”样本，从而保留 GRPO 的所有原始奖励信号。对于实际实现，我们在一个批次内编译重写和普通生成，保持 RL 算法的可扩展性，并且仅引入约 10% 的开销。对不同模型大小的不同任务的广泛实验验证了自重写的有效性。就准确性与长度的权衡而言，即使在重写提示中没有明确指示以减少推理长度，自重写方法也能以大幅缩短推理长度（-46%）的方式实现更高的准确性（+0.6），优于现有的强基线。在内部推理质量方面，自我重写在法学硕士法官指标下取得了显着更高的分数（+7.2），成功地缓解了内部推理缺陷。

Title: NLP Datasets for Idiom and Figurative Language Tasks

Authors: Blake Matheny, Phuong Minh Nguyen, Minh Le Nguyen, Stephanie Reynolds
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16345
Pdf URL: https://arxiv.org/pdf/2511.16345
Copy Paste: [[2511.16345]] NLP Datasets for Idiom and Figurative Language Tasks(https://arxiv.org/abs/2511.16345)
Keywords: language model, llm
Abstract: Idiomatic and figurative language form a large portion of colloquial speech and writing. With social media, this informal language has become more easily observable to people and trainers of large language models (LLMs) alike. While the advantage of large corpora seems like the solution to all machine learning and Natural Language Processing (NLP) problems, idioms and figurative language continue to elude LLMs. Finetuning approaches are proving to be optimal, but better and larger datasets can help narrow this gap even further. The datasets presented in this paper provide one answer, while offering a diverse set of categories on which to build new models and develop new approaches. A selection of recent idiom and figurative language datasets were used to acquire a combined idiom list, which was used to retrieve context sequences from a large corpus. One large-scale dataset of potential idiomatic and figurative language expressions and two additional human-annotated datasets of definite idiomatic and figurative language expressions were created to evaluate the baseline ability of pre-trained language models in handling figurative meaning through idiom recognition (detection) tasks. The resulting datasets were post-processed for model agnostic training compatibility, utilized in training, and evaluated on slot labeling and sequence tagging.
摘要：惯用语和比喻语言构成了口语和写作的很大一部分。通过社交媒体，人们和大型语言模型 (LLM) 的培训师等都可以更容易地观察到这种非正式语言。虽然大型语料库的优势似乎可以解决所有机器学习和自然语言处理（NLP）问题，但习语和比喻语言仍然让法学硕士无法解决。事实证明，微调方法是最佳的，但更好、更大的数据集可以帮助进一步缩小这一差距。本文提出的数据集提供了一个答案，同时提供了一组不同的类别，可用于构建新模型和开发新方法。使用精选的最新习语和比喻语言数据集来获取组合习语列表，该列表用于从大型语料库中检索上下文序列。创建了一个潜在的惯用和比喻语言表达的大规模数据集以及两个额外的人类注释的明确的惯用和比喻语言表达数据集，以评估预训练语言模型通过习语识别（检测）任务处理比喻意义的基线能力。生成的数据集经过后处理以实现与模型无关的训练兼容性，在训练中使用，并在槽标记和序列标记上进行评估。

Title: AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser

Authors: Ren Ma, Jiantao Qiu, Chao Xu, Pei Chu, Kaiwen Liu, Pengli Ren, Yuan Qu, Jiahui Peng, Linfeng Hou, Mengjie Liu, Lindong Lu, Wenchang Ning, Jia Yu, Rui Min, Jin Shi, Haojiong Chen, Peng Zhang, Wenjian Zhang, Qian Jiang, Zengjie Hu, Guoqiang Yang, Zhenxiang Li, Fukai Shang, Zhongying Tu, Wentao Zhang, Dahua Lin, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16397
Pdf URL: https://arxiv.org/pdf/2511.16397
Copy Paste: [[2511.16397]] AICC: Parse HTML Finer, Make Models Better -- A 7.3T AI-Ready Corpus Built by a Model-Based HTML Parser(https://arxiv.org/abs/2511.16397)
Keywords: language model
Abstract: While web data quality is crucial for large language models, most curation efforts focus on filtering and deduplication,treating HTML-to-text extraction as a fixed pre-processing step. Existing web corpora rely on heuristic-based extractors like Trafilatura, which struggle to preserve document structure and frequently corrupt structured elements such as formulas, codes, and tables. We hypothesize that improving extraction quality can be as impactful as aggressive filtering strategies for downstream performance. We introduce MinerU-HTML, a novel extraction pipeline that reformulates content extraction as a sequence labeling problem solved by a 0.6B-parameter language model. Unlike text-density heuristics, MinerU-HTML leverages semantic understanding and employs a two-stage formatting pipeline that explicitly categorizes semantic elements before converting to Markdown. Crucially, its model-based approach is inherently scalable, whereas heuristic methods offer limited improvement pathways. On MainWebBench, our benchmark of 7,887 annotated web pages, MinerU-HTML achieves 81.8\% ROUGE-N F1 compared to Trafilatura's 63.6\%, with exceptional structured element preservation (90.9\% for code blocks, 94.0\% for formulas). Using MinerU-HTML, we construct AICC (AI-ready Common Crawl), a 7.3-trillion token multilingual corpus from two Common Crawl snapshots. In controlled pretraining experiments where AICC and Trafilatura-extracted TfCC undergo identical filtering, models trained on AICC (62B tokens) achieve 50.8\% average accuracy across 13 benchmarks, outperforming TfCC by 1.08pp-providing direct evidence that extraction quality significantly impacts model capabilities. AICC also surpasses RefinedWeb and FineWeb on key benchmarks. We publicly release MainWebBench, MinerU-HTML, and AICC, demonstrating that HTML extraction is a critical, often underestimated component of web corpus construction.
摘要：虽然网络数据质量对于大型语言模型至关重要，但大多数管理工作都集中在过滤和重复数据删除上，将 HTML 到文本的提取视为固定的预处理步骤。现有的网络语料库依赖于 Trafilatura 等基于启发式的提取器，这些提取器很难保留文档结构，并且经常损坏公式、代码和表格等结构化元素。我们假设提高提取质量与对下游性能的积极过滤策略一样有影响力。我们介绍了 MinerU-HTML，这是一种新颖的提取管道，它将内容提取重新表述为由 0.6B 参数语言模型解决的序列标记问题。与文本密度启发法不同，MinerU-HTML 利用语义理解并采用两阶段格式化管道，在转换为 Markdown 之前对语义元素进行显式分类。至关重要的是，其基于模型的方法本质上是可扩展的，而启发式方法提供的改进途径有限。在 MainWebBench（我们的 7,887 个带注释网页的基准）上，MinerU-HTML 达到了 81.8\% ROUGE-N F1，而 Trafilatura 为 63.6\%，并且具有出色的结构化元素保留（代码块为 90.9\%，公式为 94.0\%）。使用 MinerU-HTML，我们构建了 AICC（AI-ready Common Crawl），这是一个来自两个 Common Crawl 快照的 7.3 万亿代币多语言语料库。在 AICC 和 Trafilatura 提取的 TfCC 进行相同过滤的受控预训练实验中，在 AICC（62B 代币）上训练的模型在 13 个基准中实现了 50.8% 的平均准确度，比 TfCC 高出 1.08 个百分点，这提供了提取质量显着影响模型能力的直接证据。 AICC 在关键基准测试上也超越了RefinedWeb 和FineWeb。我们公开发布了 MainWebBench、MinerU-HTML 和 AICC，证明 HTML 提取是网络语料库构建中一个关键但常常被低估的组成部分。

Title: ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports

Authors: Sherine George, Nithish Saji
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2511.16438
Pdf URL: https://arxiv.org/pdf/2511.16438
Copy Paste: [[2511.16438]] ESGBench: A Benchmark for Explainable ESG Question Answering in Corporate Sustainability Reports(https://arxiv.org/abs/2511.16438)
Keywords: llm
Abstract: We present ESGBench, a benchmark dataset and evaluation framework designed to assess explainable ESG question answering systems using corporate sustainability reports. The benchmark consists of domain-grounded questions across multiple ESG themes, paired with human-curated answers and supporting evidence to enable fine-grained evaluation of model reasoning. We analyze the performance of state-of-the-art LLMs on ESGBench, highlighting key challenges in factual consistency, traceability, and domain alignment. ESGBench aims to accelerate research in transparent and accountable ESG-focused AI systems.
摘要：我们推出 ESGBench，这是一个基准数据集和评估框架，旨在使用企业可持续发展报告评估可解释的 ESG 问答系统。该基准由跨多个 ESG 主题的领域问题组成，搭配人工策划的答案和支持证据，以实现对模型推理的细粒度评估。我们分析了 ESGBench 上最先进的法学硕士的表现，强调了事实一致性、可追溯性和领域对齐方面的关键挑战。 ESGBench 旨在加速透明且负责任的、以 ESG 为中心的人工智能系统的研究。

Title: Anatomy of an Idiom: Tracing Non-Compositionality in Language Models

Authors: Andrew Gomes
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16467
Pdf URL: https://arxiv.org/pdf/2511.16467
Copy Paste: [[2511.16467]] Anatomy of an Idiom: Tracing Non-Compositionality in Language Models(https://arxiv.org/abs/2511.16467)
Keywords: language model
Abstract: We investigate the processing of idiomatic expressions in transformer-based language models using a novel set of techniques for circuit discovery and analysis. First discovering circuits via a modified path patching algorithm, we find that idiom processing exhibits distinct computational patterns. We identify and investigate ``Idiom Heads,'' attention heads that frequently activate across different idioms, as well as enhanced attention between idiom tokens due to earlier processing, which we term ``augmented reception.'' We analyze these phenomena and the general features of the discovered circuits as mechanisms by which transformers balance computational efficiency and robustness. Finally, these findings provide insights into how transformers handle non-compositional language and suggest pathways for understanding the processing of more complex grammatical constructions.
摘要：我们使用一套新颖的电路发现和分析技术来研究基于变压器的语言模型中惯用表达的处理。首先通过修改的路径修补算法发现电路，我们发现习语处理表现出不同的计算模式。我们识别并研究“成语头”，即经常在不同成语之间激活的注意头，以及由于早期处理而增强的成语标记之间的注意力，我们将其称为“增强接收”。我们分析这些现象和所发现电路的一般特征，作为变压器平衡计算效率和鲁棒性的机制。最后，这些发现提供了关于 Transformer 如何处理非组合语言的见解，并为理解更复杂的语法结构的处理提供了建议。

Title: Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks

Authors: Éloïse Benito-Rodriguez, Einar Urdshals, Jasmina Nasufi, Nicky Pochinkov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2511.16540
Pdf URL: https://arxiv.org/pdf/2511.16540
Copy Paste: [[2511.16540]] Beyond Tokens in Language Models: Interpreting Activations through Text Genre Chunks(https://arxiv.org/abs/2511.16540)
Keywords: language model, llm, prompt
Abstract: Understanding Large Language Models (LLMs) is key to ensure their safe and beneficial deployment. This task is complicated by the difficulty of interpretability of LLM structures, and the inability to have all their outputs human-evaluated. In this paper, we present the first step towards a predictive framework, where the genre of a text used to prompt an LLM, is predicted based on its activations. Using Mistral-7B and two datasets, we show that genre can be extracted with F1-scores of up to 98% and 71% using scikit-learn classifiers. Across both datasets, results consistently outperform the control task, providing a proof of concept that text genres can be inferred from LLMs with shallow learning models.
摘要：了解大型语言模型 (LLM) 是确保其安全和有益部署的关键。由于法学硕士结构难以解释，并且无法对所有输出进行人工评估，因此这项任务变得复杂。在本文中，我们提出了迈向预测框架的第一步，其中用于提示法学硕士的文本类型是根据其激活来预测的。使用 Mistral-7B 和两个数据集，我们表明使用 scikit-learn 分类器可以以高达 98% 和 71% 的 F1 分数提取流派。在这两个数据集中，结果始终优于控制任务，这提供了一个概念证明，即文本类型可以通过浅层学习模型从法学硕士中推断出来。

Title: WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue

Authors: Zachary Ellis, Jared Joselowitz, Yash Deo, Yajie He, Anna Kalygina, Aisling Higham, Mana Rahimzadeh, Yan Jia, Ibrahim Habli, Ernest Lim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16544
Pdf URL: https://arxiv.org/pdf/2511.16544
Copy Paste: [[2511.16544]] WER is Unaware: Assessing How ASR Errors Distort Clinical Understanding in Patient Facing Dialogue(https://arxiv.org/abs/2511.16544)
Keywords: llm
Abstract: As Automatic Speech Recognition (ASR) is increasingly deployed in clinical dialogue, standard evaluations still rely heavily on Word Error Rate (WER). This paper challenges that standard, investigating whether WER or other common metrics correlate with the clinical impact of transcription errors. We establish a gold-standard benchmark by having expert clinicians compare ground-truth utterances to their ASR-generated counterparts, labeling the clinical impact of any discrepancies found in two distinct doctor-patient dialogue datasets. Our analysis reveals that WER and a comprehensive suite of existing metrics correlate poorly with the clinician-assigned risk labels (No, Minimal, or Significant Impact). To bridge this evaluation gap, we introduce an LLM-as-a-Judge, programmatically optimized using GEPA to replicate expert clinical assessment. The optimized judge (Gemini-2.5-Pro) achieves human-comparable performance, obtaining 90% accuracy and a strong Cohen's $\kappa$ of 0.816. This work provides a validated, automated framework for moving ASR evaluation beyond simple textual fidelity to a necessary, scalable assessment of safety in clinical dialogue.
摘要：随着自动语音识别 (ASR) 越来越多地应用于临床对话中，标准评估仍然严重依赖于单词错误率 (WER)。本文挑战了这一标准，调查了 WER 或其他常见指标是否与转录错误的临床影响相关。我们通过让专家临床医生将真实话语与 ASR 生成的对应话语进行比较，标记两个不同的医患对话数据集中发现的任何差异的临床影响来建立黄金标准基准。我们的分析表明，WER 和一整套现有指标与临床医生指定的风险标签（无、最小或重大影响）相关性较差。为了弥补这一评估差距，我们引入了法学硕士作为法官，使用 GEPA 进行编程优化，以复制专家的临床评估。优化后的判断器（Gemini-2.5-Pro）实现了与人类相当的性能，获得了 90% 的准确率和高达 0.816 的科恩 $\kappa$。这项工作提供了一个经过验证的自动化框架，用于将 ASR 评估从简单的文本保真度转移到临床对话中必要的、可扩展的安全性评估。

Title: Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation

Authors: Kexin Zhao, Ken Forbus
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2511.16577
Pdf URL: https://arxiv.org/pdf/2511.16577
Copy Paste: [[2511.16577]] Integrating Symbolic Natural Language Understanding and Language Models for Word Sense Disambiguation(https://arxiv.org/abs/2511.16577)
Keywords: language model, llm
Abstract: Word sense disambiguation is a fundamental challenge in natural language understanding. Current methods are primarily aimed at coarse-grained representations (e.g. WordNet synsets or FrameNet frames) and require hand-annotated training data to construct. This makes it difficult to automatically disambiguate richer representations (e.g. built on OpenCyc) that are needed for sophisticated inference. We propose a method that uses statistical language models as oracles for disambiguation that does not require any hand-annotation of training data. Instead, the multiple candidate meanings generated by a symbolic NLU system are converted into distinguishable natural language alternatives, which are used to query an LLM to select appropriate interpretations given the linguistic context. The selected meanings are propagated back to the symbolic NLU system. We evaluate our method against human-annotated gold answers to demonstrate its effectiveness.
摘要：词义消歧是自然语言理解中的一个基本挑战。当前的方法主要针对粗粒度表示（例如 WordNet 同义词集或 FrameNet 框架），并且需要手动注释的训练数据来构建。这使得自动消除复杂推理所需的更丰富的表示（例如基于 OpenCyc 构建的）的歧义变得困难。我们提出了一种使用统计语言模型作为消歧预言的方法，不需要对训练数据进行任何手动注释。相反，符号 NLU 系统生成的多个候选含义被转换为可区分的自然语言替代方案，这些替代方案用于查询法学硕士以在给定语言上下文的情况下选择适当的解释。选定的含义被传播回符号 NLU 系统。我们根据人类注释的黄金答案评估我们的方法，以证明其有效性。

Title: Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Authors: Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16654
Pdf URL: https://arxiv.org/pdf/2511.16654
Copy Paste: [[2511.16654]] Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems(https://arxiv.org/abs/2511.16654)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches across 6 LLM models and a two multi-modal embedding models on a newly created financial earnings call benchmark comprising 40 question-answer pairs, each paired with 2 documents (1 image and 1 text chunk). Experimental results demonstrate that direct multimodal embedding retrieval significantly outperforms LLM-summary-based approaches, achieving absolute improvements of 13% in mean average precision (mAP@5) and 11% in normalized discounted cumulative gain. These gains correspond to relative improvements of 32% in mAP@5 and 20% in nDCG@5, providing stronger evidence of their practical impact. We additionally find that direct multimodal retrieval produces more accurate and factually consistent answers as measured by LLM-as-a-judge pairwise comparisons. We demonstrate that LLM summarization introduces information loss during preprocessing, whereas direct multimodal embeddings preserve visual context for retrieval and inference.
摘要：检索增强生成 (RAG) 的最新进展使大型语言模型 (LLM) 能够访问包含文本和视觉信息（例如财务文档中的图表、图表和表格）的多模态知识库。然而，现有的多模态 RAG 系统在预处理过程中依赖于基于 LLM 的摘要将图像转换为文本，仅将文本表示存储在向量数据库中，这会导致对下游检索和问答至关重要的上下文信息和视觉细节丢失。为了解决这个限制，我们对多模态 RAG 系统的两种检索方法进行了全面的比较分析，包括基于文本的块检索（在嵌入之前将图像总结为文本）和直接多模态嵌入检索（其中图像本地存储在向量空间中）。我们在新创建的财务收益电话会议基准上评估了 6 个 LLM 模型和两个多模态嵌入模型的所有三种方法，该基准包括 40 个问答对，每个问答对与 2 个文档（1 个图像和 1 个文本块）配对。实验结果表明，直接多模态嵌入检索显着优于基于 LLM 摘要的方法，平均平均精度 (mAP@5) 绝对提高了 13%，归一化贴现累积增益提高了 11%。这些收益相当于 mAP@5 中 32% 的相对改进和 nDCG@5 中 20% 的相对改进，为它们的实际影响提供了更有力的证据。我们还发现，通过法学硕士作为法官的成对比较来衡量，直接多模态检索可以产生更准确且事实上一致的答案。我们证明了 LLM 总结在预处理过程中引入了信息丢失，而直接多模态嵌入则保留了用于检索和推理的视觉上下文。

Title: Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs

Authors: Ali Taghibakhshi, Sharath Turuvekere Sreenivas, Saurav Muralidharan, Ruisi Cai, Marcin Chochowski, Ameya Sunil Mahabaleshwarkar, Yoshi Suhara, Oluwatobi Olabiyi, Daniel Korzekwa, Mostofa Patwary, Mohammad Shoeybi, Jan Kautz, Bryan Catanzaro, Ashwath Aithal, Nima Tajbakhsh, Pavlo Molchanov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2511.16664
Pdf URL: https://arxiv.org/pdf/2511.16664
Copy Paste: [[2511.16664]] Nemotron Elastic: Towards Efficient Many-in-One Reasoning LLMs(https://arxiv.org/abs/2511.16664)
Keywords: language model, llm
Abstract: Training a family of large language models targeting multiple scales and deployment objectives is prohibitively expensive, requiring separate training runs for each different size. Recent work on model compression through pruning and knowledge distillation has reduced this cost; however, this process still incurs hundreds of billions of tokens worth of training cost per compressed model. In this paper, we present Nemotron Elastic, a framework for building reasoning-oriented LLMs, including hybrid Mamba-Attention architectures, that embed multiple nested submodels within a single parent model, each optimized for different deployment configurations and budgets. Each of these submodels shares weights with the parent model and can be extracted zero-shot during deployment without additional training or fine-tuning. We enable this functionality through an end-to-end trained router, tightly coupled to a two-stage training curriculum designed specifically for reasoning models. We additionally introduce group-aware SSM elastification that preserves Mamba's structural constraints, heterogeneous MLP elastification, normalized MSE-based layer importance for improved depth selection, and knowledge distillation enabling simultaneous multi-budget optimization. We apply Nemotron Elastic to the Nemotron Nano V2 12B model, simultaneously producing a 9B and a 6B model using only 110B training tokens; this results in over 360x cost reduction compared to training model families from scratch, and around 7x compared to SoTA compression techniques. Each of the nested models performs on par or better than the SoTA in accuracy. Moreover, unlike other compression methods, the nested capability of our approach allows having a many-in-one reasoning model that has constant deployment memory against the number of models in the family.
摘要：训练一系列针对多种规模和部署目标的大型语言模型的成本极高，需要针对每个不同的规模进行单独的训练。最近通过剪枝和知识蒸馏进行模型压缩的工作降低了这一成本；然而，这个过程仍然会产生每个压缩模型价值数千亿代币的训练成本。在本文中，我们提出了 Nemotron Elastic，这是一个用于构建面向推理的 LLM 的框架，包括混合 Mamba-Attention 架构，该架构在单个父模型中嵌入多个嵌套子模型，每个模型都针对不同的部署配置和预算进行了优化。这些子模型中的每一个都与父模型共享权重，并且可以在部署期间零样本提取，而无需额外的训练或微调。我们通过端到端训练的路由器来实现此功能，该路由器与专门为推理模型设计的两阶段培训课程紧密耦合。我们还引入了保留 Mamba 结构约束的群体感知 SSM 弹性化、异构 MLP 弹性化、用于改进深度选择的归一化基于 MSE 的层重要性以及支持同步多预算优化的知识蒸馏。我们将 Nemotron Elastic 应用于 Nemotron Nano V2 12B 模型，仅使用 110B 训练令牌同时生成 9B 和 6B 模型；与从头开始训练模型系列相比，这使得成本降低了 360 倍以上，与 SoTA 压缩技术相比，成本降低了约 7 倍。每个嵌套模型的准确度均与 SoTA 相当或更好。此外，与其他压缩方法不同，我们方法的嵌套功能允许拥有多合一推理模型，该模型具有针对系列中模型数量的恒定部署内存。