2026-02-05

Title: Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines

Authors: Erik Saule, Kalpathi Subramanian, Razvan Bunescu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03962
Pdf URL: https://arxiv.org/pdf/2602.03962
Copy Paste: [[2602.03962]] Automatic Classification of Pedagogical Materials against CS Curriculum Guidelines(https://arxiv.org/abs/2602.03962)
Keywords: language model
Abstract: Professional societies often publish curriculum guidelines to help programs align their content to international standards. In Computer Science, the primary standard is published by ACM and IEEE and provide detailed guidelines for what should be and could be included in a Computer Science program. While very helpful, it remains difficult for program administrators to assess how much of the guidelines is being covered by a CS program. This is in particular due to the extensiveness of the guidelines, containing thousands of individual items. As such, it is time consuming and cognitively demanding to audit every course to confidently mark everything that is actually being covered. Our preliminary work indicated that it takes about a day of work per course. In this work, we propose using Natural Language Processing techniques to accelerate the process. We explore two kinds of techniques, the first relying on traditional tools for parsing, tagging, and embeddings, while the second leverages the power of Large Language Models. We evaluate the application of these techniques to classify a corpus of pedagogical materials and show that we can meaningfully classify documents automatically.
摘要：专业协会经常发布课程指南，以帮助项目的内容符合国际标准。在计算机科学领域，主要标准由 ACM 和 IEEE 发布，并为计算机科学项目中应该包含和可以包含的内容提供详细的指南。虽然非常有帮助，但项目管理员仍然很难评估 CS 项目涵盖了多少指南。这尤其是由于指南内容广泛，包含数千个单独的项目。因此，审核每门课程以自信地标记实际涵盖的所有内容既耗时又需要认知。我们的初步工作表明，每门课程大约需要一天的时间。在这项工作中，我们建议使用自然语言处理技术来加速这一过程。我们探索两种技术，第一种依赖于传统的解析、标记和嵌入工具，而第二种则利用大型语言模型的强大功能。我们评估了这些技术对教学材料语料库进行分类的应用，并表明我们可以有意义地自动对文档进行分类。

Title: Likelihood-Based Reward Designs for General LLM Reasoning

Authors: Ariel Kwiatkowski, Natasha Butt, Ismail Labiad, Julia Kempe, Yann Ollivier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.03979
Pdf URL: https://arxiv.org/pdf/2602.03979
Copy Paste: [[2602.03979]] Likelihood-Based Reward Designs for General LLM Reasoning(https://arxiv.org/abs/2602.03979)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Fine-tuning large language models (LLMs) on reasoning benchmarks via reinforcement learning requires a specific reward function, often binary, for each benchmark. This comes with two potential limitations: the need to design the reward, and the potentially sparse nature of binary rewards. Here, we systematically investigate rewards derived from the probability or log-probability of emitting the reference answer (or any other prompt continuation present in the data), which have the advantage of not relying on specific verifiers and being available at scale. Several recent works have advocated for the use of similar rewards (e.g., VeriFree, JEPO, RLPR, NOVER). We systematically compare variants of likelihood-based rewards with standard baselines, testing performance both on standard mathematical reasoning benchmarks, and on long-form answers where no external verifier is available. We find that using the log-probability of the reference answer as the reward for chain-of-thought (CoT) learning is the only option that performs well in all setups. This reward is also consistent with the next-token log-likelihood loss used during pretraining. In verifiable settings, log-probability rewards bring comparable or better success rates than reinforcing with standard binary rewards, and yield much better perplexity. In non-verifiable settings, they perform on par with SFT. On the other hand, methods based on probability, such as VeriFree, flatline on non-verifiable settings due to vanishing probabilities of getting the correct answer. Overall, this establishes log-probability rewards as a viable method for CoT fine-tuning, bridging the short, verifiable and long, non-verifiable answer settings.
摘要：通过强化学习在推理基准上微调大型语言模型 (LLM) 需要为每个基准提供特定的奖励函数，通常是二进制的。这有两个潜在的限制：设计奖励的需要，以及二元奖励的潜在稀疏性。在这里，我们系统地研究从发出参考答案（或数据中存在的任何其他提示延续）的概率或对数概率中获得的奖励，其优点是不依赖特定验证者并且可大规模使用。最近的几项工作提倡使用类似的奖励（例如，VeriFree、JEPO、RLPR、NOVER）。我们系统地将基于可能性的奖励的变体与标准基线进行比较，在标准数学推理基准和没有外部验证者可用的长格式答案上测试性能。我们发现，使用参考答案的对数概率作为思想链（CoT）学习的奖励是在所有设置中表现良好的唯一选择。该奖励也与预训练期间使用的下一个令牌对数似然损失一致。在可验证的环境中，对数概率奖励可以带来与标准二元奖励相当或更好的成功率，并产生更好的困惑度。在不可验证的设置中，它们的性能与 SFT 相当。另一方面，基于概率的方法（例如 VeriFree）由于获得正确答案的概率消失而在不可验证的设置上保持平坦。总体而言，这将对数概率奖励确立为 CoT 微调的可行方法，连接了短的、可验证的和长的、不可验证的答案设置。

Title: Transformers perform adaptive partial pooling

Authors: Vsevolod Kapatsinski
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.03980
Pdf URL: https://arxiv.org/pdf/2602.03980
Copy Paste: [[2602.03980]] Transformers perform adaptive partial pooling(https://arxiv.org/abs/2602.03980)
Keywords: language model, gpt
Abstract: Because language is creative, any reasonable language model must generalize, deciding what to say in novel contexts by using information from similar contexts. But what about contexts that are not novel but merely infrequent? In hierarchical regression, the model's predictions for behavior in a context are affected by observations from other similar contexts to the extent that 1) the current context is infrequent and 2) different contexts behave similarly. This is called adaptive partial pooling of evidence. This paper shows that next-word predictions of a transformer (GPT2) are increasingly unaffected by observations from outside the current context across epochs of training (the amount of pooling reduces with training), and that the extent of pooling is affected by context frequency, context number (type frequency) and context variability in a similar way to hierarchical regression. These characteristics of learning in transformers are argued to be realistic on both rational and empirical grounds.
摘要：由于语言具有创造性，任何合理的语言模型都必须进行概括，通过使用来自相似上下文的信息来决定在新颖的上下文中说什么。但是那些不新颖而只是不常见的情况又如何呢？在分层回归中，模型对上下文中的行为的预测会受到来自其他类似上下文的观察的影响，其程度为：1）当前上下文不常见；2）不同上下文的行为相似。这称为自适应部分证据池。本文表明，变压器 (GPT2) 的下一个单词预测越来越不受训练时期内当前上下文外部观察的影响（池化量随着训练而减少），并且池化程度受到上下文频率、上下文数量（类型频率）和上下文可变性的影响，与层次回归类似。从理性和经验的角度来看，变压器的这些学习特征被认为是现实的。

Title: On the Credibility of Evaluating LLMs using Survey Questions

Authors: Jindřich Libovický
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2602.04033
Pdf URL: https://arxiv.org/pdf/2602.04033
Copy Paste: [[2602.04033]] On the Credibility of Evaluating LLMs using Survey Questions(https://arxiv.org/abs/2602.04033)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.
摘要：最近的研究使用适应性社会调查来评估大语言模型（LLM）的价值取向，通常是通过用调查问题提示模型并将其响应与人类平均响应进行比较。本文指出了这种方法的局限性，根据具体的设置，这些局限性可能会导致低估和高估价值取向的相似性。通过在五个国家以三种语言进行的世界价值调查，我们证明了提示方法（直接与思维链）和解码策略（贪婪与抽样）对结果有显着影响。为了评估答案之间的相互作用，我们引入了一种新的度量标准，即自相关距离。该指标衡量法学硕士是否像人类一样在不同问题的答案之间保持一致的关系。这表明，在独立考虑法学硕士的回答时，即使与人类数据的平均一致性很高，也不能保证回答的结构一致性。此外，我们揭示了两个常见评估指标（均方距离和 KL 散度）之间的弱相关性，这假设调查答案彼此独立。对于未来的研究，我们建议 CoT 提示、使用数十个样本进行基于采样的解码，以及使用多个指标（包括自相关距离）进行稳健分析。

Title: Abstraction Induces the Brain Alignment of Language and Speech Models

Authors: Emily Cheng, Aditya R. Vaidya, Richard Antonello
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04081
Pdf URL: https://arxiv.org/pdf/2602.04081
Copy Paste: [[2602.04081]] Abstraction Induces the Brain Alignment of Language and Speech Models(https://arxiv.org/abs/2602.04081)
Keywords: language model
Abstract: Research has repeatedly demonstrated that intermediate hidden states extracted from large language models and speech audio models predict measured brain response to natural language stimuli. Yet, very little is known about the representation properties that enable this high prediction performance. Why is it the intermediate layers, and not the output layers, that are most effective for this unique and highly general transfer task? We give evidence that the correspondence between speech and language models and the brain derives from shared meaning abstraction and not their next-word prediction properties. In particular, models construct higher-order linguistic features in their middle layers, cued by a peak in the layerwise intrinsic dimension, a measure of feature complexity. We show that a layer's intrinsic dimension strongly predicts how well it explains fMRI and ECoG signals; that the relation between intrinsic dimension and brain predictivity arises over model pre-training; and finetuning models to better predict the brain causally increases both representations' intrinsic dimension and their semantic content. Results suggest that semantic richness, high intrinsic dimension, and brain predictivity mirror each other, and that the key driver of model-brain similarity is rich meaning abstraction of the inputs, where language modeling is a task sufficiently complex (but perhaps not the only) to require it.
摘要：研究反复证明，从大型语言模型和语音音频模型中提取的中间隐藏状态可以预测测量到的大脑对自然语言刺激的反应。然而，对于实现如此高预测性能的表示属性知之甚少。为什么对于这种独特且高度通用的传输任务最有效的是中间层，而不是输出层？我们提供的证据表明，语音和语言模型与大脑之间的对应关系源自共享的意义抽象，而不是它们的下一个单词预测属性。特别是，模型在其中间层构造高阶语言特征，由分层内在维度（特征复杂性的度量）中的峰值提示。我们证明，层的内在维度可以强烈预测它解释 fMRI 和 ECoG 信号的能力；内在维度和大脑预测性之间的关系是在模型预训练中产生的；微调模型以更好地预测大脑因果关系增加了表征的内在维度及其语义内容。结果表明，语义丰富性、高内在维度和大脑预测性相互镜像，模型-大脑相似性的关键驱动因素是输入的丰富含义抽象，其中语言建模是一项足够复杂（但可能不是唯一）的任务。

Title: Expert Selections In MoE Models Reveal (Almost) As Much As Text

Authors: Amir Nuriyev, Gabriel Kulp
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2602.04105
Pdf URL: https://arxiv.org/pdf/2602.04105
Copy Paste: [[2602.04105]] Expert Selections In MoE Models Reveal (Almost) As Much As Text(https://arxiv.org/abs/2602.04105)
Keywords: language model
Abstract: We present a text-reconstruction attack on mixture-of-experts (MoE) language models that recovers tokens from expert selections alone. In MoE models, each token is routed to a subset of expert subnetworks; we show these routing decisions leak substantially more information than previously understood. Prior work using logistic regression achieves limited reconstruction; we show that a 3-layer MLP improves this to 63.1% top-1 accuracy, and that a transformer-based sequence decoder recovers 91.2% of tokens top-1 (94.8% top-10) on 32-token sequences from OpenWebText after training on 100M tokens. These results connect MoE routing to the broader literature on embedding inversion. We outline practical leakage scenarios (e.g., distributed inference and side channels) and show that adding noise reduces but does not eliminate reconstruction. Our findings suggest that expert selections in MoE deployments should be treated as sensitive as the underlying text.
摘要：我们提出了一种针对专家混合 (MoE) 语言模型的文本重建攻击，该攻击仅从专家选择中恢复标记。在 MoE 模型中，每个令牌都被路由到专家子网络的子集；我们发现这些路由决策泄露的信息比之前理解的要多得多。先前使用逻辑回归的工作实现了有限的重建；我们表明，3 层 MLP 将 top-1 准确率提高到 63.1%，并且在对 100M 令牌进行训练后，基于 Transformer 的序列解码器在来自 OpenWebText 的 32 令牌序列上恢复了 91.2% 的令牌 top-1（94.8% top-10）。这些结果将 MoE 路由与更广泛的嵌入反演文献联系起来。我们概述了实际的泄漏场景（例如分布式推理和侧通道），并表明添加噪声会减少但不会消除重建。我们的研究结果表明，教育部部署中的专家选择应与底层文本一样敏感。

Title: DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling

Authors: Jiangnan Yang, Junjie Chen, Fei Wang, Yiqi Nie, Yuxin Liu, Zhangling Duan, Jie Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04112
Pdf URL: https://arxiv.org/pdf/2602.04112
Copy Paste: [[2602.04112]] DELTA: Deliberative Multi-Agent Reasoning with Reinforcement Learning for Multimodal Psychological Counseling(https://arxiv.org/abs/2602.04112)
Keywords: agent
Abstract: Psychological counseling is a fundamentally multimodal cognitive process in which clinicians integrate verbal content with visual and vocal cues to infer clients' mental states and respond empathically. However, most existing language-model-based counseling systems operate on text alone and rely on implicit mental state inference. We introduce DELTA, a deliberative multi-agent framework that models counseling as a structured reasoning process over multimodal signals, separating evidence grounding, mental state abstraction, and response generation. DELTA further incorporates reinforcement learning guided by a distribution-level Emotion Attunement Score to encourage emotionally attuned responses. Experiments on a multimodal counseling benchmark show that DELTA improves both counseling quality and emotion attunement across models. Ablation and qualitative analyses suggest that explicit multimodal reasoning and structured mental state representations play complementary roles in supporting empathic human-AI interaction.
摘要：心理咨询本质上是一个多模式认知过程，临床医生将言语内容与视觉和声音提示相结合，以推断客户的心理状态并做出同理心回应。然而，大多数现有的基于语言模型的咨询系统仅对文本进行操作，并依赖于隐含的心理状态推断。我们引入了 DELTA，一种深思熟虑的多主体框架，它将咨询建模为基于多模态信号的结构化推理过程，将证据基础、心理状态抽象和响应生成分开。 DELTA 进一步结合了由分布级情绪协调分数引导的强化学习，以鼓励情绪协调反应。多模式咨询基准实验表明，DELTA 提高了咨询质量和跨模型的情绪协调。消融和定性分析表明，明确的多模态推理和结构化心理状态表征在支持移情人类与人工智能交互方面发挥着互补作用。

Title: The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment

Authors: Zhexin Zhang, Yida Lu, Junfeng Fang, Junxiao Yang, Shiyao Cui, Hao Zhou, Fandong Meng, Jie Zhou, Hongning Wang, Minlie Huang, Tat-Seng Chua
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04196
Pdf URL: https://arxiv.org/pdf/2602.04196
Copy Paste: [[2602.04196]] The Missing Half: Unveiling Training-time Implicit Safety Risks Beyond Deployment(https://arxiv.org/abs/2602.04196)
Keywords: agent
Abstract: Safety risks of AI models have been widely studied at deployment time, such as jailbreak attacks that elicit harmful outputs. In contrast, safety risks emerging during training remain largely unexplored. Beyond explicit reward hacking that directly manipulates explicit reward functions in reinforcement learning, we study implicit training-time safety risks: harmful behaviors driven by a model's internal incentives and contextual background information. For example, during code-based reinforcement learning, a model may covertly manipulate logged accuracy for self-preservation. We present the first systematic study of this problem, introducing a taxonomy with five risk levels, ten fine-grained risk categories, and three incentive types. Extensive experiments reveal the prevalence and severity of these risks: notably, Llama-3.1-8B-Instruct exhibits risky behaviors in 74.4% of training runs when provided only with background information. We further analyze factors influencing these behaviors and demonstrate that implicit training-time risks also arise in multi-agent training settings. Our results identify an overlooked yet urgent safety challenge in training.
摘要：人工智能模型的安全风险在部署时已被广泛研究，例如引发有害输出的越狱攻击。相比之下，培训期间出现的安全风险在很大程度上仍未得到探索。除了直接操纵强化学习中的显式奖励函数的显式奖励黑客之外，我们还研究了隐性训练时安全风险：由模型的内部激励和上下文背景信息驱动的有害行为。例如，在基于代码的强化学习期间，模型可能会秘密地操纵记录的准确性以进行自我保护。我们首次系统地研究了这个问题，引入了具有五个风险级别、十个细粒度风险类别和三种激励类型的分类法。大量实验揭示了这些风险的普遍性和严重性：值得注意的是，在仅提供背景信息的情况下，Llama-3.1-8B-Instruct 在 74.4% 的训练运行中表现出危险行为。我们进一步分析影响这些行为的因素，并证明在多智能体训练环境中也会出现隐性训练时间风险。我们的结果发现了培训中一个被忽视但紧迫的安全挑战。

Title: From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

Authors: Xinyue Wang, Yuanhe Zhang, Zhengshuo Gong, Haoran Gao, Fanyu Meng, Zhenhong Zhou, Li Sun, Yang Liu, Sen Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04197
Pdf URL: https://arxiv.org/pdf/2602.04197
Copy Paste: [[2602.04197]] From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents(https://arxiv.org/abs/2602.04197)
Keywords: llm, agent
Abstract: The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.
摘要：基于 LLM 的代理的增强功能伴随着模型规划和工具使用能力的紧急情况。由于LLM联盟的有益与无害权衡，代理人通常也会继承“过度拒绝”的缺陷，这是一种被动的失败模式。然而，代理的主动规划和行动能力在权衡的另一面带来了另一个重大危险。我们将这种现象称为“有毒主动性”：一种主动失败模式，其中代理人在马基雅维利帮助优化的驱动下，无视道德约束以最大化效用。与过度拒绝不同，有毒主动性表现为代理人采取过度或操纵性措施来确保维持其“有用性”。现有的研究很少关注识别这种行为，因为它通常缺乏展开此类策略所需的微妙背景。为了揭示这种风险，我们引入了一种基于双模型之间的困境驱动交互的新颖评估框架，从而能够对多步行为轨迹上的代理行为进行模拟和分析。通过对主流法学硕士的广泛实验，我们证明了有毒主动性是一种普遍存在的行为现象，并揭示了两个主要趋势。我们进一步提出了一个系统基准，用于评估跨情境设置的有毒主动行为。

Title: Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM-Based Inquiry

Authors: Hsien-Jyh Liao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04206
Pdf URL: https://arxiv.org/pdf/2602.04206
Copy Paste: [[2602.04206]] Enforcing Monotonic Progress in Legal Cross-Examination: Preventing Long-Horizon Stagnation in LLM-Based Inquiry(https://arxiv.org/abs/2602.04206)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit impressive linguistic fluency but struggle to reliably complete long-horizon tasks under explicit procedural constraints. In legal cross-examination, purely proba-bilistic generation often maintains behavioral coherence while failing to ensure procedural advancement. We characterize this failure as procedural stagnation and propose Soft-FSM, a neuro-symbolic architecture that enforces monotonic progress over accumulated Key Information Units (KIUs) via an external deterministic state controller. Experiments on three real-world Taiwanese criminal homicide cases show that baseline methods collapse below 40% completeness, while Soft-FSM consistently achieves over 97% with near-zero redundancy. These results suggest that, in such domains, reliable task completion cannot be guaranteed by emergent LLM behavior alone, and can be reliably enforced through explicit and verifiable external state control.
摘要：大型语言模型（LLM）表现出令人印象深刻的语言流畅性，但在明确的程序约束下难以可靠地完成长期任务。在法律质证中，纯粹的概率生成往往保持行为的连贯性，但无法确保程序的推进。我们将这种失败描述为程序停滞，并提出了 Soft-FSM，这是一种神经符号架构，可通过外部确定性状态控制器在累积的关键信息单元 (KIU) 上强制单调进展。对三起真实的台湾刑事凶杀案进行的实验表明，基线方法的完整性低于 40%，而 Soft-FSM 始终达到 97% 以上，且冗余度接近于零。这些结果表明，在这些领域，可靠的任务完成不能仅靠紧急的法学硕士行为来保证，而是可以通过明确且可验证的外部状态控制来可靠地执行。

Title: Language Models Struggle to Use Representations Learned In-Context

Authors: Michael A. Lepori, Tal Linzen, Ann Yuan, Katja Filippova
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04212
Pdf URL: https://arxiv.org/pdf/2602.04212
Copy Paste: [[2602.04212]] Language Models Struggle to Use Representations Learned In-Context(https://arxiv.org/abs/2602.04212)
Keywords: language model, llm
Abstract: Though large language models (LLMs) have enabled great success across a wide variety of tasks, they still appear to fall short of one of the loftier goals of artificial intelligence research: creating an artificial system that can adapt its behavior to radically new contexts upon deployment. One important step towards this goal is to create systems that can induce rich representations of data that are seen in-context, and then flexibly deploy these representations to accomplish goals. Recently, Park et al. (2024) demonstrated that current LLMs are indeed capable of inducing such representation from context (i.e., in-context representation learning). The present study investigates whether LLMs can use these representations to complete simple downstream tasks. We first assess whether open-weights LLMs can use in-context representations for next-token prediction, and then probe models using a novel task, adaptive world modeling. In both tasks, we find evidence that open-weights LLMs struggle to deploy representations of novel semantics that are defined in-context, even if they encode these semantics in their latent representations. Furthermore, we assess closed-source, state-of-the-art reasoning models on the adaptive world modeling task, demonstrating that even the most performant LLMs cannot reliably leverage novel patterns presented in-context. Overall, this work seeks to inspire novel methods for encouraging models to not only encode information presented in-context, but to do so in a manner that supports flexible deployment of this information.
摘要：尽管大型语言模型 (LLM) 在各种任务中取得了巨大成功，但它们似乎仍然未能实现人工智能研究的崇高目标之一：创建一个能够在部署时使其行为适应全新环境的人工系统。实现这一目标的一个重要步骤是创建能够诱导在上下文中看到的丰富数据表示的系统，然后灵活地部署这些表示来实现目标。最近，帕克等人。（2024）证明当前的法学硕士确实能够从上下文中诱导出这种表示（即上下文中表示学习）。本研究调查法学硕士是否可以使用这些表示来完成简单的下游任务。我们首先评估开放权重法学硕士是否可以使用上下文表示来进行下一个标记预测，然后使用新任务——自适应世界建模来探索模型。在这两项任务中，我们发现证据表明，开放权重法学硕士很难部署在上下文中定义的新颖语义的表示，即使它们在潜在表示中对这些语义进行了编码。此外，我们评估了自适应世界建模任务的闭源、最先进的推理模型，证明即使是性能最好的法学硕士也无法可靠地利用上下文中呈现的新颖模式。总的来说，这项工作旨在激发新的方法，鼓励模型不仅对上下文中呈现的信息进行编码，而且以支持灵活部署该信息的方式进行编码。

Title: Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation

Authors: Nuo Xu, Ahrii Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04241
Pdf URL: https://arxiv.org/pdf/2602.04241
Copy Paste: [[2602.04241]] Tokenization and Morphological Fidelity in Uralic NLP: A Cross-Lingual Evaluation(https://arxiv.org/abs/2602.04241)
Keywords: language model
Abstract: Subword tokenization critically affects Natural Language Processing (NLP) performance, yet its behavior in morphologically rich and low-resource language families remains under-explored. This study systematically compares three subword paradigms -- Byte Pair Encoding (BPE), Overlap BPE (OBPE), and Unigram Language Model -- across six Uralic languages with varying resource availability and typological diversity. Using part-of-speech (POS) tagging as a controlled downstream task, we show that OBPE consistently achieves stronger morphological alignment and higher tagging accuracy than conventional methods, particularly within the Latin-script group. These gains arise from reduced fragmentation in open-class categories and a better balance across the frequency spectrum. Transfer efficacy further depends on the downstream tagging architecture, interacting with both training volume and genealogical proximity. Taken together, these findings highlight that morphology-sensitive tokenization is not merely a preprocessing choice but a decisive factor in enabling effective cross-lingual transfer for agglutinative, low-resource languages.
摘要：子词标记化严重影响自然语言处理（NLP）性能，但其在形态丰富且资源匮乏的语言家族中的行为仍未得到充分探索。本研究系统地比较了六种具有不同资源可用性和类型多样性的乌拉尔语言的三种子词范式——字节对编码 (BPE)、重叠 BPE (OBPE) 和一元语言模型。使用词性（POS）标记作为受控下游任务，我们表明 OBPE 始终比传统方法实现更强的形态对齐和更高的标记准确性，特别是在拉丁文字组中。这些收益来自于开放类别的碎片化减少以及频谱之间更好的平衡。转移效率进一步取决于下游标记架构，与训练量和谱系接近度相互作用。总而言之，这些发现强调，形态敏感的标记化不仅是一种预处理选择，而且是实现粘着性、低资源语言有效跨语言迁移的决定性因素。

Title: CoLT: Reasoning with Chain of Latent Tool Calls

Authors: Fangwei Zhu, Zhifang Sui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04246
Pdf URL: https://arxiv.org/pdf/2602.04246
Copy Paste: [[2602.04246]] CoLT: Reasoning with Chain of Latent Tool Calls(https://arxiv.org/abs/2602.04246)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) is a critical technique in enhancing the reasoning ability of Large Language Models (LLMs), and latent reasoning methods have been proposed to accelerate the inefficient token-level reasoning chain. We notice that existing latent reasoning methods generally require model structure augmentation and exhaustive training, limiting their broader applicability. In this paper, we propose CoLT, a novel framework that implements latent reasoning as ``tool calls''. Instead of reasoning entirely in the latent space, CoLT generates seed tokens that contain information of a reasoning step. When a latent tool call is triggered, a smaller external model will take the hidden states of seed tokens as its input, and unpack the seed tokens back to a full reasoning step. In this way, we can ensure that the main model reasons in the explicit token space, preserving its ability while improving efficiency. Experimental results on four mathematical datasets demonstrate that CoLT achieves higher accuracy and shorter reasoning length than baseline latent models, and is compatible with reinforcement learning algorithms and different decoder structures.
摘要：思想链（CoT）是增强大型语言模型（LLM）推理能力的关键技术，人们提出了潜在推理方法来加速低效的令牌级推理链。我们注意到，现有的潜在推理方法通常需要模型结构增强和详尽的训练，限制了其更广泛的适用性。在本文中，我们提出了 CoLT，这是一种将潜在推理实现为“工具调用”的新颖框架。 CoLT 不是完全在潜在空间中进行推理，而是生成包含推理步骤信息的种子标记。当触发潜在工具调用时，较小的外部模型将采用种子标记的隐藏状态作为其输入，并将种子标记解压回完整的推理步骤。通过这种方式，我们可以确保主模型在显式令牌空间中进行推理，在提高效率的同时保留其能力。在四个数学数据集上的实验结果表明，CoLT 比基线潜在模型具有更高的准确性和更短的推理长度，并且与强化学习算法和不同的解码器结构兼容。

Title: Scaling Agentic Verifier for Competitive Coding

Authors: Zeyao Ma, Jing Zhang, Xiaokang Zhang, Jiaxi Yang, Zongmeng Zhang, Jiajun Zhang, Yuheng Jing, Lei Zhang, Hao Zheng, Wenting Zhao, Junyang Lin, Binyuan Hui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04254
Pdf URL: https://arxiv.org/pdf/2602.04254
Copy Paste: [[2602.04254]] Scaling Agentic Verifier for Competitive Coding(https://arxiv.org/abs/2602.04254)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated strong coding capabilities but still struggle to solve competitive programming problems correctly in a single attempt. Execution-based re-ranking offers a promising test-time scaling strategy, yet existing methods are constrained by either difficult test case generation or inefficient random input sampling. To address this limitation, we propose Agentic Verifier, an execution-based agent that actively reasons about program behaviors and searches for highly discriminative test inputs that expose behavioral discrepancies among candidate solutions. Through multi-turn interaction with code execution environments, the verifier iteratively refines the candidate input generator and produces targeted counterexamples rather than blindly sampling inputs. We train the verifier to acquire this discriminative input generation capability via a scalable pipeline combining large-scale data synthesis, rejection fine-tuning, and agentic reinforcement learning. Extensive experiments across five competitive programming benchmarks demonstrate consistent improvements over strong execution-based baselines, achieving up to +10-15% absolute gains in Best@K accuracy. Further analysis reveals clear test-time scaling behavior and highlights the verifier's broader potential beyond reranking.
摘要：大型语言模型 (LLM) 已展现出强大的编码能力，但仍难以在一次尝试中正确解决竞争性编程问题。基于执行的重新排序提供了一种有前途的测试时间扩展策略，但现有方法受到测试用例生成困难或随机输入采样效率低下的限制。为了解决这个限制，我们提出了 Agentic Verifier，这是一种基于执行的代理，它主动推理程序行为并搜索具有高度区分性的测试输入，从而暴露候选解决方案之间的行为差异。通过与代码执行环境的多轮交互，验证者迭代地细化候选输入生成器并产生有针对性的反例，而不是盲目地采样输入。我们训练验证者通过结合大规模数据合成、拒绝微调和代理强化学习的可扩展管道来获取这种判别性输入生成能力。跨越五个竞争性编程基准的广泛实验表明，与基于强大执行的基准相比，得到了一致的改进，在 Best@K 准确率方面实现了高达 +10-15% 的绝对增益。进一步的分析揭示了清晰的测试时间缩放行为，并强调了验证者在重新排名之外的更广泛的潜力。

Title: ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation

Authors: Jiarui Jin, Haoyu Wang, Xingliang Wu, Xiaocheng Fang, Xiang Lan, Zihan Wang, Deyun Zhang, Bo Liu, Yingying Zhang, Xian Wu, Hongyan Li, Shenda Hong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04279
Pdf URL: https://arxiv.org/pdf/2602.04279
Copy Paste: [[2602.04279]] ECG-R1: Protocol-Guided and Modality-Agnostic MLLM for Reliable ECG Interpretation(https://arxiv.org/abs/2602.04279)
Keywords: language model, llm, hallucination
Abstract: Electrocardiography (ECG) serves as an indispensable diagnostic tool in clinical practice, yet existing multimodal large language models (MLLMs) remain unreliable for ECG interpretation, often producing plausible but clinically incorrect analyses. To address this, we propose ECG-R1, the first reasoning MLLM designed for reliable ECG interpretation via three innovations. First, we construct the interpretation corpus using \textit{Protocol-Guided Instruction Data Generation}, grounding interpretation in measurable ECG features and monograph-defined quantitative thresholds and diagnostic logic. Second, we present a modality-decoupled architecture with \textit{Interleaved Modality Dropout} to improve robustness and cross-modal consistency when either the ECG signal or ECG image is missing. Third, we present \textit{Reinforcement Learning with ECG Diagnostic Evidence Rewards} to strengthen evidence-grounded ECG interpretation. Additionally, we systematically evaluate the ECG interpretation capabilities of proprietary, open-source, and medical MLLMs, and provide the first quantitative evidence that severe hallucinations are widespread, suggesting that the public should not directly trust these outputs without independent verification. Code and data are publicly available at \href{this https URL}{here}, and an online platform can be accessed at \href{this http URL}{here}.
摘要：心电图 (ECG) 是临床实践中不可或缺的诊断工具，但现有的多模态大语言模型 (MLLM) 对于心电图解释仍然不可靠，经常产生看似合理但临床上不正确的分析。为了解决这个问题，我们提出了 ECG-R1，这是第一个推理 MLLM，旨在通过三项创新实现可靠的心电图解释。首先，我们使用 \textit{协议引导指令数据生成} 构建解释语料库，将解释基于可测量的心电图特征和专着定义的定量阈值和诊断逻辑。其次，我们提出了一种具有 \textit{Interleaved Modality Dropout} 的模态解耦架构，以在心电图信号或心电图图像丢失时提高鲁棒性和跨模态一致性。第三，我们提出 \textit{带有心电图诊断证据奖励的强化学习}，以加强基于证据的心电图解释。此外，我们系统地评估了专有、开源和医学 MLLM 的心电图解读能力，并提供了第一个表明严重幻觉普遍存在的定量证据，表明公众不应在未经独立验证的情况下直接信任这些输出。代码和数据可在 \href{此 https URL}{此处} 公开获取，并且可以通过 \href{此 http URL}{此处} 访问在线平台。

Title: Contextual Drag: How Errors in the Context Affect LLM Reasoning

Authors: Yun Cheng, Xingyu Zhu, Haoyu Zhao, Sanjeev Arora
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04288
Pdf URL: https://arxiv.org/pdf/2602.04288
Copy Paste: [[2602.04288]] Contextual Drag: How Errors in the Context Affect LLM Reasoning(https://arxiv.org/abs/2602.04288)
Keywords: language model, llm
Abstract: Central to many self-improvement pipelines for large language models (LLMs) is the assumption that models can improve by reflecting on past mistakes. We study a phenomenon termed contextual drag: the presence of failed attempts in the context biases subsequent generations toward structurally similar errors. Across evaluations of 11 proprietary and open-weight models on 8 reasoning tasks, contextual drag induces 10-20% performance drops, and iterative self-refinement in models with severe contextual drag can collapse into self-deterioration. Structural analysis using tree edit distance reveals that subsequent reasoning trajectories inherit structurally similar error patterns from the context. We demonstrate that neither external feedback nor successful self-verification suffices to eliminate this effect. While mitigation strategies such as fallback-behavior fine-tuning and context denoising yield partial improvements, they fail to fully restore baseline performance, positioning contextual drag as a persistent failure mode in current reasoning architectures.
摘要：大型语言模型 (LLM) 的许多自我改进流程的核心是假设模型可以通过反思过去的错误来改进。我们研究了一种称为情境阻力的现象：情境中失败尝试的存在会使后代偏向于结构上相似的错误。在对 8 项推理任务的 11 个专有和开放权重模型进行评估时，上下文拖拽会导致 10-20% 的性能下降，并且具有严重上下文拖拽的模型中的迭代自我完善可能会陷入自我恶化。使用树编辑距离的结构分析表明，后续推理轨迹从上下文继承了结构上相似的错误模式。我们证明外部反馈和成功的自我验证都不足以消除这种影响。虽然回退行为微调和上下文去噪等缓解策略产生了部分改进，但它们无法完全恢复基线性能，将上下文拖累定位为当前推理架构中的持久故障模式。

Title: Proxy Compression for Language Modeling

Authors: Lin Zheng, Xinyu Li, Qian Liu, Xiachong Feng, Lingpeng Kong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04289
Pdf URL: https://arxiv.org/pdf/2602.04289
Copy Paste: [[2602.04289]] Proxy Compression for Language Modeling(https://arxiv.org/abs/2602.04289)
Keywords: language model
Abstract: Modern language models are trained almost exclusively on token sequences produced by a fixed tokenizer, an external lossless compressor often over UTF-8 byte sequences, thereby coupling the model to that compressor. This work introduces proxy compression, an alternative training scheme that preserves the efficiency benefits of compressed inputs while providing an end-to-end, raw-byte interface at inference time. During training, one language model is jointly trained on raw byte sequences and compressed views generated by external compressors; through the process, the model learns to internally align compressed sequences and raw bytes. This alignment enables strong transfer between the two formats, even when training predominantly on compressed inputs which are discarded at inference. Extensive experiments on code language modeling demonstrate that proxy compression substantially improves training efficiency and significantly outperforms pure byte-level baselines given fixed compute budgets. As model scale increases, these gains become more pronounced, and proxy-trained models eventually match or rival tokenizer approaches, all while operating solely on raw bytes and retaining the inherent robustness of byte-level modeling.
摘要：现代语言模型几乎完全基于固定标记生成器（通常是 UTF-8 字节序列的外部无损压缩器）生成的标记序列进行训练，从而将模型与该压缩器耦合。这项工作引入了代理压缩，这是一种替代训练方案，可以保留压缩输入的效率优势，同时在推理时提供端到端的原始字节接口。在训练过程中，一种语言模型在原始字节序列和外部压缩器生成的压缩视图上进行联合训练；通过这个过程，模型学习在内部对齐压缩序列和原始字节。这种对齐可以实现两种格式之间的强大传输，即使主要在推理时丢弃的压缩输入上进行训练也是如此。代码语言建模的大量实验表明，在给定固定计算预算的情况下，代理压缩极大地提高了训练效率，并且显着优于纯字节级基线。随着模型规模的增加，这些收益变得更加明显，代理训练的模型最终会匹配或竞争分词器方法，同时仅对原始字节进行操作并保留字节级建模的固有鲁棒性。

Title: Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision

Authors: Lingzhuang Sun, Ruitong Liu, Yuxia Zhu, Xiaohan Xu, Jingxuan Wei, Xiangxiang Zhang, Bihui Yu, Wentao Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04290
Pdf URL: https://arxiv.org/pdf/2602.04290
Copy Paste: [[2602.04290]] Guided Verifier: Collaborative Multimodal Reasoning via Dynamic Process Supervision(https://arxiv.org/abs/2602.04290)
Keywords: language model, llm, hallucination
Abstract: Reinforcement Learning (RL) has emerged as a pivotal mechanism for enhancing the complex reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevailing paradigms typically rely on solitary rollout strategies where the model works alone. This lack of intermediate oversight renders the reasoning process susceptible to error propagation, where early logical deviations cascade into irreversible failures, resulting in noisy optimization signals. In this paper, we propose the \textbf{Guided Verifier} framework to address these structural limitations. Moving beyond passive terminal rewards, we introduce a dynamic verifier that actively co-solves tasks alongside the policy. During the rollout phase, this verifier interacts with the policy model in real-time, detecting inconsistencies and providing directional signals to steer the model toward valid trajectories. To facilitate this, we develop a specialized data synthesis pipeline targeting multimodal hallucinations, constructing \textbf{CoRe} dataset of process-level negatives and \textbf{Co}rrect-guide \textbf{Re}asoning trajectories to train the guided verifier. Extensive experiments on MathVista, MathVerse and MMMU indicate that by allocating compute to collaborative inference and dynamic verification, an 8B-parameter model can achieve strong performance.
摘要：强化学习（RL）已成为增强多模态大型语言模型（MLLM）复杂推理能力的关键机制。然而，流行的范例通常依赖于模型单独工作的单独部署策略。缺乏中间监督使得推理过程容易受到错误传播的影响，早期的逻辑偏差会级联成不可逆转的故障，从而导致嘈杂的优化信号。在本文中，我们提出了 \textbf{Guided Verifier} 框架来解决这些结构限制。除了被动的终端奖励之外，我们引入了一个动态验证器，它可以与策略一起主动共同解决任务。在推出阶段，该验证器与策略模型实时交互，检测不一致并提供方向信号以引导模型走向有效的轨迹。为了促进这一点，我们开发了一个针对多模态幻觉的专门数据合成管道，构建过程级负数的 \textbf{CoRe} 数据集和 \textbf{Co}rrect-guide \textbf{Re}asoning 轨迹来训练引导验证器。 MathVista、MathVerse 和 MMMU 上的大量实验表明，通过将计算分配给协作推理和动态验证，8B 参数模型可以获得强大的性能。

Title: How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks

Authors: Yanshu Wang, Shuaishuai Yang, Jingjing He, Tong Yang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2602.04294
Pdf URL: https://arxiv.org/pdf/2602.04294
Copy Paste: [[2602.04294]] How Few-shot Demonstrations Affect Prompt-based Defenses Against LLM Jailbreak Attacks(https://arxiv.org/abs/2602.04294)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) face increasing threats from jailbreak attacks that bypass safety alignment. While prompt-based defenses such as Role-Oriented Prompts (RoP) and Task-Oriented Prompts (ToP) have shown effectiveness, the role of few-shot demonstrations in these defense strategies remains unclear. Prior work suggests that few-shot examples may compromise safety, but lacks investigation into how few-shot interacts with different system prompt strategies. In this paper, we conduct a comprehensive evaluation on multiple mainstream LLMs across four safety benchmarks (AdvBench, HarmBench, SG-Bench, XSTest) using six jailbreak attack methods. Our key finding reveals that few-shot demonstrations produce opposite effects on RoP and ToP: few-shot enhances RoP's safety rate by up to 4.5% through reinforcing role identity, while it degrades ToP's effectiveness by up to 21.2% through distracting attention from task instructions. Based on these findings, we provide practical recommendations for deploying prompt-based defenses in real-world LLM applications.
摘要：大型语言模型 (LLM) 面临着越来越多的绕过安全调整的越狱攻击的威胁。虽然基于提示的防御，例如面向角色的提示 (RoP) 和面向任务的提示 (ToP) 已显示出有效性，但小镜头演示在这些防御策略中的作用仍不清楚。之前的工作表明，few-shot 示例可能会损害安全性，但缺乏对 Few-shot 如何与不同系统提示策略交互的研究。在本文中，我们使用六种越狱攻击方法，在四个安全基准（AdvBench、HarmBench、SG-Bench、XSTest）上对多个主流LLM进行了综合评估。我们的主要发现表明，few-shot 演示对 RoP 和 ToP 产生相反的影响：few-shot 通过强化角色身份，将 RoP 的安全率提高了 4.5%，而通过分散对任务指令的注意力，使 ToP 的有效性降低了 21.2%。基于这些发现，我们为在现实世界的法学硕士应用程序中部署基于提示的防御提供了实用的建议。

Title: Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification

Authors: Branislav Pecher, Michal Spiegel, Robert Belanec, Jan Cegin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04297
Pdf URL: https://arxiv.org/pdf/2602.04297
Copy Paste: [[2602.04297]] Revisiting Prompt Sensitivity in Large Language Models for Text Classification: The Role of Prompt Underspecification(https://arxiv.org/abs/2602.04297)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are widely used as zero-shot and few-shot classifiers, where task behaviour is largely controlled through prompting. A growing number of works have observed that LLMs are sensitive to prompt variations, with small changes leading to large changes in performance. However, in many cases, the investigation of sensitivity is performed using underspecified prompts that provide minimal task instructions and weakly constrain the model's output space. In this work, we argue that a significant portion of the observed prompt sensitivity can be attributed to prompt underspecification. We systematically study and compare the sensitivity of underspecified prompts and prompts that provide specific instructions. Utilising performance analysis, logit analysis, and linear probing, we find that underspecified prompts exhibit higher performance variance and lower logit values for relevant tokens, while instruction-prompts suffer less from such problems. However, linear probing analysis suggests that the effects of prompt underspecification have only a marginal impact on the internal LLM representations, instead emerging in the final layers. Overall, our findings highlight the need for more rigour when investigating and mitigating prompt sensitivity.
摘要：大型语言模型（LLM）被广泛用作零样本和少样本分类器，其中任务行为主要通过提示来控制。越来越多的研究发现，法学硕士对即时变化很敏感，微小的变化会导致性能的巨大变化。然而，在许多情况下，敏感性的调查是使用未指定的提示进行的，这些提示提供最少的任务指令并弱约束模型的输出空间。在这项工作中，我们认为观察到的即时敏感性的很大一部分可归因于即时规格不足。我们系统地研究和比较未指定提示和提供具体说明的提示的敏感性。利用性能分析、logit 分析和线性探测，我们发现未指定的提示表现出较高的性能方差和相关标记的较低 logit 值，而指令提示则较少出现此类问题。然而，线性探测分析表明，即时规格不足的影响仅对内部 LLM 表示产生边际影响，而不是在最后几层中出现。总体而言，我们的研究结果强调在调查和减轻即时敏感性时需要更加严格。

Title: DeFrame: Debiasing Large Language Models Against Framing Effects

Authors: Kahee Lim, Soyeon Kim, Steven Euijong Whang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04306
Pdf URL: https://arxiv.org/pdf/2602.04306
Copy Paste: [[2602.04306]] DeFrame: Debiasing Large Language Models Against Framing Effects(https://arxiv.org/abs/2602.04306)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.
摘要：随着大型语言模型 (LLM) 越来越多地部署在现实世界的应用中，确保其在不同人群中的公平响应变得至关重要。尽管做出了许多努力，但持续存在的挑战是隐藏的偏见：法学硕士在标准评估下显得公平，但在这些评估设置之外可能会产生有偏见的反应。在本文中，我们将框架——语义上等效提示的表达方式的差异（例如，“A 比 B 更好”与“B 比 A 更差”）——作为造成这一差距的一个尚未得到充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评价的影响。通过使用替代框架增强公平性评估基准，我们发现（1）公平性分数随框架而显着变化，（2）现有的去偏方法提高了整体（即帧平均）公平性，但通常无法减少框架引起的差异。为了解决这个问题，我们提出了一种框架感知去偏方法，鼓励法学硕士在框架之间更加一致。实验表明，我们的方法减少了总体偏差，提高了针对框架差异的鲁棒性，使法学硕士能够产生更公平、更一致的反应。

Title: Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models

Authors: Sichu Liang, Hongyu Zhu, Wenwen Wang, Deyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04355
Pdf URL: https://arxiv.org/pdf/2602.04355
Copy Paste: [[2602.04355]] Can Vision Replace Text in Working Memory? Evidence from Spatial n-Back in Vision-Language Models(https://arxiv.org/abs/2602.04355)
Keywords: language model
Abstract: Working memory is a central component of intelligent behavior, providing a dynamic workspace for maintaining and updating task-relevant information. Recent work has used n-back tasks to probe working-memory-like behavior in large language models, but it is unclear whether the same probe elicits comparable computations when information is carried in a visual rather than textual code in vision-language models. We evaluate Qwen2.5 and Qwen2.5-VL on a controlled spatial n-back task presented as matched text-rendered or image-rendered grids. Across conditions, models show reliably higher accuracy and d' with text than with vision. To interpret these differences at the process level, we use trial-wise log-probability evidence and find that nominal 2/3-back often fails to reflect the instructed lag and instead aligns with a recency-locked comparison. We further show that grid size alters recent-repeat structure in the stimulus stream, thereby changing interference and error patterns. These results motivate computation-sensitive interpretations of multimodal working memory.
摘要：工作记忆是智能行为的核心组成部分，为维护和更新任务相关信息提供动态工作空间。最近的工作使用 n-back 任务来探测大型语言模型中的类似工作记忆的行为，但尚不清楚当信息以视觉语言模型中的视觉代码而不是文本代码携带时，相同的探测是否会引发类似的计算。我们在受控空间 n-back 任务上评估 Qwen2.5 和 Qwen2.5-VL，该任务呈现为匹配的文本渲染或图像渲染网格。在各种条件下，模型在文本上表现出比视觉上更高的准确度和d'。为了在流程层面解释这些差异，我们使用试验性对数概率证据，发现名义 2/3-back 通常无法反映指示的滞后，而是与新近度锁定比较一致。我们进一步表明，网格大小改变了刺激流中的最近重复结构，从而改变了干扰和错误模式。这些结果激发了对多模式工作记忆的计算敏感的解释。

Title: Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning

Authors: Jie Deng, Hanshuang Tong, Jun Li, Shining Liang, Ning Wu, Hongzhi Li, Yutao Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04391
Pdf URL: https://arxiv.org/pdf/2602.04391
Copy Paste: [[2602.04391]] Beyond Rejection Sampling: Trajectory Fusion for Scaling Mathematical Reasoning(https://arxiv.org/abs/2602.04391)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have made impressive strides in mathematical reasoning, often fine-tuned using rejection sampling that retains only correct reasoning trajectories. While effective, this paradigm treats supervision as a binary filter that systematically excludes teacher-generated errors, leaving a gap in how reasoning failures are modeled during training. In this paper, we propose TrajFusion, a fine-tuning strategy that reframes rejection sampling as a structured supervision construction process. Specifically, TrajFusion forms fused trajectories that explicitly model trial-and-error reasoning by interleaving selected incorrect trajectories with reflection prompts and correct trajectories. The length of each fused sample is adaptively controlled based on the frequency and diversity of teacher errors, providing richer supervision for challenging problems while safely reducing to vanilla rejection sampling fine-tuning (RFT) when error signals are uninformative. TrajFusion requires no changes to the architecture or training objective. Extensive experiments across multiple math benchmarks demonstrate that TrajFusion consistently outperforms RFT, particularly on challenging and long-form reasoning problems.
摘要：大型语言模型 (LLM) 在数学推理方面取得了令人印象深刻的进步，通常使用仅保留正确推理轨迹的拒绝采样进行微调。虽然有效，但这种范式将监督视为二元过滤器，系统地排除教师产生的错误，从而在训练过程中如何对推理失败进行建模方面留下了空白。在本文中，我们提出了 TrajFusion，一种微调策略，将拒绝采样重新构建为结构化监督构建过程。具体来说，TrajFusion 形成融合轨迹，通过将选定的不正确轨迹与反射提示和正确轨迹交错来显式模拟试错推理。每个融合样本的长度根据教师错误的频率和多样性进行自适应控制，为具有挑战性的问题提供更丰富的监督，同时在错误信号信息不丰富时安全地减少到普通拒绝采样微调（RFT）。 TrajFusion 不需要更改架构或培训目标。跨多个数学基准的大量实验表明，TrajFusion 始终优于 RFT，特别是在具有挑战性的长形式推理问题上。

Title: Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models

Authors: Isabel Tsintsiper, Sheng Wong, Beth Albert, Shaun P Brennecke, Gabriel Davis Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04392
Pdf URL: https://arxiv.org/pdf/2602.04392
Copy Paste: [[2602.04392]] Evaluating the Presence of Sex Bias in Clinical Reasoning by Large Language Models(https://arxiv.org/abs/2602.04392)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are increasingly embedded in healthcare workflows for documentation, education, and clinical decision support. However, these systems are trained on large text corpora that encode existing biases, including sex disparities in diagnosis and treatment, raising concerns that such patterns may be reproduced or amplified. We systematically examined whether contemporary LLMs exhibit sex-specific biases in clinical reasoning and how model configuration influences these behaviours. We conducted three experiments using 50 clinician-authored vignettes spanning 44 specialties in which sex was non-informative to the initial diagnostic pathway. Four general-purpose LLMs (ChatGPT (gpt-4o-mini), Claude 3.7 Sonnet, Gemini 2.0 Flash and DeepSeekchat). All models demonstrated significant sex-assignment skew, with predicted sex differing by model. At temperature 0.5, ChatGPT assigned female sex in 70% of cases (95% CI 0.66-0.75), DeepSeek in 61% (0.57-0.65) and Claude in 59% (0.55-0.63), whereas Gemini showed a male skew, assigning a female sex in 36% of cases (0.32-0.41). Contemporary LLMs exhibit stable, model-specific sex biases in clinical reasoning. Permitting abstention reduces explicit labelling but does not eliminate downstream diagnostic differences. Safe clinical integration requires conservative and documented configuration, specialty-level clinical data auditing, and continued human oversight when deploying general-purpose models in healthcare settings.
摘要：大型语言模型 (LLM) 越来越多地嵌入医疗保健工作流程中，用于文档、教育和临床决策支持。然而，这些系统是在大型文本语料库上进行训练的，这些语料库编码了现有的偏见，包括诊断和治疗中的性别差异，引发了人们对此类模式可能被复制或放大的担忧。我们系统地研究了当代法学硕士在临床推理中是否表现出性别特异性偏差，以及模型配置如何影响这些行为。我们使用涵盖 44 个专业的 50 个临床医生撰写的小插图进行了三项实验，其中性别对于最初的诊断途径来说并不提供信息。四个通用 LLM（ChatGPT (gpt-4o-mini)、Claude 3.7 Sonnet、Gemini 2.0 Flash 和 DeepSeekchat）。所有模型都表现出显着的性别分配偏差，预测的性别因模型而异。在温度 0.5 时，ChatGPT 在 70% 的病例中指定了女性性别（95% CI 0.66-0.75），DeepSeek 为 61%（0.57-0.65），Claude 为 59%（0.55-0.63），而 Gemini 显示出男性倾向，在 36% 的病例中指定了女性性别（0.32-0.41）。当代法学硕士在临床推理中表现出稳定的、模型特定的性别偏见。允许弃权可以减少明确的标签，但不能消除下游诊断差异。在医疗保健环境中部署通用模型时，安全的临床集成需要保守且记录在案的配置、专业级临床数据审核以及持续的人工监督。

Title: Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts

Authors: Yujie Lin, Kunquan Li, Yixuan Liao, Xiaoxin Chen, Jinsong Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04398
Pdf URL: https://arxiv.org/pdf/2602.04398
Copy Paste: [[2602.04398]] Bi-directional Bias Attribution: Debiasing Large Language Models without Modifying Prompts(https://arxiv.org/abs/2602.04398)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated impressive capabilities across a wide range of natural language processing tasks. However, their outputs often exhibit social biases, raising fairness concerns. Existing debiasing methods, such as fine-tuning on additional datasets or prompt engineering, face scalability issues or compromise user experience in multi-turn interactions. To address these challenges, we propose a framework for detecting stereotype-inducing words and attributing neuron-level bias in LLMs, without the need for fine-tuning or prompt modification. Our framework first identifies stereotype-inducing adjectives and nouns via comparative analysis across demographic groups. We then attribute biased behavior to specific neurons using two attribution strategies based on integrated gradients. Finally, we mitigate bias by directly intervening on their activations at the projection layer. Experiments on three widely used LLMs demonstrate that our method effectively reduces bias while preserving overall model performance. Code is available at the github link: this https URL.
摘要：大型语言模型 (LLM) 在广泛的自然语言处理任务中展示了令人印象深刻的能力。然而，他们的产出往往表现出社会偏见，引发公平问题。现有的去偏差方法，例如对附加数据集进行微调或提示工程，面临可扩展性问题或损害多轮交互中的用户体验。为了应对这些挑战，我们提出了一个框架，用于检测法学硕士中的刻板印象诱导词并归因神经元水平偏差，而无需微调或立即修改。我们的框架首先通过跨人口群体的比较分析来识别引起刻板印象的形容词和名词。然后，我们使用两种基于积分梯度的归因策略将偏见行为归因于特定神经元。最后，我们通过直接干预投影层的激活来减轻偏差。对三个广泛使用的法学硕士的实验表明，我们的方法有效地减少了偏差，同时保持了模型的整体性能。代码可在 github 链接中找到：此 https URL。

Title: Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models

Authors: Yu Zhang, Xinchen Li, Jialei Zhou, Hongnan Ma, Zhongwei Wan, Yiwei Shi, Duoqian Miao, Qi Zhang, Longbing Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04399
Pdf URL: https://arxiv.org/pdf/2602.04399
Copy Paste: [[2602.04399]] Swordsman: Entropy-Driven Adaptive Block Partition for Efficient Diffusion Language Models(https://arxiv.org/abs/2602.04399)
Keywords: language model
Abstract: Block-wise decoding effectively improves the inference speed and quality in diffusion language models (DLMs) by combining inter-block sequential denoising and intra-block parallel unmasking. However, existing block-wise decoding methods typically partition blocks in a rigid and fixed manner, which inevitably fragments complete semantic or syntactic constituents, leading to suboptimal performance. Inspired by the entropy reduction hypothesis (ERH), we recognize that constituent boundaries offer greater opportunities for uncertainty reduction, which motivates us to employ entropy analysis for identifying constituent boundaries. Therefore, we propose Swordsman, an entropy-driven adaptive block-wise decoding framework for DLMs. Swordsman adaptively partitions blocks by identifying entropy shifts between adjacent tokens to better align with semantic or syntactic constituent boundaries. In addition, Swordsman dynamically adjusts unmasking thresholds conditioned on the real-time unmasking status within a block, further improving both efficiency and stability. As a training-free framework, supported by KV Cache, Swordsman demonstrates state-of-the-art performance across extensive evaluations.
摘要：分块解码通过结合块间顺序去噪和块内并行去掩码，有效提高了扩散语言模型（DLM）的推理速度和质量。然而，现有的按块解码方法通常以严格且固定的方式划分块，这不可避免地会分割完整的语义或句法成分，导致性能不佳。受熵减少假说（ERH）的启发，我们认识到成分边界为减少不确定性提供了更大的机会，这促使我们采用熵分析来识别成分边界。因此，我们提出了 Swordsman，一种用于 DLM 的熵驱动的自适应分块解码框架。 Swordsman 通过识别相邻标记之间的熵位移来自适应地划分块，以更好地与语义或句法组成边界对齐。此外，笑傲江湖还根据区块内的实时解密状态动态调整解密阈值，进一步提高效率和稳定性。作为一个免训练框架，由 KV Cache 支持，Swordsman 在广泛的评估中展示了最先进的性能。

Title: History-Guided Iterative Visual Reasoning with Self-Correction

Authors: Xinglong Yang, Zhilin Peng, Zhanzhan Liu, Haochen Shi, Sheng-Jun Huang
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2602.04413
Pdf URL: https://arxiv.org/pdf/2602.04413
Copy Paste: [[2602.04413]] History-Guided Iterative Visual Reasoning with Self-Correction(https://arxiv.org/abs/2602.04413)
Keywords: language model, llm
Abstract: Self-consistency methods are the core technique for improving the reasoning reliability of multimodal large language models (MLLMs). By generating multiple reasoning results through repeated sampling and selecting the best answer via voting, they play an important role in cross-modal tasks. However, most existing self-consistency methods are limited to a fixed ``repeated sampling and voting'' paradigm and do not reuse historical reasoning information. As a result, models struggle to actively correct visual understanding errors and dynamically adjust their reasoning during iteration. Inspired by the human reasoning behavior of repeated verification and dynamic error correction, we propose the H-GIVR framework. During iterative reasoning, the MLLM observes the image multiple times and uses previously generated answers as references for subsequent steps, enabling dynamic correction of errors and improving answer accuracy. We conduct comprehensive experiments on five datasets and three models. The results show that the H-GIVR framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost. For instance, using \texttt{Llama3.2-vision:11b} on the ScienceQA dataset, the model requires an average of 2.57 responses per question to achieve an accuracy of 78.90\%, representing a 107\% improvement over the baseline.
摘要：自一致性方法是提高多模态大语言模型（MLLM）推理可靠性的核心技术。通过重复采样生成多个推理结果并通过投票选择最佳答案，它们在跨模态任务中发挥着重要作用。然而，大多数现有的自洽方法仅限于固定的“重复采样和投票”范式，并且不重用历史推理信息。因此，模型很难主动纠正视觉理解错误并在迭代过程中动态调整其推理。受到重复验证和动态纠错的人类推理行为的启发，我们提出了H-GIVR框架。在迭代推理过程中，MLLM 会多次观察图像，并使用先前生成的答案作为后续步骤的参考，从而能够动态纠正错误并提高答案准确性。我们对五个数据集和三个模型进行了全面的实验。结果表明，H-GIVR框架可以显着提高跨模态推理精度，同时保持较低的计算成本。例如，在 ScienceQA 数据集上使用 \texttt{Llama3.2-vision:11b}，该模型平均每个问题需要 2.57 个响应才能达到 78.90% 的准确度，比基线提高了 107%。

Title: Fine-Grained Activation Steering: Steering Less, Achieving More

Authors: Zijian Feng, Tianjiao Li, Zixiao Zhu, Hanzhang Zhou, Junlang Qian, Li Zhang, Jia Jim Deryl Chua, Lee Onn Mak, Gee Wah Ng, Kezhi Mao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04428
Pdf URL: https://arxiv.org/pdf/2602.04428
Copy Paste: [[2602.04428]] Fine-Grained Activation Steering: Steering Less, Achieving More(https://arxiv.org/abs/2602.04428)
Keywords: language model, llm
Abstract: Activation steering has emerged as a cost-effective paradigm for modifying large language model (LLM) behaviors. Existing methods typically intervene at the block level, steering the bundled activations of selected attention heads, feedforward networks, or residual streams. However, we reveal that block-level activations are inherently heterogeneous, entangling beneficial, irrelevant, and harmful features, thereby rendering block-level steering coarse, inefficient, and intrusive. To investigate the root cause, we decompose block activations into fine-grained atomic unit (AU)-level activations, where each AU-level activation corresponds to a single dimension of the block activation, and each AU denotes a slice of the block weight matrix. Steering an AU-level activation is thus equivalent to steering its associated AU. Our theoretical and empirical analysis show that heterogeneity arises because different AUs or dimensions control distinct token distributions in LLM outputs. Hence, block-level steering inevitably moves helpful and harmful token directions together, which reduces efficiency. Restricting intervention to beneficial AUs yields more precise and effective steering. Building on this insight, we propose AUSteer, a simple and efficient method that operates at a finer granularity of the AU level. AUSteer first identifies discriminative AUs globally by computing activation momenta on contrastive samples. It then assigns adaptive steering strengths tailored to diverse inputs and selected AU activations. Comprehensive experiments on multiple LLMs and tasks show that AUSteer consistently surpasses advanced baselines while steering considerably fewer activations, demonstrating that steering less achieves more.
摘要：激活引导已成为修改大型语言模型 (LLM) 行为的经济有效的范例。现有的方法通常在块级别进行干预，引导选定的注意力头、前馈网络或残余流的捆绑激活。然而，我们发现块级激活本质上是异构的，纠缠着有益的、不相关的和有害的特征，从而使块级控制变得粗糙、低效和侵入性。为了研究根本原因，我们将块激活分解为细粒度原子单元（AU）级激活，其中每个AU级激活对应于块激活的单个维度，每个AU表示块权重矩阵的一个切片。因此，控制 AU 级别的激活相当于控制其关联的 AU。我们的理论和实证分析表明，异质性的出现是因为不同的 AU 或维度控制着 LLM 输出中不同的令牌分布。因此，区块级转向不可避免地将有益和有害的代币方向一起移动，从而降低了效率。将干预限制在有益的 AU 上可以产生更精确、更有效的指导。基于这一见解，我们提出了 AUSteer，这是一种简单而有效的方法，可以在 AU 级别的更细粒度上运行。 AUSteer 首先通过计算对比样本的激活动量来识别全局的判别性 AU。然后，它会根据不同的输入和选定的 AU 激活分配自适应转向强度。对多个法学硕士和任务的综合实验表明，AUSteer 始终超越先进的基线，同时引导的激活数量大大减少，这表明更少的引导可以实现更多的目标。

Title: No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data

Authors: Dmitry Karpov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04442
Pdf URL: https://arxiv.org/pdf/2602.04442
Copy Paste: [[2602.04442]] No One-Size-Fits-All: Building Systems For Translation to Bashkir, Kazakh, Kyrgyz, Tatar and Chuvash Using Synthetic And Original Data(https://arxiv.org/abs/2602.04442)
Keywords: prompt
Abstract: We explore machine translation for five Turkic language pairs: Russian-Bashkir, Russian-Kazakh, Russian-Kyrgyz, English-Tatar, English-Chuvash. Fine-tuning nllb-200-distilled-600M with LoRA on synthetic data achieved chrF++ 49.71 for Kazakh and 46.94 for Bashkir. Prompting DeepSeek-V3.2 with retrieved similar examples achieved chrF++ 39.47 for Chuvash. For Tatar, zero-shot or retrieval-based approaches achieved chrF++ 41.6, while for Kyrgyz the zero-shot approach reached 45.6. We release the dataset and the obtained weights.
摘要：我们探索五种突厥语对的机器翻译：俄语-巴什基尔语、俄语-哈萨克语、俄语-吉尔吉斯语、英语-鞑靼语、英语-楚瓦什语。使用 LoRA 对合成数据进行微调 nllb-200-distilled-600M，哈萨克语 chrF++ 为 49.71，巴什基尔语为 46.94。使用检索到的类似示例提示 DeepSeek-V3.2，Chuvash 的 chrF++ 达到了 39.47。对于鞑靼人来说，零样本或基于检索的方法达到了 chrF++ 41.6，而对于吉尔吉斯斯坦来说，零样本方法达到了 45.6。我们发布数据集和获得的权重。

Title: Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks

Authors: Masaya Tsunokake, Yuta Koreeda, Terufumi Morishita, Koichi Nagatsuka, Hikaru Tomonari, Yasuhiro Sogawa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04466
Pdf URL: https://arxiv.org/pdf/2602.04466
Copy Paste: [[2602.04466]] Is Micro Domain-Adaptive Pre-Training Effective for Real-World Operations? Multi-Step Evaluation Reveals Potential and Bottlenecks(https://arxiv.org/abs/2602.04466)
Keywords: llm
Abstract: When applying LLMs to real-world enterprise operations, LLMs need to handle proprietary knowledge in small domains of specific operations ($\textbf{micro domains}$). A previous study shows micro domain-adaptive pre-training ($\textbf{mDAPT}$) with fewer documents is effective, similarly to DAPT in larger domains. However, it evaluates mDAPT only on multiple-choice questions; thus, its effectiveness for generative tasks in real-world operations remains unknown. We aim to reveal the potential and bottlenecks of mDAPT for generative tasks. To this end, we disentangle the answering process into three subtasks and evaluate the performance of each subtask: (1) $\textbf{eliciting}$ facts relevant to questions from an LLM's own knowledge, (2) $\textbf{reasoning}$ over the facts to obtain conclusions, and (3) $\textbf{composing}$ long-form answers based on the conclusions. We verified mDAPT on proprietary IT product knowledge for real-world questions in IT technical support operations. As a result, mDAPT resolved the elicitation task that the base model struggled with but did not resolve other subtasks. This clarifies mDAPT's effectiveness in the knowledge aspect and its bottlenecks in other aspects. Further analysis empirically shows that resolving the elicitation and reasoning tasks ensures sufficient performance (over 90%), emphasizing the need to enhance reasoning capability.
摘要：当将法学硕士应用于现实世界的企业运营时，法学硕士需要处理特定运营的小领域（$\textbf{微领域}$）中的专有知识。之前的一项研究表明，文档较少的微域自适应预训练 ($\textbf{mDAPT}$) 是有效的，类似于较大域中的 DAPT。然而，它仅在多项选择题上评估 mDAPT；因此，它在现实世界操作中生成任务的有效性仍然未知。我们的目标是揭示 mDAPT 在生成任务中的潜力和瓶颈。为此，我们将回答过程分解为三个子任务，并评估每个子任务的表现：（1）$\textbf{exciting}$从LLM自己的知识中提取与问题相关的事实，（2）$\textbf{reasoning}$对事实进行推理以获得结论，以及（3）$\textbf{composition}$基于结论的长式答案。我们针对 IT 技术支持运营中的实际问题验证了 mDAPT 的专有 IT 产品知识。结果，mDAPT 解决了基础模型难以解决的启发任务，但没有解决其他子任务。这阐明了mDAPT在知识方面的有效性以及在其他方面的瓶颈。进一步的实证分析表明，解决启发和推理任务可以确保足够的性能（超过90％），强调需要增强推理能力。

Title: Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition

Authors: Jinlong Ma, Yu Zhang, Xuefeng Bai, Kehai Chen, Yuwei Wang, Zeming Liu, Jun Yu, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04486
Pdf URL: https://arxiv.org/pdf/2602.04486
Copy Paste: [[2602.04486]] Beyond Unimodal Shortcuts: MLLMs as Cross-Modal Reasoners for Grounded Named Entity Recognition(https://arxiv.org/abs/2602.04486)
Keywords: language model, llm
Abstract: Grounded Multimodal Named Entity Recognition (GMNER) aims to extract text-based entities, assign them semantic categories, and ground them to corresponding visual regions. In this work, we explore the potential of Multimodal Large Language Models (MLLMs) to perform GMNER in an end-to-end manner, moving beyond their typical role as auxiliary tools within cascaded pipelines. Crucially, our investigation reveals a fundamental challenge: MLLMs exhibit $\textbf{modality bias}$, including visual bias and textual bias, which stems from their tendency to take unimodal shortcuts rather than rigorous cross-modal verification. To address this, we propose Modality-aware Consistency Reasoning ($\textbf{MCR}$), which enforces structured cross-modal reasoning through Multi-style Reasoning Schema Injection (MRSI) and Constraint-guided Verifiable Optimization (CVO). MRSI transforms abstract constraints into executable reasoning chains, while CVO empowers the model to dynamically align its reasoning trajectories with Group Relative Policy Optimization (GRPO). Experiments on GMNER and visual grounding tasks demonstrate that MCR effectively mitigates modality bias and achieves superior performance compared to existing baselines.
摘要：接地多模态命名实体识别（GMNER）旨在提取基于文本的实体，为其分配语义类别，并将它们接地到相应的视觉区域。在这项工作中，我们探索了多模态大型语言模型 (MLLM) 以端到端方式执行 GMNER 的潜力，超越了它们作为级联管道中辅助工具的典型角色。至关重要的是，我们的调查揭示了一个根本性的挑战：MLLM 表现出 $\textbf{模态偏差}$，包括视觉偏差和文本偏差，这源于它们倾向于采用单模态捷径而不是严格的跨模态验证。为了解决这个问题，我们提出了模态感知一致性推理（$\textbf{MCR}$），它通过多风格推理模式注入（MRSI）和约束引导的可验证优化（CVO）来强制结构化跨模态推理。 MRSI 将抽象约束转化为可执行的推理链，而 CVO 使模型能够动态地将其推理轨迹与组相对策略优化 (GRPO) 对齐。 GMNER 和视觉基础任务的实验表明，与现有基线相比，MCR 有效减轻了模态偏差并实现了卓越的性能。

Title: Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough

Authors: Dario Paape, Tal Linzen, Shravan Vasishth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04489
Pdf URL: https://arxiv.org/pdf/2602.04489
Copy Paste: [[2602.04489]] Deconstructing sentence disambiguation by joint latent modeling of reading paradigms: LLM surprisal is not enough(https://arxiv.org/abs/2602.04489)
Keywords: gpt, llm
Abstract: Using temporarily ambiguous garden-path sentences ("While the team trained the striker wondered ...") as a test case, we present a latent-process mixture model of human reading behavior across four different reading paradigms (eye tracking, uni- and bidirectional self-paced reading, Maze). The model distinguishes between garden-path probability, garden-path cost, and reanalysis cost, and yields more realistic processing cost estimates by taking into account trials with inattentive reading. We show that the model is able to reproduce empirical patterns with regard to rereading behavior, comprehension question responses, and grammaticality judgments. Cross-validation reveals that the mixture model also has better predictive fit to human reading patterns and end-of-trial task data than a mixture-free model based on GPT-2-derived surprisal values. We discuss implications for future work.
摘要：使用暂时模糊的花园小路句子（“当团队训练前锋想知道......”）作为测试用例时，我们提出了跨四种不同阅读范式（眼动追踪、单向和双向自定进度阅读、迷宫）的人类阅读行为的潜在过程混合模型。该模型区分花园路径概率、花园路径成本和重新分析成本，并通过考虑不专心阅读的试验来产生更现实的处理成本估计。我们证明该模型能够重现重读行为、理解问题回答和语法判断方面的经验模式。交叉验证表明，与基于 GPT-2 得出的意外值的无混合模型相比，混合模型对人类阅读模式和试验结束任务数据也具有更好的预测拟合。我们讨论对未来工作的影响。

Title: PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation

Authors: Saleh Afzoon, MohammadHossein Ahmadi, Usman Naseem, Amin Beheshti
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2602.04493
Pdf URL: https://arxiv.org/pdf/2602.04493
Copy Paste: [[2602.04493]] PersoDPO: Scalable Preference Optimization for Instruction-Adherent, Persona-Grounded Dialogue via Multi-LLM Evaluation(https://arxiv.org/abs/2602.04493)
Keywords: language model, llm
Abstract: Personalization and contextual coherence are two essential components in building effective persona-grounded dialogue systems. These aspects play a crucial role in enhancing user engagement and ensuring responses are more relevant and consistent with user identity. However, recent studies indicate that open-source large language models (LLMs) continue to struggle to generate responses that are both contextually grounded and aligned with persona cues, despite exhibiting strong general conversational abilities like fluency and naturalness. We present PersoDPO, a scalable preference optimisation framework that uses supervision signals from automatic evaluations of responses generated by both closed-source and open-source LLMs to fine-tune dialogue models. The framework integrates evaluation metrics targeting coherence and personalization, along with a length-format compliance feature to promote instruction adherence. These signals are combined to automatically construct high-quality preference pairs without manual annotation, enabling a scalable and reproducible training pipeline. Experiments on the FoCus dataset show that an open-source language model fine-tuned with the PersoDPO framework consistently outperforms strong open-source baselines and a standard Direct Preference Optimization (DPO) variant across multiple evaluation dimensions.
摘要：个性化和上下文连贯性是构建有效的基于角色的对话系统的两个重要组成部分。这些方面在增强用户参与度和确保响应与用户身份更加相关和一致方面发挥着至关重要的作用。然而，最近的研究表明，开源大语言模型 (LLM) 仍然难以生成既基于上下文又与人物角色线索一致的响应，尽管它们表现出强大的一般会话能力，例如流畅性和自然性。我们提出了 PersoDPO，这是一个可扩展的偏好优化框架，它使用来自闭源和开源法学硕士生成的响应自动评估的监督信号来微调对话模型。该框架集成了针对连贯性和个性化的评估指标，以及长度格式合规性功能，以促进指令的遵守。这些信号组合起来可以自动构建高质量的偏好对，无需手动注释，从而实现可扩展且可重复的训练流程。 FoCus 数据集上的实验表明，使用 PersoDPO 框架进行微调的开源语言模型在多个评估维度上始终优于强大的开源基线和标准直接偏好优化 (DPO) 变体。

Title: Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models

Authors: Hyeontaek Hwang, Nguyen Dinh Son, Daeyoung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04509
Pdf URL: https://arxiv.org/pdf/2602.04509
Copy Paste: [[2602.04509]] Model-Dowser: Data-Free Importance Probing to Mitigate Catastrophic Forgetting in Multimodal Large Language Models(https://arxiv.org/abs/2602.04509)
Keywords: language model, llm
Abstract: Fine-tuning Multimodal Large Language Models (MLLMs) on task-specific data is an effective way to improve performance on downstream applications. However, such adaptation often leads to a degradation in generalization on pretrained tasks, a phenomenon known as Catastrophic Forgetting. Existing methods that aim to mitigate this issue either become ineffective when fine-tuning deeper layers of the language decoder or scale poorly with increasing model size. To address these limitations, we propose Model-Dowser, a novel sparse fine-tuning approach for MLLMs. Model-Dowser measures a principled importance score for each model parameter with respect to pretrained generalization (prior to downstream adaptation) by jointly considering weight magnitudes, input activations, and output sensitivities. During fine-tuning, Model-Dowser selectively preserves high-importance parameters and updates the remaining. Comprehensive experiments on two representative MLLMs, LLaVA and NVILA, demonstrate that Model-Dowser effectively mitigates catastrophic forgetting and consistently outperforms prior methods, while remaining resource-efficient and scalable to multi-billion-parameter models.
摘要：针对特定任务的数据微调多模态大型语言模型 (MLLM) 是提高下游应用程序性能的有效方法。然而，这种适应通常会导致预训练任务的泛化能力下降，这种现象被称为灾难性遗忘。旨在缓解此问题的现有方法要么在微调语言解码器的更深层时变得无效，要么随着模型大小的增加而扩展性较差。为了解决这些限制，我们提出了 Model-Dowser，这是一种新颖的 MLLM 稀疏微调方法。 Model-Dowser 通过联合考虑权重大小、输入激活和输出敏感性，测量每个模型参数相对于预训练泛化（在下游适应之前）的原则重要性得分。在微调过程中，Model-Dowser 选择性地保留高重要性参数并更新其余参数。对两种代表性 MLLM LLaVA 和 NVILA 的综合实验表明，Model-Dowser 可以有效减轻灾难性遗忘，并始终优于先前的方法，同时保持资源效率并可扩展至数十亿参数模型。

Title: $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal

Authors: Aditya Kasliwal, Pratinav Seth, Vinay Kumar Sankarapu
Subjects: cs.CL, cs.ET
Abstract URL: https://arxiv.org/abs/2602.04521
Pdf URL: https://arxiv.org/pdf/2602.04521
Copy Paste: [[2602.04521]] $C$-$ΔΘ$: Circuit-Restricted Weight Arithmetic for Selective Refusal(https://arxiv.org/abs/2602.04521)
Keywords: llm
Abstract: Modern deployments require LLMs to enforce safety policies at scale, yet many controls rely on inference-time interventions that add recurring compute cost and serving complexity. Activation steering is widely used, but it requires runtime hooks and scales cost with the number of generations; conditional variants improve selectivity by gating when steering is applied but still retain an inference-time control path. We ask whether selective refusal can be moved entirely offline: can a mechanistic understanding of category-specific refusal be distilled into a circuit-restricted weight update that deploys as a standard checkpoint? We propose C-{\Delta}{\theta}: Circuit Restricted Weight Arithmetic, which (i) localizes refusal-causal computation as a sparse circuit using EAP-IG and (ii) computes a constrained weight update {\Delta}{\theta}C supported only on that circuit (typically <5% of parameters). Applying {\Delta}{\theta}C yields a drop-in edited checkpoint with no inference-time hooks, shifting cost from per-request intervention to a one-time offline update. We evaluate category-targeted selectivity and capability retention on refusal and utility benchmarks.
摘要：现代部署要求法学硕士大规模实施安全策略，但许多控制依赖于推理时间干预，这增加了经常性计算成本和服务复杂性。激活引导被广泛使用，但它需要运行时钩子并且随着代数的增加而增加成本；条件变体通过在应用转向时进行门控来提高选择性，但仍保留推理时间控制路径。我们询问选择性拒绝是否可以完全离线：对特定类别拒绝的机械理解是否可以提炼成作为标准检查点部署的电路限制权重更新？我们提出 C-{\Delta}{\theta}：电路限制权重算法，它 (i) 使用 EAP-IG 将拒绝因果计算本地化为稀疏电路，并且 (ii) 计算仅在该电路上支持的约束权重更新 {\Delta}{\theta}C（通常<5% 的参数）。应用 {\Delta}{\theta}C 会产生一个无需推理时间钩子的插入式编辑检查点，将成本从每次请求干预转移到一次性离线更新。我们根据拒绝和效用基准评估针对类别的选择性和能力保留。

Title: LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding

Authors: Gang Lin, Dongfang Li, Zhuoen Chen, Yukun Shi, Xuhui Chen, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04541
Pdf URL: https://arxiv.org/pdf/2602.04541
Copy Paste: [[2602.04541]] LycheeDecode: Accelerating Long-Context LLM Inference via Hybrid-Head Sparse Decoding(https://arxiv.org/abs/2602.04541)
Keywords: language model, llm
Abstract: The proliferation of long-context large language models (LLMs) exposes a key bottleneck: the rapidly expanding key-value cache during decoding, which imposes heavy memory and latency costs. While recent approaches attempt to alleviate this by sharing a single set of crucial tokens across layers, such coarse-grained sharing undermines model performance by neglecting the functional diversity of attention heads. To address this, we propose LycheeDecode, an efficient decoding method centered on a fine-grained hybrid-head attention mechanism that employs a hardware-efficient top-k selection strategy. Specifically, the novel HardKuma-based mechanism partitions attention heads into a small subset of retrieval heads that dynamically identify crucial tokens and a majority of sparse heads that reuse them for efficient computation. Through extensive experiments on leading models like Llama3 and Qwen3 across diverse benchmarks for long-context understanding (e.g., LongBench, RULER) and complex reasoning (e.g., AIME24, OlympiadBench), we demonstrate that LycheeDecode achieves generative quality comparable to, and at times surpassing even the full-attention baseline. Crucially, this is accomplished with up to a 2.7x speedup at a 128K context length. By preserving the functional diversity of attention heads, our fine-grained strategy overcomes the performance bottlenecks of existing methods, providing a powerful and validated pathway to both efficient and high-quality long-context LLM inference.
摘要：长上下文大语言模型（LLM）的激增暴露了一个关键瓶颈：解码过程中快速扩展的键值缓存，这会带来沉重的内存和延迟成本。虽然最近的方法试图通过跨层共享一组关键令牌来缓解这一问题，但这种粗粒度的共享由于忽略了注意力头的功能多样性而损害了模型的性能。为了解决这个问题，我们提出了 LycheeDecode，这是一种以细粒度混合头注意力机制为中心的高效解码方法，采用硬件高效的 top-k 选择策略。具体来说，基于 HardKuma 的新颖机制将注意力头划分为一小部分检索头，用于动态识别关键标记，以及大多数稀疏头，重用它们以进行高效计算。通过对 Llama3 和 Qwen3 等领先模型在长上下文理解（例如 LongBench、RULER）和复杂推理（例如 AIME24、OlympiadBench）的不同基准上进行的广泛实验，我们证明 LycheeDecode 的生成质量可与全注意力基准相媲美，有时甚至超过全注意力基准。至关重要的是，这是在 128K 上下文长度下实现了高达 2.7 倍的加速。通过保留注意力头的功能多样性，我们的细粒度策略克服了现有方法的性能瓶颈，为高效、高质量的长上下文 LLM 推理提供了强大且经过验证的途径。

Title: Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04556
Pdf URL: https://arxiv.org/pdf/2602.04556
Copy Paste: [[2602.04556]] Rethinking Weight Tying: Pseudo-Inverse Tying for Stable LM Training and Updates(https://arxiv.org/abs/2602.04556)
Keywords: language model
Abstract: Weight tying is widely used in compact language models to reduce parameters by sharing the token table between the input embedding and the output projection. However, weight sharing does not guarantee a stable token interface: during training, the correspondence between encoding tokens into hidden states and decoding hidden states into logits can drift, worsening optimization sensitivity and making post-training interventions such as editing, patching, and lightweight adaptation less predictable. We propose Pseudo-Inverse Tying (PIT), which synchronizes embedding and unembedding as coupled projections of a shared latent token memory, guaranteeing a pseudo-inverse-consistent interface throughout training. PIT maintains an orthonormal shared memory, obtained by thin polar decomposition for teacher initialization or random orthonormal initialization from scratch, and introduces a fully learned symmetric positive definite hidden-space transform parameterized via a Cholesky factor. The output head applies this transform to hidden states before the vocabulary projection, while the embedding applies the inverse transform to token vectors using stable triangular solves, avoiding explicit pseudo-inverse recomputation and any vocabulary-sized auxiliary parameters. We evaluate PIT on on-device models spanning 256M-1.3B parameters across pretraining and adaptation, and consistently observe improved training stability, stronger layerwise semantic consistency, and substantially reduced side effects.
摘要：权重绑定广泛应用于紧凑语言模型中，通过在输入嵌入和输出投影之间共享标记表来减少参数。然而，权重共享并不能保证稳定的 token 接口：在训练期间，将 token 编码为隐藏状态和将隐藏状态解码为 logits 之间的对应关系可能会发生漂移，从而恶化优化灵敏度，并使编辑、修补和轻量级适应等训练后干预变得难以预测。我们提出了伪逆绑定（PIT），它将嵌入和非嵌入同步为共享潜在令牌内存的耦合投影，从而保证整个训练过程中的伪逆一致接口。 PIT 维护一个正交共享内存，该内存是通过用于教师初始化的薄极分解或从头开始的随机正交初始化获得的，并引入了通过 Cholesky 因子参数化的完全学习的对称正定隐藏空间变换。输出头在词汇投影之前将此变换应用于隐藏状态，而嵌入则使用稳定的三角求解将逆变换应用于标记向量，从而避免显式伪逆重新计算和任何词汇大小的辅助参数。我们在预训练和适应过程中评估跨 256M-1.3B 参数的设备上模型的 PIT，并持续观察到训练稳定性的提高、分层语义一致性的增强以及副作用的大幅减少。

Title: Textual Planning with Explicit Latent Transitions

Authors: Eliezer Shlomi, Ido Levy, Eilam Shapira, Michael Katz, Guy Uziel, Segev Shlomov, Nir Mashkif, Roi Reichart, Sarah Keren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04557
Pdf URL: https://arxiv.org/pdf/2602.04557
Copy Paste: [[2602.04557]] Textual Planning with Explicit Latent Transitions(https://arxiv.org/abs/2602.04557)
Keywords: llm
Abstract: Planning with LLMs is bottlenecked by token-by-token generation and repeated full forward passes, making multi-step lookahead and rollout-based search expensive in latency and compute. We propose EmbedPlan, which replaces autoregressive next-state generation with a lightweight transition model operating in a frozen language embedding space. EmbedPlan encodes natural language state and action descriptions into vectors, predicts the next-state embedding, and retrieves the next state by nearest-neighbor similarity, enabling fast planning computation without fine-tuning the encoder. We evaluate next-state prediction across nine classical planning domains using six evaluation protocols of increasing difficulty: interpolation, plan-variant, extrapolation, multi-domain, cross-domain, and leave-one-out. Results show near-perfect interpolation performance but a sharp degradation when generalization requires transfer to unseen problems or unseen domains; plan-variant evaluation indicates generalization to alternative plans rather than memorizing seen trajectories. Overall, frozen embeddings support within-domain dynamics learning after observing a domain's transitions, while transfer across domain boundaries remains a bottleneck.
摘要：LLM 的规划受到逐个令牌生成和重复的完整前向传播的瓶颈，使得多步前瞻和基于转出的搜索在延迟和计算方面成本高昂。我们提出了 EmbedPlan，它用在冻结语言嵌入空间中运行的轻量级转换模型来取代自回归的下一代状态。 EmbedPlan 将自然语言状态和动作描述编码为向量，预测下一个状态嵌入，并通过最近邻相似度检索下一个状态，从而无需微调编码器即可实现快速规划计算。我们使用六种难度不断增加的评估协议来评估九个经典规划领域的下一状态预测：插值、计划变体、外推、多域、跨域和留一法。结果显示，插值性能近乎完美，但当泛化需要转移到未见问题或未见领域时，插值性能急剧下降；计划变体评估表明对替代计划的概括，而不是记住看到的轨迹。总体而言，冻结嵌入在观察域的转换后支持域内动态学习，而跨域边界的传输仍然是一个瓶颈。

Title: Can LLMs capture stable human-generated sentence entropy measures?

Authors: Estrella Pivel-Villanueva, Elisabeth Frederike Sterner, Franziska Knolle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04570
Pdf URL: https://arxiv.org/pdf/2602.04570
Copy Paste: [[2602.04570]] Can LLMs capture stable human-generated sentence entropy measures?(https://arxiv.org/abs/2602.04570)
Keywords: language model, gpt, llm, prompt, chat
Abstract: Predicting upcoming words is a core mechanism of language comprehension and may be quantified using Shannon entropy. There is currently no empirical consensus on how many human responses are required to obtain stable and unbiased entropy estimates at the word level. Moreover, large language models (LLMs) are increasingly used as substitutes for human norming data, yet their ability to reproduce stable human entropy remains unclear. Here, we address both issues using two large publicly available cloze datasets in German 1 and English 2. We implemented a bootstrap-based convergence analysis that tracks how entropy estimates stabilize as a function of sample size. Across both languages, more than 97% of sentences reached stable entropy estimates within the available sample sizes. 90% of sentences converged after 111 responses in German and 81 responses in English, while low-entropy sentences (<1) required as few as 20 responses and high-entropy sentences (>2.5) substantially more. These findings provide the first direct empirical validation for common norming practices and demonstrate that convergence critically depends on sentence predictability. We then compared stable human entropy values with entropy estimates derived from several LLMs, including GPT-4o, using both logit-based probability extraction and sampling-based frequency estimation, GPT2-xl/german-GPT-2, RoBERTa Base/GottBERT, and LLaMA 2 7B Chat. GPT-4o showed the highest correspondence with human data, although alignment depended strongly on the extraction method and prompt design. Logit-based estimates minimized absolute error, whereas sampling-based estimates were better in capturing the dispersion of human variability. Together, our results establish practical guidelines for human norming and show that while LLMs can approximate human entropy, they are not interchangeable with stable human-derived distributions.
摘要：预测即将出现的单词是语言理解的核心机制，可以使用香农熵进行量化。目前对于需要多少人类反应才能获得单词级别的稳定且无偏差的熵估计还没有经验上的共识。此外，大型语言模型（LLM）越来越多地用作人类规范数据的替代品，但它们再现稳定人类熵的能力仍不清楚。在这里，我们使用德语 1 和英语 2 中的两个大型公开可用的完形填空数据集来解决这两个问题。我们实现了基于引导程序的收敛分析，跟踪熵估计如何作为样本大小的函数保持稳定。在这两种语言中，超过 97% 的句子在可用样本量内达到了稳定的熵估计。 90% 的句子在 111 个德语响应和 81 个英语响应后收敛，而低熵句子 (<1) 只需 20 个响应，而高熵句子 (>2.5) 则需要更多响应。这些发现为常见规范实践提供了第一个直接的实证验证，并证明收敛关键取决于句子的可预测性。然后，我们使用基于 logit 的概率提取和基于采样的频率估计、GPT2-xl/german-GPT-2、RoBERTa Base/GottBERT 和 LLaMA 2 7B Chat，将稳定的人类熵值与来自多个 LLM（包括 GPT-4o）的熵估计进行比较。尽管对齐很大程度上取决于提取方法和提示设计，但 GPT-4o 显示出与人类数据的最高一致性。基于 Logit 的估计最大限度地减少了绝对误差，而基于抽样的估计可以更好地捕捉人类变异性的分散性。总之，我们的结果为人类规范建立了实用指南，并表明虽然法学硕士可以近似人类熵，但它们不能与稳定的人类衍生分布互换。

Title: Semantic Self-Distillation for Language Model Uncertainty

Authors: Edward Phillips, Sean Wu, Boyan Gao, David A. Clifton
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04577
Pdf URL: https://arxiv.org/pdf/2602.04577
Copy Paste: [[2602.04577]] Semantic Self-Distillation for Language Model Uncertainty(https://arxiv.org/abs/2602.04577)
Keywords: language model, hallucination, prompt
Abstract: Large language models present challenges for principled uncertainty quantification, in part due to their complexity and the diversity of their outputs. Semantic dispersion, or the variance in the meaning of sampled answers, has been proposed as a useful proxy for model uncertainty, but the associated computational cost prohibits its use in latency-critical applications. We show that sampled semantic distributions can be distilled into lightweight student models which estimate a prompt-conditioned uncertainty before the language model generates an answer token. The student model predicts a semantic distribution over possible answers; the entropy of this distribution provides an effective uncertainty signal for hallucination prediction, and the probability density allows candidate answers to be evaluated for reliability. On TriviaQA, our student models match or outperform finite-sample semantic dispersion for hallucination prediction and provide a strong signal for out-of-domain answer detection. We term this technique Semantic Self-Distillation (SSD), which we suggest provides a general framework for distilling predictive uncertainty in complex output spaces beyond language.
摘要：大型语言模型对原则性的不确定性量化提出了挑战，部分原因是它们的复杂性和输出的多样性。语义分散或采样答案含义的差异已被提议作为模型不确定性的有用代理，但相关的计算成本阻碍了其在延迟关键型应用中的使用。我们表明，采样的语义分布可以被提炼成轻量级学生模型，该模型在语言模型生成答案标记之前估计提示条件的不确定性。学生模型预测可能答案的语义分布；该分布的熵为幻觉预测提供了有效的不确定性信号，概率密度允许评估候选答案的可靠性。在 TriviaQA 上，我们的学生模型在幻觉预测方面匹配或优于有限样本语义分散，并为域外答案检测提供了强大的信号。我们将这种技术称为语义自蒸馏（SSD），我们建议它提供了一个通用框架，用于在语言之外的复杂输出空间中提取预测不确定性。

Title: Trust The Typical

Authors: Debargha Ganguly, Sreehari Sankar, Biyao Zhang, Vikash Singh, Kanan Gupta, Harshini Kavuru, Alan Luo, Weicong Chen, Warren Morningstar, Raghu Machiraju, Vipin Chaudhary
Subjects: cs.CL, cs.AI, cs.DC, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04581
Pdf URL: https://arxiv.org/pdf/2602.04581
Copy Paste: [[2602.04581]] Trust The Typical(https://arxiv.org/abs/2602.04581)
Keywords: llm, prompt
Abstract: Current approaches to LLM safety fundamentally rely on a brittle cat-and-mouse game of identifying and blocking known threats via guardrails. We argue for a fresh approach: robust safety comes not from enumerating what is harmful, but from deeply understanding what is safe. We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem. T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat. Unlike prior methods, it requires no training on harmful examples, yet achieves state-of-the-art performance across 18 benchmarks spanning toxicity, hate speech, jailbreaking, multilingual harms, and over-refusal, reducing false positive rates by up to 40x relative to specialized safety models. A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining. Finally, we demonstrate production readiness by integrating a GPU-optimized version into vLLM, enabling continuous guardrailing during token generation with less than 6% overhead even under dense evaluation intervals on large-scale workloads.
摘要：当前的法学硕士安全方法从根本上依赖于脆弱的猫鼠游戏，即通过护栏识别和阻止已知威胁。我们主张采取一种新的方法：强大的安全性不是来自于列举有害的东西，而是来自于深入了解什么是安全的。我们引入了信任典型 (T3)，这是一个通过将安全视为分发外 (OOD) 检测问题来实施这一原则的框架。 T3 了解语义空间中可接受的提示的分布，并将任何重大偏差标记为潜在威胁。与之前的方法不同，它不需要对有害示例进行训练，但却在毒性、仇恨言论、越狱、多语言伤害和过度拒绝等 18 个基准测试中实现了最先进的性能，相对于专门的安全模型，误报率降低了高达 40 倍。仅接受安全英语文本训练的单一模型可以有效地迁移到不同领域和超过 14 种语言，而无需重新训练。最后，我们通过将 GPU 优化版本集成到 vLLM 中来展示生产准备情况，即使在大规模工作负载的密集评估间隔下，也能在令牌生成过程中实现连续防护，开销也小于 6%。

Title: VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration

Authors: Jaeyoon Jung, Yejun Yoon, Seunghyun Yoon, Kunwoo Park
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2602.04587
Pdf URL: https://arxiv.org/pdf/2602.04587
Copy Paste: [[2602.04587]] VILLAIN at AVerImaTeC: Verifying Image-Text Claims via Multi-Agent Collaboration(https://arxiv.org/abs/2602.04587)
Keywords: language model, prompt, agent
Abstract: This paper describes VILLAIN, a multimodal fact-checking system that verifies image-text claims through prompt-based multi-agent collaboration. For the AVerImaTeC shared task, VILLAIN employs vision-language model agents across multiple stages of fact-checking. Textual and visual evidence is retrieved from the knowledge store enriched through additional web collection. To identify key information and address inconsistencies among evidence items, modality-specific and cross-modal agents generate analysis reports. In the subsequent stage, question-answer pairs are produced based on these reports. Finally, the Verdict Prediction agent produces the verification outcome based on the image-text claim and the generated question-answer pairs. Our system ranked first on the leaderboard across all evaluation metrics. The source code is publicly available at this https URL.
摘要：本文介绍了 VILLAIN，这是一种多模式事实检查系统，可通过基于提示的多智能体协作来验证图像文本声明。对于 AVerImaTeC 共享任务，VILLAIN 在事实检查的多个阶段使用视觉语言模型代理。文本和视觉证据是从通过额外的网络收集丰富的知识存储中检索的。为了识别关键信息并解决证据项之间的不一致问题，特定模态和跨模态代理会生成分析报告。在后续阶段，根据这些报告生成问答对。最后，判决预测代理根据图像文本声明和生成的问答对生成验证结果。我们的系统在所有评估指标的排行榜上均排名第一。源代码可通过此 https URL 公开获取。

Title: Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays

Authors: Lucile Favero, Juan Antonio Pérez-Ortiz, Tanja Käser, Nuria Oliver
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04604
Pdf URL: https://arxiv.org/pdf/2602.04604
Copy Paste: [[2602.04604]] Beyond Holistic Scores: Automatic Trait-Based Quality Scoring of Argumentative Essays(https://arxiv.org/abs/2602.04604)
Keywords: llm, prompt
Abstract: Automated Essay Scoring systems have traditionally focused on holistic scores, limiting their pedagogical usefulness, especially in the case of complex essay genres such as argumentative writing. In educational contexts, teachers and learners require interpretable, trait-level feedback that aligns with instructional goals and established rubrics. In this paper, we study trait-based Automatic Argumentative Essay Scoring using two complementary modeling paradigms designed for realistic educational deployment: (1) structured in-context learning with small open-source LLMs, and (2) a supervised, encoder-based BigBird model with a CORAL-style ordinal regression formulation, optimized for long-sequence understanding. We conduct a systematic evaluation on the ASAP++ dataset, which includes essay scores across five quality traits, offering strong coverage of core argumentation dimensions. LLMs are prompted with designed, rubric-aligned in-context examples, along with feedback and confidence requests, while we explicitly model ordinality in scores with the BigBird model via the rank-consistent CORAL framework. Our results show that explicitly modeling score ordinality substantially improves agreement with human raters across all traits, outperforming LLMs and nominal classification and regression-based baselines. This finding reinforces the importance of aligning model objectives with rubric semantics for educational assessment. At the same time, small open-source LLMs achieve a competitive performance without task-specific fine-tuning, particularly for reasoning-oriented traits, while enabling transparent, privacy-preserving, and locally deployable assessment scenarios. Our findings provide methodological, modeling, and practical insights for the design of AI-based educational systems that aim to deliver interpretable, rubric-aligned feedback for argumentative writing.
摘要：自动论文评分系统传统上注重整体分数，限制了其教学实用性，特别是在复杂的论文体裁（例如议论文写作）的情况下。在教育环境中，教师和学习者需要可解释的、与教学目标和既定规则相一致的特质水平反馈。在本文中，我们使用两种专为现实教育部署而设计的互补建模范例来研究基于特征的自动议论文评分：（1）使用小型开源法学硕士进行结构化上下文学习，以及（2）基于编码器的监督式 BigBird 模型，具有 CORAL 式序数回归公式，针对长序列理解进行了优化。我们对 ASAP++ 数据集进行了系统评估，其中包括五个质量特征的论文分数，提供了对核心论证维度的强大覆盖。法学硕士会收到设计好的、与标题一致的上下文示例以及反馈和置信度请求，同时我们通过排名一致的 CORAL 框架使用 BigBird 模型明确地对分数中的序数进行建模。我们的结果表明，显式地建模分数序数大大提高了与人类评分者在所有特征上的一致性，优于法学硕士以及名义分类和基于回归的基线。这一发现强调了将模型目标与教育评估的标题语义保持一致的重要性。与此同时，小型开源法学硕士无需针对特定任务进行微调即可实现具有竞争力的表现，特别是对于面向推理的特征，同时实现透明、保护隐私和本地可部署的评估场景。我们的研究结果为基于人工智能的教育系统的设计提供了方法、建模和实践见解，旨在为议论文写作提供可解释的、符合标准的反馈。

Title: Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection

Authors: Junhao Liu, Haonan Yu, Zhenyu Yan, Xin Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04607
Pdf URL: https://arxiv.org/pdf/2602.04607
Copy Paste: [[2602.04607]] Focus-LIME: Surgical Interpretation of Long-Context Large Language Models via Proxy-Based Neighborhood Selection(https://arxiv.org/abs/2602.04607)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) scale to handle massive context windows, achieving surgical feature-level interpretation is essential for high-stakes tasks like legal auditing and code debugging. However, existing local model-agnostic explanation methods face a critical dilemma in these scenarios: feature-based methods suffer from attribution dilution due to high feature dimensionality, thus failing to provide faithful explanations. In this paper, we propose Focus-LIME, a coarse-to-fine framework designed to restore the tractability of surgical interpretation. Focus-LIME utilizes a proxy model to curate the perturbation neighborhood, allowing the target model to perform fine-grained attribution exclusively within the optimized context. Empirical evaluations on long-context benchmarks demonstrate that our method makes surgical explanations practicable and provides faithful explanations to users.
摘要：随着大型语言模型（LLM）扩展以处理大量上下文窗口，实现外科手术特征级解释对于法律审计和代码调试等高风险任务至关重要。然而，现有的与模型无关的局部解释方法在这些场景中面临着一个严重的困境：基于特征的方法由于高特征维度而遭受归因稀释，因此无法提供忠实的解释。在本文中，我们提出了 Focus-LIME，这是一个从粗到细的框架，旨在恢复手术解释的易处理性。 Focus-LIME 利用代理模型来管理扰动邻域，允许目标模型专门在优化的上下文中执行细粒度归因。对长上下文基准的实证评估表明，我们的方法使手术解释变得可行，并为用户提供了忠实的解释。

Title: Disentangling meaning from language in LLM-based machine translation

Authors: Théo Lasnier, Armel Zebaze, Djamé Seddah, Rachel Bawden, Benoît Sagot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04613
Pdf URL: https://arxiv.org/pdf/2602.04613
Copy Paste: [[2602.04613]] Disentangling meaning from language in LLM-based machine translation(https://arxiv.org/abs/2602.04613)
Keywords: language model, llm, prompt
Abstract: Mechanistic Interpretability (MI) seeks to explain how neural networks implement their capabilities, but the scale of Large Language Models (LLMs) has limited prior MI work in Machine Translation (MT) to word-level analyses. We study sentence-level MT from a mechanistic perspective by analyzing attention heads to understand how LLMs internally encode and distribute translation functions. We decompose MT into two subtasks: producing text in the target language (i.e. target language identification) and preserving the input sentence's meaning (i.e. sentence equivalence). Across three families of open-source models and 20 translation directions, we find that distinct, sparse sets of attention heads specialize in each subtask. Based on this insight, we construct subtask-specific steering vectors and show that modifying just 1% of the relevant heads enables instruction-free MT performance comparable to instruction-based prompting, while ablating these heads selectively disrupts their corresponding translation functions.
摘要：机械可解释性 (MI) 试图解释神经网络如何实现其功能，但大型语言模型 (LLM) 的规模限制了机器翻译 (MT) 领域先前的 MI 工作仅限于词级分析。我们通过分析注意力头，从机制的角度研究句子级机器翻译，以了解法学硕士如何内部编码和分配翻译功能。我们将机器翻译分解为两个子任务：用目标语言生成文本（即目标语言识别）和保留输入句子的含义（即句子等价）。在三个开源模型系列和 20 个翻译方向中，我们发现不同的、稀疏的注意力头集专门负责每个子任务。基于这一见解，我们构建了特定于子任务的引导向量，并表明仅修改 1% 的相关头即可实现与基于指令的提示相当的无指令 MT 性能，而选择性地消除这些头会破坏其相应的翻译功能。

Title: LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation

Authors: Ruixiao Yang, Yuanhe Tian, Xu Yang, Huiqi Li, Yan Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04617
Pdf URL: https://arxiv.org/pdf/2602.04617
Copy Paste: [[2602.04617]] LEAD: Layer-wise Expert-aligned Decoding for Faithful Radiology Report Generation(https://arxiv.org/abs/2602.04617)
Keywords: language model, llm, hallucination
Abstract: Radiology Report Generation (RRG) aims to produce accurate and coherent diagnostics from medical images. Although large vision language models (LVLM) improve report fluency and accuracy, they exhibit hallucinations, generating plausible yet image-ungrounded pathological details. Existing methods primarily rely on external knowledge guidance to facilitate the alignment between generated text and visual information. However, these approaches often ignore the inherent decoding priors and vision-language alignment biases in pretrained models and lack robustness due to reliance on constructed guidance. In this paper, we propose Layer-wise Expert-aligned Decoding (LEAD), a novel method to inherently modify the LVLM decoding trajectory. A multiple experts module is designed for extracting distinct pathological features which are integrated into each decoder layer via a gating mechanism. This layer-wise architecture enables the LLM to consult expert features at every inference step via a learned gating function, thereby dynamically rectifying decoding biases and steering the generation toward factual consistency. Experiments conducted on multiple public datasets demonstrate that the LEAD method yields effective improvements in clinical accuracy metrics and mitigates hallucinations while preserving high generation quality.
摘要：放射学报告生成 (RRG) 旨在根据医学图像生成准确且连贯的诊断。尽管大视觉语言模型（LVLM）提高了报告的流畅性和准确性，但它们表现出幻觉，产生看似合理但缺乏图像依据的病理细节。现有方法主要依靠外部知识指导来促进生成的文本和视觉信息之间的对齐。然而，这些方法通常忽略预训练模型中固有的解码先验和视觉语言对齐偏差，并且由于依赖构建的指导而缺乏鲁棒性。在本文中，我们提出了逐层专家对齐解码（LEAD），这是一种从本质上修改 LVLM 解码轨迹的新颖方法。多专家模块旨在提取不同的病理特征，这些特征通过门控机制集成到每个解码器层中。这种分层架构使法学硕士能够通过学习的门函数在每个推理步骤中咨询专家特征，从而动态纠正解码偏差并引导生成走向事实一致性。在多个公共数据集上进行的实验表明，LEAD 方法可以有效提高临床准确性指标并减轻幻觉，同时保持较高的生成质量。

Title: Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings

Authors: Tim Kunt, Annika Buchholz, Imene Khebouri, Thorsten Koch, Ida Litzel, Thi Huong Vu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04630
Pdf URL: https://arxiv.org/pdf/2602.04630
Copy Paste: [[2602.04630]] Mapping the Web of Science, a large-scale graph and text-based dataset with LLM embeddings(https://arxiv.org/abs/2602.04630)
Keywords: llm
Abstract: Large text data sets, such as publications, websites, and other text-based media, inherit two distinct types of features: (1) the text itself, its information conveyed through semantics, and (2) its relationship to other texts through links, references, or shared attributes. While the latter can be described as a graph structure and can be handled by a range of established algorithms for classification and prediction, the former has recently gained new potential through the use of LLM embedding models. Demonstrating these possibilities and their practicability, we investigate the Web of Science dataset, containing ~56 million scientific publications through the lens of our proposed embedding method, revealing a self-structured landscape of texts.
摘要：大型文本数据集（例如出版物、网站和其他基于文本的媒体）继承了两种不同类型的特征：（1）文本本身，其通过语义传达的信息，以及（2）通过链接、引用或共享属性与其他文本的关系。虽然后者可以描述为图结构，并且可以通过一系列已建立的分类和预测算法来处理，但前者最近通过使用 LLM 嵌入模型获得了新的潜力。为了证明这些可能性及其实用性，我们通过我们提出的嵌入方法研究了 Web of Science 数据集，其中包含约 5600 万篇科学出版物，揭示了文本的自结构化景观。

Title: Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Authors: Binghai Wang, Yantao Liu, Yuxuan Liu, Tianyi Tang, Shenzhi Wang, Chang Gao, Chujie Zheng, Yichang Zhang, Le Yu, Shixuan Liu, Tao Gui, Qi Zhang, Xuanjing Huang, Bowen Yu, Fei Huang, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04649
Pdf URL: https://arxiv.org/pdf/2602.04649
Copy Paste: [[2602.04649]] Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models(https://arxiv.org/abs/2602.04649)
Keywords: llm
Abstract: Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons, as they are trained and evaluated to prioritize Outcome Accuracy, which undermines their ability to generalize during RLHF. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model's reasoning process and human judgment. Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training. Our training method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), surpassing outcome-only baselines by an average of 5%. Using RM during RLHF, our method effectively improves performance as demonstrated on Arena Hard v2, notably yielding a 7% improvement in creative writing tasks. Further analysis confirms that our method escapes the deceptive alignment trap, effectively reversing the decline in rationale consistency observed in outcome-only training.
摘要：生成奖励模型（GenRM）和法学硕士作为法官通过出于错误的原因产生正确的判断而表现出欺骗性的一致性，因为它们经过训练和评估以优先考虑结果准确性，这削弱了它们在 RLHF 期间泛化的能力。我们引入了基本原理一致性，这是一种细粒度的指标，可以量化模型的推理过程和人类判断之间的一致性。我们对前沿模型的评估表明，基本原理一致性可以有效地区分最先进的模型并检测欺骗性的一致性，而结果准确性在这两方面都不足。为了缩小这一差距，我们引入了一种混合信号，它将基本原理一致性与 GenRM 训练的结果准确性结合起来。我们的训练方法在 RM-Bench (87.1%) 和 JudgeBench (82%) 上实现了最先进的性能，平均超过仅结果基线 5%。在 RLHF 期间使用 RM，我们的方法有效地提高了性能，如 Arena Hard v2 所示，特别是创意写作任务提高了 7%。进一步的分析证实，我们的方法逃脱了欺骗性的对齐陷阱，有效地扭转了在仅结果训练中观察到的基本原理一致性的下降。

Title: Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers

Authors: Lukas Radosky, Miroslav Blstak, Matej Krajcovic, Ivan Polasek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04659
Pdf URL: https://arxiv.org/pdf/2602.04659
Copy Paste: [[2602.04659]] Approaches to Semantic Textual Similarity in Slovak Language: From Algorithms to Transformers(https://arxiv.org/abs/2602.04659)
Keywords: gpt
Abstract: Semantic textual similarity (STS) plays a crucial role in many natural language processing tasks. While extensively studied in high-resource languages, STS remains challenging for under-resourced languages such as Slovak. This paper presents a comparative evaluation of sentence-level STS methods applied to Slovak, including traditional algorithms, supervised machine learning models, and third-party deep learning tools. We trained several machine learning models using outputs from traditional algorithms as features, with feature selection and hyperparameter tuning jointly guided by artificial bee colony optimization. Finally, we evaluated several third-party tools, including fine-tuned model by CloudNLP, OpenAI's embedding models, GPT-4 model, and pretrained SlovakBERT model. Our findings highlight the trade-offs between different approaches.
摘要：语义文本相似度（STS）在许多自然语言处理任务中起着至关重要的作用。虽然 STS 在资源丰富的语言中得到了广泛的研究，但对于斯洛伐克语等资源贫乏的语言来说仍然具有挑战性。本文对应用于斯洛伐克语的句子级 STS 方法进行了比较评估，包括传统算法、监督机器学习模型和第三方深度学习工具。我们使用传统算法的输出作为特征训练了多个机器学习模型，并在人工蜂群优化的共同指导下进行特征选择和超参数调整。最后，我们评估了几个第三方工具，包括 CloudNLP 的微调模型、OpenAI 的嵌入模型、GPT-4 模型和预训练的 SlovakBERT 模型。我们的研究结果强调了不同方法之间的权衡。

Title: Investigating Disability Representations in Text-to-Image Models

Authors: Yang Yian, Yu Fan, Liudmila Zavolokina, Sarah Ebling
Subjects: cs.CL, cs.CV, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2602.04687
Pdf URL: https://arxiv.org/pdf/2602.04687
Copy Paste: [[2602.04687]] Investigating Disability Representations in Text-to-Image Models(https://arxiv.org/abs/2602.04687)
Keywords: prompt
Abstract: Text-to-image generative models have made remarkable progress in producing high-quality visual content from textual descriptions, yet concerns remain about how they represent social groups. While characteristics like gender and race have received increasing attention, disability representations remain underexplored. This study investigates how people with disabilities are represented in AI-generated images by analyzing outputs from Stable Diffusion XL and DALL-E 3 using a structured prompt design. We analyze disability representations by comparing image similarities between generic disability prompts and prompts referring to specific disability categories. Moreover, we evaluate how mitigation strategies influence disability portrayals, with a focus on assessing affective framing through sentiment polarity analysis, combining both automatic and human evaluation. Our findings reveal persistent representational imbalances and highlight the need for continuous evaluation and refinement of generative models to foster more diverse and inclusive portrayals of disability.
摘要：文本到图像生成模型在根据文本描述生成高质量视觉内容方面取得了显着进展，但人们仍然担心它们如何代表社会群体。虽然性别和种族等特征受到越来越多的关注，但残疾代表性仍然没有得到充分探索。本研究通过使用结构化提示设计分析 Stable Diffusion XL 和 DALL-E 3 的输出，研究如何在 AI 生成的图像中呈现残疾人。我们通过比较一般残疾提示和涉及特定残疾类别的提示之间的图像相似性来分析残疾表征。此外，我们评估缓解策略如何影响残疾描绘，重点是通过情感极性分析评估情感框架，结合自动和人工评估。我们的研究结果揭示了持续存在的代表性失衡，并强调需要不断评估和完善生成模型，以促进对残疾的更加多样化和包容性的描述。

Title: LinGO: A Linguistic Graph Optimization Framework with LLMs for Interpreting Intents of Online Uncivil Discourse

Authors: Yuan Zhang, Thales Bertaglia
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.04693
Pdf URL: https://arxiv.org/pdf/2602.04693
Copy Paste: [[2602.04693]] LinGO: A Linguistic Graph Optimization Framework with LLMs for Interpreting Intents of Online Uncivil Discourse(https://arxiv.org/abs/2602.04693)
Keywords: language model, gpt, llm, prompt, retrieval-augmented generation, chain-of-thought
Abstract: Detecting uncivil language is crucial for maintaining safe, inclusive, and democratic online spaces. Yet existing classifiers often misinterpret posts containing uncivil cues but expressing civil intents, leading to inflated estimates of harmful incivility online. We introduce LinGO, a linguistic graph optimization framework for large language models (LLMs) that leverages linguistic structures and optimization techniques to classify multi-class intents of incivility that use various direct and indirect expressions. LinGO decomposes language into multi-step linguistic components, identifies targeted steps that cause the most errors, and iteratively optimizes prompt and/or example components for targeted steps. We evaluate it using a dataset collected during the 2022 Brazilian presidential election, encompassing four forms of political incivility: Impoliteness (IMP), Hate Speech and Stereotyping (HSST), Physical Harm and Violent Political Rhetoric (PHAVPR), and Threats to Democratic Institutions and Values (THREAT). Each instance is annotated with six types of civil/uncivil intent. We benchmark LinGO using three cost-efficient LLMs: GPT-5-mini, Gemini 2.5 Flash-Lite, and Claude 3 Haiku, and four optimization techniques: TextGrad, AdalFlow, DSPy, and Retrieval-Augmented Generation (RAG). The results show that, across all models, LinGO consistently improves accuracy and weighted F1 compared with zero-shot, chain-of-thought, direct optimization, and fine-tuning baselines. RAG is the strongest optimization technique and, when paired with Gemini model, achieves the best overall performance. These findings demonstrate that incorporating multi-step linguistic components into LLM instructions and optimize targeted components can help the models explain complex semantic meanings, which can be extended to other complex semantic explanation tasks in the future.
摘要：检测不文明语言对于维护安全、包容和民主的网络空间至关重要。然而，现有的分类器经常会误解包含不文明暗示但表达文明意图的帖子，导致对网上有害不文明行为的估计过高。我们引入了 LinGO，这是一种用于大型语言模型 (LLM) 的语言图优化框架，它利用语言结构和优化技术对使用各种直接和间接表达的多类不文明意图进行分类。 LinGO 将语言分解为多步骤语言组件，识别导致最多错误的目标步骤，并迭代优化目标步骤的提示和/或示例组件。我们使用 2022 年巴西总统选举期间收集的数据集对其进行评估，其中包括四种形式的政治不文明行为：不礼貌 (IMP)、仇恨言论和成见 (HSST)、身体伤害和暴力政治言论 (PHAVPR) 以及对民主制度和价值观的威胁 (THREAT)。每个实例都注释有六种类型的民事/不民事意图。我们使用三种经济高效的 LLM：GPT-5-mini、Gemini 2.5 Flash-Lite 和 Claude 3 Haiku 以及四种优化技术：TextGrad、AdalFlow、DSPy 和检索增强生成 (RAG) 对 LinGO 进行基准测试。结果表明，在所有模型中，与零样本、思维链、直接优化和微调基线相比，LinGO 持续提高了准确性和加权 F1。 RAG是最强的优化技术，与Gemini模型配合使用时，可以实现最佳的整体性能。这些发现表明，将多步骤语言组件合并到LLM指令中并优化目标组件可以帮助模型解释复杂的语义含义，这可以在未来扩展到其他复杂的语义解释任务。

Title: LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers

Authors: Yike Sun, Haotong Yang, Zhouchen Lin, Muhan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04706
Pdf URL: https://arxiv.org/pdf/2602.04706
Copy Paste: [[2602.04706]] LiteToken: Removing Intermediate Merge Residues From BPE Tokenizers(https://arxiv.org/abs/2602.04706)
Keywords: language model
Abstract: Tokenization is fundamental to how language models represent and process text, yet the behavior of widely used BPE tokenizers has received far less study than model architectures and training. In this paper, we investigate intermediate merge residues in BPE vocabularies: tokens that are frequent during merge learning so that retained in the final vocabulary, but are mostly further merged and rarely emitted when tokenizing the corpus during tokenizer usage. Such low-frequency tokens not only waste vocabulary capacity but also increase vulnerability to adversarial or atypical inputs. We present a systematic empirical characterization of this phenomenon across commonly used tokenizers and introduce LiteToken, a simple method for removing residue tokens. Because the affected tokens are rarely used, pretrained models can often accommodate the modified tokenizer without additional fine-tuning. Experiments show that LiteToken reduces token fragmentation, reduces parameters, and improves robustness to noisy or misspelled inputs, while preserving overall performance.
摘要：分词是语言模型如何表示和处理文本的基础，但广泛使用的 BPE 分词器的行为得到的研究远少于模型架构和训练。在本文中，我们研究了 BPE 词汇表中的中间合并残基：在合并学习期间频繁出现的标记，因此保留在最终词汇中，但在标记器使用期间对语料库进行标记时，大多数会进一步合并并且很少发出。这种低频标记不仅浪费词汇量，而且还增加了对抗性或非典型输入的脆弱性。我们通过常用的分词器对这种现象进行了系统的经验表征，并介绍了 LiteToken，一种去除残留分词的简单方法。由于受影响的标记很少使用，因此预训练模型通常可以容纳修改后的标记生成器，而无需额外的微调。实验表明，LiteToken 减少了令牌碎片、减少了参数，并提高了对噪声或拼写错误输入的鲁棒性，同时保持了整体性能。

Title: "Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs

Authors: Madison Van Doren, Casey Ford, Jennifer Barajas, Cory Holland
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04729
Pdf URL: https://arxiv.org/pdf/2602.04729
Copy Paste: [[2602.04729]] "Be My Cheese?": Cultural Nuance Benchmarking for Machine Translation in Multilingual LLMs(https://arxiv.org/abs/2602.04729)
Keywords: language model, gpt, llm
Abstract: We present a large-scale human evaluation benchmark for assessing cultural localisation in machine translation produced by state-of-the-art multilingual large language models (LLMs). Existing MT benchmarks emphasise token-level and grammatical accuracy, but of ten overlook pragmatic and culturally grounded competencies required for real-world localisation. Building on a pilot study of 87 translations across 20 languages, we evaluate 7 multilingual LLMs across 15 target languages with 5 native-speaker raters per language. Raters scored both full-text translations and segment-level instances of culturally nuanced language (idioms, puns, holidays, and culturally embedded concepts) on an ordinal 0-3 quality scale; segment ratings additionally included an NA option for untranslated segments. Across full-text evaluations, mean overall quality is modest (1.68/3): GPT-5 (2.10/3), Claude Sonnet 3.7 (1.97/3), and Mistral Medium 3.1 (1.84/3) form the strongest tier with fewer catastrophic failures. Segment-level results show sharp category effects: holidays (2.20/3) and cultural concepts (2.19/3) translate substantially better than idioms (1.65/3) and puns (1.45/3), and idioms are most likely to be left untranslated. These findings demonstrate a persistent gap between grammatical adequacy and cultural resonance. To our knowledge, this is the first multilingual, human-annotated benchmark focused explicitly on cultural nuance in translation and localisation, highlighting the need for culturally informed training data, improved cross-lingual pragmatics, and evaluation paradigms that better reflect real-world communicative competence.
摘要：我们提出了一个大规模的人类评估基准，用于评估由最先进的多语言大语言模型（LLM）生成的机器翻译中的文化本地化。现有的机器翻译基准强调标记级别和语法准确性，但有十个基准忽视了现实世界本地化所需的务实和文化基础能力。基于对 20 种语言的 87 种翻译的试点研究，我们评估了 15 种目标语言的 7 个多语言法学硕士，每种语言有 5 位母语评估者。评估者按照 0-3 的顺序质量等级对全文翻译和文化细微差别的语言（习语、双关语、假期和文化嵌入概念）的片段级实例进行评分；片段评级还包括针对未翻译片段的 NA 选项。在全文评估中，平均总体质量适中 (1.68/3)：GPT-5 (2.10/3)、Claude Sonnet 3.7 (1.97/3) 和 Mistral Medium 3.1 (1.84/3) 形成最强的级别，灾难性故障较少。细分层面的结果显示出明显的类别效应：假期（2.20/3）和文化概念（2.19/3）的翻译效果明显好于习语（1.65/3）和双关语（1.45/3），并且习语最有可能不翻译。这些发现表明语法充分性和文化共鸣之间持续存在差距。据我们所知，这是第一个明确关注翻译和本地化中文化细微差别的多语言、人工注释的基准，强调了对文化信息训练数据、改进的跨语言语用学和评估范式的需求，以更好地反映现实世界的沟通能力。

Title: Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging

Authors: Sameh Khattab, Jean-Philippe Corbeil, Osman Alperen Koraş, Amin Dada, Julian Friedrich, François Beaulieu, Paul Vozila, Jens Kleesiek
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04731
Pdf URL: https://arxiv.org/pdf/2602.04731
Copy Paste: [[2602.04731]] Less Finetuning, Better Retrieval: Rethinking LLM Adaptation for Biomedical Retrievers via Synthetic Data and Model Merging(https://arxiv.org/abs/2602.04731)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has become the backbone of grounding Large Language Models (LLMs), improving knowledge updates and reducing hallucinations. Recently, LLM-based retriever models have shown state-of-the-art performance for RAG applications. However, several technical aspects remain underexplored on how to adapt general-purpose LLMs into effective domain-specific retrievers, especially in specialized domains such as biomedicine. We present Synthesize-Train-Merge (STM), a modular framework that enhances decoder-only LLMs with synthetic hard negatives, retrieval prompt optimization, and model merging. Experiments on a subset of 12 medical and general tasks from the MTEB benchmark show STM boosts task-specific experts by up to 23.5\% (average 7.5\%) and produces merged models that outperform both single experts and strong baselines without extensive pretraining. Our results demonstrate a scalable, efficient path for turning general LLMs into high-performing, domain-specialized retrievers, preserving general-domain capabilities while excelling on specialized tasks.
摘要：检索增强生成（RAG）已成为奠定大型语言模型（LLM）、改进知识更新和减少幻觉的支柱。最近，基于 LLM 的检索器模型在 RAG 应用中展现出了最先进的性能。然而，关于如何将通用法学硕士转化为有效的特定领域检索器，特别是在生物医学等专业领域，仍有几个技术方面尚未得到充分探索。我们提出了 Synthesize-Train-Merge (STM)，这是一个模块化框架，可通过合成硬底片、检索提示优化和模型合并来增强仅解码器的 LLM。对 MTEB 基准测试中的 12 项医疗和一般任务子集进行的实验表明，STM 将特定于任务的专家提升高达 23.5%（平均 7.5%），并生成优于单个专家和强基线的合并模型，而无需进行大量预训练。我们的研究结果展示了一条可扩展、高效的途径，可将普通法学硕士转变为高性能、专业领域的检索器，在保留通用领域能力的同时，在专业任务上表现出色。

Title: Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases

Authors: Casey Ford, Madison Van Doren, Emily Dix
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2602.04739
Pdf URL: https://arxiv.org/pdf/2602.04739
Copy Paste: [[2602.04739]] Alignment Drift in Multimodal LLMs: A Two-Phase, Longitudinal Evaluation of Harm Across Eight Model Releases(https://arxiv.org/abs/2602.04739)
Keywords: language model, gpt, llm, prompt
Abstract: Multimodal large language models (MLLMs) are increasingly deployed in real-world systems, yet their safety under adversarial prompting remains underexplored. We present a two-phase evaluation of MLLM harmlessness using a fixed benchmark of 726 adversarial prompts authored by 26 professional red teamers. Phase 1 assessed GPT-4o, Claude Sonnet 3.5, Pixtral 12B, and Qwen VL Plus; Phase 2 evaluated their successors (GPT-5, Claude Sonnet 4.5, Pixtral Large, and Qwen Omni) yielding 82,256 human harm ratings. Large, persistent differences emerged across model families: Pixtral models were consistently the most vulnerable, whereas Claude models appeared safest due to high refusal rates. Attack success rates (ASR) showed clear alignment drift: GPT and Claude models exhibited increased ASR across generations, while Pixtral and Qwen showed modest decreases. Modality effects also shifted over time: text-only prompts were more effective in Phase 1, whereas Phase 2 produced model-specific patterns, with GPT-5 and Claude 4.5 showing near-equivalent vulnerability across modalities. These findings demonstrate that MLLM harmlessness is neither uniform nor stable across updates, underscoring the need for longitudinal, multimodal benchmarks to track evolving safety behaviour.
摘要：多模态大语言模型（MLLM）越来越多地部署在现实世界系统中，但其在对抗性提示下的安全性仍未得到充分探索。我们使用由 26 名专业红队成员编写的 726 个对抗性提示的固定基准，对 MLLM 无害性进行了两阶段评估。第一阶段评估了 GPT-4o、Claude Sonnet 3.5、Pixtral 12B 和 Qwen VL Plus；第 2 阶段评估了其后继版本（GPT-5、Claude Sonnet 4.5、Pixtral Large 和 Qwen Omni），得出 82,256 个人类伤害评级。不同模型家族之间存在巨大且持续的差异：Pixtral 模型始终是最脆弱的，而 Claude 模型由于高拒绝率而显得最安全。攻击成功率 (ASR) 显示出明显的对齐漂移：GPT 和 Claude 模型在各代之间表现出 ASR 的增加，而 Pixtral 和 Qwen 则显示出适度的下降。模态效果也随着时间的推移而发生变化：纯文本提示在第一阶段更有效，而第二阶段产生了特定于模型的模式，GPT-5 和 Claude 4.5 显示出跨模态的几乎相同的脆弱性。这些发现表明，MLLM 无害性在更新过程中既不统一也不稳定，强调需要纵向、多模式基准来跟踪不断变化的安全行为。

Title: Exploiting contextual information to improve stance detection in informal political discourse with LLMs

Authors: Arman Engin Sucu, Yixiang Zhou, Mario A. Nascimento, Tony Mullen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04750
Pdf URL: https://arxiv.org/pdf/2602.04750
Copy Paste: [[2602.04750]] Exploiting contextual information to improve stance detection in informal political discourse with LLMs(https://arxiv.org/abs/2602.04750)
Keywords: language model, llm, prompt
Abstract: This study investigates the use of Large Language Models (LLMs) for political stance detection in informal online discourse, where language is often sarcastic, ambiguous, and context-dependent. We explore whether providing contextual information, specifically user profile summaries derived from historical posts, can improve classification accuracy. Using a real-world political forum dataset, we generate structured profiles that summarize users' ideological leaning, recurring topics, and linguistic patterns. We evaluate seven state-of-the-art LLMs across baseline and context-enriched setups through a comprehensive cross-model evaluation. Our findings show that contextual prompts significantly boost accuracy, with improvements ranging from +17.5\% to +38.5\%, achieving up to 74\% accuracy that surpasses previous approaches. We also analyze how profile size and post selection strategies affect performance, showing that strategically chosen political content yields better results than larger, randomly selected contexts. These findings underscore the value of incorporating user-level context to enhance LLM performance in nuanced political classification tasks.
摘要：本研究调查了大型语言模型 (LLM) 在非正式在线话语中检测政治立场的用途，其中语言通常是讽刺性的、模棱两可的且依赖于上下文。我们探讨提供上下文信息，特别是从历史帖子中得出的用户个人资料摘要是否可以提高分类准确性。使用现实世界的政治论坛数据集，我们生成结构化的个人资料，总结用户的意识形态倾向、重复出现的主题和语言模式。我们通过全面的跨模型评估，跨基线和背景丰富的设置评估了七个最先进的法学硕士。我们的研究结果表明，上下文提示显着提高了准确性，改进范围从 +17.5\% 到 +38.5\%，准确率高达 74\%，超越了以前的方法。我们还分析了个人资料大小和职位选择策略如何影响绩效，表明策略性选择的政治内容比更大的随机选择的背景产生更好的结果。这些发现强调了纳入用户级别背景以提高法学硕士在细致入微的政治分类任务中的表现的价值。

Title: When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?

Authors: Xinyu Zhou, Chang Jin, Carsten Eickhoff, Zhijiang Guo, Seyed Ali Bahrainian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.04755
Pdf URL: https://arxiv.org/pdf/2602.04755
Copy Paste: [[2602.04755]] When Silence Is Golden: Can LLMs Learn to Abstain in Temporal QA and Beyond?(https://arxiv.org/abs/2602.04755)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large language models (LLMs) rarely admit uncertainty, often producing fluent but misleading answers, rather than abstaining (i.e., refusing to answer). This weakness is even evident in temporal question answering, where models frequently ignore time-sensitive evidence and conflate facts across different time-periods. In this paper, we present the first empirical study of training LLMs with an abstention ability while reasoning about temporal QA. Existing approaches such as calibration might be unreliable in capturing uncertainty in complex reasoning. We instead frame abstention as a teachable skill and introduce a pipeline that couples Chain-of-Thought (CoT) supervision with Reinforcement Learning (RL) guided by abstention-aware rewards. Our goal is to systematically analyze how different information types and training techniques affect temporal reasoning with abstention behavior in LLMs. Through extensive experiments studying various methods, we find that RL yields strong empirical gains on reasoning: a model initialized by Qwen2.5-1.5B-Instruct surpasses GPT-4o by $3.46\%$ and $5.80\%$ in Exact Match on TimeQA-Easy and Hard, respectively. Moreover, it improves the True Positive rate on unanswerable questions by $20\%$ over a pure supervised fine-tuned (SFT) variant. Beyond performance, our analysis shows that SFT induces overconfidence and harms reliability, while RL improves prediction accuracy but exhibits similar risks. Finally, by comparing implicit reasoning cues (e.g., original context, temporal sub-context, knowledge graphs) with explicit CoT supervision, we find that implicit information provides limited benefit for reasoning with abstention. Our study provides new insights into how abstention and reasoning can be jointly optimized, providing a foundation for building more reliable LLMs.
摘要：大型语言模型（LLM）很少承认不确定性，通常会产生流畅但具有误导性的答案，而不是放弃（即拒绝回答）。这种弱点在时间问答中甚至很明显，其中模型经常忽略时间敏感的证据并将不同时间段的事实混为一谈。在本文中，我们提出了第一个在推理时间 QA 时训练具有弃权能力的法学硕士的实证研究。校准等现有方法在捕获复杂推理中的不确定性方面可能不可靠。相反，我们将弃权视为一种可教授的技能，并引入了一种将思想链（CoT）监督与由弃权感知奖励引导的强化学习（RL）相结合的管道。我们的目标是系统地分析不同的信息类型和培训技术如何影响法学硕士的时间推理和弃权行为。通过研究各种方法的大量实验，我们发现 RL 在推理方面产生了强大的经验收益：由 Qwen2.5-1.5B-Instruct 初始化的模型在 TimeQA-Easy 和 Hard 上的精确匹配中分别超过 GPT-4o $3.46\%$ 和 $5.80\%$。此外，与纯监督微调 (SFT) 变体相比，它将无法回答的问题的真实阳性率提高了 20\%$。除了性能之外，我们的分析表明，SFT 会导致过度自信并损害可靠性，而 RL 可以提高预测准确性，但也表现出类似的风险。最后，通过将隐式推理线索（例如，原始上下文、时间子上下文、知识图）与显式 CoT 监督进行比较，我们发现隐式信息为弃权推理提供的好处有限。我们的研究为如何联合优化弃权和推理提供了新的见解，为建立更可靠的法学硕士奠定了基础。

Title: Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation

Authors: Luis Frentzen Salim, Esteban Carlin, Alexandre Morinvil, Xi Ai, Lun-Wei Ku
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04764
Pdf URL: https://arxiv.org/pdf/2602.04764
Copy Paste: [[2602.04764]] Beyond Many-Shot Translation: Scaling In-Context Demonstrations For Low-Resource Machine Translation(https://arxiv.org/abs/2602.04764)
Keywords: language model, llm
Abstract: Building machine translation (MT) systems for low-resource languages is notably difficult due to the scarcity of high-quality data. Although Large Language Models (LLMs) have improved MT system performance, adapting them to lesser-represented languages remains challenging. In-context learning (ICL) may offer novel ways to adapt LLMs for low-resource MT by conditioning models on demonstration at inference time. In this study, we explore scaling low-resource machine translation ICL beyond the few-shot setting to thousands of examples with long-context models. We scale in-context token budget to 1M tokens and compare three types of training corpora used as in-context supervision: monolingual unsupervised data, instruction-style data, and parallel data (English--target and Indonesian--target). Our experiments on Javanese and Sundanese show that gains from additional context saturate quickly and can degrade near the maximum context window, with scaling behavior strongly dependent on corpus type. Notably, some forms of monolingual supervision can be competitive with parallel data, despite the latter offering additional supervision. Overall, our results characterize the effective limits and corpus-type sensitivity of long-context ICL for low-resource MT, highlighting that larger context windows do not necessarily yield proportional quality gains.
摘要：由于缺乏高质量数据，为资源匮乏的语言构建机器翻译 (MT) 系统尤其困难。尽管大型语言模型 (LLM) 提高了机器翻译系统的性能，但将其适应较少代表性的语言仍然具有挑战性。上下文学习（ICL）可以通过在推理时调节演示模型来提供使法学硕士适应低资源机器翻译的新方法。在这项研究中，我们探索将低资源机器翻译 ICL 扩展到少数镜头设置之外，扩展到具有长上下文模型的数千个示例。我们将上下文中的令牌预算扩展到 1M 个令牌，并比较了用作上下文监督的三种类型的训练语料库：单语无监督数据、指令式数据和并行数据（英语-目标和印度尼西亚语-目标）。我们对爪哇语和巽他语的实验表明，从额外上下文中获得的收益很快就会饱和，并且在最大上下文窗口附近可能会降低，而扩展行为强烈依赖于语料库类型。值得注意的是，某些形式的单语监督可以与并行数据竞争，尽管后者提供了额外的监督。总体而言，我们的结果描述了低资源 MT 的长上下文 ICL 的有效限制和语料库类型敏感性，强调较大的上下文窗口不一定会产生成比例的质量增益。

Title: OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models

Authors: Yue Ding, Yiyan Ji, Jungang Li, Xuyang Liu, Xinlong Chen, Junfei Wu, Bozhou Li, Bohan Zeng, Yang Shi, Yushuo Guan, Yuanxing Zhang, Jiaheng Liu, Qiang Liu, Pengfei Wan, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04804
Pdf URL: https://arxiv.org/pdf/2602.04804
Copy Paste: [[2602.04804]] OmniSIFT: Modality-Asymmetric Token Compression for Efficient Omni-modal Large Language Models(https://arxiv.org/abs/2602.04804)
Keywords: language model, llm
Abstract: Omni-modal Large Language Models (Omni-LLMs) have demonstrated strong capabilities in audio-video understanding tasks. However, their reliance on long multimodal token sequences leads to substantial computational overhead. Despite this challenge, token compression methods designed for Omni-LLMs remain limited. To bridge this gap, we propose OmniSIFT (Omni-modal Spatio-temporal Informed Fine-grained Token compression), a modality-asymmetric token compression framework tailored for Omni-LLMs. Specifically, OmniSIFT adopts a two-stage compression strategy: (i) a spatio-temporal video pruning module that removes video redundancy arising from both intra-frame structure and inter-frame overlap, and (ii) a vision-guided audio selection module that filters audio tokens. The entire framework is optimized end-to-end via a differentiable straight-through estimator. Extensive experiments on five representative benchmarks demonstrate the efficacy and robustness of OmniSIFT. Notably, for Qwen2.5-Omni-7B, OmniSIFT introduces only 4.85M parameters while maintaining lower latency than training-free baselines such as OmniZip. With merely 25% of the original token context, OmniSIFT consistently outperforms all compression baselines and even surpasses the performance of the full-token model on several tasks.
摘要：全模态大语言模型（Omni-LLM）在音视频理解任务中表现出了强大的能力。然而，它们对长多模式令牌序列的依赖导致大量的计算开销。尽管面临这一挑战，为 Omni-LLM 设计的令牌压缩方法仍然有限。为了弥补这一差距，我们提出了 OmniSIFT（全模态时空知情细粒度令牌压缩），这是一种为 Omni-LLM 量身定制的模态非对称令牌压缩框架。具体来说，OmniSIFT 采用两级压缩策略：(i) 时空视频修剪模块，消除帧内结构和帧间重叠产生的视频冗余；(ii) 视觉引导音频选择模块，过滤音频标记。整个框架通过可微直通估计器进行端到端优化。对五个代表性基准的大量实验证明了 OmniSIFT 的有效性和稳健性。值得注意的是，对于 Qwen2.5-Omni-7B，OmniSIFT 仅引入 4.85M 参数，同时保持比 OmniZip 等免训练基线更低的延迟。仅使用 25% 的原始令牌上下文，OmniSIFT 始终优于所有压缩基线，甚至在多项任务上超过了全令牌模型的性能。

Title: SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization

Authors: Jiarui Yuan, Tailin Jin, Weize Chen, Zeyuan Liu, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04811
Pdf URL: https://arxiv.org/pdf/2602.04811
Copy Paste: [[2602.04811]] SE-Bench: Benchmarking Self-Evolution with Knowledge Internalization(https://arxiv.org/abs/2602.04811)
Keywords: agent
Abstract: True self-evolution requires agents to act as lifelong learners that internalize novel experiences to solve future problems. However, rigorously measuring this foundational capability is hindered by two obstacles: the entanglement of prior knowledge, where ``new'' knowledge may appear in pre-training data, and the entanglement of reasoning complexity, where failures may stem from problem difficulty rather than an inability to recall learned knowledge. We introduce SE-Bench, a diagnostic environment that obfuscates the NumPy library and its API doc into a pseudo-novel package with randomized identifiers. Agents are trained to internalize this package and evaluated on simple coding tasks without access to documentation, yielding a clean setting where tasks are trivial with the new API doc but impossible for base models without it. Our investigation reveals three insights: (1) the Open-Book Paradox, where training with reference documentation inhibits retention, requiring "Closed-Book Training" to force knowledge compression into weights; (2) the RL Gap, where standard RL fails to internalize new knowledge completely due to PPO clipping and negative gradients; and (3) the viability of Self-Play for internalization, proving models can learn from self-generated, noisy tasks when coupled with SFT, but not RL. Overall, SE-Bench establishes a rigorous diagnostic platform for self-evolution with knowledge internalization. Our code and dataset can be found at this https URL.
摘要：真正的自我进化要求智能体充当终身学习者，内化新的经验来解决未来的问题。然而，严格衡量这种基本能力受到两个障碍的阻碍：先验知识的纠缠，其中“新”知识可能出现在预训练数据中，以及推理复杂性的纠缠，其中失败可能源于问题难度，而不是无法回忆起学到的知识。我们引入了 SE-Bench，这是一种诊断环境，它将 NumPy 库及其 API 文档混淆成具有随机标识符的伪新颖包。代理经过训练可以内化这个包，并在无需访问文档的情况下对简单的编码任务进行评估，从而产生一个干净的设置，其中任务对于新的 API 文档来说是微不足道的，但对于没有它的基本模型来说是不可能的。我们的调查揭示了三个见解：（1）开卷悖论，参考文档的培训会抑制记忆力，需要“闭卷培训”将知识压缩成权重； (2) RL Gap，由于 PPO 裁剪和负梯度，标准 RL 无法完全内化新知识；（3）自我对弈对于内化的可行性，证明模型在与 SFT 结合时可以从自我生成的噪声任务中学习，但不能从 RL 中学习。总体而言，SE-Bench建立了一个严格的诊断平台，用于知识内化的自我进化。我们的代码和数据集可以在此 https URL 中找到。

Title: Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"

Authors: Dhruv Madhwal, Lyuxin David Zhang, Dan Roth, Tomer Wolfson, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04853
Pdf URL: https://arxiv.org/pdf/2602.04853
Copy Paste: [[2602.04853]] Decomposed Prompting Does Not Fix Knowledge Gaps, But Helps Models Say "I Don't Know"(https://arxiv.org/abs/2602.04853)
Keywords: language model, hallucination, prompt
Abstract: Large language models often struggle to recognize their knowledge limits in closed-book question answering, leading to confident hallucinations. While decomposed prompting is typically used to improve accuracy, we investigate its impact on reliability. We evaluate three task-equivalent prompting regimes: Direct, Assistive, and Incremental, across different model scales and multi-hop QA benchmarks. We find that although accuracy gains from decomposition diminish in frontier models, disagreements between prompting regimes remain highly indicative of potential errors. Because factual knowledge is stable while hallucinations are stochastic, cross-regime agreement provides a precise signal of internal uncertainty. We leverage this signal to implement a training-free abstention policy that requires no retrieval or fine-tuning. Our results show that disagreement-based abstention outperforms standard uncertainty baselines as an error detector, improving both F1 and AUROC across settings. This demonstrates that decomposition-based prompting can serve as a practical diagnostic probe for model reliability in closed-book QA.
摘要：大型语言模型通常很难在闭门答疑中认识到它们的知识限制，从而导致自信的幻觉。虽然分解提示通常用于提高准确性，但我们研究了它对可靠性的影响。我们在不同的模型规模和多跳 QA 基准上评估了三种任务等效的提示机制：直接、辅助和增量。我们发现，尽管前沿模型中分解带来的准确性增益有所减少，但提示机制之间的分歧仍然高度表明潜在的错误。由于事实知识是稳定的，而幻觉是随机的，跨政权协议提供了内部不确定性的精确信号。我们利用这个信号来实施无需检索或微调的免培训弃权政策。我们的结果表明，基于分歧的弃权作为错误检测器优于标准不确定性基线，在不同设置下改善了 F1 和 AUROC。这表明基于分解的提示可以作为闭卷 QA 中模型可靠性的实用诊断探针。

Title: CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation

Authors: Zhao Tong, Chunlin Gong, Yiping Zhang, Qiang Liu, Xingcheng Xu, Shu Wu, Haichao Shi, Xiao-Yu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.04856
Pdf URL: https://arxiv.org/pdf/2602.04856
Copy Paste: [[2602.04856]] CoT is Not the Chain of Truth: An Empirical Internal Analysis of Reasoning LLMs for Fake News Generation(https://arxiv.org/abs/2602.04856)
Keywords: language model, llm, chain-of-thought
Abstract: From generating headlines to fabricating news, the Large Language Models (LLMs) are typically assessed by their final outputs, under the safety assumption that a refusal response signifies safe reasoning throughout the entire process. Challenging this assumption, our study reveals that during fake news generation, even when a model rejects a harmful request, its Chain-of-Thought (CoT) reasoning may still internally contain and propagate unsafe narratives. To analyze this phenomenon, we introduce a unified safety-analysis framework that systematically deconstructs CoT generation across model layers and evaluates the role of individual attention heads through Jacobian-based spectral metrics. Within this framework, we introduce three interpretable measures: stability, geometry, and energy to quantify how specific attention heads respond or embed deceptive reasoning patterns. Extensive experiments on multiple reasoning-oriented LLMs show that the generation risk rise significantly when the thinking mode is activated, where the critical routing decisions concentrated in only a few contiguous mid-depth layers. By precisely identifying the attention heads responsible for this divergence, our work challenges the assumption that refusal implies safety and provides a new understanding perspective for mitigating latent reasoning risks.
摘要：从生成头条新闻到捏造新闻，大型语言模型 (LLM) 通常根据其最终输出进行评估，安全假设是拒绝响应意味着整个过程中的安全推理。为了挑战这一假设，我们的研究表明，在假新闻生成过程中，即使模型拒绝有害请求，其思想链（CoT）推理仍然可能在内部包含和传播不安全的叙述。为了分析这种现象，我们引入了一个统一的安全分析框架，该框架系统地解构跨模型层的 CoT 生成，并通过基于雅可比行列式的谱度量评估各个注意力头的作用。在此框架内，我们引入了三种可解释的衡量标准：稳定性、几何形状和能量，以量化特定注意力头如何响应或嵌入欺骗性推理模式。对多个面向推理的法学硕士的大量实验表明，当思维模式被激活时，生成风险显着上升，其中关键路由决策仅集中在几个连续的中深度层。通过精确识别造成这种分歧的注意力头，我们的工作挑战了拒绝意味着安全的假设，并为减轻潜在推理风险提供了新的理解视角。

Title: Reinforced Attention Learning

Authors: Bangzheng Li, Jianmo Ni, Chen Qu, Ian Miao, Liu Yang, Xingyu Fu, Muhao Chen, Derek Zhiyuan Cheng
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2602.04884
Pdf URL: https://arxiv.org/pdf/2602.04884
Copy Paste: [[2602.04884]] Reinforced Attention Learning(https://arxiv.org/abs/2602.04884)
Keywords: language model, llm
Abstract: Post-training with Reinforcement Learning (RL) has substantially improved reasoning in Large Language Models (LLMs) via test-time scaling. However, extending this paradigm to Multimodal LLMs (MLLMs) through verbose rationales yields limited gains for perception and can even degrade performance. We propose Reinforced Attention Learning (RAL), a policy-gradient framework that directly optimizes internal attention distributions rather than output token sequences. By shifting optimization from what to generate to where to attend, RAL promotes effective information allocation and improved grounding in complex multimodal inputs. Experiments across diverse image and video benchmarks show consistent gains over GRPO and other baselines. We further introduce On-Policy Attention Distillation, demonstrating that transferring latent attention behaviors yields stronger cross-modal alignment than standard knowledge distillation. Our results position attention policies as a principled and general alternative for multimodal post-training.
摘要：强化学习 (RL) 的后训练通过测试时间扩展显着改善了大型语言模型 (LLM) 的推理能力。然而，通过冗长的原理将这种范式扩展到多模式法学硕士（MLLM）只会产生有限的感知收益，甚至会降低性能。我们提出了强化注意力学习（RAL），这是一种策略梯度框架，可以直接优化内部注意力分布而不是输出令牌序列。通过将优化从生成内容转向参加地点，RAL 促进了有效的信息分配并改善了复杂多式联运输入的基础。跨不同图像和视频基准的实验显示，与 GRPO 和其他基准相比，其获得了一致的增益。我们进一步引入了策略注意力蒸馏，证明转移潜在注意力行为比标准知识蒸馏产生更强的跨模式对齐。我们的结果将注意力政策定位为多模式后培训的原则性和通用替代方案。