2026-02-10

Title: Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models

Authors: Lucky Susanto, Musa Izzanardi Wijanarko, Khumaisa Nur'aini, Farid Adilazuarda, Alham Fikri Aji, Derry Tanti Wijaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.06973
Pdf URL: https://arxiv.org/pdf/2602.06973
Copy Paste: [[2602.06973]] Does Visual Rendering Bypass Tokenization? Investigating Script-Tokenizer Misalignment in Pixel-Based Language Models(https://arxiv.org/abs/2602.06973)
Keywords: language model, gpt
Abstract: While pixel-based language modeling aims to bypass the sub-word tokenization bottleneck by rendering text as images, recent multimodal variants such as DualGPT reintroduce text tokenizers to improve autoregressive performance. We investigate a fundamental question, does visual rendering truly decouple a model from tokenization constraints? Focusing on four Indonesian low-resource local languages that have their own non-Latin scripts (i.e., Javanese, Balinese, Sundanese, and Lampungnese), we evaluate the impact of script-tokenizer alignment within the DualGPT architecture. Our results show that, despite visual rendering, reintegrating a text tokenizer into the architecture reintroduces the same issue that pixel-based language modeling aims to resolve, which is the tokenizer misalignment problem. Despite having lower OOV and fertility rates, we show that the Llama 2 tokenizer performs significantly worse than a custom tokenizer, with improvements of up to 30.15 chrF++. Our findings serve as a warning for future multimodal variants, as text tokenizers remain a significant barrier to equitable models.
摘要：虽然基于像素的语言建模旨在通过将文本渲染为图像来绕过子词标记化瓶颈，但最近的多模态变体（例如 DualGPT）重新引入了文本标记器以提高自回归性能。我们研究一个基本问题，视觉渲染是否真正将模型与标记化约束分离？我们重点关注四种拥有自己的非拉丁文字的印尼资源匮乏的本地语言（即爪哇语、巴厘岛语、巽他语和楠榜尼斯语），评估了 DualGPT 架构中脚本标记器对齐的影响。我们的结果表明，尽管有视觉渲染，但将文本标记器重新集成到架构中会重新引入基于像素的语言建模旨在解决的相同问题，即标记器未对齐问题。尽管 OOV 和生育率较低，但我们表明 Llama 2 分词器的性能明显比自定义分词器差，改进高达 30.15 chrF++。我们的研究结果对未来的多模式变体提出了警告，因为文本标记器仍然是公平模型的重大障碍。

Title: BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents

Authors: R. James Cotton, Thomas Leonard
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.06975
Pdf URL: https://arxiv.org/pdf/2602.06975
Copy Paste: [[2602.06975]] BiomechAgent: AI-Assisted Biomechanical Analysis Through Code-Generating Agents(https://arxiv.org/abs/2602.06975)
Keywords: llm, prompt, agent
Abstract: Markerless motion capture is making quantitative movement analysis increasingly accessible, yet analyzing the resulting data remains a barrier for clinicians without programming expertise. We present BiomechAgent, a code-generating AI agent that enables biomechanical analysis through natural language and allows users to querying databases, generating visualizations, and even interpret data without requiring users to write code. To evaluate BiomechAgent's capabilities, we developed a systematic benchmark spanning data retrieval, visualization, activity classification, temporal segmentation, and clinical reasoning. BiomechAgent achieved robust accuracy on data retrieval and visualization tasks and demonstrated emerging clinical reasoning capabilities. We used our dataset to systematically evaluate several of our design decisions. Biomechanically-informed, domain-specific instructions significantly improved performance over generic prompts, and integrating validated specialized tools for gait event detection substantially boosted accuracy on challenging spatiotemporal analysis where the base agent struggled. We also tested BiomechAgent using a local open-weight model instead of a frontier cloud based LLM and found that perform was substantially diminished in most domains other than database retrieval. In short, BiomechAgent makes the data from accessible motion capture and much more useful and accessible to end users.
摘要：无标记动作捕捉使定量运动分析变得越来越容易，但分析结果数据对于没有编程专业知识的临床医生来说仍然是一个障碍。我们推出了 BiomechAgent，这是一种代码生成人工智能代理，可以通过自然语言进行生物力学分析，并允许用户查询数据库、生成可视化，甚至无需用户编写代码即可解释数据。为了评估 BiomechAgent 的功能，我们开发了一个涵盖数据检索、可视化、活动分类、时间分割和临床推理的系统基准。 BiomechAgent 在数据检索和可视化任务上实现了强大的准确性，并展示了新兴的临床推理能力。我们使用数据集系统地评估了我们的几个设计决策。与通用提示相比，基于生物力学的特定领域指令显着提高了性能，并且集成用于步态事件检测的经过验证的专用工具大大提高了基础智能体难以应对的具有挑战性的时空分析的准确性。我们还使用本地开放权重模型而不是基于前沿云的 LLM 测试了 BiomechAgent，发现除数据库检索之外的大多数领域的性能都大幅下降。简而言之，BiomechAgent 使来自可访问的动作捕捉的数据对最终用户来说更加有用和易于访问。

Title: Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks

Authors: Chen Shen, Wei Cheng, Jingyue Yang, Huan Zhang, Yuhan Wu, Wei Hu
Subjects: cs.CL, cs.AI, cs.LG, cs.PL
Abstract URL: https://arxiv.org/abs/2602.06976
Pdf URL: https://arxiv.org/pdf/2602.06976
Copy Paste: [[2602.06976]] Bridging the Knowledge Void: Inference-time Acquisition of Unfamiliar Programming Languages for Coding Tasks(https://arxiv.org/abs/2602.06976)
Keywords: language model, llm, agent
Abstract: The proficiency of Large Language Models (LLMs) in coding tasks is often a reflection of their extensive pre-training corpora, which typically collapses when confronted with previously unfamiliar programming languages. Departing from data-intensive finetuning, we investigate the paradigm of Inference-time Language Acquisition (ILA), where an LLM masters an unfamiliar language through dynamic interaction with limited external resources. In this paper, we propose ILA-agent, a general ILA framework that equips LLMs with a set of behavioral primitives. By modeling essential human-like behaviors as a suite of tools, ILA-agent enables LLMs to incrementally explore, apply, and verify language knowledge through structured interactions with the official documentation and execution environment. To provide a rigorous evaluation in a low-resource setting, we construct Cangjie-bench, a multi-task benchmark based on the novel statically-typed language Cangjie. We instantiate ILA-agent for Cangjie and evaluate its performance across code generation, translation, and program repair tasks. Results using diverse LLMs demonstrate that ILA-agent significantly outperforms retrieval-augmented baselines. Further analysis of agent trajectories characterizes the emergent behavior patterns while highlighting persisting performance gaps.
摘要：大型语言模型（LLM）在编码任务中的熟练程度通常反映了其广泛的预训练语料库，当遇到以前不熟悉的编程语言时，这些语料库通常会崩溃。与数据密集型微调不同，我们研究了推理时语言习得（ILA）的范式，其中法学硕士通过与有限的外部资源的动态交互来掌握不熟悉的语言。在本文中，我们提出了 ILA-agent，这是一个通用的 ILA 框架，为 LLM 配备了一组行为原语。通过将基本的类人行为建模为一套工具，ILA-agent 使法学硕士能够通过与官方文档和执行环境的结构化交互来逐步探索、应用和验证语言知识。为了在资源匮乏的情况下提供严格的评估，我们构建了Cangjie-bench，这是一个基于新颖的静态类型语言Cangjie的多任务基准测试。我们为仓颉实例化 ILA-agent 并评估其在代码生成、翻译和程序修复任务中的性能。使用不同法学硕士的结果表明，ILA 代理显着优于检索增强基线。对代理轨迹的进一步分析描述了新出现的行为模式，同时强调了持续存在的绩效差距。

Title: Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model

Authors: Jacqueline He, Jonathan Hayase, Wen-tau Yih, Sewoong Oh, Luke Zettlemoyer, Pang Wei Koh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07120
Pdf URL: https://arxiv.org/pdf/2602.07120
Copy Paste: [[2602.07120]] Anchored Decoding: Provably Reducing Copyright Risk for Any Language Model(https://arxiv.org/abs/2602.07120)
Keywords: language model
Abstract: Modern language models (LMs) tend to memorize portions of their training data and emit verbatim spans. When the underlying sources are sensitive or copyright-protected, such reproduction raises issues of consent and compensation for creators and compliance risks for developers. We propose Anchored Decoding, a plug-and-play inference-time method for suppressing verbatim copying: it enables decoding from any risky LM trained on mixed-license data by keeping generation in bounded proximity to a permissively trained safe LM. Anchored Decoding adaptively allocates a user-chosen information budget over the generation trajectory and enforces per-step constraints that yield a sequence-level guarantee, enabling a tunable risk-utility trade-off. To make Anchored Decoding practically useful, we introduce a new permissively trained safe model (TinyComma 1.8B), as well as Anchored$_{\mathrm{Byte}}$ Decoding, a byte-level variant of our method that enables cross-vocabulary fusion via the ByteSampler framework (Hayase et al., 2025). We evaluate our methods across six model pairs on long-form evaluations of copyright risk and utility. Anchored and Anchored$_{\mathrm{Byte}}$ Decoding define a new Pareto frontier, preserving near-original fluency and factuality while eliminating up to 75% of the measurable copying gap (averaged over six copying metrics) between the risky baseline and a safe reference, at a modest inference overhead.
摘要：现代语言模型 (LM) 倾向于记住部分训练数据并逐字输出跨度。当底层来源敏感或受版权保护时，此类复制会引发创作者的同意和赔偿问题以及开发商的合规风险。我们提出了锚定解码（Anchored Decoding），这是一种用于抑制逐字复制的即插即用推理时间方法：它可以通过保持生成与经过许可训练的安全 LM 的有限接近度来对任何在混合许可数据上训练的有风险的 LM 进行解码。锚定解码在生成轨迹上自适应地分配用户选择的信息预算，并强制执行每步约束，从而产生序列级保证，从而实现可调整的风险-效用权衡。为了使 Anchored Decoding 实用化，我们引入了一种新的经过许可训练的安全模型 (TinyComma 1.8B)，以及 Anchored$_{\mathrm{Byte}}$ Decoding，这是我们方法的字节级变体，可通过 ByteSampler 框架实现跨词汇融合 (Hayase et al., 2025)。我们通过六个模型对对版权风险和效用的长期评估来评估我们的方法。 Anchored 和 Anchored$_{\mathrm{Byte}}$ 解码定义了一个新的帕累托前沿，保留了接近原始的流畅性和事实性，同时以适度的推理开销消除了风险基线和安全参考之间高达 75% 的可测量复制差距（超过六个复制指标的平均值）。

Title: Your Language Model Secretly Contains Personality Subnetworks

Authors: Ruimeng Ye, Zihan Wang, Zinan Ling, Yang Xiao, Manling Li, Xiaolong Ma, Bo Hui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07164
Pdf URL: https://arxiv.org/pdf/2602.07164
Copy Paste: [[2602.07164]] Your Language Model Secretly Contains Personality Subnetworks(https://arxiv.org/abs/2602.07164)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Humans shift between different personas depending on social context. Large Language Models (LLMs) demonstrate a similar flexibility in adopting different personas and behaviors. Existing approaches, however, typically adapt such behavior through external knowledge such as prompting, retrieval-augmented generation (RAG), or fine-tuning. We ask: do LLMs really need external context or parameters to adapt to different behaviors, or do they already have such knowledge embedded in their parameters? In this work, we show that LLMs already contain persona-specialized subnetworks in their parameter space. Using small calibration datasets, we identify distinct activation signatures associated with different personas. Guided by these statistics, we develop a masking strategy that isolates lightweight persona subnetworks. Building on the findings, we further discuss: how can we discover opposing subnetwork from the model that lead to binary-opposing personas, such as introvert-extrovert? To further enhance separation in binary opposition scenarios, we introduce a contrastive pruning strategy that identifies parameters responsible for the statistical divergence between opposing personas. Our method is entirely training-free and relies solely on the language model's existing parameter space. Across diverse evaluation settings, the resulting subnetworks exhibit significantly stronger persona alignment than baselines that require external knowledge while being more efficient. Our findings suggest that diverse human-like behaviors are not merely induced in LLMs, but are already embedded in their parameter space, pointing toward a new perspective on controllable and interpretable personalization in large language models.
摘要：人类根据社会背景在不同的角色之间转换。大型语言模型 (LLM) 在采用不同的角色和行为方面表现出类似的灵活性。然而，现有方法通常通过外部知识（例如提示、检索增强生成（RAG）或微调）来适应此类行为。我们问：法学硕士真的需要外部背景或参数来适应不同的行为，还是他们已经将这些知识嵌入到他们的参数中？在这项工作中，我们表明 LLM 的参数空间中已经包含角色专用子网络。使用小型校准数据集，我们识别与不同角色相关的不同激活特征。在这些统计数据的指导下，我们开发了一种隔离轻量级角色子网络的屏蔽策略。基于这些发现，我们进一步讨论：我们如何从模型中发现导致二元对立角色（例如内向-外向）的对立子网络？为了进一步增强二元对立场景中的分离，我们引入了一种对比修剪策略，该策略可识别导致对立角色之间统计差异的参数。我们的方法完全无需训练，并且仅依赖于语言模型的现有参数空间。在不同的评估设置中，生成的子网络表现出比需要外部知识的基线更强的角色一致性，同时效率更高。我们的研究结果表明，多样化的类人行为不仅是在法学硕士中引发的，而且已经嵌入到其参数空间中，这为大型语言模型中可控和可解释的个性化指明了新的视角。

Title: Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI

Authors: Mohamed El Hajji, Tarek Ait Baha, Aicha Dakir, Hammou Fadili, Youssef Es-Saady
Subjects: cs.CL, cs.AI, cs.ET, cs.HC
Abstract URL: https://arxiv.org/abs/2602.07176
Pdf URL: https://arxiv.org/pdf/2602.07176
Copy Paste: [[2602.07176]] Open TutorAI: An Open-source Platform for Personalized and Immersive Learning with Generative AI(https://arxiv.org/abs/2602.07176)
Keywords: llm, chat
Abstract: Recent advances in artificial intelligence have created new possibilities for making education more scalable, adaptive, and learner-centered. However, existing educational chatbot systems often lack contextual adaptability, real-time responsiveness, and pedagogical agility. which can limit learner engagement and diminish instructional effectiveness. Thus, there is a growing need for open, integrative platforms that combine AI and immersive technologies to support personalized, meaningful learning experiences. This paper presents Open TutorAI, an open-source educational platform based on LLMs and generative technologies that provides dynamic, personalized tutoring. The system integrates natural language processing with customizable 3D avatars to enable multimodal learner interaction. Through a structured onboarding process, it captures each learner's goals and preferences in order to configure a learner-specific AI assistant. This assistant is accessible via both text-based and avatar-driven interfaces. The platform includes tools for organizing content, providing embedded feedback, and offering dedicated interfaces for learners, educators, and parents. This work focuses on learner-facing components, delivering a tool for adaptive support that responds to individual learner profiles without requiring technical expertise. Its assistant-generation pipeline and avatar integration enhance engagement and emotional presence, creating a more humanized, immersive learning environment. Embedded learning analytics support self-regulated learning by tracking engagement patterns and generating actionable feedback. The result is Open TutorAI, which unites modular architecture, generative AI, and learner analytics within an open-source framework. It contributes to the development of next-generation intelligent tutoring systems.
摘要：人工智能的最新进展为使教育更具可扩展性、适应性和以学习者为中心创造了新的可能性。然而，现有的教育聊天机器人系统往往缺乏情境适应性、实时响应能力和教学敏捷性。这会限制学习者的参与并降低教学效率。因此，人们越来越需要开放、集成的平台，将人工智能和沉浸式技术相结合，以支持个性化、有意义的学习体验。本文介绍了 Open TutorAI，这是一个基于法学硕士和生成技术的开源教育平台，可提供动态、个性化的辅导。该系统将自然语言处理与可定制的 3D 头像集成在一起，以实现多模式学习者交互。通过结构化的入职流程，它可以捕获每个学习者的目标和偏好，以便配置针对学习者的人工智能助手。该助手可通过基于文本和头像驱动的界面进行访问。该平台包括用于组织内容、提供嵌入式反馈以及为学习者、教育工作者和家长提供专用界面的工具。这项工作重点关注面向学习者的组件，提供一种自适应支持工具，无需技术专业知识即可响应个人学习者资料。其助理生成管道和化身集成增强了参与度和情感存在，创造了更加人性化、身临其境的学习环境。嵌入式学习分析通过跟踪参与模式和生成可行的反馈来支持自我调节学习。其结果是 Open TutorAI，它将模块化架构、生成式人工智能和学习者分析结合在一个开源框架内。它有助于下一代智能辅导系统的开发。

Title: Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs

Authors: Tianyu Zhao, Siqi Li, Yasser Shoukry, Salma Elmalaki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07181
Pdf URL: https://arxiv.org/pdf/2602.07181
Copy Paste: [[2602.07181]] Can LLMs Discern the Traits Influencing Your Preferences? Evaluating Personality-Driven Preference Alignment in LLMs(https://arxiv.org/abs/2602.07181)
Keywords: language model, llm
Abstract: User preferences are increasingly used to personalize Large Language Model (LLM) responses, yet how to reliably leverage preference signals for answer generation remains under-explored. In practice, preferences can be noisy, incomplete, or even misleading, which can degrade answer quality when applied naively. Motivated by the observation that stable personality traits shape everyday preferences, we study personality as a principled ''latent'' signal behind preference statements. Through extensive experiments, we find that conditioning on personality-aligned preferences substantially improves personalized question answering: selecting preferences consistent with a user's inferred personality increases answer-choice accuracy from 29.25% to 76%, compared to using randomly selected preferences. Based on these findings, we introduce PACIFIC (Preference Alignment Choices Inference for Five-factor Identity Characterization), a personality-labeled preference dataset containing 1200 preference statements spanning diverse domains (e.g., travel, movies, education), annotated with Big-Five (OCEAN) trait directions. Finally, we propose a framework that enables an LLM model to automatically retrieve personality-aligned preferences and incorporate them during answer generation.
摘要：用户偏好越来越多地用于个性化大型语言模型 (LLM) 响应，但如何可靠地利用偏好信号来生成答案仍有待探索。在实践中，偏好可能是嘈杂的、不完整的，甚至是误导性的，当天真地应用时，这可能会降低答案的质量。受稳定的人格特质塑造日常偏好这一观察的启发，我们将人格研究为偏好陈述背后的原则性“潜在”信号。通过大量实验，我们发现，对与个性相关的偏好进行调节可以极大地改善个性化问题回答：与使用随机选择的偏好相比，选择与用户推断的个性一致的偏好可以将答案选择的准确性从 29.25% 提高到 76%。基于这些发现，我们引入了 PACIFIC（五因素身份表征的偏好对齐选择推断），这是一个个性标记的偏好数据集，包含跨越不同领域（例如旅行、电影、教育）的 1200 个偏好陈述，并用大五（OCEAN）特质方向进行注释。最后，我们提出了一个框架，使法学硕士模型能够自动检索与个性相关的偏好，并在答案生成过程中合并它们。

Title: Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities

Authors: Ju Lin, Jing Pan, Ruizhi Li, Ming Sun, Yuzong Liu, Alaa Hassan, Jing Zheng, Florian Metze
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2602.07211
Pdf URL: https://arxiv.org/pdf/2602.07211
Copy Paste: [[2602.07211]] Equipping LLM with Directional Multi-Talker Speech Understanding Capabilities(https://arxiv.org/abs/2602.07211)
Keywords: language model, llm, prompt
Abstract: Recent studies have demonstrated that prompting large language models (LLM) with audio encodings enables effective speech understanding capabilities. However, most speech LLMs are trained on single-channel, single-talker data, which makes it challenging to directly apply them to multi-talker and multi-channel speech understanding task. In this work, we present a comprehensive investigation on how to enable directional multi-talker speech understanding capabilities for LLMs, specifically in smart glasses usecase. We propose two novel approaches to integrate directivity into LLMs: (1) a cascaded system that leverages a source separation front-end module, and (2) an end-to-end system that utilizes serialized output training. All of the approaches utilize a multi-microphone array embedded in smart glasses to optimize directivity interpretation and processing in a streaming manner. Experimental results demonstrate the efficacy of our proposed methods in endowing LLMs with directional speech understanding capabilities, achieving strong performance in both speech recognition and speech translation tasks.
摘要：最近的研究表明，通过音频编码促进大型语言模型（LLM）可以实现有效的语音理解能力。然而，大多数语音法学硕士都是在单通道、单说话者数据上进行训练的，这使得将它们直接应用于多说话者和多通道语音理解任务变得具有挑战性。在这项工作中，我们对如何为法学硕士启用定向多说话者语音理解功能进行了全面的研究，特别是在智能眼镜用例中。我们提出了两种将方向性集成到 LLM 中的新颖方法：（1）利用源分离前端模块的级联系统，以及（2）利用序列化输出训练的端到端系统。所有方法都利用嵌入智能眼镜中的多麦克风阵列以流方式优化方向性解释和处理。实验结果证明了我们提出的方法在赋予法学硕士定向语音理解能力方面的有效性，在语音识别和语音翻译任务中取得了强劲的表现。

Title: Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice

Authors: Savan Doshi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07319
Pdf URL: https://arxiv.org/pdf/2602.07319
Copy Paste: [[2602.07319]] Beyond Accuracy: Risk-Sensitive Evaluation of Hallucinated Medical Advice(https://arxiv.org/abs/2602.07319)
Keywords: language model, hallucination, prompt
Abstract: Large language models are increasingly being used in patient-facing medical question answering, where hallucinated outputs can vary widely in potential harm. However, existing hallucination standards and evaluation metrics focus primarily on factual correctness, treating all errors as equally severe. This obscures clinically relevant failure modes, particularly when models generate unsupported but actionable medical language. We propose a risk-sensitive evaluation framework that quantifies hallucinations through the presence of risk-bearing language, including treatment directives, contraindications, urgency cues, and mentions of high-risk medications. Rather than assessing clinical correctness, our approach evaluates the potential impact of hallucinated content if acted upon. We further combine risk scoring with a relevance measure to identify high-risk, low-grounding failures. We apply this framework to three instruction-tuned language models using controlled patient-facing prompts designed as safety stress tests. Our results show that models with similar surface-level behavior exhibit substantially different risk profiles and that standard evaluation metrics fail to capture these distinctions. These findings highlight the importance of incorporating risk sensitivity into hallucination evaluation and suggest that evaluation validity is critically dependent on task and prompt design.
摘要：大型语言模型越来越多地用于面向患者的医疗问答，其中幻觉输出的潜在危害可能存在很大差异。然而，现有的幻觉标准和评估指标主要关注事实的正确性，将所有错误视为同样严重。这掩盖了临床相关的失败模式，特别是当模型生成不受支持但可操作的医学语言时。我们提出了一个风险敏感的评估框架，通过存在风险语言来量化幻觉，包括治疗指示、禁忌症、紧急提示和高风险药物的提及。我们的方法不是评估临床正确性，而是评估幻觉内容如果采取行动的潜在影响。我们进一步将风险评分与相关性度量相结合，以识别高风险、低基础的失败。我们将此框架应用于三种指令调整的语言模型，使用设计为安全压力测试的受控面向患者的提示。我们的结果表明，具有相似表面行为的模型表现出截然不同的风险状况，并且标准评估指标无法捕捉这些区别。这些发现强调了将风险敏感性纳入幻觉评估的重要性，并表明评估的有效性很大程度上取决于任务和提示设计。

Title: Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation

Authors: Geng Liu, Fei Zhu, Rong Feng, Changyi Ma, Shiqi Wang, Gaofeng Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07338
Pdf URL: https://arxiv.org/pdf/2602.07338
Copy Paste: [[2602.07338]] Intent Mismatch Causes LLMs to Get Lost in Multi-Turn Conversation(https://arxiv.org/abs/2602.07338)
Keywords: language model, llm
Abstract: Multi-turn conversation has emerged as a predominant interaction paradigm for Large Language Models (LLMs). Users often employ follow-up questions to refine their intent, expecting LLMs to adapt dynamically. However, recent research reveals that LLMs suffer a substantial performance drop in multi-turn settings compared to single-turn interactions with fully specified instructions, a phenomenon termed ``Lost in Conversation'' (LiC). While this prior work attributes LiC to model unreliability, we argue that the root cause lies in an intent alignment gap rather than intrinsic capability deficits. In this paper, we first demonstrate that LiC is not a failure of model capability but rather a breakdown in interaction between users and LLMs. We theoretically show that scaling model size or improving training alone cannot resolve this gap, as it arises from structural ambiguity in conversational context rather than representational limitations. To address this, we propose to decouple intent understanding from task execution through a Mediator-Assistant architecture. By utilizing an experience-driven Mediator to explicate user inputs into explicit, well-structured instructions based on historical interaction patterns, our approach effectively bridges the gap between vague user intent and model interpretation. Experimental results demonstrate that this method significantly mitigates performance degradation in multi-turn conversations across diverse LLMs.
摘要：多轮对话已成为大型语言模型 (LLM) 的主要交互范例。用户经常使用后续问题来完善他们的意图，期望法学硕士能够动态适应。然而，最近的研究表明，与具有完全指定指令的单轮交互相比，法学硕士在多轮设置中的性能大幅下降，这种现象被称为“对话中迷失”（LiC）。虽然这项先前的工作将 LiC 归因于模型的不可靠性，但我们认为根本原因在于意图一致性差距，而不是内在能力缺陷。在本文中，我们首先证明 LiC 并不是模型能力的失败，而是用户和法学硕士之间交互的失败。我们从理论上证明，仅靠扩展模型大小或改进训练无法解决这一差距，因为它是由对话上下文中的结构模糊性而不是表征限制引起的。为了解决这个问题，我们建议通过中介者-助理架构将意图理解与任务执行分离。通过利用体验驱动的中介器将用户输入解释为基于历史交互模式的明确的、结构良好的指令，我们的方法有效地弥合了模糊的用户意图和模型解释之间的差距。实验结果表明，该方法显着减轻了不同法学硕士的多轮对话中的性能下降。

Title: ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations

Authors: Long S. T. Nguyen, Quan M. Bui, Tin T. Ngo, Quynh T. N. Vo, Dung N. H. Le, Tho T. Quan
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.07361
Pdf URL: https://arxiv.org/pdf/2602.07361
Copy Paste: [[2602.07361]] ViHERMES: A Graph-Grounded Multihop Question Answering Benchmark and System for Vietnamese Healthcare Regulations(https://arxiv.org/abs/2602.07361)
Keywords: language model
Abstract: Question Answering (QA) over regulatory documents is inherently challenging due to the need for multihop reasoning across legally interdependent texts, a requirement that is particularly pronounced in the healthcare domain where regulations are hierarchically structured and frequently revised through amendments and cross-references. Despite recent progress in retrieval-augmented and graph-based QA methods, systematic evaluation in this setting remains limited, especially for low-resource languages such as Vietnamese, due to the lack of benchmark datasets that explicitly support multihop reasoning over healthcare regulations. In this work, we introduce the Vietnamese Healthcare Regulations-Multihop Reasoning Dataset (ViHERMES), a benchmark designed for multihop QA over Vietnamese healthcare regulatory documents. ViHERMES consists of high-quality question-answer pairs that require reasoning across multiple regulations and capture diverse dependency patterns, including amendment tracing, cross-document comparison, and procedural synthesis. To construct the dataset, we propose a controlled multihop QA generation pipeline based on semantic clustering and graph-inspired data mining, followed by large language model-based generation with structured evidence and reasoning annotations. We further present a graph-aware retrieval framework that models formal legal relations at the level of legal units and supports principled context expansion for legally valid and coherent answers. Experimental results demonstrate that ViHERMES provides a challenging benchmark for evaluating multihop regulatory QA systems and that the proposed graph-aware approach consistently outperforms strong retrieval-based baselines. The ViHERMES dataset and system implementation are publicly available at this https URL.
摘要：由于需要在法律上相互依赖的文本中进行多跳推理，因此对监管文件的问答（QA）本身就具有挑战性，这一要求在医疗保健领域尤为明显，因为法规是分层结构的，并且经常通过修订和交叉引用进行修订。尽管最近在检索增强和基于图形的 QA 方法方面取得了进展，但由于缺乏明确支持医疗保健法规多跳推理的基准数据集，因此在这种情况下的系统评估仍然有限，特别是对于越南语等资源匮乏的语言。在这项工作中，我们介绍了越南医疗保健法规-多跳推理数据集 (ViHERMES)，这是一个针对越南医疗保健监管文件进行多跳 QA 的基准。 ViHERMES 由高质量的问答对组成，需要跨多个法规进行推理并捕获不同的依赖性模式，包括修正追踪、跨文档比较和程序综合。为了构建数据集，我们提出了一种基于语义聚类和图启发数据挖掘的受控多跳 QA 生成管道，然后是基于大型语言模型的生成，具有结构化证据和推理注释。我们进一步提出了一个图形感知检索框架，该框架在法律单位层面对正式法律关系进行建模，并支持有原则的上下文扩展，以获得合法有效且连贯的答案。实验结果表明，ViHERMES 为评估多跳监管 QA 系统提供了一个具有挑战性的基准，并且所提出的图形感知方法始终优于强大的基于检索的基线。 ViHERMES 数据集和系统实现可通过此 https URL 公开获取。

Title: TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling

Authors: Nisharg Nargund, Priyesh Shukla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07374
Pdf URL: https://arxiv.org/pdf/2602.07374
Copy Paste: [[2602.07374]] TernaryLM: Memory-Efficient Language Modeling via Native 1-Bit Quantization with Adaptive Layer-wise Scaling(https://arxiv.org/abs/2602.07374)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve remarkable performance but demand substantial computational resources, limiting deployment on edge devices and resource-constrained environments. We present TernaryLM, a 132M parameter transformer architecture that employs native 1-bit ternary quantization {-1, 0, +1} during training, achieving significant memory reduction without sacrificing language modeling capability. Unlike post-training quantization approaches that quantize pre-trained full-precision models, TernaryLM learns quantization-aware representations from scratch using straight-through estimators and adaptive per-layer scaling factors. Our experiments demonstrate: (1) validation perplexity of 58.42 on TinyStories; (2) downstream transfer with 82.47 percent F1 on MRPC paraphrase detection; (3) 2.4x memory reduction (498MB vs 1197MB) with comparable inference latency; and (4) stable training dynamics across diverse corpora. We provide layer-wise quantization analysis showing that middle transformer layers exhibit highest compatibility with extreme quantization, informing future non-uniform precision strategies. Our results suggest that native 1-bit training is a promising direction for efficient neural language models. Code is available at this https URL.
摘要：大型语言模型 (LLM) 实现了卓越的性能，但需要大量的计算资源，限制了在边缘设备和资源受限环境中的部署。我们提出了 TernaryLM，一种 132M 参数转换器架构，在训练期间采用本机 1 位三元量化 {-1, 0, +1}，在不牺牲语言建模能力的情况下实现显着的内存减少。与量化预训练全精度模型的训练后量化方法不同，TernaryLM 使用直通估计器和自适应每层缩放因子从头开始学习量化感知表示。我们的实验证明：（1）TinyStories 上的验证困惑度为 58.42； (2) 下游转移，MRPC 释义检测 F1 为 82.47%； (3) 内存减少 2.4 倍（498MB 与 1197MB），推理延迟相当； (4) 跨不同语料库的稳定的训练动态。我们提供分层量化分析，表明中间变压器层表现出与极端量化的最高兼容性，为未来的非均匀精度策略提供信息。我们的结果表明，原生 1 位训练是高效神经语言模型的一个有前途的方向。代码可从此 https URL 获取。

Title: Efficient Post-Training Pruning of Large Language Models with Statistical Correction

Authors: Peiqi Yu, Jinhao Wang, Xinyi Sui, Nam Ling, Wei Wang, Wei Jiang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.07375
Pdf URL: https://arxiv.org/pdf/2602.07375
Copy Paste: [[2602.07375]] Efficient Post-Training Pruning of Large Language Models with Statistical Correction(https://arxiv.org/abs/2602.07375)
Keywords: language model, llm
Abstract: Post-training pruning is an effective approach for reducing the size and inference cost of large language models (LLMs), but existing methods often face a trade-off between pruning quality and computational efficiency. Heuristic pruning methods are efficient but sensitive to activation outliers, while reconstruction-based approaches improve fidelity at the cost of heavy computation. In this work, we propose a lightweight post-training pruning framework based on first-order statistical properties of model weights and activations. During pruning, channel-wise statistics are used to calibrate magnitude-based importance scores, reducing bias from activation-dominated channels. After pruning, we apply an analytic energy compensation to correct distributional distortions caused by weight removal. Both steps operate without retraining, gradients, or second-order information. Experiments across multiple LLM families, sparsity patterns, and evaluation tasks show that the proposed approach improves pruning performance while maintaining computational cost comparable to heuristic methods. The results suggest that simple statistical corrections can be effective for post-training pruning of LLMs.
摘要：训练后剪枝是减少大型语言模型（LLM）的大小和推理成本的有效方法，但现有方法通常面临剪枝质量和计算效率之间的权衡。启发式修剪方法很有效，但对激活异常值敏感，而基于重建的方法以大量计算为代价提高了保真度。在这项工作中，我们提出了一种基于模型权重和激活的一阶统计特性的轻量级训练后剪枝框架。在修剪过程中，通道统计数据用于校准基于幅度的重要性分数，减少激活主导通道的偏差。修剪后，我们应用解析能量补偿来纠正由权重去除引起的分布扭曲。这两个步骤的运行都无需重新训练、梯度或二阶信息。跨多个 LLM 系列、稀疏模式和评估任务的实验表明，所提出的方法提高了修剪性能，同时保持了与启发式方法相当的计算成本。结果表明，简单的统计修正对于法学硕士的训练后修剪是有效的。

Title: Do Large Language Models Reflect Demographic Pluralism in Safety?

Authors: Usman Naseem, Gautam Siddharth Kashyap, Sushant Kumar Ray, Rafiq Ali, Ebad Shabbir, Abdullah Mohammad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07376
Pdf URL: https://arxiv.org/pdf/2602.07376
Copy Paste: [[2602.07376]] Do Large Language Models Reflect Demographic Pluralism in Safety?(https://arxiv.org/abs/2602.07376)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Model (LLM) safety is inherently pluralistic, reflecting variations in moral norms, cultural expectations, and demographic contexts. Yet, existing alignment datasets such as ANTHROPIC-HH and DICES rely on demographically narrow annotator pools, overlooking variation in safety perception across communities. Demo-SafetyBench addresses this gap by modeling demographic pluralism directly at the prompt level, decoupling value framing from responses. In Stage I, prompts from DICES are reclassified into 14 safety domains (adapted from BEAVERTAILS) using Mistral 7B-Instruct-v0.3, retaining demographic metadata and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based deduplication, yielding 43,050 samples. In Stage II, pluralistic sensitivity is evaluated using LLMs-as-Raters-Gemma-7B, GPT-4o, and LLaMA-2-7B-under zero-shot inference. Balanced thresholds (delta = 0.5, tau = 10) achieve high reliability (ICC = 0.87) and low demographic sensitivity (DS = 0.12), confirming that pluralistic safety evaluation can be both scalable and demographically robust.
摘要：大语言模型 (LLM) 安全本质上是多元化的，反映了道德规范、文化期望和人口背景的差异。然而，现有的对齐数据集（例如 ANTHROPIC-HH 和 DICES）依赖于人口统计范围狭窄的注释者池，忽视了不同社区安全认知的差异。 Demo-SafetyBench 通过直接在即时级别对人口多元化进行建模，将价值框架与响应脱钩来解决这一差距。在第一阶段，使用 Mistral 7B-Instruct-v0.3 将 DICES 的提示重新分类为 14 个安全域（改编自 BEAVERTAILS），保留人口统计元数据，并通过 Llama-3.1-8B-Instruct 和基于 SimHash 的重复数据删除功能扩展低资源域，产生 43,050 个样本。在第二阶段，使用LLMs-as-Raters-Gemma-7B、GPT-4o和LLaMA-2-7B在零样本推理下评估多元敏感性。平衡阈值（delta = 0.5，tau = 10）实现了高可靠性（ICC = 0.87）和低人口统计学敏感性（DS = 0.12），证实了多元安全性评估可以具有可扩展性和人口统计学稳健性。

Title: When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified

Authors: Gautam Siddharth Kashyap, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07381
Pdf URL: https://arxiv.org/pdf/2602.07381
Copy Paste: [[2602.07381]] When the Model Said 'No Comment', We Knew Helpfulness Was Dead, Honesty Was Alive, and Safety Was Terrified(https://arxiv.org/abs/2602.07381)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) need to be in accordance with human values-being helpful, harmless, and honest (HHH)-is important for safe deployment. Existing works use Supervised Fine-Tuning (SFT) and Mixture-of-Experts (MoE) to align LLMs. However, these works face challenges in multi-objective settings, such as SFT leading to interference between conflicting objectives, while MoEs suffer from miscalibrated routing. We term this failure mode Axis Collapse, marked by (1) disjoint feature spaces causing catastrophic forgetting, and (2) unreliable inference from misrouted experts. To resolve this, we propose AlignX, a two-stage framework. Stage 1 uses prompt-injected fine-tuning to extract axis-specific task features, mitigating catastrophic forgetting. Stage 2 deploys a MoCaE module that calibrates expert routing using fractal and natural geometry, improving inference reliability. AlignX achieves significant gains on Alpaca (Helpfulness), BeaverTails (Harmlessness), and TruthfulQA (Honesty), with +171.5% win rate, +110.1% in truthfulness-informativeness, and 4.3% fewer safety violations. It also reduces latency and memory usage by over 35% compared to prior MoEs. Results across four LLMs validate its generalizability.
摘要：大型语言模型 (LLM) 需要符合人类价值观——乐于助人、无害和诚实 (HHH)——这对于安全部署非常重要。现有的工作使用监督微调（SFT）和专家混合（MoE）来调整法学硕士。然而，这些工作在多目标设置中面临着挑战，例如 SFT 会导致相互冲突的目标之间的干扰，而 MoE 则面临着错误校准的路由问题。我们将这种故障模式称为“轴崩溃”，其特点是（1）不相交的特征空间导致灾难性遗忘，以及（2）来自错误路线的专家的不可靠推论。为了解决这个问题，我们提出了 AlignX，一个两阶段框架。第一阶段使用提示注入微调来提取特定于轴的任务特征，从而减轻灾难性遗忘。第 2 阶段部署了 MoCaE 模块，该模块使用分形和自然几何来校准专家路由，从而提高推理可靠性。 AlignX 在 Alpaca（乐于助人）、BeaverTails（无害）和 TruthfulQA（诚实）方面取得了显着进步，胜率提高了 171.5%，真实信息性提高了 110.1%，安全违规行为减少了 4.3%。与之前的 MoE 相比，它还将延迟和内存使用量减少了 35% 以上。四个法学硕士的结果验证了其普遍性。

Title: Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi

Authors: Debtanu Datta, Rajdeep Mukherjee, Adrijit Goswami, Saptarshi Ghosh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07382
Pdf URL: https://arxiv.org/pdf/2602.07382
Copy Paste: [[2602.07382]] Advantages of Domain Knowledge Injection for Legal Document Summarization: A Case Study on Summarizing Indian Court Judgments in English and Hindi(https://arxiv.org/abs/2602.07382)
Keywords: language model
Abstract: Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
摘要：总结印度法院判决是一项复杂的任务，不仅因为法律文本的语言复杂且非结构化，而且因为很大一部分印度人口不理解法律文本所用的复杂英语，因此需要用印度语言进行摘要。在本研究中，我们的目标是通过将领域知识注入到不同的摘要模型中，改进印度法律文本的摘要，以生成英语和印地语（最广泛使用的印度语言）的摘要。我们提出了一个框架，通过结合针对法律文本定制的特定领域的预训练编码器来增强提取神经摘要模型。此外，我们通过对英语和印地语大型法律语料库的持续预训练，探索将法律领域知识注入生成模型（包括大型语言模型）。根据标准评估指标、事实一致性指标和法律领域特定指标来衡量，我们提出的方法在英语到英语和英语到印地语的印度法律文件摘要方面实现了统计上的显着改进。此外，这些改进得到了领域专家的验证，证明了我们方法的有效性。

Title: DLLM Agent: See Farther, Run Faster

Authors: Huiling Zhen, Weizhe Lin, Renxi Liu, Kai Han, Yiming Li, Yuchuan Tian, Hanting Chen, Xiaoguang Li, Xiaosong Li, Chen Chen, Xianzhi Yu, Mingxuan Yuan, Youliang Yan, Peifeng Qin, Jun Wang, Yu Wang, Dacheng Tao, Yunhe Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07451
Pdf URL: https://arxiv.org/pdf/2602.07451
Copy Paste: [[2602.07451]] DLLM Agent: See Farther, Run Faster(https://arxiv.org/abs/2602.07451)
Keywords: language model, llm, agent
Abstract: Diffusion large language models (DLLMs) have emerged as an alternative to autoregressive (AR) decoding with appealing efficiency and modeling properties, yet their implications for agentic multi-step decision making remain underexplored. We ask a concrete question: when the generation paradigm is changed but the agent framework and supervision are held fixed, do diffusion backbones induce systematically different planning and tool-use behaviors, and do these differences translate into end-to-end efficiency gains? We study this in a controlled setting by instantiating DLLM and AR backbones within the same agent workflow (DeepDiver) and performing matched agent-oriented fine-tuning on the same trajectory data, yielding diffusion-backed DLLM Agents and directly comparable AR agents. Across benchmarks and case studies, we find that, at comparable accuracy, DLLM Agents are on average over 30% faster end to end than AR agents, with some cases exceeding 8x speedup. Conditioned on correct task completion, DLLM Agents also require fewer interaction rounds and tool invocations, consistent with higher planner hit rates that converge earlier to a correct action path with less backtracking. We further identify two practical considerations for deploying diffusion backbones in tool-using agents. First, naive DLLM policies are more prone to structured tool-call failures, necessitating stronger tool-call-specific training to emit valid schemas and arguments. Second, for multi-turn inputs interleaving context and action spans, diffusion-style span corruption requires aligned attention masking to avoid spurious context-action information flow; without such alignment, performance degrades. Finally, we analyze attention dynamics across workflow stages and observe paradigm-specific coordination patterns, suggesting stronger global planning signals in diffusion-backed agents.
摘要：扩散大语言模型 (DLLM) 已成为自回归 (AR) 解码的替代方案，具有吸引人的效率和建模特性，但它们对代理多步骤决策的影响仍未得到充分探索。我们提出一个具体的问题：当生成范式发生变化但代理框架和监督保持不变时，扩散主干是否会导致系统性不同的规划和工具使用行为，这些差异是否会转化为端到端的效率增益？我们在受控环境中研究这一问题，方法是在同一代理工作流程 (DeepDiver) 中实例化 DLLM 和 AR 主干，并对相同的轨迹数据执行匹配的面向代理的微调，产生扩散支持的 DLLM 代理和直接可比较的 AR 代理。通过基准测试和案例研究，我们发现，在相当的精度下，DLLM 代理的端到端速度平均比 AR 代理快 30% 以上，在某些情况下加速超过 8 倍。以正确完成任务为条件，DLLM 代理还需要更少的交互轮次和工具调用，这与更高的规划器命中率一致，更早地收敛到正确的操作路径，并且回溯更少。我们进一步确定了在使用工具的代理中部署扩散骨干网的两个实际考虑因素。首先，幼稚的 DLLM 策略更容易出现结构化工具调用失败，因此需要更强大的特定于工具调用的训练来发出有效的模式和参数。其次，对于交错上下文和动作跨度的多轮输入，扩散式跨度损坏需要对齐的注意力掩蔽，以避免虚假的上下文动作信息流；如果没有这种对齐，性能就会下降。最后，我们分析了整个工作流程阶段的注意力动态，并观察特定范式的协调模式，表明扩散支持的代理中有更强的全局规划信号。

Title: SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning

Authors: Yijie Chen, Yijin Liu, Fandong Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07464
Pdf URL: https://arxiv.org/pdf/2602.07464
Copy Paste: [[2602.07464]] SED-SFT: Selectively Encouraging Diversity in Supervised Fine-Tuning(https://arxiv.org/abs/2602.07464)
Keywords: language model, llm
Abstract: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL) has emerged as the standard post-training paradigm for large language models (LLMs). However, the conventional SFT process, driven by Cross-Entropy (CE) loss, often induces mode collapse, where models over-concentrate on specific response patterns. This lack of distributional diversity severely restricts the exploration efficiency required for subsequent RL. While recent studies have attempted to improve SFT by replacing the CE loss, aiming to preserve diversity or refine the update policy, they fail to adequately balance diversity and accuracy, thereby yielding suboptimal performance after RL. To address the mode collapse problem, we propose SED-SFT, which adaptively encourages diversity based on the token exploration space. This framework introduces a selective entropy regularization term with a selective masking mechanism into the optimization objective. Extensive experiments across eight mathematical benchmarks demonstrate that SED-SFT significantly enhances generation diversity with a negligible computational overhead increase compared with CE loss, yielding average improvements of 2.06 and 1.20 points in subsequent RL performance over standard CE-based baselines on Llama-3.2-3B-Instruct and Qwen2.5-Math-7B-Instruct, respectively. The code is publicly available at this https URL
摘要：监督微调 (SFT) 和强化学习 (RL) 已成为大型语言模型 (LLM) 的标准训练后范例。然而，由交叉熵 (CE) 损失驱动的传统 SFT 过程通常会导致模式崩溃，即模型过度关注特定的响应模式。这种分布多样性的缺乏严重限制了后续强化学习所需的探索效率。虽然最近的研究试图通过替换 CE 损失来改进 SFT，旨在保留多样性或完善更新策略，但它们未能充分平衡多样性和准确性，从而在 RL 后产生次优性能。为了解决模式崩溃问题，我们提出了 SED-SFT，它基于代币探索空间自适应地鼓励多样性。该框架将具有选择性掩蔽机制的选择性熵正则化项引入到优化目标中。跨越八个数学基准的大量实验表明，与 CE 损失相比，SED-SFT 显着增强了生成多样性，计算开销的增加可以忽略不计，与 Llama-3.2-3B-Instruct 和 Qwen2.5-Math-7B-Instruct 上基于标准 CE 的基线相比，随后的 RL 性能平均提高了 2.06 点和 1.20 点。该代码可在此 https URL 公开获取

Title: From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection

Authors: Mo Wang, Kaixuan Ren, Pratik Jalan, Ahmed Ashraf, Tuong Vy Vu, Rahul Seetharaman, Shah Nawaz, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07497
Pdf URL: https://arxiv.org/pdf/2602.07497
Copy Paste: [[2602.07497]] From Native Memes to Global Moderation: Cros-Cultural Evaluation of Vision-Language Models for Hateful Meme Detection(https://arxiv.org/abs/2602.07497)
Keywords: language model, prompt
Abstract: Cultural context profoundly shapes how people interpret online content, yet vision-language models (VLMs) remain predominantly trained through Western or English-centric lenses. This limits their fairness and cross-cultural robustness in tasks like hateful meme detection. We introduce a systematic evaluation framework designed to diagnose and quantify the cross-cultural robustness of state-of-the-art VLMs across multilingual meme datasets, analyzing three axes: (i) learning strategy (zero-shot vs. one-shot), (ii) prompting language (native vs. English), and (iii) translation effects on meaning and detection. Results show that the common ``translate-then-detect'' approach deteriorate performance, while culturally aligned interventions - native-language prompting and one-shot learning - significantly enhance detection. Our findings reveal systematic convergence toward Western safety norms and provide actionable strategies to mitigate such bias, guiding the design of globally robust multimodal moderation systems.
摘要：文化背景深刻地影响着人们如何解读在线内容，但视觉语言模型（VLM）仍然主要通过西方或以英语为中心的镜头进行训练。这限制了它们在仇恨模因检测等任务中的公平性和跨文化鲁棒性。我们引入了一个系统评估框架，旨在诊断和量化跨多语言模因数据集的最先进的 VLM 的跨文化鲁棒性，分析三个轴：（i）学习策略（零样本与单样本），（ii）提示语言（母语与英语），以及（iii）翻译对意义和检测的影响。结果表明，常见的“翻译然后检测”方法会降低性能，而文化上一致的干预措施（母语提示和一次性学习）可显着增强检测能力。我们的研究结果揭示了与西方安全规范的系统趋同，并提供了减轻这种偏见的可行策略，指导全球稳健的多模式调节系统的设计。

Title: Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification

Authors: Jingshen Zhang, Xin Ying Qiu, Lifang Lu, Zhuhua Huang, Yutao Hu, Yuechang Wu, JunYu Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07499
Pdf URL: https://arxiv.org/pdf/2602.07499
Copy Paste: [[2602.07499]] Let's Simplify Step by Step: Guiding LLM Towards Multilingual Unsupervised Proficiency-Controlled Sentence Simplification(https://arxiv.org/abs/2602.07499)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models demonstrate limited capability in proficiency-controlled sentence simplification, particularly when simplifying across large readability levels. We propose a framework that decomposes complex simplifications into manageable steps through dynamic path planning, semantic-aware exemplar selection, and chain-of-thought generation with conversation history for coherent reasoning. Evaluation on five languages across two benchmarks shows our approach improves simplification effectiveness while reducing computational steps by 22-42%. Human evaluation confirms the fundamental trade-off between simplification effectiveness and meaning preservation. Notably, even human annotators struggle to agree on semantic preservation judgments, highlighting the inherent complexity of this task. Our work shows that while step-by-step simplification improves control, preserving semantic fidelity during extensive simplification remains an open challenge.
摘要：大型语言模型在熟练度控制的句子简化方面表现出有限的能力，特别是在跨大型可读级别进行简化时。我们提出了一个框架，通过动态路径规划、语义感知样本选择以及具有对话历史的思想链生成来进行连贯推理，将复杂的简化分解为可管理的步骤。通过两个基准对五种语言进行的评估表明，我们的方法提高了简化效率，同时减少了 22-42% 的计算步骤。人类评估证实了简化有效性和意义保存之间的基本权衡。值得注意的是，即使是人类注释者也很难就语义保留判断达成一致，这凸显了这项任务固有的复杂性。我们的工作表明，虽然逐步简化可以改善控制，但在广泛简化过程中保持语义保真度仍然是一个开放的挑战。

Title: Improving Variable-Length Generation in Diffusion Language Models via Length Regularization

Authors: Zicong Cheng, Ruixuan Jia, Jia Li, Guo-Wei Yang, Meng-Hao Guo, Shi-Min Hu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.07546
Pdf URL: https://arxiv.org/pdf/2602.07546
Copy Paste: [[2602.07546]] Improving Variable-Length Generation in Diffusion Language Models via Length Regularization(https://arxiv.org/abs/2602.07546)
Keywords: language model, llm
Abstract: Diffusion Large Language Models (DLLMs) are inherently ill-suited for variable-length generation, as their inference is defined on a fixed-length canvas and implicitly assumes a known target length. When the length is unknown, as in realistic completion and infilling, naively comparing confidence across mask lengths becomes systematically biased, leading to under-generation or redundant continuations. In this paper, we show that this failure arises from an intrinsic lengthinduced bias in generation confidence estimates, leaving existing DLLMs without a robust way to determine generation length and making variablelength inference unreliable. To address this issue, we propose LR-DLLM, a length-regularized inference framework for DLLMs that treats generation length as an explicit variable and achieves reliable length determination at inference time. It decouples semantic compatibility from lengthinduced uncertainty through an explicit length regularization that corrects biased confidence estimates. Based on this, LR-DLLM enables dynamic expansion or contraction of the generation span without modifying the underlying DLLM or its training procedure. Experiments show that LRDLLM achieves 51.3% Pass@1 on HumanEvalInfilling under fully unknown lengths (+13.4% vs. DreamOn) and 51.5% average Pass@1 on four-language McEval (+14.3% vs. DreamOn).
摘要：扩散大型语言模型 (DLLM) 本质上不适合可变长度生成，因为它们的推理是在固定长度画布上定义的，并且隐式假设已知的目标长度。当长度未知时，如在实际完成和填充中，天真地比较不同掩模长度的置信度会产生系统偏差，导致生成不足或冗余延续。在本文中，我们表明这种失败是由生成置信度估计中固有的长度引起的偏差引起的，使得现有的 DLLM 没有可靠的方法来确定生成长度，并使可变长度推理变得不可靠。为了解决这个问题，我们提出了 LR-DLLM，这是一种用于 DLLM 的长度正则化推理框架，它将生成长度视为显式变量，并在推理时实现可靠的长度确定。它通过显式长度正则化来纠正有偏差的置信度估计，从而将语义兼容性与长度引起的不确定性解耦。基于此，LR-DLLM 能够动态扩展或收缩生成跨度，而无需修改底层 DLLM 或其训练过程。实验表明，LRDLLM 在完全未知的长度下在 HumanEvalInfilling 上实现了 51.3% Pass@1（与 DreamOn 相比增加了 13.4%），在四种语言 McEval 上实现了 51.5% 的平均 Pass@1（与 DreamOn 相比增加了 14.3%）。

Title: Learning to Self-Verify Makes Language Models Better Reasoners

Authors: Yuxin Chen, Yu Wang, Yi Zhang, Ziang Ye, Zhengzhou Cai, Yaorui Shi, Qi Gu, Hui Su, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07594
Pdf URL: https://arxiv.org/pdf/2602.07594
Copy Paste: [[2602.07594]] Learning to Self-Verify Makes Language Models Better Reasoners(https://arxiv.org/abs/2602.07594)
Keywords: language model, llm
Abstract: Recent large language models (LLMs) achieve strong performance in generating promising reasoning paths for complex tasks. However, despite powerful generation ability, LLMs remain weak at verifying their own answers, revealing a persistent capability asymmetry between generation and self-verification. In this work, we conduct an in-depth investigation of this asymmetry throughout training evolution and show that, even on the same task, improving generation does not lead to corresponding improvements in self-verification. Interestingly, we find that the reverse direction of this asymmetry behaves differently: learning to self-verify can effectively improve generation performance, achieving accuracy comparable to standard generation training while yielding more efficient and effective reasoning traces. Building on this observation, we further explore integrating self-verification into generation training by formulating a multi-task reinforcement learning framework, where generation and self-verification are optimized as two independent but complementary objectives. Extensive experiments across benchmarks and models demonstrate performance gains over generation-only training in both generation and verification capabilities.
摘要：最近的大型语言模型（LLM）在为复杂任务生成有希望的推理路径方面取得了强大的性能。然而，尽管生成能力强大，法学硕士在验证自己的答案方面仍然很弱，揭示了生成和自我验证之间持续的能力不对称。在这项工作中，我们对整个训练演化过程中的这种不对称性进行了深入研究，结果表明，即使在同一任务上，改进生成也不会导致自我验证的相应改进。有趣的是，我们发现这种不对称性的反向表现有所不同：学习自我验证可以有效提高生成性能，达到与标准生成训练相当的准确性，同时产生更高效和有效的推理轨迹。基于这一观察，我们通过制定多任务强化学习框架，进一步探索将自我验证整合到生成训练中，其中生成和自我验证作为两个独立但互补的目标进行优化。跨基准和模型的广泛实验表明，在生成和验证能力方面，与仅生成训练相比，性能有所提高。

Title: SciClaimEval: Cross-modal Claim Verification in Scientific Papers

Authors: Xanh Ho, Yun-Ang Wu, Sunisth Kumar, Tian Cheng Xia, Florian Boudin, Andre Greiner-Petter, Akiko Aizawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07621
Pdf URL: https://arxiv.org/pdf/2602.07621
Copy Paste: [[2602.07621]] SciClaimEval: Cross-modal Claim Verification in Scientific Papers(https://arxiv.org/abs/2602.07621)
Keywords: language model, llm
Abstract: We present SciClaimEval, a new scientific dataset for the claim verification task. Unlike existing resources, SciClaimEval features authentic claims, including refuted ones, directly extracted from published papers. To create refuted claims, we introduce a novel approach that modifies the supporting evidence (figures and tables), rather than altering the claims or relying on large language models (LLMs) to fabricate contradictions. The dataset provides cross-modal evidence with diverse representations: figures are available as images, while tables are provided in multiple formats, including images, LaTeX source, HTML, and JSON. SciClaimEval contains 1,664 annotated samples from 180 papers across three domains, machine learning, natural language processing, and medicine, validated through expert annotation. We benchmark 11 multimodal foundation models, both open-source and proprietary, across the dataset. Results show that figure-based verification remains particularly challenging for all models, as a substantial performance gap remains between the best system and human baseline.
摘要：我们推出了 SciClaimEval，这是一个用于声明验证任务的新科学数据集。与现有资源不同，SciClaimEval 具有直接从发表的论文中提取的真实主张，包括反驳的主张。为了创建反驳的主张，我们引入了一种新颖的方法来修改支持证据（图形和表格），而不是改变主张或依赖大型语言模型（LLM）来制造矛盾。该数据集提供了具有多种表示形式的跨模式证据：图形以图像形式提供，而表格以多种格式提供，包括图像、LaTeX 源、HTML 和 JSON。 SciClaimEval 包含来自机器学习、自然语言处理和医学三个领域的 180 篇论文的 1,664 个带注释的样本，并通过专家注释进行验证。我们在整个数据集中对 11 个多模式基础模型（包括开源模型和专有模型）进行了基准测试。结果表明，基于数字的验证对于所有模型来说仍然特别具有挑战性，因为最佳系统和人类基线之间仍然存在巨大的性能差距。

Title: Letting Tutor Personas "Speak Up" for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization

Authors: Jaewook Lee, Alexander Scarlatos, Simon Woodhead, Andrew Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07639
Pdf URL: https://arxiv.org/pdf/2602.07639
Copy Paste: [[2602.07639]] Letting Tutor Personas "Speak Up" for LLMs: Learning Steering Vectors from Dialogue via Preference Optimization(https://arxiv.org/abs/2602.07639)
Keywords: language model, llm, prompt
Abstract: With the emergence of large language models (LLMs) as a powerful class of generative artificial intelligence (AI), their use in tutoring has become increasingly prominent. Prior works on LLM-based tutoring typically learn a single tutor policy and do not capture the diversity of tutoring styles. In real-world tutor-student interactions, pedagogical intent is realized through adaptive instructional strategies, with tutors varying the level of scaffolding, instructional directiveness, feedback, and affective support in response to learners' needs. These differences can all impact dialogue dynamics and student engagement. In this paper, we explore how tutor personas embedded in human tutor-student dialogues can be used to guide LLM behavior without relying on explicitly prompted instructions. We modify Bidirectional Preference Optimization (BiPO) to learn a steering vector, an activation-space direction that steers model responses towards certain tutor personas. We find that this steering vector captures tutor-specific variation across dialogue contexts, improving semantic alignment with ground-truth tutor utterances and increasing preference-based evaluations, while largely preserving lexical similarity. Analysis of the learned directional coefficients further reveals interpretable structure across tutors, corresponding to consistent differences in tutoring behavior. These results demonstrate that activation steering offers an effective and interpretable way for controlling tutor-specific variation in LLMs using signals derived directly from human dialogue data.
摘要：随着大型语言模型（LLM）作为一类强大的生成人工智能（AI）的出现，它们在辅导中的应用变得越来越突出。之前基于法学硕士的辅导工作通常学习单一的导师政策，并且没有捕捉到辅导风格的多样性。在现实世界的导师与学生的互动中，教学意图是通过适应性教学策略来实现的，导师根据学习者的需求改变支架、教学指导性、反馈和情感支持的水平。这些差异都会影响对话动态和学生参与度。在本文中，我们探讨了如何使用嵌入人类导师与学生对话中的导师角色来指导 LLM 行为，而不依赖于明确提示的指令。我们修改双向偏好优化（BiPO）来学习转向向量，这是一个激活空间方向，可将模型响应转向某些导师角色。我们发现这个引导向量捕获了对话上下文中特定于导师的变化，改善了与真实导师话语的语义对齐，并增加了基于偏好的评估，同时在很大程度上保留了词汇相似性。对学习到的方向系数的分析进一步揭示了导师之间的可解释结构，对应于辅导行为的一致差异。这些结果表明，激活引导提供了一种有效且可解释的方法，可以使用直接从人类对话数据派生的信号来控制法学硕士中导师特定的变化。

Title: Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation

Authors: Jiangnan Fang, Cheng-Tse Liu, Hanieh Deilamsalehy, Nesreen K. Ahmed, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A. Rossi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07673
Pdf URL: https://arxiv.org/pdf/2602.07673
Copy Paste: [[2602.07673]] Blind to the Human Touch: Overlap Bias in LLM-Based Summary Evaluation(https://arxiv.org/abs/2602.07673)
Keywords: language model, llm, prompt
Abstract: Large language model (LLM) judges have often been used alongside traditional, algorithm-based metrics for tasks like summarization because they better capture semantic information, are better at reasoning, and are more robust to paraphrasing. However, LLM judges show biases for length and order among others, and are vulnerable to various adversarial input prompts. While recent studies have looked into these biases, few have analyzed them at a more granular level in relation to a well-defined overlap metric. In this work we provide an LLM judge bias analysis as a function of overlap with human-written responses in the domain of summarization. We test 9 recent LLMs with parameter counts ranging from 1 billion to 12 billion, including variants of Gemma 3 and LLaMA 3. We find that LLM judges increasingly prefer summaries generated by other LLMs over those written by humans as the similarities (as measured by ROUGE and BLEU) between the judged summaries decrease, and this pattern extends to all but one model tested, and exists regardless of the models' own position biases. Additionally, we find that models struggle to judge even summaries with limited overlaps, suggesting that LLM-as-a-judge in the summary domain should rely on techniques beyond a simple comparison.
摘要：大型语言模型 (LLM) 法官通常与传统的基于算法的指标一起用于摘要等任务，因为它们可以更好地捕获语义信息，更擅长推理，并且对释义更稳健。然而，法学硕士法官在长度和顺序等方面表现出偏见，并且容易受到各种对抗性输入提示的影响。虽然最近的研究已经研究了这些偏差，但很少有人根据明确定义的重叠指标在更细粒度的层面上分析它们。在这项工作中，我们提供了法学硕士法官偏见分析，作为总结领域中与人类书面回答重叠的函数。我们测试了 9 个最近的 LLM，参数数量从 10 亿到 120 亿不等，包括 Gemma 3 和 LLaMA 3 的变体。我们发现，LLM 法官越来越喜欢由其他 LLM 生成的摘要，而不是人类编写的摘要，因为判断的摘要之间的相似性（通过 ROUGE 和 BLEU 测量）减少，并且这种模式扩展到除一个测试模型之外的所有模型，并且无论模型自身的位置偏差如何，都存在这种模式。此外，我们发现即使是重叠有限的摘要，模型也很难判断，这表明作为摘要领域的法学硕士应该依赖于简单比较之外的技术。

Title: SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents

Authors: Chen Zhang, Kuicai Dong, Dexun Li, Wenjun Li, Qu Yang, Wei Han, Yong Liu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.07773
Pdf URL: https://arxiv.org/pdf/2602.07773
Copy Paste: [[2602.07773]] SRR-Judge: Step-Level Rating and Refinement for Enhancing Search-Integrated Reasoning in Search Agents(https://arxiv.org/abs/2602.07773)
Keywords: agent
Abstract: Recent deep search agents built on large reasoning models (LRMs) excel at complex question answering by iteratively planning, acting, and gathering evidence, a capability known as search-integrated reasoning. However, mainstream approaches often train this ability using only outcome-based supervision, neglecting the quality of intermediate thoughts and actions. We introduce SRR-Judge, a framework for reliable step-level assessment of reasoning and search actions. Integrated into a modified ReAct-style rate-and-refine workflow, SRR-Judge provides fine-grained guidance for search-integrated reasoning and enables efficient post-training annotation. Using SRR-annotated data, we apply an iterative rejection sampling fine-tuning procedure to enhance the deep search capability of the base agent. Empirically, SRR-Judge delivers more reliable step-level evaluations than much larger models such as DeepSeek-V3.1, with its ratings showing strong correlation with final answer correctness. Moreover, aligning the policy with SRR-Judge annotated trajectories leads to substantial performance gains, yielding over a 10 percent average absolute pass@1 improvement across challenging deep search benchmarks.
摘要：最近基于大型推理模型（LRM）构建的深度搜索代理擅长通过迭代规划、行动和收集证据来回答复杂的问题，这种能力被称为搜索集成推理。然而，主流方法通常仅使用基于结果的监督来训练这种能力，而忽略了中间思想和行动的质量。我们引入了 SRR-Judge，这是一个对推理和搜索动作进行可靠的步骤级评估的框架。 SRR-Judge 集成到修改后的 ReAct 式速率和优化工作流程中，为搜索集成推理提供细粒度指导，并实现高效的训练后注释。使用 SRR 注释数据，我们应用迭代拒绝采样微调过程来增强基本代理的深度搜索能力。根据经验，SRR-Judge 比 DeepSeek-V3.1 等更大的模型提供更可靠的步进级评估，其评级与最终答案的正确性显示出很强的相关性。此外，使策略与 SRR-Judge 注释轨迹保持一致可以带来显着的性能提升，在具有挑战性的深度搜索基准测试中产生超过 10% 的平均绝对 pass@1 改进。

Title: Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs

Authors: Shenglai Zeng, Tianqi Zheng, Chuan Tian, Dante Everaert, Yau-Shian Wang, Yupin Huang, Michael J. Morais, Rohit Patki, Jinjin Tian, Xinnan Dai, Kai Guo, Monica Xiao Cheng, Hui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07778
Pdf URL: https://arxiv.org/pdf/2602.07778
Copy Paste: [[2602.07778]] Attn-GS: Attention-Guided Context Compression for Efficient Personalized LLMs(https://arxiv.org/abs/2602.07778)
Keywords: language model, llm, prompt
Abstract: Personalizing large language models (LLMs) to individual users requires incorporating extensive interaction histories and profiles, but input token constraints make this impractical due to high inference latency and API costs. Existing approaches rely on heuristic methods such as selecting recent interactions or prompting summarization models to compress user profiles. However, these methods treat context as a monolithic whole and fail to consider how LLMs internally process and prioritize different profile components. We investigate whether LLMs' attention patterns can effectively identify important personalization signals for intelligent context compression. Through preliminary studies on representative personalization tasks, we discover that (a) LLMs' attention patterns naturally reveal important signals, and (b) fine-tuning enhances LLMs' ability to distinguish between relevant and irrelevant information. Based on these insights, we propose Attn-GS, an attention-guided context compression framework that leverages attention feedback from a marking model to mark important personalization sentences, then guides a compression model to generate task-relevant, high-quality compressed user contexts. Extensive experiments demonstrate that Attn-GS significantly outperforms various baselines across different tasks, token limits, and settings, achieving performance close to using full context while reducing token usage by 50 times.
摘要：为个人用户个性化大型语言模型 (LLM) 需要合并大量的交互历史记录和配置文件，但由于高推理延迟和 API 成本，输入令牌限制使得这种做法不切实际。现有方法依赖于启发式方法，例如选择最近的交互或提示摘要模型来压缩用户配置文件。然而，这些方法将上下文视为一个整体，未能考虑法学硕士内部如何处理和优先考虑不同的配置文件组件。我们研究法学硕士的注意力模式是否可以有效识别智能上下文压缩的重要个性化信号。通过对代表性个性化任务的初步研究，我们发现（a）法学硕士的注意力模式自然地揭示了重要信号，（b）微调增强了法学硕士区分相关和不相关信息的能力。基于这些见解，我们提出了 Attn-GS，这是一种注意力引导的上下文压缩框架，它利用标记模型的注意力反馈来标记重要的个性化句子，然后指导压缩模型生成与任务相关的高质量压缩用户上下文。大量实验表明，Attn-GS 在不同任务、令牌限制和设置上显着优于各种基线，实现接近使用完整上下文的性能，同时将令牌使用量减少 50 倍。

Title: Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models

Authors: Ningyu Xu, Qi Zhang, Xipeng Qiu, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07794
Pdf URL: https://arxiv.org/pdf/2602.07794
Copy Paste: [[2602.07794]] Emergent Structured Representations Support Flexible In-Context Inference in Large Language Models(https://arxiv.org/abs/2602.07794)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit emergent behaviors suggestive of human-like reasoning. While recent work has identified structured, human-like conceptual representations within these models, it remains unclear whether they functionally rely on such representations for reasoning. Here we investigate the internal processing of LLMs during in-context concept inference. Our results reveal a conceptual subspace emerging in middle to late layers, whose representational structure persists across contexts. Using causal mediation analyses, we demonstrate that this subspace is not merely an epiphenomenon but is functionally central to model predictions, establishing its causal role in inference. We further identify a layer-wise progression where attention heads in early-to-middle layers integrate contextual cues to construct and refine the subspace, which is subsequently leveraged by later layers to generate predictions. Together, these findings provide evidence that LLMs dynamically construct and use structured, latent representations in context for inference, offering insights into the computational processes underlying flexible adaptation.
摘要：大型语言模型（LLM）表现出暗示类人推理的新兴行为。虽然最近的工作已经在这些模型中确定了结构化的、类人的概念表示，但仍不清楚它们在功能上是否依赖于这些表示进行推理。在这里，我们研究了法学硕士在上下文概念推理过程中的内部处理。我们的结果揭示了在中后期层中出现的概念子空间，其表征结构在不同上下文中持续存在。通过因果中介分析，我们证明这个子空间不仅仅是一种附带现象，而且在功能上是模型预测的核心，确立了其在推理中的因果作用。我们进一步确定了一个逐层的进展，其中早期到中间层的注意力头整合上下文线索来构建和细化子空间，随后由后面的层利用子空间来生成预测。总之，这些发现提供了证据，证明法学硕士可以在上下文中动态构建和使用结构化的潜在表示进行推理，从而深入了解灵活适应背后的计算过程。

Title: Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents

Authors: Jiatong Li, Changdae Oh, Hyeong Kyu Choi, Jindong Wang, Sharon Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07796
Pdf URL: https://arxiv.org/pdf/2602.07796
Copy Paste: [[2602.07796]] Thinking Makes LLM Agents Introverted: How Mandatory Thinking Can Backfire in User-Engaged Agents(https://arxiv.org/abs/2602.07796)
Keywords: language model, llm, prompt, agent
Abstract: Eliciting reasoning has emerged as a powerful technique for improving the performance of large language models (LLMs) on complex tasks by inducing thinking. However, their effectiveness in realistic user-engaged agent scenarios remains unclear. In this paper, we conduct a comprehensive study on the effect of explicit thinking in user-engaged LLM agents. Our experiments span across seven models, three benchmarks, and two thinking instantiations, and we evaluate them through both a quantitative response taxonomy analysis and qualitative failure propagation case studies. Contrary to expectations, we find that mandatory thinking often backfires on agents in user-engaged settings, causing anomalous performance degradation across various LLMs. Our key finding reveals that thinking makes agents more ``introverted'' by shortening responses and reducing information disclosure to users, which weakens agent-user information exchange and leads to downstream task failures. Furthermore, we demonstrate that explicitly prompting for information disclosure reliably improves performance across diverse model families, suggesting that proactive transparency is a vital lever for agent optimization. Overall, our study suggests that information transparency awareness is a crucial yet underexplored perspective for the future design of reasoning agents in real-world scenarios. Our code is available at this https URL.
摘要：引发推理已成为一种强大的技术，可通过诱导思维来提高大型语言模型 (LLM) 在复杂任务上的性能。然而，它们在现实的用户参与代理场景中的有效性仍不清楚。在本文中，我们对用户参与的法学硕士代理中显性思维的影响进行了全面的研究。我们的实验涵盖七个模型、三个基准和两个思维实例，我们通过定量响应分类分析和定性故障传播案例研究来评估它们。与预期相反，我们发现，在用户参与的环境中，强制思维常常会对代理产生适得其反的效果，导致各种法学硕士的性能异常下降。我们的主要发现表明，思维通过缩短响应和减少向用户披露信息而使代理变得更加“内向”，从而削弱了代理与用户的信息交换并导致下游任务失败。此外，我们证明，明确提示信息披露可以可靠地提高不同模型系列的性能，这表明主动透明度是代理优化的重要杠杆。总的来说，我们的研究表明，信息透明度意识对于现实场景中推理代理的未来设计来说是一个至关重要但尚未得到充分探索的视角。我们的代码可以在这个 https URL 上找到。

Title: Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models

Authors: Xuan Ding, Pengyu Tong, Ranjie Duan, Yunjian Zhang, Rui Sun, Yao Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07804
Pdf URL: https://arxiv.org/pdf/2602.07804
Copy Paste: [[2602.07804]] Pruning as a Cooperative Game: Surrogate-Assisted Layer Contribution Estimation for Large Language Models(https://arxiv.org/abs/2602.07804)
Keywords: language model, llm
Abstract: While large language models (LLMs) demonstrate impressive performance across various tasks, their deployment in real-world scenarios is still constrained by high computational demands. Layer-wise pruning, a commonly employed strategy to mitigate inference costs, can partially address this challenge. However, existing approaches generally depend on static heuristic rules and fail to account for the interdependencies among layers, thereby limiting the effectiveness of the pruning process. To this end, this paper proposes a game-theoretic framework that formulates layer pruning as a cooperative game in which each layer acts as a player and model performance serves as the utility. As computing exact Shapley values is computationally infeasible for large language models (LLMs), we propose using a lightweight surrogate network to estimate layer-wise marginal contributions. This network can predict LLM performance for arbitrary layer combinations at a low computational cost. Additionally, we employ stratified Monte Carlo mask sampling to further reduce the cost of Sharpley value estimation. This approach captures inter-layer dependencies and dynamically identifies critical layers for pruning. Extensive experiments demonstrate the consistent superiority of our method in terms of perplexity and zero-shot accuracy, achieving more efficient and effective layer-wise pruning for large language models.
摘要：虽然大型语言模型 (LLM) 在各种任务中表现出令人印象深刻的性能，但它们在现实场景中的部署仍然受到高计算需求的限制。分层剪枝是一种常用的降低推理成本的策略，可以部分解决这一挑战。然而，现有方法通常依赖于静态启发式规则，无法考虑层之间的相互依赖性，从而限制了修剪过程的有效性。为此，本文提出了一种博弈论框架，将层剪枝制定为合作博弈，其中每一层充当参与者，模型性能充当效用。由于计算精确的 Shapley 值对于大型语言模型（LLM）来说在计算上是不可行的，因此我们建议使用轻量级代理网络来估计逐层边际贡献。该网络可以以较低的计算成本预测任意层组合的 LLM 性能。此外，我们采用分层蒙特卡罗掩模采样来进一步降低夏普利值估计的成本。这种方法捕获层间依赖关系并动态识别关键层以进行修剪。大量的实验证明了我们的方法在困惑度和零样本精度方面始终如一的优越性，为大型语言模型实现了更高效、更有效的分层剪枝。

Title: LLMs Know More About Numbers than They Can Say

Authors: Fengting Yuchi, Li Du, Jason Eisner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07812
Pdf URL: https://arxiv.org/pdf/2602.07812
Copy Paste: [[2602.07812]] LLMs Know More About Numbers than They Can Say(https://arxiv.org/abs/2602.07812)
Keywords: llm
Abstract: Although state-of-the-art LLMs can solve math problems, we find that they make errors on numerical comparisons with mixed notation: "Which is larger, $5.7 \times 10^2$ or $580$?" This raises a fundamental question: Do LLMs even know how big these numbers are? We probe the hidden states of several smaller open-source LLMs. A single linear projection of an appropriate hidden layer encodes the log-magnitudes of both kinds of numerals, allowing us to recover the numbers with relative error of about 2.3% (on restricted synthetic text) or 19.06% (on scientific papers). Furthermore, the hidden state after reading a pair of numerals encodes their ranking, with a linear classifier achieving over 90% accuracy. Yet surprisingly, when explicitly asked to rank the same pairs of numerals, these LLMs achieve only 50-70% accuracy, with worse performance for models whose probes are less effective. Finally, we show that incorporating the classifier probe's log-loss as an auxiliary objective during finetuning brings an additional 3.22% improvement in verbalized accuracy over base models, demonstrating that improving models' internal magnitude representations can enhance their numerical reasoning capabilities.
摘要：尽管最先进的法学硕士可以解决数学问题，但我们发现他们在使用混合符号进行数值比较时会犯错误：“5.7 美元\乘以 10^2$ 或 580$ 哪个更大？”这就提出了一个基本问题：法学硕士知道这些数字有多大吗？我们探讨了几个较小的开源法学硕士的隐藏状态。适当隐藏层的单个线性投影对两种数字的对数幅度进行编码，使我们能够恢复相对误差约为 2.3%（在受限合成文本上）或 19.06%（在科学论文上）的数字。此外，读取一对数字后的隐藏状态对其排名进行编码，线性分类器的准确率达到 90% 以上。然而令人惊讶的是，当明确要求对相同的数字对进行排名时，这些法学硕士仅达到 50-70% 的准确度，对于探针效率较低的模型来说，其性能更差。最后，我们表明，在微调过程中将分类器探针的对数损失作为辅助目标，与基本模型相比，语言准确率额外提高了 3.22%，这表明改进模型的内部幅度表示可以增强其数值推理能力。

Title: TodoEvolve: Learning to Architect Agent Planning Systems

Authors: Jiaxi Liu, Yanzuo Jiang, Guibin Zhang, Zihan Zhang, Heng Chang, Zhenfei Yin, Qibing Ren, Junchi Yan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.07839
Pdf URL: https://arxiv.org/pdf/2602.07839
Copy Paste: [[2602.07839]] TodoEvolve: Learning to Architect Agent Planning Systems(https://arxiv.org/abs/2602.07839)
Keywords: agent
Abstract: Planning has become a central capability for contemporary agent systems in navigating complex, long-horizon tasks, yet existing approaches predominantly rely on fixed, hand-crafted planning structures that lack the flexibility to adapt to the structural diversity of open-ended problems. To address this limitation, we introduce TodoEvolve, a meta-planning paradigm that autonomously synthesizes and dynamically revises task-specific planning architectures. Specifically, we first construct PlanFactory, a modular design space that standardizes diverse planning paradigms within a unified codebase encompassing topology, initialization, adaptation, and navigation, thereby providing a common interface for heterogeneous planning patterns. Leveraging PlanFactory, we collect high-quality planning trajectories and train Todo-14B via \textit{Impedance-Guided Preference Optimization} (IGPO), a multi-objective reinforcement learning objective that encourages the generation of planning systems that are performant, stable, and token-efficient across arbitrary tasks and agent backbones. Empirical evaluations on five agentic benchmarks demonstrate that TodoEvolve consistently surpasses carefully engineered planning modules while maintaining economical API costs and runtime overhead.
摘要：规划已成为当代代理系统处理复杂、长期任务的核心能力，但现有方法主要依赖于固定的、手工制定的规划结构，缺乏适应开放式问题的结构多样性的灵活性。为了解决这个限制，我们引入了 TodoEvolve，一种元规划范例，可以自主综合和动态修改特定于任务的规划架构。具体来说，我们首先构建 PlanFactory，这是一个模块化设计空间，它在一个统一的代码库中标准化了不同的规划范例，包括拓扑、初始化、适应和导航，从而为异构规划模式提供了一个通用接口。利用 PlanFactory，我们收集高质量的规划轨迹，并通过 \textit{阻抗引导偏好优化} (IGPO) 训练 Todo-14B，IGPO 是一种多目标强化学习目标，鼓励生成跨任意任务和代理骨干的高性能、稳定且令牌高效的规划系统。对五个代理基准的实证评估表明，TodoEvolve 始终超越精心设计的规划模块，同时保持经济的 API 成本和运行时开销。

Title: Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers

Authors: Yuhan Wang, Shiyu Ni, Zhikai Ding, Zihang Zhan, Yuanzi Li, Keping Bi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07842
Pdf URL: https://arxiv.org/pdf/2602.07842
Copy Paste: [[2602.07842]] Evaluating and Calibrating LLM Confidence on Questions with Multiple Correct Answers(https://arxiv.org/abs/2602.07842)
Keywords: language model, llm
Abstract: Confidence calibration is essential for making large language models (LLMs) reliable, yet existing training-free methods have been primarily studied under single-answer question answering. In this paper, we show that these methods break down in the presence of multiple valid answers, where disagreement among equally correct responses leads to systematic underestimation of confidence. To enable a systematic study of this phenomenon, we introduce MACE, a benchmark of 12,000 factual questions spanning six domains with varying numbers of correct answers. Experiments across 15 representative calibration methods and four LLM families (7B-72B) reveal that while accuracy increases with answer cardinality, estimated confidence consistently decreases, causing severe miscalibration for questions with mixed answer counts. To address this issue, we propose Semantic Confidence Aggregation (SCA), which aggregates confidence over multiple high-probability sampled responses. SCA achieves state-of-the-art calibration performance under mixed-answer settings while preserving strong calibration on single-answer questions.
摘要：置信度校准对于使大型语言模型（LLM）可靠至关重要，但现有的免训练方法主要是在单答案问答下进行研究的。在本文中，我们表明这些方法在存在多个有效答案的情况下会崩溃，同样正确的答案之间的分歧会导致系统性地低估置信度。为了能够系统地研究这一现象，我们引入了 MACE，这是一个涵盖六个领域的 12,000 个事实问题的基准，且正确答案的数量各不相同。跨 15 种代表性校准方法和 4 个 LLM 系列 (7B-72B) 的实验表明，虽然准确性随着答案基数的增加而增加，但估计的置信度持续下降，从而导致混合答案计数问题的严重错误校准。为了解决这个问题，我们提出了语义置信度聚合（SCA），它聚合了多个高概率采样响应的置信度。 SCA 在混合答案设置下实现了最先进的校准性能，同时保留了对单一答案问题的强大校准。

Title: SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization

Authors: Taolin Zhang, Hang Guo, Wang Lu, Tao Dai, Shu-Tao Xia, Jindong Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.07909
Pdf URL: https://arxiv.org/pdf/2602.07909
Copy Paste: [[2602.07909]] SparseEval: Efficient Evaluation of Large Language Models by Sparse Optimization(https://arxiv.org/abs/2602.07909)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to scale up, their performance on various downstream tasks has significantly improved. However, evaluating their capabilities has become increasingly expensive, as performing inference on a large number of benchmark samples incurs high computational costs. In this paper, we revisit the model-item performance matrix and show that it exhibits sparsity, that representative items can be selected as anchors, and that the task of efficient benchmarking can be formulated as a sparse optimization problem. Based on these insights, we propose SparseEval, a method that, for the first time, adopts gradient descent to optimize anchor weights and employs an iterative refinement strategy for anchor selection. We utilize the representation capacity of MLP to handle sparse optimization and propose the Anchor Importance Score and Candidate Importance Score to evaluate the value of each item for task-aware refinement. Extensive experiments demonstrate the low estimation error and high Kendall's~$\tau$ of our method across a variety of benchmarks, showcasing its superior robustness and practicality in real-world scenarios. Code is available at {this https URL}.
摘要：随着大型语言模型（LLM）不断扩大规模，它们在各种下游任务上的性能显着提高。然而，评估它们的能力变得越来越昂贵，因为对大量基准样本进行推理会产生高昂的计算成本。在本文中，我们重新审视模型-项目性能矩阵，并表明它表现出稀疏性，可以选择代表性项目作为锚点，并且可以将有效基准测试的任务表示为稀疏优化问题。基于这些见解，我们提出了 SparseEval 方法，该方法首次采用梯度下降来优化锚点权重，并采用迭代细化策略进行锚点选择。我们利用 MLP 的表示能力来处理稀疏优化，并提出锚点重要性得分和候选重要性得分来评估每个项目的价值，以进行任务感知细化。大量的实验证明了我们的方法在各种基准测试中的低估计误差和高 Kendall's~$\tau$，展示了其在现实场景中卓越的鲁棒性和实用性。代码可在 {this https URL} 获取。

Title: Patches of Nonlinearity: Instruction Vectors in Large Language Models

Authors: Irina Bigoulaeva, Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07930
Pdf URL: https://arxiv.org/pdf/2602.07930
Copy Paste: [[2602.07930]] Patches of Nonlinearity: Instruction Vectors in Large Language Models(https://arxiv.org/abs/2602.07930)
Keywords: language model
Abstract: Despite the recent success of instruction-tuned language models and their ubiquitous usage, very little is known of how models process instructions internally. In this work, we address this gap from a mechanistic point of view by investigating how instruction-specific representations are constructed and utilized in different stages of post-training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). Via causal mediation, we identify that instruction representation is fairly localized in models. These representations, which we call Instruction Vectors (IVs), demonstrate a curious juxtaposition of linear separability along with non-linear causal interaction, broadly questioning the scope of the linear representation hypothesis commonplace in mechanistic interpretability. To disentangle the non-linear causal interaction, we propose a novel method to localize information processing in language models that is free from the implicit linear assumptions of patching-based techniques. We find that, conditioned on the task representations formed in the early layers, different information pathways are selected in the later layers to solve that task, i.e., IVs act as circuit selectors.
摘要：尽管指令调整的语言模型最近取得了成功并且用途广泛，但人们对模型如何在内部处理指令知之甚少。在这项工作中，我们通过研究如何在训练后的不同阶段构建和利用特定于指令的表示形式，从机械的角度解决这一差距：监督微调（SFT）和直接偏好优化（DPO）。通过因果中介，我们发现指令表示在模型中相当局部化。这些表示，我们称之为指令向量（IV），展示了线性可分离性与非线性因果相互作用的奇怪并置，广泛质疑机械可解释性中常见的线性表示假设的范围。为了解开非线性因果相互作用，我们提出了一种在语言模型中本地化信息处理的新方法，该方法不受基于修补技术的隐式线性假设的影响。我们发现，根据早期层中形成的任务表示，在后面的层中选择不同的信息路径来解决该任务，即 IV 充当电路选择器。

Title: Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation

Authors: Krzysztof Wróbel, Jan Maria Kowalski, Jerzy Surma, Igor Ciuciura, Maciej Szymański
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07954
Pdf URL: https://arxiv.org/pdf/2602.07954
Copy Paste: [[2602.07954]] Bielik Guard: Efficient Polish Language Safety Classifiers for LLM Content Moderation(https://arxiv.org/abs/2602.07954)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) become increasingly deployed in Polish language applications, the need for efficient and accurate content safety classifiers has become paramount. We present Bielik Guard, a family of compact Polish language safety classifiers comprising two model variants: a 0.1B parameter model based on MMLW-RoBERTa-base and a 0.5B parameter model based on PKOBP/polish-roberta-8k. Fine-tuned on a community-annotated dataset of 6,885 Polish texts, these models classify content across five safety categories: Hate/Aggression, Vulgarities, Sexual Content, Crime, and Self-Harm. Our evaluation demonstrates that both models achieve strong performance on multiple benchmarks. The 0.5B variant offers the best overall discrimination capability with F1 scores of 0.791 (micro) and 0.785 (macro) on the test set, while the 0.1B variant demonstrates exceptional efficiency. Notably, Bielik Guard 0.1B v1.1 achieves superior precision (77.65\%) and very low false positive rate (0.63\%) on real user prompts, outperforming HerBERT-PL-Guard (31.55\% precision, 4.70\% FPR) despite identical model size. The models are publicly available and designed to provide appropriate responses rather than simple content blocking, particularly for sensitive categories like self-harm.
摘要：随着大型语言模型 (LLM) 在波兰语应用程序中的部署越来越多，对高效、准确的内容安全分类器的需求变得至关重要。我们推出了 Bielik Guard，这是一个紧凑的波兰语安全分类器系列，包含两个模型变体：基于 MMLW-RoBERTa-base 的 0.1B 参数模型和基于 PKOBP/polish-roberta-8k 的 0.5B 参数模型。这些模型在包含 6,885 篇波兰语文本的社区注释数据集上进行了微调，将内容分为五个安全类别：仇恨/攻击、粗俗、色情内容、犯罪和自残。我们的评估表明，这两种模型在多个基准上都取得了出色的性能。 0.5B 变体提供了最佳的整体辨别能力，在测试集上的 F1 分数为 0.791（微观）和 0.785（宏观），而 0.1B 变体则表现出卓越的效率。值得注意的是，Bielik Guard 0.1B v1.1 在真实用户提示上实现了卓越的精度 (77.65\%) 和极低的误报率 (0.63\%)，尽管模型大小相同，但其性能优于 HerBERT-PL-Guard（31.55\% 精度，4.70\% FPR）。这些模型是公开的，旨在提供适当的响应，而不是简单的内容屏蔽，特别是对于自残等敏感类别。

Title: Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms

Authors: Vaibhav Shukla, Hardik Sharma, Adith N Reganti, Soham Wasmatkar, Bagesh Kumar, Vrijendra Singh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.07963
Pdf URL: https://arxiv.org/pdf/2602.07963
Copy Paste: [[2602.07963]] Lost in Translation? A Comparative Study on the Cross-Lingual Transfer of Composite Harms(https://arxiv.org/abs/2602.07963)
Keywords: language model, llm
Abstract: Most safety evaluations of large language models (LLMs) remain anchored in English. Translation is often used as a shortcut to probe multilingual behavior, but it rarely captures the full picture, especially when harmful intent or structure morphs across languages. Some types of harm survive translation almost intact, while others distort or disappear. To study this effect, we introduce CompositeHarm, a translation-based benchmark designed to examine how safety alignment holds up as both syntax and semantics shift. It combines two complementary English datasets, AttaQ, which targets structured adversarial attacks, and MMSafetyBench, which covers contextual, real-world harms, and extends them into six languages: English, Hindi, Assamese, Marathi, Kannada, and Gujarati. Using three large models, we find that attack success rates rise sharply in Indic languages, especially under adversarial syntax, while contextual harms transfer more moderately. To ensure scalability and energy efficiency, our study adopts lightweight inference strategies inspired by edge-AI design principles, reducing redundant evaluation passes while preserving cross-lingual fidelity. This design makes large-scale multilingual safety testing both computationally feasible and environmentally conscious. Overall, our results show that translated benchmarks are a necessary first step, but not a sufficient one, toward building grounded, resource-aware, language-adaptive safety systems.
摘要：大多数大型语言模型（LLM）的安全性评估仍然以英语为基础。翻译通常被用作探索多语言行为的捷径，但它很少能捕捉到全貌，尤其是当有害意图或结构在不同语言之间发生变化时。有些类型的伤害在翻译后几乎完好无损，而另一些则被扭曲或消失。为了研究这种影响，我们引入了 CompositeHarm，这是一个基于翻译的基准测试，旨在检查安全对齐在语法和语义转变时如何保持。它结合了两个互补的英语数据集：AttaQ（针对结构化对抗性攻击）和 MMSafetyBench（涵盖上下文、现实世界的危害），并将其扩展为六种语言：英语、印地语、阿萨姆语、马拉地语、卡纳达语和古吉拉特语。使用三个大型模型，我们发现印度语言的攻击成功率急剧上升，尤其是在对抗性语法下，而上下文损害的转移则较为温和。为了确保可扩展性和能源效率，我们的研究采用了受边缘人工智能设计原理启发的轻量级推理策略，减少了冗余的评估过程，同时保持了跨语言保真度。这种设计使得大规模多语言安全测试在计算上可行且具有环保意识。总体而言，我们的结果表明，翻译后的基准是构建基础、资源感知、语言自适应安全系统必要的第一步，但还不够。

Title: Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection

Authors: Rui Feng, Zhiyao Luo, Liuyu Wu, Wei Wang, Yuting Song, Yong Liu, Kok Pin Ng, Jianqing Li, Xingyao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07978
Pdf URL: https://arxiv.org/pdf/2602.07978
Copy Paste: [[2602.07978]] Cross-Linguistic Persona-Driven Data Synthesis for Robust Multimodal Cognitive Decline Detection(https://arxiv.org/abs/2602.07978)
Keywords: language model, llm, chain-of-thought
Abstract: Speech-based digital biomarkers represent a scalable, non-invasive frontier for the early identification of Mild Cognitive Impairment (MCI). However, the development of robust diagnostic models remains impeded by acute clinical data scarcity and a lack of interpretable reasoning. Current solutions frequently struggle with cross-lingual generalization and fail to provide the transparent rationales essential for clinical trust. To address these barriers, we introduce SynCog, a novel framework integrating controllable zero-shot multimodal data synthesis with Chain-of-Thought (CoT) deduction fine-tuning. Specifically, SynCog simulates diverse virtual subjects with varying cognitive profiles to effectively alleviate clinical data scarcity. This generative paradigm enables the rapid, zero-shot expansion of clinical corpora across diverse languages, effectively bypassing data bottlenecks in low-resource settings and bolstering the diagnostic performance of Multimodal Large Language Models (MLLMs). Leveraging this synthesized dataset, we fine-tune a foundational multimodal backbone using a CoT deduction strategy, empowering the model to explicitly articulate diagnostic thought processes rather than relying on black-box predictions. Extensive experiments on the ADReSS and ADReSSo benchmarks demonstrate that augmenting limited clinical data with synthetic phenotypes yields competitive diagnostic performance, achieving Macro-F1 scores of 80.67% and 78.46%, respectively, outperforming current baseline models. Furthermore, evaluation on an independent real-world Mandarin cohort (CIR-E) demonstrates robust cross-linguistic generalization, attaining a Macro-F1 of 48.71%. These findings constitute a critical step toward providing clinically trustworthy and linguistically inclusive cognitive assessment tools for global healthcare.
摘要：基于语音的数字生物标记代表了早期识别轻度认知障碍 (MCI) 的可扩展、非侵入性前沿领域。然而，由于严重的临床数据匮乏和缺乏可解释的推理，稳健的诊断模型的发展仍然受到阻碍。当前的解决方案经常难以跨语言泛化，并且无法提供临床信任所必需的透明原理。为了解决这些障碍，我们引入了 SynCog，这是一种将可控零样本多模态数据合成与思想链 (CoT) 推演微调相结合的新型框架。具体来说，SynCog 模拟具有不同认知特征的不同虚拟受试者，以有效缓解临床数据稀缺问题。这种生成范式能够跨多种语言快速、零次扩展临床语料库，有效绕过资源匮乏环境中的数据瓶颈，并增强多模态大语言模型（MLLM）的诊断性能。利用这个合成数据集，我们使用 CoT 推导策略微调基础多模式主干，使模型能够明确阐明诊断思维过程，而不是依赖黑盒预测。 ADReSS 和 ADReSSo 基准的大量实验表明，通过合成表型增强有限的临床数据可产生具有竞争力的诊断性能，分别实现 80.67% 和 78.46% 的 Macro-F1 分数，优于当前的基线模型。此外，对独立现实世界普通话队列 (CIR-E) 的评估显示出强大的跨语言泛化能力，达到 48.71% 的 Macro-F1。这些发现构成了为全球医疗保健提供临床可信且语言包容的认知评估工具的关键一步。

Title: The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation

Authors: Arash Marioriyad, Omid Ghahroodi, Ehsaneddin Asgari, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.07996
Pdf URL: https://arxiv.org/pdf/2602.07996
Copy Paste: [[2602.07996]] The Judge Who Never Admits: Hidden Shortcuts in LLM-based Evaluation(https://arxiv.org/abs/2602.07996)
Keywords: language model, gpt, llm, prompt
Abstract: Large language models (LLMs) are increasingly used as automatic judges to evaluate system outputs in tasks such as reasoning, question answering, and creative writing. A faithful judge should base its verdicts solely on content quality, remain invariant to irrelevant context, and transparently reflect the factors driving its decisions. We test this ideal via controlled cue perturbations-synthetic metadata labels injected into evaluation prompts-for six judge models: GPT-4o, Gemini-2.0-Flash, Gemma-3-27B, Qwen3-235B, Claude-3-Haiku, and Llama3-70B. Experiments span two complementary datasets with distinct evaluation regimes: ELI5 (factual QA) and LitBench (open-ended creative writing). We study six cue families: source, temporal, age, gender, ethnicity, and educational status. Beyond measuring verdict shift rates (VSR), we introduce cue acknowledgment rate (CAR) to quantify whether judges explicitly reference the injected cues in their natural-language rationales. Across cues with strong behavioral effects-e.g., provenance hierarchies (Expert > Human > LLM > Unknown), recency preferences (New > Old), and educational-status favoritism-CAR is typically at or near zero, indicating that shortcut reliance is largely unreported even when it drives decisions. Crucially, CAR is also dataset-dependent: explicit cue recognition is more likely to surface in the factual ELI5 setting for some models and cues, but often collapses in the open-ended LitBench regime, where large verdict shifts can persist despite zero acknowledgment. The combination of substantial verdict sensitivity and limited cue acknowledgment reveals an explanation gap in LLM-as-judge pipelines, raising concerns about reliability of model-based evaluation in both research and deployment.
摘要：大型语言模型（LLM）越来越多地用作自动法官，以评估推理、问答和创意写作等任务中的系统输出。忠实的法官应该仅根据内容质量做出裁决，不受无关上下文的影响，并透明地反映推动其决策的因素。我们通过受控提示扰动（将合成元数据标签注入到评估提示中）来测试这一理想，适用于六种判断模型：GPT-4o、Gemini-2.0-Flash、Gemma-3-27B、Qwen3-235B、Claude-3-Haiku 和 Llama3-70B。实验跨越两个具有不同评估机制的互补数据集：ELI5（事实 QA）和 LitBench（开放式创意写作）。我们研究六个线索家族：来源、时间、年龄、性别、种族和教育状况。除了测量判决转移率（VSR）之外，我们还引入了线索确认率（CAR）来量化法官是否在其自然语言原理中明确引用了注入的线索。在具有强烈行为效应的线索中，例如出处层次结构（专家 > 人类 > 法学硕士 > 未知）、新近度偏好（新 > 旧）和教育地位偏爱，CAR 通常为零或接近于零，表明捷径依赖即使在驱动决策时也基本上未被报告。至关重要的是，CAR 还依赖于数据集：对于某些模型和线索，显式线索识别更有可能在实际 ELI5 设置中浮现，但在开放式 LitBench 体系中常常崩溃，尽管零确认，但较大的判决变化可能会持续存在。大量的判决敏感性和有限的提示确认相结合，揭示了法学硕士作为法官管道中的解释差距，引发了人们对研究和部署中基于模型的评估可靠性的担忧。

Title: DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity

Authors: Jitai Hao, Qiang Huang, Yaowei Wang, Min Zhang, Jun Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08005
Pdf URL: https://arxiv.org/pdf/2602.08005
Copy Paste: [[2602.08005]] DeltaKV: Residual-Based KV Cache Compression via Long-Range Similarity(https://arxiv.org/abs/2602.08005)
Keywords: llm, agent
Abstract: The deployment of efficient long-context LLMs in applications like autonomous agents, long-chain reasoning, and creative writing is fundamentally bottlenecked by the linear growth of KV cache memory. Existing compression and eviction methods often struggle to balance accuracy, compression ratio, and hardware efficiency. We propose DeltaKV, a residual-based KV cache compression framework motivated by two empirical findings: long-range inter-token similarity and highly shared latent components in KV representations. Instead of discarding tokens, DeltaKV encodes semantic residuals relative to retrieved historical references, preserving fidelity while substantially reducing storage. To translate compression gains into real system speedups, we further introduce Sparse-vLLM, a high-performance inference engine with decoupled memory management and kernels optimized for sparse and irregular KV layouts. Experiments show that DeltaKV reduces KV cache memory to 29\% of the original while maintaining near-lossless accuracy on LongBench, SCBench, and AIME. When integrated with Sparse-vLLM, it achieves up to 2$\times$ throughput improvement over vLLM in long-context scenarios, demonstrating a practical path toward scalable long-context LLM deployment. Code, model checkpoints, and datasets are available at this https URL.
摘要：在自主代理、长链推理和创意写作等应用中部署高效的长上下文 LLM 从根本上受到 KV 缓存线性增长的瓶颈。现有的压缩和驱逐方法通常难以平衡精度、压缩比和硬件效率。我们提出了 DeltaKV，一种基于残差的 KV 缓存压缩框架，其动机是两个实证发现：远程令牌间相似性和 KV 表示中高度共享的潜在组件。 DeltaKV 没有丢弃标记，而是对相对于检索到的历史引用的语义残差进行编码，在保持保真度的同时大幅减少存储。为了将压缩增益转化为真正的系统加速，我们进一步引入了 Sparse-vLLM，这是一种高性能推理引擎，具有解耦的内存管理和针对稀疏和不规则 KV 布局进行优化的内核。实验表明，DeltaKV 将 KV 缓存内存减少到原来的 29%，同时在 LongBench、SCBench 和 AIME 上保持近乎无损的精度。当与 Sparse-vLLM 集成时，它在长上下文场景中比 vLLM 实现了高达 2 美元\倍的吞吐量改进，展示了可扩展的长上下文 LLM 部署的实用路径。代码、模型检查点和数据集可从此 https URL 获取。

Title: Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

Authors: Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08028
Pdf URL: https://arxiv.org/pdf/2602.08028
Copy Paste: [[2602.08028]] Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning(https://arxiv.org/abs/2602.08028)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.
摘要：为了解决标准思想链提示中无引导推理路径的不稳定性，最近的方法通过首先引出单一推理策略来指导大型语言模型（LLM）。然而，每个问题仅依赖一种策略仍然会限制不同任务的性能。我们提出了分歧归纳提示（DIP），这是一个首先提示法学硕士为每个问题生成多个不同的高级理由的框架。然后将每个理由详细阐述为详细的、逐步的计划草案。最后，这些规划草案被归纳为最终规划。 DIP 提高了零样本推理的准确性，而不依赖于资源密集型采样。实验表明，DIP 优于单策略提示，证明了多计划归纳对于基于提示的推理的有效性。

Title: Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection

Authors: Chenwang Wu, Yiu-ming Cheung, Shuhai Zhang, Bo Han, Defu Lian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08031
Pdf URL: https://arxiv.org/pdf/2602.08031
Copy Paste: [[2602.08031]] Beyond Raw Detection Scores: Markov-Informed Calibration for Boosting Machine-Generated Text Detection(https://arxiv.org/abs/2602.08031)
Keywords: llm
Abstract: While machine-generated texts (MGTs) offer great convenience, they also pose risks such as disinformation and phishing, highlighting the need for reliable detection. Metric-based methods, which extract statistically distinguishable features of MGTs, are often more practical than complex model-based methods that are prone to overfitting. Given their diverse designs, we first place representative metric-based methods within a unified framework, enabling a clear assessment of their advantages and limitations. Our analysis identifies a core challenge across these methods: the token-level detection score is easily biased by the inherent randomness of the MGTs generation process. To address this, we theoretically and empirically reveal two relationships of context detection scores that may aid calibration: Neighbor Similarity and Initial Instability. We then propose a Markov-informed score calibration strategy that models these relationships using Markov random fields, and implements it as a lightweight component via a mean-field approximation, allowing our method to be seamlessly integrated into existing detectors. Extensive experiments in various real-world scenarios, such as cross-LLM and paraphrasing attacks, demonstrate significant gains over baselines with negligible computational overhead. The code is available at this https URL.
摘要：虽然机器生成的文本 (MGT) 提供了极大的便利，但它们也带来了虚假信息和网络钓鱼等风险，这凸显了可靠检测的必要性。基于度量的方法提取 MGT 的统计上可区分的特征，通常比容易过度拟合的复杂的基于模型的方法更实用。鉴于其不同的设计，我们首先将代表性的基于度量的方法置于统一的框架内，从而能够清楚地评估其优点和局限性。我们的分析确定了这些方法的核心挑战：令牌级检测分数很容易因 MGT 生成过程固有的随机性而产生偏差。为了解决这个问题，我们从理论上和经验上揭示了可能有助于校准的上下文检测分数的两种关系：邻居相似度和初始不稳定性。然后，我们提出了一种基于马尔可夫的分数校准策略，该策略使用马尔可夫随机场对这些关系进行建模，并通过平均场近似将其实现为轻量级组件，从而使我们的方法能够无缝集成到现有检测器中。在各种现实场景中进行的广泛实验（例如跨法学硕士和释义攻击）表明，在计算开销可以忽略不计的情况下，与基线相比取得了显着的进步。该代码可从此 https URL 获取。

Title: TDGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs

Authors: Arshia Hemmat, Philip Torr, Yongqiang Chen, Junchi Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08048
Pdf URL: https://arxiv.org/pdf/2602.08048
Copy Paste: [[2602.08048]] TDGNet: Hallucination Detection in Diffusion Language Models via Temporal Dynamic Graphs(https://arxiv.org/abs/2602.08048)
Keywords: language model, llm, hallucination
Abstract: Diffusion language models (D-LLMs) offer parallel denoising and bidirectional context, but hallucination detection for D-LLMs remains underexplored. Prior detectors developed for auto-regressive LLMs typically rely on single-pass cues and do not directly transfer to diffusion generation, where factuality evidence is distributed across the denoising trajectory and may appear, drift, or be self-corrected over time. We introduce TDGNet, a temporal dynamic graph framework that formulates hallucination detection as learning over evolving token-level attention graphs. At each denoising step, we sparsify the attention graph and update per-token memories via message passing, then apply temporal attention to aggregate trajectory-wide evidence for final prediction. Experiments on LLaDA-8B and Dream-7B across QA benchmarks show consistent AUROC improvements over output-based, latent-based, and static-graph baselines, with single-pass inference and modest overhead. These results highlight the importance of temporal reasoning on attention graphs for robust hallucination detection in diffusion language models.
摘要：扩散语言模型 (D-LLM) 提供并行去噪和双向上下文，但 D-LLM 的幻觉检测仍未得到充分探索。为自回归 LLM 开发的现有检测器通常依赖于单通道线索，并且不会直接转移到扩散生成，其中事实证据分布在去噪轨迹上，并且可能会随着时间的推移而出现、漂移或自我校正。我们引入了 TDGNet，这是一个时间动态图框架，它将幻觉检测制定为学习不断发展的令牌级注意力图。在每个去噪步骤中，我们稀疏注意力图并通过消息传递更新每个标记的记忆，然后应用时间注意力来聚合轨迹范围的证据以进行最终预测。在跨 QA 基准的 LLaDA-8B 和 Dream-7B 上进行的实验表明，AUROC 相对于基于输出、基于潜在和静态图的基线有一致的改进，并且具有单通道推理和适度的开销。这些结果强调了注意力图上的时间推理对于扩散语言模型中稳健的幻觉检测的重要性。

Title: Emergent Search and Backtracking in Latent Reasoning Models

Authors: Jasmine Cui, Charles Ye
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08100
Pdf URL: https://arxiv.org/pdf/2602.08100
Copy Paste: [[2602.08100]] Emergent Search and Backtracking in Latent Reasoning Models(https://arxiv.org/abs/2602.08100)
Keywords: language model, llm, chain-of-thought
Abstract: What happens when a language model thinks without words? Standard reasoning LLMs verbalize intermediate steps as chain-of-thought; latent reasoning transformers (LRTs) instead perform deliberation entirely in continuous hidden space. We investigate an LRT, decoding the model's evolving beliefs at every step on a multiple-choice QA benchmark. We find that the model spontaneously learns a structured search process in latent space. Deliberation follows a consistent trajectory: an exploration phase where probability mass spreads across candidates, tentative commitment to a frontrunner, and either convergence or backtracking. Backtracking is prevalent (32% of instances), beneficial (34% accuracy gain over non-backtracking instances), and predominantly directed away from the semantically closest distractor toward the correct answer. The search is adaptive: replacing distractors with implausible alternatives shortens exploration by 54%. Latent reasoning models achieve in activation space what chain-of-thought achieves through words: the ability to be wrong, notice, and recover.
摘要：当语言模型不用言语思考时会发生什么？标准推理法学硕士将中间步骤描述为思想链；相反，潜在推理变压器（LRT）完全在连续的隐藏空间中进行审议。我们研究了 LRT，在多项选择 QA 基准上解码模型的每一步不断演变的信念。我们发现该模型自发地学习了潜在空间中的结构化搜索过程。审议遵循一致的轨迹：一个探索阶段，概率质量分布在候选人之间，对领先者的暂时承诺，以及收敛或回溯。回溯很普遍（32% 的实例）、有益（比非回溯实例准确率提高 34%），并且主要是从语义上最接近的干扰项转向正确答案。搜索是自适应的：用难以置信的替代品替换干扰物可以将探索时间缩短 54%。潜在推理模型在激活空间中实现了思想链通过文字实现的功能：犯错、注意到和恢复的能力。

Title: Gender and Race Bias in Consumer Product Recommendations by Large Language Models

Authors: Ke Xu, Shera Potka, Alex Thomo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.08124
Pdf URL: https://arxiv.org/pdf/2602.08124
Copy Paste: [[2602.08124]] Gender and Race Bias in Consumer Product Recommendations by Large Language Models(https://arxiv.org/abs/2602.08124)
Keywords: language model, llm, prompt
Abstract: Large Language Models are increasingly employed in generating consumer product recommendations, yet their potential for embedding and amplifying gender and race biases remains underexplored. This paper serves as one of the first attempts to examine these biases within LLM-generated recommendations. We leverage prompt engineering to elicit product suggestions from LLMs for various race and gender groups and employ three analytical methods-Marked Words, Support Vector Machines, and Jensen-Shannon Divergence-to identify and quantify biases. Our findings reveal significant disparities in the recommendations for demographic groups, underscoring the need for more equitable LLM recommendation systems.
摘要：大型语言模型越来越多地用于生成消费产品推荐，但它们嵌入和放大性别和种族偏见的潜力仍未得到充分开发。本文是检验法学硕士生成的建议中这些偏见的首次尝试之一。我们利用即时工程来征求法学硕士针对不同种族和性别群体的产品建议，并采用三种分析方法——标记词、支持向量机和詹森-香农散度——来识别和量化偏见。我们的研究结果揭示了针对人口群体的推荐存在显着差异，强调需要更公平的法学硕士推荐系统。

Title: DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries

Authors: Sahana Ramnath, Nima Chitsazan, Mingyang Zhou, Chia-Hsuan Lee, Shi-Xiong Zhang, Stephen Rawls, Sambit Sahu, Sangwoo Cho, Xiang Ren, Genta Indra Winata, Akshaj Kumar Veldanda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08149
Pdf URL: https://arxiv.org/pdf/2602.08149
Copy Paste: [[2602.08149]] DIAL-SUMMER: A Structured Evaluation Framework of Hierarchical Errors in Dialogue Summaries(https://arxiv.org/abs/2602.08149)
Keywords: llm, hallucination, agent
Abstract: Dialogues are a predominant mode of communication for humans, and it is immensely helpful to have automatically generated summaries of them (e.g., to revise key points discussed in a meeting, to review conversations between customer agents and product users). Prior works on dialogue summary evaluation largely ignore the complexities specific to this task: (i) shift in structure, from multiple speakers discussing information in a scattered fashion across several turns, to a summary's sentences, and (ii) shift in narration viewpoint, from speakers' first/second-person narration, standardized third-person narration in the summary. In this work, we introduce our framework DIALSUMMER to address the above. We propose DIAL-SUMMER's taxonomy of errors to comprehensively evaluate dialogue summaries at two hierarchical levels: DIALOGUE-LEVEL that focuses on the broader speakers/turns, and WITHIN-TURN-LEVEL that focuses on the information talked about inside a turn. We then present DIAL-SUMMER's dataset composed of dialogue summaries manually annotated with our taxonomy's fine-grained errors. We conduct empirical analyses of these annotated errors, and observe interesting trends (e.g., turns occurring in middle of the dialogue are the most frequently missed in the summary, extrinsic hallucinations largely occur at the end of the summary). We also conduct experiments on LLM-Judges' capability at detecting these errors, through which we demonstrate the challenging nature of our dataset, the robustness of our taxonomy, and the need for future work in this field to enhance LLMs' performance in the same. Code and inference dataset coming soon.
摘要：对话是人类交流的主要模式，自动生成对话摘要非常有帮助（例如，修改会议中讨论的要点、审查客户代理和产品用户之间的对话）。先前关于对话摘要评估的工作在很大程度上忽略了该任务特有的复杂性：（i）结构的转变，从多个发言者以分散的方式在几个回合中讨论信息，到摘要的句子，以及（ii）叙述视角的转变，从发言者的第一/第二人称叙述，到摘要中标准化的第三人称叙述。在这项工作中，我们引入了我们的框架 DIALSUMMER 来解决上述问题。我们提出 DIAL-SUMMER 的错误分类法，以在两个层次级别上全面评估对话摘要：对话级别侧重于更广泛的发言者/回合，而内部回合级别则侧重于回合内讨论的信息。然后，我们展示 DIAL-SUMMER 的数据集，该数据集由对话摘要组成，并用我们的分类法的细粒度错误手动注释。我们对这些注释错误进行实证分析，并观察有趣的趋势（例如，摘要中最常遗漏对话中间发生的转折，外在幻觉主要发生在摘要末尾）。我们还对法学硕士法官检测这些错误的能力进行了实验，通过这些实验，我们证明了我们的数据集的挑战性性质、我们的分类法的稳健性，以及该领域未来工作的必要性，以提高法学硕士的表现。代码和推理数据集即将推出。

Title: LLMs and people both learn to form conventions -- just not with each other

Authors: Cameron R. Jones, Agnese Lombardi, Kyle Mahowald, Benjamin K. Bergen
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2602.08208
Pdf URL: https://arxiv.org/pdf/2602.08208
Copy Paste: [[2602.08208]] LLMs and people both learn to form conventions -- just not with each other(https://arxiv.org/abs/2602.08208)
Keywords: llm, prompt
Abstract: Humans align to one another in conversation -- adopting shared conventions that ease communication. We test whether LLMs form the same kinds of conventions in a multimodal communication game. Both humans and LLMs display evidence of convention-formation (increasing the accuracy and consistency of their turns while decreasing their length) when communicating in same-type dyads (humans with humans, AI with AI). However, heterogenous human-AI pairs fail -- suggesting differences in communicative tendencies. In Experiment 2, we ask whether LLMs can be induced to behave more like human conversants, by prompting them to produce superficially humanlike behavior. While the length of their messages matches that of human pairs, accuracy and lexical overlap in human-LLM pairs continues to lag behind that of both human-human and AI-AI pairs. These results suggest that conversational alignment requires more than just the ability to mimic previous interactions, but also shared interpretative biases toward the meanings that are conveyed.
摘要：人类在对话中相互协调——采用共同的约定来简化沟通。我们测试法学硕士是否在多模式通信游戏中形成相同类型的约定。在以相同类型的二元组（人类与人类、人工智能与人工智能）进行交流时，人类和法学硕士都表现出了惯例形成的证据（提高了轮流的准确性和一致性，同时减少了轮流的长度）。然而，异质的人类与人工智能配对却失败了——这表明沟通倾向存在差异。在实验 2 中，我们询问法学硕士是否可以通过促使他们产生表面上类似人类的行为来使其表现得更像人类熟人。虽然他们的消息长度与人类对的长度相匹配，但人类与法学硕士对的准确性和词汇重叠仍然落后于人类对人类和人工智能对人工智能对的准确性和词汇重叠。这些结果表明，对话一致性不仅需要模仿以前的互动的能力，还需要对所传达的含义有共同的解释偏见。

Title: Pretraining with Token-Level Adaptive Latent Chain-of-Thought

Authors: Boyi Zeng, Yiqin Hao, He Li, Shixiang Song, Feichen Song, Zitong Wang, Siyuan Huang, Yi Xu, ZiWei He, Xinbing Wang, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08220
Pdf URL: https://arxiv.org/pdf/2602.08220
Copy Paste: [[2602.08220]] Pretraining with Token-Level Adaptive Latent Chain-of-Thought(https://arxiv.org/abs/2602.08220)
Keywords: language model, chain-of-thought
Abstract: Scaling large language models by increasing parameters and training data is increasingly constrained by limited high-quality corpora and rising communication costs. This work explores an alternative axis: increasing per-token computation without expanding parameters, by internalizing latent Chain-of-Thought (CoT) into pretraining. We propose Pretraining with Token-Level Adaptive Latent CoT (adaptive latent CoT), where the model generates a variable-length latent CoT trajectory before emitting each token -- allocating longer trajectories to difficult tokens and shorter (or even zero) trajectories to easy ones. Importantly, this behavior emerges naturally from one-stage pretraining on general text and reduces computation in both training and inference via token-wise adaptive halting. Experiments with Llama architectures show that adaptive latent CoT consistently improves language modeling perplexity and broad downstream accuracy, even with fewer training FLOPs than prior recurrent baselines.
摘要：通过增加参数和训练数据来扩展大型语言模型越来越受到有限的高质量语料库和不断上升的通信成本的限制。这项工作探索了另一个轴：通过将潜在的思想链（CoT）内化到预训练中，在不扩展参数的情况下增加每个令牌的计算。我们建议使用令牌级自适应潜在 CoT（自适应潜在 CoT）进行预训练，其中模型在发出每个令牌之前生成可变长度的潜在 CoT 轨迹 - 为困难的令牌分配较长的轨迹，为简单的令牌分配较短（甚至为零）的轨迹。重要的是，这种行为自然地从一般文本的单阶段预训练中出现，并通过令牌智能自适应停止减少训练和推理中的计算。 Llama 架构的实验表明，即使训练 FLOP 数比之前的循环基线少，自适应潜在 CoT 仍能持续改善语言建模的复杂性和广泛的下游准确性。

Title: CoRect: Context-Aware Logit Contrast for Hidden State Rectification to Resolve Knowledge Conflicts

Authors: Xuhua Ma, Richong Zhang, Zhijie Nie
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08221
Pdf URL: https://arxiv.org/pdf/2602.08221
Copy Paste: [[2602.08221]] CoRect: Context-Aware Logit Contrast for Hidden State Rectification to Resolve Knowledge Conflicts(https://arxiv.org/abs/2602.08221)
Keywords: hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) often struggles with knowledge conflicts, where model-internal parametric knowledge overrides retrieved evidence, leading to unfaithful outputs. Existing approaches are often limited, relying either on superficial decoding adjustments or weight editing that necessitates ground-truth targets. Through layer-wise analysis, we attribute this failure to a parametric suppression phenomenon: specifically, in deep layers, certain FFN layers overwrite context-sensitive representations with memorized priors. To address this, we propose CoRect (Context-Aware Logit Contrast for Hidden State Rectification). By contrasting logits from contextualized and non-contextualized forward passes, CoRect identifies layers that exhibit high parametric bias without requiring ground-truth labels. It then rectifies the hidden states to preserve evidence-grounded information. Across question answering (QA) and summarization benchmarks, CoRect consistently improves faithfulness and reduces hallucinations compared to strong baselines.
摘要：检索增强生成（RAG）经常与知识冲突作斗争，其中模型内部参数知识覆盖检索到的证据，导致不忠实的输出。现有的方法通常是有限的，要么依赖于表面的解码调整，要么依赖于需要真实目标的权重编辑。通过逐层分析，我们将这种失败归因于参数抑制现象：具体来说，在深层中，某些 FFN 层会用记忆的先验覆盖上下文相关的表示。为了解决这个问题，我们提出了 CoRect（用于隐藏状态校正的上下文感知 Logit 对比度）。通过对比上下文化和非上下文化前向传播的 logits，CoRect 可以识别表现出高参数偏差的层，而无需地面实况标签。然后它会纠正隐藏状态以保留基于证据的信息。在问答 (QA) 和摘要基准中，与强基准相比，CoRect 始终如一地提高了可信度并减少了幻觉。

Title: When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

Authors: Jaylen Jones, Zhehao Zhang, Yuting Ning, Eric Fosler-Lussier, Pierre-Luc St-Charles, Yoshua Bengio, Dawn Song, Yu Su, Huan Sun
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2602.08235
Pdf URL: https://arxiv.org/pdf/2602.08235
Copy Paste: [[2602.08235]] When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents(https://arxiv.org/abs/2602.08235)
Keywords: agent
Abstract: Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.
摘要：尽管计算机使用代理（CUA）在自动化日益复杂的操作系统工作流程方面具有巨大潜力，但即使在良性输入环境下，它们也可能表现出偏离预期结果的不安全的意外行为。然而，对这种风险的探索在很大程度上仍然是轶事，缺乏具体的特征描述和自动化方法来主动揭示现实 CUA 场景下的长尾意外行为。为了填补这一空白，我们通过定义其关键特征、自动引发它们并分析它们如何从良性输入中产生，引入了第一个针对非预期 CUA 行为的概念和方法框架。我们提出 AutoElicit：一种代理框架，它使用 CUA 执行反馈迭代地扰乱良性指令，并在保持扰动现实和良性的同时引发严重危害。使用 AutoElicit，我们可以从 Claude 4.5 Haiku 和 Opus 等最先进的 CUA 中发现数百种有害的意外行为。我们进一步评估了经人类验证的成功扰动的可转移性，识别了各种其他前沿 CUA 对意外行为的持续敏感性。这项工作为系统分析现实计算机使用环境中的意外行为奠定了基础。

Title: Document Reconstruction Unlocks Scalable Long-Context RLVR

Authors: Yao Xiao, Lei Wang, Yue Deng, Guanzheng Chen, Ziqi Jin, Jung-jae Kim, Xiaoli Li, Roy Ka-wei Lee, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08237
Pdf URL: https://arxiv.org/pdf/2602.08237
Copy Paste: [[2602.08237]] Document Reconstruction Unlocks Scalable Long-Context RLVR(https://arxiv.org/abs/2602.08237)
Keywords: language model, llm
Abstract: Reinforcement Learning with Verifiable Rewards~(RLVR) has become a prominent paradigm to enhance the capabilities (i.e.\ long-context) of Large Language Models~(LLMs). However, it often relies on gold-standard answers or explicit evaluation rubrics provided by powerful teacher models or human experts, which are costly and time-consuming. In this work, we investigate unsupervised approaches to enhance the long-context capabilities of LLMs, eliminating the need for heavy human annotations or teacher models' supervision. Specifically, we first replace a few paragraphs with special placeholders in a long document. LLMs are trained through reinforcement learning to reconstruct the document by correctly identifying and sequencing missing paragraphs from a set of candidate options. This training paradigm enables the model to capture global narrative coherence, significantly boosting long-context performance. We validate the effectiveness of our method on two widely used benchmarks, RULER and LongBench~v2. While acquiring noticeable gains on RULER, it can also achieve a reasonable improvement on LongBench~v2 without any manually curated long-context QA data. Furthermore, we conduct extensive ablation studies to analyze the impact of reward design, data curation strategies, training schemes, and data scaling effects on model performance. We publicly release our code, data, and models.
摘要：具有可验证奖励的强化学习~（RLVR）已成为增强大型语言模型~（LLM）能力（即长上下文）的重要范例。然而，它通常依赖于强大的教师模型或人类专家提供的黄金标准答案或明确的评估标准，这是昂贵且耗时的。在这项工作中，我们研究了无监督方法来增强法学硕士的长上下文能力，消除了对大量人工注释或教师模型监督的需要。具体来说，我们首先用长文档中的特殊占位符替换一些段落。法学硕士通过强化学习进行训练，通过正确识别和排序一组候选选项中缺失的段落来重建文档。这种训练范式使模型能够捕获全局叙事连贯性，从而显着提高长上下文性能。我们在两个广泛使用的基准 RULER 和 LongBench~v2 上验证了我们方法的有效性。在 RULER 上获得显着提升的同时，它还可以在 LongBench~v2 上实现合理的改进，而无需任何手动策划的长上下文 QA 数据。此外，我们进行了广泛的消融研究，以分析奖励设计、数据管理策略、训练方案和数据扩展对模型性能的影响。我们公开发布我们的代码、数据和模型。

Title: Language Predicts Identity Fusion Across Cultures and Reveals Divergent Pathways to Violence

Authors: Devin R. Wright, Justin E. Lane, F. LeRon Shults
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08252
Pdf URL: https://arxiv.org/pdf/2602.08252
Copy Paste: [[2602.08252]] Language Predicts Identity Fusion Across Cultures and Reveals Divergent Pathways to Violence(https://arxiv.org/abs/2602.08252)
Keywords: llm
Abstract: In light of increasing polarization and political violence, understanding the psychological roots of extremism is increasingly important. Prior research shows that identity fusion predicts willingness to engage in extreme acts. We evaluate the Cognitive Linguistic Identity Fusion Score, a method that uses cognitive linguistic patterns, LLMs, and implicit metaphor to measure fusion from language. Across datasets from the United Kingdom and Singapore, this approach outperforms existing methods in predicting validated fusion scores. Applied to extremist manifestos, two distinct high-fusion pathways to violence emerge: ideologues tend to frame themselves in terms of group, forming kinship bonds; whereas grievance-driven individuals frame the group in terms of their personal identity. These results refine theories of identity fusion and provide a scalable tool aiding fusion research and extremism detection.
摘要：鉴于两极分化和政治暴力日益加剧，了解极端主义的心理根源变得越来越重要。先前的研究表明，身份融合预示着从事极端行为的意愿。我们评估认知语言身份融合评分，这是一种使用认知语言模式、法学硕士和隐喻来衡量语言融合的方法。在英国和新加坡的数据集中，这种方法在预测经过验证的融合分数方面优于现有方法。应用于极端主义宣言，就会出现两种截然不同的高度融合的暴力途径：意识形态家倾向于以群体的方式来塑造自己，形成亲属关系；而受不满驱使的个人则根据个人身份来塑造群体。这些结果完善了身份融合理论，并提供了一个可扩展的工具，帮助融合研究和极端主义检测。

Title: Language Modeling and Understanding Through Paraphrase Generation and Detection

Authors: Jan Philip Wahle
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08274
Pdf URL: https://arxiv.org/pdf/2602.08274
Copy Paste: [[2602.08274]] Language Modeling and Understanding Through Paraphrase Generation and Detection(https://arxiv.org/abs/2602.08274)
Keywords: language model
Abstract: Language enables humans to share knowledge, reason about the world, and pass on strategies for survival and innovation across generations. At the heart of this process is not just the ability to communicate but also the remarkable flexibility in how we can express ourselves. We can express the same thoughts in virtually infinite ways using different words and structures - this ability to rephrase and reformulate expressions is known as paraphrase. Modeling paraphrases is a keystone to meaning in computational language models; being able to construct different variations of texts that convey the same meaning or not shows strong abilities of semantic understanding. If computational language models are to represent meaning, they must understand and control the different aspects that construct the same meaning as opposed to different meanings at a fine granularity. Yet most existing approaches reduce paraphrasing to a binary decision between two texts or to producing a single rewrite of a source, obscuring which linguistic factors are responsible for meaning preservation. In this thesis, I propose that decomposing paraphrases into their constituent linguistic aspects (paraphrase types) offers a more fine-grained and cognitively grounded view of semantic equivalence. I show that even advanced machine learning models struggle with this task. Yet, when explicitly trained on paraphrase types, models achieve stronger performance on related paraphrase tasks and downstream applications. For example, in plagiarism detection, language models trained on paraphrase types surpass human baselines: 89.6% accuracy compared to 78.4% for plagiarism cases from Wikipedia, and 66.5% compared to 55.7% for plagiarism of scientific papers from arXiv. In identifying duplicate questions on Quora, models trained with paraphrase types improve over models trained on binary pairs. Furthermore, I demonstrate that...
摘要：语言使人类能够分享知识，推理世界，并代代相传生存和创新策略。这个过程的核心不仅是沟通的能力，还包括我们表达自己的方式的非凡灵活性。我们可以使用不同的词语和结构以几乎无限的方式表达相同的想法——这种重新措辞和重新表述表达的能力被称为释义。释义建模是计算语言模型中意义的基石；能够构建表达相同或不同含义的不同文本变体，显示出强大的语义理解能力。如果计算语言模型要表示含义，它们必须理解和控制在细粒度上构建相同含义而不是不同含义的不同方面。然而，大多数现有方法将释义简化为两个文本之间的二元决策，或者对源进行一次重写，从而模糊了哪些语言因素负责意义保存。在本文中，我建议将释义分解为其构成语言方面（释义类型），为语义等价提供了更细粒度和认知基础的观点。我表明，即使是先进的机器学习模型也难以完成这项任务。然而，当对释义类型进行明确训练时，模型在相关释义任务和下游应用程序上实现了更强的性能。例如，在抄袭检测中，根据释义类型训练的语言模型超越了人类基线：维基百科抄袭案例的准确率为 89.6%，而 arXiv 科学论文抄袭案例的准确率为 78.4%，准确率为 66.5%，而抄袭案例的准确率为 55.7%。在识别 Quora 上的重复问题时，使用释义类型训练的模型比使用二进制对训练的模型有所改进。此外，我证明...

Title: New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR

Authors: Zhilin Wang, Yafu Li, Shunkai Zhang, Zhi Wang, Haoran Zhang, Xiaoye Qu, Yu Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08281
Pdf URL: https://arxiv.org/pdf/2602.08281
Copy Paste: [[2602.08281]] New Skills or Sharper Primitives? A Probabilistic Perspective on the Emergence of Reasoning in RLVR(https://arxiv.org/abs/2602.08281)
Keywords: language model, llm
Abstract: Whether Reinforcement Learning with Verifiable Rewards (RLVR) endows Large Language Models (LLMs) with new capabilities or merely elicits latent traces remains a central debate. In this work, we align with the former view, proposing a probabilistic framework where capability is defined by instance-level solvability. We hypothesize that the emergence of complex reasoning can be driven by sharpening atomic step probabilities, which enables models to overcome the exponential decay of success rates inherent in multi-step reasoning chains. Utilizing the Algebrarium framework, we train models exclusively on single-step operations and evaluate their performance on unseen multi-step tasks. Our empirical results confirm that: (1) RLVR incentivizes the exploration of previously inaccessible solution paths by amplifying the model's existing skills; (2) composite performance is strictly governed by the joint probability of atomic steps, evidenced by high Pearson correlation coefficients ($\rho \in [0.69, 0.96]$); and (3) RLVR, acting as a global optimizer, can cause specific skills to be sacrificed to maximize aggregate reward. Our work offers a novel explanation for emergent abilities in RLVR, suggesting that the iterative optimization of solvable problems enables models to develop the capabilities to tackle previously unsolvable scenarios.
摘要：具有可验证奖励的强化学习（RLVR）是否赋予大型语言模型（LLM）新的功能，还是仅仅引出潜在的痕迹仍然是一个中心争论。在这项工作中，我们与前一种观点保持一致，提出了一个概率框架，其中能力由实例级可解决性定义。我们假设复杂推理的出现可以通过锐化原子步骤概率来驱动，这使得模型能够克服多步骤推理链中固有的成功率指数衰减。利用 Algebrarium 框架，我们仅在单步操作上训练模型，并评估其在未见过的多步任务上的性能。我们的实证结果证实：（1）RLVR 通过放大模型的现有技能来激励对以前无法访问的解决方案路径的探索； (2) 复合性能严格受原子步骤的联合概率控制，高皮尔逊相关系数证明了这一点 ($\rho \in [0.69, 0.96]$)； (3) RLVR 作为全局优化器，可以导致牺牲特定技能来最大化总奖励。我们的工作为 RLVR 中的新兴能力提供了一种新颖的解释，表明可解决问题的迭代优化使模型能够开发出解决以前无法解决的场景的能力。

Title: When Does Context Help? Error Dynamics of Contextual Information in Large Language Models

Authors: Dingzirui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08294
Pdf URL: https://arxiv.org/pdf/2602.08294
Copy Paste: [[2602.08294]] When Does Context Help? Error Dynamics of Contextual Information in Large Language Models(https://arxiv.org/abs/2602.08294)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Contextual information at inference time, such as demonstrations, retrieved knowledge, or interaction history, can substantially improve large language models (LLMs) without parameter updates, yet its theoretical role remains poorly understood beyond specific settings such as in-context learning (ICL). We present a unified theoretical framework for analyzing the effect of arbitrary contextual information in Transformer-based LLMs. Our analysis characterizes contextual influence through output error dynamics. In a single-layer Transformer, we prove that the context-conditioned error vector decomposes additively into the baseline error vector and a contextual correction vector. This yields necessary geometric conditions for error reduction: the contextual correction must align with the negative baseline error and satisfy a norm constraint. We further show that the contextual correction norm admits an explicit upper bound determined by context-query relevance and complementarity. These results extend to multi-context and multi-layer Transformers. Experiments across ICL, retrieval-augmented generation, and memory evolution validate our theory and motivate a principled context selection strategy that improves performance by $0.6\%$.
摘要：推理时的上下文信息，例如演示、检索到的知识或交互历史记录，可以在无需更新参数的情况下显着改进大型语言模型 (LLM)，但除了上下文学习 (ICL) 等特定设置之外，其理论作用仍然知之甚少。我们提出了一个统一的理论框架，用于分析基于 Transformer 的法学硕士中任意上下文信息的影响。我们的分析通过输出误差动态来表征上下文影响。在单层 Transformer 中，我们证明上下文条件误差向量可以加法分解为基线误差向量和上下文校正向量。这产生了减少误差的必要几何条件：上下文校正必须与负基线误差一致并满足范数约束。我们进一步表明，上下文校正范数承认由上下文查询相关性和互补性确定的明确上限。这些结果扩展到多上下文和多层 Transformer。 ICL、检索增强生成和记忆进化的实验验证了我们的理论，并激发了一种有原则的上下文选择策略，该策略将性能提高了 0.6\%$。

Title: Improving Data and Reward Design for Scientific Reasoning in Large Language Models

Authors: Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08321
Pdf URL: https://arxiv.org/pdf/2602.08321
Copy Paste: [[2602.08321]] Improving Data and Reward Design for Scientific Reasoning in Large Language Models(https://arxiv.org/abs/2602.08321)
Keywords: language model, gpt
Abstract: Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using this http URL pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.
摘要：对于大型语言模型来说，解决开放式科学问题仍然具有挑战性，特别是由于本质上不可靠的监督和评估。瓶颈在于科学的后期培训的数据建设和奖励设计。我们开发了一个大规模、系统的数据处理管道，可将异构开源科学数据转换为 Dr. SCI 数据集，该数据集包含 8 个 STEM 科目的 100 万个问题，具有明确的可验证/开放式分割、可扩展的难度注释以及可操作评估开放式答案的细粒度评分标准。在此数据集的基础上，我们提出了 Dr. SCI 训练后流程，它通过三个组件重新设计了标准 SFT -> RL 工作流程：（i）探索扩展 SFT，它在 RL 之前拓宽了模型的推理模式覆盖范围； (ii) 动态难度课程，根据模型不断发展的科学能力调整训练数据； (iii) SciRubric-Guided RL，它通过具有明确答案正确性的基于标题的评估，能够对开放式科学问题进行稳定的强化学习。使用此 http URL 管道训练的 Qwen3-4B-Base 在 GPQA-diamond 上达到 63.2，在 GPQA-general 上达到 32.4，持续改进强大的训练后基线（例如 o1-mini 和 GPT-4o），证明了科学推理方面的巨大进步，特别是在开放式设置中。

Title: Latent Reasoning with Supervised Thinking States

Authors: Ido Amos, Avi Caciularu, Mor Geva, Amir Globerson, Jonathan Herzig, Lior Shani, Idan Szpektor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08332
Pdf URL: https://arxiv.org/pdf/2602.08332
Copy Paste: [[2602.08332]] Latent Reasoning with Supervised Thinking States(https://arxiv.org/abs/2602.08332)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning with a chain-of-thought (CoT) enables Large Language Models (LLMs) to solve complex tasks but incurs significant inference costs due to the generation of long rationales. We propose Thinking States, a method that performs reasoning {\em while} the input is processing. Specifically, Thinking States generates sequences of thinking tokens every few input tokens, transforms the thoughts back into embedding space, and adds them to the following input tokens. This has two key advantages. First, it captures the recurrent nature of CoT, but where the thought tokens are generated as input is processing. Second, since the thoughts are represented as tokens, they can be learned from natural language supervision, and using teacher-forcing, which is parallelizable. Empirically, Thinking States outperforms other latent reasoning methods on multiple reasoning tasks, narrowing the gap to CoT on math problems, and matching its performance on 2-Hop QA with improved latency. On state-tracking tasks, we show Thinking States leads to stronger reasoning behavior than CoT, successfully extrapolating to longer sequences than seen during training.
摘要：使用思想链 (CoT) 进行推理使大型语言模型 (LLM) 能够解决复杂的任务，但由于生成长的基本原理，会产生巨大的推理成本。我们提出了“思考状态”，这是一种在处理输入时执行推理的方法。具体来说，思维状态每隔几个输入标记生成思维标记序列，将思想转换回嵌入空间，并将它们添加到后续输入标记中。这有两个关键优点。首先，它捕捉了 CoT 的循环性质，但思想标记是在输入处理时生成的。其次，由于思想被表示为标记，因此可以从自然语言监督中学习它们，并使用可并行的教师强制。根据经验，思考状态在多个推理任务上优于其他潜在推理方法，缩小了与数学问题上的 CoT 的差距，并在 2-Hop QA 上的性能与改进的延迟相匹配。在状态跟踪任务中，我们表明思考状态会导致比 CoT 更强的推理行为，成功地推断出比训练期间看到的更长的序列。

Title: UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models

Authors: Cheng Yang, Chufan Shi, Bo Shui, Yaokang Wu, Muzi Tao, Huijuan Wang, Ivan Yee Lee, Yong Liu, Xuezhe Ma, Taylor Berg-Kirkpatrick
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2602.08336
Pdf URL: https://arxiv.org/pdf/2602.08336
Copy Paste: [[2602.08336]] UReason: Benchmarking the Reasoning Paradox in Unified Multimodal Models(https://arxiv.org/abs/2602.08336)
Keywords: prompt, chain-of-thought
Abstract: To elicit capabilities for addressing complex and implicit visual requirements, recent unified multimodal models increasingly adopt chain-of-thought reasoning to guide image generation. However, the actual effect of reasoning on visual synthesis remains unclear. We present UReason, a diagnostic benchmark for reasoning-driven image generation that evaluates whether reasoning can be faithfully executed in pixels. UReason contains 2,000 instances across five task families: Code, Arithmetic, Spatial, Attribute, and Text reasoning. To isolate the role of reasoning traces, we introduce an evaluation framework comparing direct generation, reasoning-guided generation, and de-contextualized generation which conditions only on the refined prompt. Across eight open-source unified models, we observe a consistent Reasoning Paradox: Reasoning traces generally improve performance over direct generation, yet retaining intermediate thoughts as conditioning context often hinders visual synthesis, and conditioning only on the refined prompt yields substantial gains. Our analysis suggests that the bottleneck lies in contextual interference rather than insufficient reasoning capacity. UReason provides a principled testbed for studying reasoning in unified models and motivates future methods that effectively integrate reasoning for visual generation while mitigating interference.
摘要：为了获得解决复杂和隐式视觉需求的能力，最近的统一多模态模型越来越多地采用思想链推理来指导图像生成。然而，推理对视觉合成的实际影响仍不清楚。我们提出了 UReason，一种用于推理驱动图像生成的诊断基准，用于评估推理是否可以在像素中忠实地执行。 UReason 包含跨五个任务系列的 2,000 个实例：代码、算术、空间、属性和文本推理。为了隔离推理痕迹的作用，我们引入了一个评估框架，比较直接生成、推理引导生成和仅以精炼提示为条件的去情境化生成。在八个开源统一模型中，我们观察到一致的推理悖论：推理轨迹通常比直接生成提高性能，但保留中间思想，因为调节上下文通常会阻碍视觉合成，而仅根据精致的提示进行调节会产生巨大的收益。我们的分析表明，瓶颈在于上下文干扰，而不是推理能力不足。 UReason 为研究统一模型中的推理提供了一个原则性的测试平台，并激发了未来有效整合视觉生成推理同时减轻干扰的方法。

Title: WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints

Authors: Zexuan Wang, Chenghao Yang, Yingqi Que, Zhenzhu Yang, Huaqing Yuan, Yiwen Wang, Zhengxuan Jiang, Shengjie Fang, Zhenhe Wu, Zhaohui Wang, Zhixin Yao, Jiashuo Liu, Jincheng Ren, Yuzhen Li, Yang Yang, Jiaheng Liu, Jian Yang, Zaiyuan Wang, Ge Zhang, Zhoufutu Wen, Wenhao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08367
Pdf URL: https://arxiv.org/pdf/2602.08367
Copy Paste: [[2602.08367]] WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints(https://arxiv.org/abs/2602.08367)
Keywords: gpt, agent
Abstract: Real-world autonomous planning requires coordinating tightly coupled constraints where a single decision dictates the feasibility of all subsequent actions. However, existing benchmarks predominantly feature loosely coupled constraints solvable through local greedy decisions and rely on idealized data, failing to capture the complexity of extracting parameters from dynamic web environments. We introduce \textbf{WorldTravel}, a benchmark comprising 150 real-world travel scenarios across 5 cities that demand navigating an average of 15+ interdependent temporal and logical constraints. To evaluate agents in realistic deployments, we develop \textbf{WorldTravel-Webscape}, a multi-modal environment featuring over 2,000 rendered webpages where agents must perceive constraint parameters directly from visual layouts to inform their planning. Our evaluation of 10 frontier models reveals a significant performance collapse: even the state-of-the-art GPT-5.2 achieves only 32.67\% feasibility in text-only settings, which plummets to 19.33\% in multi-modal environments. We identify a critical Perception-Action Gap and a Planning Horizon threshold at approximately 10 constraints where model reasoning consistently fails, suggesting that perception and reasoning remain independent bottlenecks. These findings underscore the need for next-generation agents that unify high-fidelity visual perception with long-horizon reasoning to handle brittle real-world logistics.
摘要：现实世界的自主规划需要协调紧密耦合的约束，其中单个决策决定所有后续行动的可行性。然而，现有的基准主要具有松散耦合的约束，可通过局部贪婪决策来解决，并且依赖于理想化数据，无法捕获从动态 Web 环境中提取参数的复杂性。我们引入 \textbf{WorldTravel}，这是一个基准，包含 5 个城市的 150 个现实世界旅行场景，需要导航平均超过 15 个相互依赖的时间和逻辑约束。为了评估实际部署中的代理，我们开发了 \textbf{WorldTravel-Webscape}，这是一个多模式环境，具有 2,000 多个渲染的网页，代理必须直接从视觉布局中感知约束参数，以告知他们的规划。我们对 10 个前沿模型的评估揭示了显着的性能崩溃：即使是最先进的 GPT-5.2 在纯文本设置中也仅实现了 32.67% 的可行性，在多模态环境中骤降至 19.33%。我们在大约 10 个约束条件下确定了关键的感知-行动差距和规划范围阈值，其中模型推理始终失败，这表明感知和推理仍然是独立的瓶颈。这些发现强调需要下一代智能体将高保真视觉感知与长视野推理相结合，以处理脆弱的现实世界物流。

Title: ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts

Authors: Hung Quang Tran, Nam Tien Pham, Son T. Luu, Kiet Van Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08371
Pdf URL: https://arxiv.org/pdf/2602.08371
Copy Paste: [[2602.08371]] ViGoEmotions: A Benchmark Dataset For Fine-grained Emotion Detection on Vietnamese Texts(https://arxiv.org/abs/2602.08371)
Keywords: language model, llm
Abstract: Emotion classification plays a significant role in emotion prediction and harmful content detection. Recent advancements in NLP, particularly through large language models (LLMs), have greatly improved outcomes in this field. This study introduces ViGoEmotions -- a Vietnamese emotion corpus comprising 20,664 social media comments in which each comment is classified into 27 fine-grained distinct emotions. To evaluate the quality of the dataset and its impact on emotion classification, eight pre-trained Transformer-based models were evaluated under three preprocessing strategies: preserving original emojis with rule-based normalization, converting emojis into textual descriptions, and applying ViSoLex, a model-based lexical normalization system. Results show that converting emojis into text often improves the performance of several BERT-based baselines, while preserving emojis yields the best results for ViSoBERT and CafeBERT. In contrast, removing emojis generally leads to lower performance. ViSoBERT achieved the highest Macro F1-score of 61.50% and Weighted F1-score of 63.26%. Strong performance was also observed from CafeBERT and PhoBERT. These findings highlight that while the proposed corpus can support diverse architectures effectively, preprocessing strategies and annotation quality remain key factors influencing downstream performance.
摘要：情绪分类在情绪预测和有害内容检测中发挥着重要作用。 NLP 的最新进展，特别是通过大型语言模型 (LLM)，极大地改善了该领域的成果。这项研究引入了 ViGoEmotions——一个越南情感语料库，包含 20,664 条社交媒体评论，其中每条评论被分为 27 种细粒度的不同情感。为了评估数据集的质量及其对情感分类的影响，在三种预处理策略下评估了八个基于 Transformer 的预训练模型：使用基于规则的标准化保留原始表情符号、将表情符号转换为文本描述以及应用基于模型的词汇标准化系统 ViSoLex。结果表明，将表情符号转换为文本通常可以提高多个基于 BERT 的基线的性能，而保留表情符号可以为 ViSoBERT 和 CafeBERT 带来最佳结果。相反，删除表情符号通常会导致性能下降。 ViSoBERT 获得了最高的宏观 F1 分数 61.50% 和加权 F1 分数 63.26%。 CafeBERT 和 PhoBERT 也表现强劲。这些发现强调，虽然所提出的语料库可以有效支持不同的架构，但预处理策略和注释质量仍然是影响下游性能的关键因素。

Title: Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning

Authors: Zhuoen Chen, Dongfang Li, Meishan Zhang, Baotian Hu, Min Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08382
Pdf URL: https://arxiv.org/pdf/2602.08382
Copy Paste: [[2602.08382]] Dynamic Long Context Reasoning over Compressed Memory via End-to-End Reinforcement Learning(https://arxiv.org/abs/2602.08382)
Keywords: language model, llm, long context, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) face significant challenges in long-context processing, including quadratic computational costs, information forgetting, and the context fragmentation inherent in retrieval-augmented generation (RAG). We propose a cognitively inspired framework for efficient long-context inference based on chunk-wise compression and selective memory recall, rather than processing all raw tokens. The framework segments long inputs into chunks and encodes each chunk into compressed memory representations using a learned compressor. A gating module dynamically selects relevant memory blocks, which are then iteratively processed by a reasoning module with an evolving working memory to solve downstream tasks. The compressor and reasoner are jointly optimized via end-to-end reinforcement learning, while the gating module is trained separately as a classifier. Experimental results show that the proposed method achieves competitive accuracy on multi-hop reasoning benchmarks such as RULER-HQA, extrapolates context length from 7K to 1.75M tokens, and offers a favorable accuracy-efficiency trade-off compared to strong long-context baselines. In particular, it achieves up to a 2 times reduction in peak GPU memory usage and a 6 times inference speedup over MemAgent.
摘要：大型语言模型 (LLM) 在长上下文处理中面临重大挑战，包括二次计算成本、信息遗忘以及检索增强生成 (RAG) 中固有的上下文碎片。我们提出了一种基于分块压缩和选择性记忆回忆的高效长上下文推理的认知启发框架，而不是处理所有原始标记。该框架将长输入分割成块，并使用学习的压缩器将每个块编码为压缩的内存表示。门控模块动态选择相关的内存块，然后由具有不断发展的工作内存的推理模块迭代处理这些内存块，以解决下游任务。压缩器和推理器通过端到端强化学习联合优化，而门控模块作为分类器单独训练。实验结果表明，所提出的方法在 RULER-HQA 等多跳推理基准上实现了有竞争力的准确性，将上下文长度从 7K 推断到 1.75M 标记，并且与强长上下文基线相比，提供了有利的准确性-效率权衡。特别是，与 MemAgent 相比，它的峰值 GPU 内存使用量降低了 2 倍，推理速度提高了 6 倍。

Title: TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

Authors: Linye Wei, Zixiang Luo, Pingzhi Tang, Meng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08404
Pdf URL: https://arxiv.org/pdf/2602.08404
Copy Paste: [[2602.08404]] TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration(https://arxiv.org/abs/2602.08404)
Keywords: language model, llm
Abstract: Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at this https URL.
摘要：扩散大语言模型 (dLLM) 最近因其对并行解码的固有支持而受到广泛关注。在此范式的基础上，具有自回归 (AR) 初始化功能的混合专家 (MoE) dLLM 进一步展示了与主流 AR 模型相媲美的强大性能。然而，我们发现 MoE 架构和基于扩散的解码之间存在根本的不匹配。具体来说，在每个去噪步骤中都会激活大量专家，而最终只接受一小部分令牌，从而导致大量的推理开销并限制了它们在延迟敏感的应用程序中的部署。在这项工作中，我们提出了 TEAM，这是一个即插即用的框架，通过用更少的激活专家启用更多接受的代币来加速 MoE dLLM。 TEAM 的动机是观察到专家路由决策在去噪级别上表现出很强的时间一致性，并且在令牌位置之间表现出很强的空间一致性。利用这些特性，TEAM 采用三种互补的专家激活和解码策略，保守地选择必要的专家来解码和屏蔽令牌，并同时对多个候选者进行积极的推测性探索。实验结果表明，TEAM 比普通 MoE dLLM 实现了高达 2.2 倍的加速，而性能下降可以忽略不计。代码在此 https URL 发布。

Title: Prism: Spectral-Aware Block-Sparse Attention

Authors: Xinghao Wang, Pengyu Wang, Xiaoran Liu, Fangxu Liu, Jason Chu, Kai Song, Xipeng Qiu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.08426
Pdf URL: https://arxiv.org/pdf/2602.08426
Copy Paste: [[2602.08426]] Prism: Spectral-Aware Block-Sparse Attention(https://arxiv.org/abs/2602.08426)
Keywords: llm
Abstract: Block-sparse attention is promising for accelerating long-context LLM pre-filling, yet identifying relevant blocks efficiently remains a bottleneck. Existing methods typically employ coarse-grained attention as a proxy for block importance estimation, but often resort to expensive token-level searching or scoring, resulting in significant selection overhead. In this work, we trace the inaccuracy of standard coarse-grained attention via mean pooling to a theoretical root cause: the interaction between mean pooling and Rotary Positional Embeddings (RoPE). We prove that mean pooling acts as a low-pass filter that induces destructive interference in high-frequency dimensions, effectively creating a "blind spot" for local positional information (e.g., slash patterns). To address this, we introduce Prism, a training-free spectral-aware approach that decomposes block selection into high-frequency and low-frequency branches. By applying energy-based temperature calibration, Prism restores the attenuated positional signals directly from pooled representations, enabling block importance estimation using purely block-level operations, thereby improving efficiency. Extensive evaluations confirm that Prism maintains accuracy parity with full attention while delivering up to $\mathbf{5.1\times}$ speedup.
摘要：块稀疏注意力有望加速长上下文 LLM 预填充，但有效识别相关块仍然是一个瓶颈。现有方法通常采用粗粒度注意力作为块重要性估计的代理，但通常采用昂贵的令牌级搜索或评分，从而导致显着的选择开销。在这项工作中，我们通过均值池化将标准粗粒度注意力的不准确性追溯到理论根本原因：均值池化和旋转位置嵌入（RoPE）之间的相互作用。我们证明，均值池充当低通滤波器，在高频维度中引起破坏性干扰，有效地为局部位置信息（例如斜线图案）创建“盲点”。为了解决这个问题，我们引入了 Prism，这是一种免训练的频谱感知方法，可将块选择分解为高频和低频分支。通过应用基于能量的温度校准，Prism 直接从池表示中恢复衰减的位置信号，从而能够使用纯粹的块级操作来估计块重要性，从而提高效率。广泛的评估证实，Prism 在充分关注的情况下保持了准确性，同时提供高达 $\mathbf{5.1\times}$ 的加速。

Title: Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI

Authors: Ziyan wang, Longlong Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08437
Pdf URL: https://arxiv.org/pdf/2602.08437
Copy Paste: [[2602.08437]] Large Language Models and Impossible Language Acquisition: "False Promise" or an Overturn of our Current Perspective towards AI(https://arxiv.org/abs/2602.08437)
Keywords: language model, gpt, llm, chat
Abstract: In Chomsky's provocative critique "The False Promise of CHATGPT," Large Language Models (LLMs) are characterized as mere pattern predictors that do not acquire languages via intrinsic causal and self-correction structures like humans, therefore are not able to distinguish impossible languages. It stands as a representative in a fundamental challenge to the intellectual foundations of AI, for it integrally synthesizes major issues in methodologies within LLMs and possesses an iconic a priori rationalist perspective. We examine this famous critic from both the perspective in pre-existing literature of linguistics and psychology as well as a research based on an experiment inquiring the capacity of learning both possible and impossible languages among LLMs. We constructed a set of syntactically impossible languages by applying certain transformations to English. These include reversing whole sentences, and adding negation based on word-count parity. Two rounds of controlled experiments were each conducted on GPT-2 small models and long short-term memory (LSTM) models. Statistical analysis (Welch's t-test) shows GPT2 small models underperform in learning all of the impossible languages compared to their performance on the possible language (p<.001). On the other hand, LSTM models' performance tallies with Chomsky's argument, suggesting the irreplaceable role of the evolution of transformer architecture. Based on theoretical analysis and empirical findings, we propose a new vision within Chomsky's theory towards LLMs, and a shift of theoretical paradigm outside Chomsky, from his "rationalist-romantics" paradigm to functionalism and empiricism in LLMs research.
摘要：在乔姆斯基的挑衅性批评“CHATGPT 的虚假承诺”中，大型语言模型（LLM）被描述为纯粹的模式预测器，它们不会像人类一样通过内在因果和自我纠正结构来获取语言，因此无法区分不可能的语言。它是对人工智能知识基础发起根本性挑战的代表，因为它综合了法学硕士方法论中的重大问题，并具有标志性的先验理性主义观点。我们从现有语言学和心理学文献的角度以及基于一项实验的研究来审视这位著名的批评家，该实验旨在探究法学硕士学习可能和不可能语言的能力。我们通过对英语进行某些转换来构建一组语法上不可能的语言。其中包括反转整个句子，以及根据字数奇偶性添加否定。两轮对照实验分别在 GPT-2 小模型和长短期记忆（LSTM）模型上进行。统计分析（Welch 的 t 检验）显示，与学习可能语言的表现相比，GPT2 小模型在学习所有不可能的语言方面表现不佳 (p<.001)。另一方面，LSTM模型的性能与乔姆斯基的论点相符，表明变压器架构的演变具有不可替代的作用。基于理论分析和实证研究结果，我们提出了乔姆斯基理论中对法学硕士的新愿景，以及乔姆斯基之外的理论范式的转变，从他的“理性主义浪漫主义”范式转向法学硕士研究中的功能主义和经验主义。

Title: GISA: A Benchmark for General Information-Seeking Assistant

Authors: Yutao Zhu, Xingshuo Zhang, Maosen Zhang, Jiajie Jin, Liancheng Zhang, Xiaoshuai Song, Kangzhi Zhao, Wencong Zeng, Ruiming Tang, Han Li, Ji-Rong Wen, Zhicheng Dou
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.08543
Pdf URL: https://arxiv.org/pdf/2602.08543
Copy Paste: [[2602.08543]] GISA: A Benchmark for General Information-Seeking Assistant(https://arxiv.org/abs/2602.08543)
Keywords: language model, llm, agent
Abstract: The advancement of large language models (LLMs) has significantly accelerated the development of search agents capable of autonomously gathering information through multi-turn web interactions. Various benchmarks have been proposed to evaluate such agents. However, existing benchmarks often construct queries backward from answers, producing unnatural tasks misaligned with real-world needs. Moreover, these benchmarks tend to focus on either locating specific information or aggregating information from multiple sources, while relying on static answer sets prone to data contamination. To bridge these gaps, we introduce GISA, a benchmark for General Information-Seeking Assistants comprising 373 human-crafted queries that reflect authentic information-seeking scenarios. GISA features four structured answer formats (item, set, list, and table), enabling deterministic evaluation. It integrates both deep reasoning and broad information aggregation within unified tasks, and includes a live subset with periodically updated answers to resist memorization. Notably, GISA provides complete human search trajectories for every query, offering gold-standard references for process-level supervision and imitation learning. Experiments on mainstream LLMs and commercial search products reveal that even the best-performing model achieves only 19.30\% exact match score, with performance notably degrading on tasks requiring complex planning and comprehensive information gathering. These findings highlight substantial room for future improvement.
摘要：大语言模型（LLM）的进步显着加速了能够通过多轮网络交互自主收集信息的搜索代理的开发。已经提出了各种基准来评估此类代理。然而，现有的基准测试通常从答案向后构建查询，从而产生与现实世界需求不相符的不自然的任务。此外，这些基准往往侧重于定位特定信息或聚合来自多个来源的信息，同时依赖容易受到数据污染的静态答案集。为了弥补这些差距，我们引入了 GISA，这是通用信息搜索助手的基准，包含 373 个人工查询，反映了真实的信息搜索场景。 GISA 具有四种结构化答案格式（项目、集合、列表和表格），可实现确定性评估。它将深度推理和广泛的信息聚合集成到统一的任务中，并包含一个实时子集，其中定期更新答案以防止记忆。值得注意的是，GISA 为每个查询提供完整的人类搜索轨迹，为流程级监督和模仿学习提供黄金标准参考。对主流法学硕士和商业搜索产品的实验表明，即使是性能最好的模型也只能达到 19.30% 的精确匹配分数，在需要复杂规划和全面信息收集的任务上性能显着下降。这些发现凸显了未来改进的巨大空间。

Title: How Do Language Models Understand Tables? A Mechanistic Analysis of Cell Location

Authors: Xuanliang Zhang, Dingzirui Wang, Keyan Xu, Qingfu Zhu, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08548
Pdf URL: https://arxiv.org/pdf/2602.08548
Copy Paste: [[2602.08548]] How Do Language Models Understand Tables? A Mechanistic Analysis of Cell Location(https://arxiv.org/abs/2602.08548)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) are increasingly deployed for table-related tasks, the internal mechanisms enabling them to process linearized two-dimensional structured tables remain opaque. In this work, we investigate the process of table understanding by dissecting the atomic task of cell location. Through activation patching and complementary interpretability techniques, we delineate the table understanding mechanism into a sequential three-stage pipeline: Semantic Binding, Coordinate Localization, and Information Extraction. We demonstrate that models locate the target cell via an ordinal mechanism that counts discrete delimiters to resolve coordinates. Furthermore, column indices are encoded within a linear subspace that allows for precise steering of model focus through vector arithmetic. Finally, we reveal that models generalize to multi-cell location tasks by multiplexing the identical attention heads identified during atomic location. Our findings provide a comprehensive explanation of table understanding within Transformer architectures.
摘要：虽然大型语言模型 (LLM) 越来越多地部署用于与表相关的任务，但使它们能够处理线性化二维结构化表的内部机制仍然不透明。在这项工作中，我们通过剖析单元格位置的原子任务来研究表格理解的过程。通过激活修补和互补的可解释性技术，我们将表理解机制描述为一个连续的三阶段管道：语义绑定、坐标定位和信息提取。我们证明模型通过计算离散分隔符来解析坐标的序数机制来定位目标单元格。此外，列索引在线性子空间内进行编码，允许通过向量算术精确控制模型焦点。最后，我们揭示了通过复用原子定位期间识别的相同注意头，模型可以推广到多单元定位任务。我们的研究结果为 Transformer 架构中的表理解提供了全面的解释。

Title: Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation

Authors: Archchana Sindhujan, Girish A. Koushik, Shenbin Qian, Diptesh Kanojia, Constantin Orăsan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08600
Pdf URL: https://arxiv.org/pdf/2602.08600
Copy Paste: [[2602.08600]] Beyond Scalar Scores: Reinforcement Learning for Error-Aware Quality Estimation of Machine Translation(https://arxiv.org/abs/2602.08600)
Keywords: language model, llm
Abstract: Quality Estimation (QE) aims to assess the quality of machine translation (MT) outputs without relying on reference translations, making it essential for real-world, large-scale MT evaluation. Large Language Models (LLMs) have shown significant promise in advancing the field of quality estimation of machine translation. However, most of the QE approaches solely rely on scalar quality scores, offering no explicit information about the translation errors that should drive these judgments. Moreover, for low-resource languages where annotated QE data is limited, existing approaches struggle to achieve reliable performance. To address these challenges, we introduce the first segment-level QE dataset for English to Malayalam, a severely resource-scarce language pair in the QE domain, comprising human-annotated Direct Assessment (DA) scores and Translation Quality Remarks (TQR), which are short, contextual, free-form annotator comments that describe translation errors. We further introduce ALOPE-RL, a policy-based reinforcement learning framework that trains efficient adapters based on policy rewards derived from DA score and TQR. Integrating error-aware rewards with ALOPE-RL, enables LLMs to reason about translation quality beyond numeric scores. Despite being trained on a small-scale QE dataset, ALOPE-RL achieves state-of-the-art performance on English to Malayalam QE using compact LLMs (<=4B parameters}) fine-tuned with LoRA and 4-bit quantization, outperforming both larger LLM-based baselines and leading encoder-based QE models. Our results demonstrate that error-aware, policy-based learning can deliver strong QE performance under limited data and compute budgets. We release our dataset, code, and trained models to support future research.
摘要：质量估计 (QE) 旨在不依赖参考翻译来评估机器翻译 (MT) 输出的质量，这对于现实世界的大规模 MT 评估至关重要。大型语言模型 (LLM) 在推进机器翻译质量评估领域显示出巨大的前景。然而，大多数量化宽松方法仅依赖于标量质量分数，没有提供有关驱动这些判断的翻译错误的明确信息。此外，对于带注释的 QE 数据有限的资源匮乏的语言，现有方法很难实现可靠的性能。为了应对这些挑战，我们引入了第一个英语到马拉雅拉姆语的段级 QE 数据集，这是 QE 领域中资源严重稀缺的语言对，包括人工注释的直接评估 (DA) 分数和翻译质量备注 (TQR)，它们是描述翻译错误的简短、上下文、自由格式注释器注释。我们进一步介绍了 ALOPE-RL，这是一种基于策略的强化学习框架，可根据 DA 分数和 TQR 得出的策略奖励来训练高效的适配器。将错误感知奖励与 ALOPE-RL 相结合，使法学硕士能够超越数字分数来推断翻译质量。尽管是在小规模 QE 数据集上进行训练，但 ALOPE-RL 使用通过 LoRA 和 4 位量化进行微调的紧凑 LLM（<=4B 参数}）在英语到马拉雅拉姆语 QE 上实现了最先进的性能，优于基于 LLM 的较大基线和领先的基于编码器的 QE 模型。我们的结果表明，错误感知、基于策略的学习可以在有限的数据和计算预算下提供强大的量化宽松性能。我们发布数据集、代码和训练模型以支持未来的研究。

Title: VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling

Authors: Ziyang Cheng, Yuhao Wang, Heyang Liu, Ronghua Wu, Qunshan Gu, Yanfeng Wang, Yu Wang
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2602.08607
Pdf URL: https://arxiv.org/pdf/2602.08607
Copy Paste: [[2602.08607]] VocalNet-MDM: Accelerating Streaming Speech LLM via Self-Distilled Masked Diffusion Modeling(https://arxiv.org/abs/2602.08607)
Keywords: language model, llm
Abstract: Recent Speech Large Language Models~(LLMs) have achieved impressive capabilities in end-to-end speech interaction. However, the prevailing autoregressive paradigm imposes strict serial constraints, limiting generation efficiency and introducing exposure bias. In this paper, we investigate Masked Diffusion Modeling~(MDM) as a non-autoregressive paradigm for speech LLMs and introduce VocalNet-MDM. To adapt MDM for streaming speech interaction, we address two critical challenges: training-inference mismatch and iterative overhead. We propose Hierarchical Block-wise Masking to align training objectives with the progressive masked states encountered during block diffusion decoding, and Iterative Self-Distillation to compress multi-step refinement into fewer steps for low-latency inference. Trained on a limited scale of only 6K hours of speech data, VocalNet-MDM achieves a 3.7$\times$--10$\times$ decoding speedup and reduces first-chunk latency by 34\% compared to AR baselines. It maintains competitive recognition accuracy while achieving state-of-the-art text quality and speech naturalness, demonstrating that MDM is a promising and scalable alternative for low-latency, efficient speech LLMs.
摘要：最近的语音大型语言模型~（LLM）在端到端语音交互方面取得了令人印象深刻的能力。然而，流行的自回归范式施加了严格的串行约束，限制了发电效率并引入了暴露偏差。在本文中，我们研究了掩蔽扩散建模（MDM）作为语音法学硕士的非自回归范式，并介绍了 VocalNet-MDM。为了使 MDM 适应流式语音交互，我们解决了两个关键挑战：训练推理不匹配和迭代开销。我们提出分层分块屏蔽，将训练目标与块扩散解码过程中遇到的渐进屏蔽状态保持一致，并提出迭代自蒸馏，将多步细化压缩为更少的步骤，以实现低延迟推理。在仅 6K 小时语音数据的有限规模上进行训练，与 AR 基线相比，VocalNet-MDM 实现了 3.7$\times$--10$\times$ 解码加速，并将首块延迟减少了 34\%。它保持有竞争力的识别准确性，同时实现最先进的文本质量和语音自然度，这表明 MDM 是低延迟、高效语音法学硕士的一种有前途且可扩展的替代方案。

Title: Do Multilingual LLMs have specialized language heads?

Authors: Muhammad Naufil
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08625
Pdf URL: https://arxiv.org/pdf/2602.08625
Copy Paste: [[2602.08625]] Do Multilingual LLMs have specialized language heads?(https://arxiv.org/abs/2602.08625)
Keywords: language model, llm
Abstract: Multilingual large language models (LLMs) have gained significant popularity for their ability to process and generate text across multiple languages. However, deploying these models in production can be inefficient when only a subset of the supported languages is of interest. There has been some research conducted on identifying whether machine translation models have language-specific or language-agnostic heads, however no research has been conducted for multilingual LLMs, to the best of our knowledge, that as we know are capable of performing diverse tasks beyond just translation. This paper explores whether multilingual LLMs have specialized language attention heads for each language, and investigates the possibility of removing language-specific heads for unwanted languages without degrading performance in the targeted languages. Our findings could inform more efficient deployment strategies for multilingual LLMs, enabling reduced model complexity while maintaining high accuracy for targeted languages.
摘要：多语言大语言模型 (LLM) 因其跨多种语言处理和生成文本的能力而广受欢迎。但是，当仅对受支持语言的子集感兴趣时，在生产中部署这些模型可能效率低下。已经进行了一些研究来确定机器翻译模型是否具有特定于语言或与语言无关的头部，但据我们所知，尚未对多语言法学硕士进行研究，据我们所知，多语言法学硕士能够执行翻译以外的各种任务。本文探讨了多语言法学硕士是否对每种语言都有专门的语言注意头，并研究了在不降低目标语言性能的情况下删除不需要的语言的特定语言头的可能性。我们的研究结果可以为多语言法学硕士提供更有效的部署策略，从而降低模型复杂性，同时保持目标语言的高精度。

Title: Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models

Authors: Mingzi Cao, Xingwei Tan, Mahmud Akhter, Marco Valentino, Maria Liakata, Xi Wang, Nikolaos Aletras
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08658
Pdf URL: https://arxiv.org/pdf/2602.08658
Copy Paste: [[2602.08658]] Fundamental Reasoning Paradigms Induce Out-of-Domain Generalization in Language Models(https://arxiv.org/abs/2602.08658)
Keywords: language model, llm
Abstract: Deduction, induction, and abduction are fundamental reasoning paradigms, core for human logical thinking. Although improving Large Language Model (LLM) reasoning has attracted significant research efforts, the extent to which the fundamental paradigms induce generalization has yet to be systematically explored. In this study, we shed light on how the interplay between these core paradigms influences LLMs' reasoning behavior. To this end, we first collect a new dataset of reasoning trajectories from symbolic tasks, each targeting one of the three fundamental paradigms, to abstract from concrete world knowledge. Then, we investigate effective ways for inducing these skills into LLMs. We experiment with a battery of methods including simple fine-tuning, and more complex approaches to increase model depth, or transform a dense model to a mixture-of-experts. We comprehensively evaluate induced models on realistic out-of-domain tasks, that are entirely formulated in natural language and contain real-world knowledge. Our results reveal that our approach yields strong generalizability with substantial performance gains (up to $14.60$) across realistic tasks.
摘要：演绎、归纳和溯因是基本的推理范式，是人类逻辑思维的核心。尽管改进大语言模型（LLM）推理已经吸引了大量的研究工作，但基本范式诱导泛化的程度仍有待系统探索。在这项研究中，我们阐明了这些核心范式之间的相互作用如何影响法学硕士的推理行为。为此，我们首先从符号任务中收集一个新的推理轨迹数据集，每个数据集都针对三个基本范式之一，以从具体的世界知识中抽象出来。然后，我们研究将这些技能引入法学硕士的有效方法。我们尝试了一系列方法，包括简单的微调和更复杂的方法来增加模型深度，或将密集模型转换为专家混合模型。我们全面评估现实域外任务的诱导模型，这些模型完全用自然语言表述并包含现实世界的知识。我们的结果表明，我们的方法具有很强的通用性，在实际任务中具有显着的性能提升（高达 14.60 美元）。

Title: Learning to Judge: LLMs Designing and Applying Evaluation Rubrics

Authors: Clemencia Siro, Pourya Aliannejadi, Mohammad Aliannejadi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.08672
Pdf URL: https://arxiv.org/pdf/2602.08672
Copy Paste: [[2602.08672]] Learning to Judge: LLMs Designing and Applying Evaluation Rubrics(https://arxiv.org/abs/2602.08672)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used as evaluators for natural language generation, applying human-defined rubrics to assess system outputs. However, human rubrics are often static and misaligned with how models internally represent language quality. We introduce GER-Eval (Generating Evaluation Rubrics for Evaluation) to investigate whether LLMs can design and apply their own evaluation rubrics. We evaluate the semantic coherence and scoring reliability of LLM-defined criteria and their alignment with human criteria. LLMs reliably generate interpretable and task-aware evaluation dimensions and apply them consistently within models, but their scoring reliability degrades in factual and knowledge-intensive settings. Closed-source models such as GPT-4o achieve higher agreement and cross-model generalization than open-weight models such as Llama. Our findings position evaluation as a learned linguistic capability of LLMs, consistent within models but fragmented across them, and call for new methods that jointly model human and LLM evaluative language to improve reliability and interpretability.
摘要：大型语言模型 (LLM) 越来越多地用作自然语言生成的评估器，应用人类定义的规则来评估系统输出。然而，人类的标准通常是静态的，并且与模型内部表示语言质量的方式不一致。我们引入 GER-Eval（生成评估标准）来研究法学硕士是否可以设计和应用自己的评估标准。我们评估法学硕士定义的标准的语义一致性和评分可靠性及其与人类标准的一致性。法学硕士可靠地生成可解释和任务感知的评估维度，并在模型中一致地应用它们，但它们的评分可靠性在事实和知识密集型环境中会降低。 GPT-4o 等闭源模型比 Llama 等开放权重模型实现了更高的一致性和跨模型泛化。我们的研究结果将评估定位为法学硕士学习的语言能力，在模型内一致，但在模型之间分散，并呼吁联合建模人类和法学硕士评估语言以提高可靠性和可解释性。

Title: Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement

Authors: Hossein Kermani, Fatemeh Oudlajani, Pardis Yarahmadi, Hamideh Mahdi Soltani, Mohammad Makki, Zahra HosseiniKhoo
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.08688
Pdf URL: https://arxiv.org/pdf/2602.08688
Copy Paste: [[2602.08688]] Old wine in old glasses: Comparing computational and qualitative methods in identifying incivility on Persian Twitter during the #MahsaAmini movement(https://arxiv.org/abs/2602.08688)
Keywords: language model, gpt, prompt, chat
Abstract: This paper compares three approaches to detecting incivility in Persian tweets: human qualitative coding, supervised learning with ParsBERT, and large language models (ChatGPT). Using 47,278 tweets from the #MahsaAmini movement in Iran, we evaluate the accuracy and efficiency of each method. ParsBERT substantially outperforms seven evaluated ChatGPT models in identifying hate speech. We also find that ChatGPT struggles not only with subtle cases but also with explicitly uncivil content, and that prompt language (English vs. Persian) does not meaningfully affect its outputs. The study provides a detailed comparison of these approaches and clarifies their strengths and limitations for analyzing hate speech in a low-resource language context.
摘要：本文比较了检测波斯推文中不文明行为的三种方法：人类定性编码、ParsBERT 监督学习和大型语言模型 (ChatGPT)。我们使用伊朗 #MahsaAmini 运动的 47,278 条推文来评估每种方法的准确性和效率。 ParsBERT 在识别仇恨言论方面远远优于七个经过评估的 ChatGPT 模型。我们还发现 ChatGPT 不仅难以处理微妙的情况，而且还难以处理明显不文明的内容，并且提示语言（英语与波斯语）不会对其输出产生有意义的影响。该研究对这些方法进行了详细比较，并阐明了它们在资源匮乏的语言环境中分析仇恨言论的优点和局限性。

Title: FactSim: Fact-Checking for Opinion Summarization

Authors: Leandro Anghinoni, Jorge Sanchez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08709
Pdf URL: https://arxiv.org/pdf/2602.08709
Copy Paste: [[2602.08709]] FactSim: Fact-Checking for Opinion Summarization(https://arxiv.org/abs/2602.08709)
Keywords: language model, llm
Abstract: We explore the need for more comprehensive and precise evaluation techniques for generative artificial intelligence (GenAI) in text summarization tasks, specifically in the area of opinion summarization. Traditional methods, which leverage automated metrics to compare machine-generated summaries from a collection of opinion pieces, e.g. product reviews, have shown limitations due to the paradigm shift introduced by large language models (LLM). This paper addresses these shortcomings by proposing a novel, fully automated methodology for assessing the factual consistency of such summaries. The method is based on measuring the similarity between the claims in a given summary with those from the original reviews, measuring the coverage and consistency of the generated summary. To do so, we rely on a simple approach to extract factual assessment from texts that we then compare and summarize in a suitable score. We demonstrate that the proposed metric attributes higher scores to similar claims, regardless of whether the claim is negated, paraphrased, or expanded, and that the score has a high correlation to human judgment when compared to state-of-the-art metrics.
摘要：我们探索在文本摘要任务中，特别是在观点摘要领域，对生成人工智能（GenAI）更全面、更精确的评估技术的需求。传统方法利用自动化指标来比较机器生成的来自一系列观点的摘要，例如由于大型语言模型（LLM）引入的范式转变，产品评论已显示出局限性。本文提出了一种新颖的、全自动的方法来评估此类摘要的事实一致性，从而解决了这些缺点。该方法基于测量给定摘要中的权利要求与原始评论中的权利要求之间的相似性，测量生成的摘要的覆盖范围和一致性。为此，我们依靠一种简单的方法从文本中提取事实评估，然后进行比较并总结为合适的分数。我们证明，无论该主张是否被否定、解释或扩展，所提出的指标都会为类似的主张赋予更高的分数，并且与最先进的指标相比，该分数与人类判断具有高度相关性。

Title: PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments

Authors: Shangrui Nie, Kian Omoomi, Lucie Flek, Zhixue Zhao, Charles Welch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08716
Pdf URL: https://arxiv.org/pdf/2602.08716
Copy Paste: [[2602.08716]] PERSPECTRA: A Scalable and Configurable Pluralist Benchmark of Perspectives from Arguments(https://arxiv.org/abs/2602.08716)
Keywords: language model, llm
Abstract: Pluralism, the capacity to engage with diverse perspectives without collapsing them into a single viewpoint, is critical for developing large language models that faithfully reflect human heterogeneity. Yet this characteristic has not been carefully examined in the LLM research community and remains absent from most alignment studies. Debate-oriented sources provide a natural entry point for pluralism research. Previous work builds on online debate sources but remains constrained by costly human validation. Other debate-rich platforms such as Reddit and Kialo also offer promising material: Reddit provides linguistic diversity and scale but lacks clear argumentative structure, while Kialo supplies explicit pro/con graphs but remains overly concise and detached from natural discourse. We introduce PERSPECTRA, a pluralist benchmark that integrates the structural clarity of Kialo debate graphs with the linguistic diversity of real Reddit discussions. Using a controlled retrieval-and-expansion pipeline, we construct 3,810 enriched arguments spanning 762 pro/con stances on 100 controversial topics. Each opinion is expanded to multiple naturalistic variants, enabling robust evaluation of pluralism. We initialise three tasks with PERSPECTRA: opinion counting (identifying distinct viewpoints), opinion matching (aligning supporting stances and discourse to source opinions), and polarity check (inferring aggregate stance in mixed discourse). Experiments with state-of-the-art open-source and proprietary LLMs, highlight systematic failures, such as overestimating the number of viewpoints and misclassifying concessive structures, underscoring the difficulty of pluralism-aware understanding and reasoning. By combining diversity with structure, PERSPECTRA establishes the first scalable, configurable benchmark for evaluating how well models represent, distinguish, and reason over multiple perspectives.
摘要：多元主义是一种能够接受不同观点而不将它们压缩成单一观点的能力，对于开发忠实反映人类异质性的大型语言模型至关重要。然而，这一特征尚未在法学硕士研究界得到仔细研究，并且在大多数一致性研究中仍然缺失。以辩论为导向的资源为多元化研究提供了一个自然的切入点。之前的工作建立在在线辩论资源的基础上，但仍然受到昂贵的人工验证的限制。 Reddit 和 Kialo 等其他辩论丰富的平台也提供了有前景的材料：Reddit 提供了语言多样性和规模，但缺乏清晰的论证结构，而 Kialo 提供了明确的赞成/反对图表，但仍然过于简洁且脱离自然话语。我们引入了 PERSPECTRA，这是一个多元基准，它将 Kialo 辩论图的结构清晰度与真实 Reddit 讨论的语言多样性相结合。使用受控的检索和扩展管道，我们构建了 3,810 个丰富的论点，涵盖 100 个有争议主题的 762 个赞成/反对立场。每个观点都扩展到多种自然主义变体，从而能够对多元化进行稳健的评估。我们用 PERSPECTRA 初始化三项任务：意见计数（识别不同的观点）、意见匹配（将支持立场和话语与源意见相一致）和极性检查（推断混合话语中的总体立场）。对最先进的开源和专有法学硕士的实验突显了系统性失败，例如高估了观点数量和错误分类了让步结构，强调了多元意识理解和推理的难度。通过将多样性与结构相结合，PERSPECTRA 建立了第一个可扩展、可配置的基准，用于评估模型在多个角度的表示、区分和推理能力。

Title: LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation

Authors: Yushi Sun, Xujia Li, Nan Tang, Quanqing Xu, Chuanhui Yang, Lei Chen
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2602.08793
Pdf URL: https://arxiv.org/pdf/2602.08793
Copy Paste: [[2602.08793]] LakeHopper: Cross Data Lakes Column Type Annotation through Model Adaptation(https://arxiv.org/abs/2602.08793)
Keywords: language model
Abstract: Column type annotation is vital for tasks like data cleaning, integration, and visualization. Recent solutions rely on resource-intensive language models fine-tuned on well-annotated columns from a particular set of tables, i.e., a source data lake. In this paper, we study whether we can adapt an existing pre-trained LM-based model to a new (i.e., target) data lake to minimize the annotations required on the new data lake. However, challenges include the source-target knowledge gap, selecting informative target data, and fine-tuning without losing shared knowledge exist. We propose LakeHopper, a framework that identifies and resolves the knowledge gap through LM interactions, employs a cluster-based data selection scheme for unannotated columns, and uses an incremental fine-tuning mechanism that gradually adapts the source model to the target data lake. Our experimental results validate the effectiveness of LakeHopper on two different data lake transfers under both low-resource and high-resource settings.
摘要：列类型注释对于数据清理、集成和可视化等任务至关重要。最近的解决方案依赖于资源密集型语言模型，该模型在一组特定表（即源数据湖）中经过良好注释的列上进行了微调。在本文中，我们研究是否可以将现有的基于 LM 的预训练模型适应新的（即目标）数据湖，以最大限度地减少新数据湖所需的注释。然而，存在的挑战包括源-目标知识差距、选择信息丰富的目标数据以及在不丢失共享知识的情况下进行微调。我们提出了 LakeHopper，一个通过 LM 交互来识别和解决知识差距的框架，对未注释的列采用基于集群的数据选择方案，并使用增量微调机制逐渐使源模型适应目标数据湖。我们的实验结果验证了 LakeHopper 在低资源和高资源设置下两种不同数据湖传输的有效性。

Title: Affective Flow Language Model for Emotional Support Conversation

Authors: Chenghui Zou, Ning Wang, Tiesunlong Shen, Luwei Xiao, Chuan Ma, Xiangpeng Li, Rui Mao, Erik Cambria
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08826
Pdf URL: https://arxiv.org/pdf/2602.08826
Copy Paste: [[2602.08826]] Affective Flow Language Model for Emotional Support Conversation(https://arxiv.org/abs/2602.08826)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have been widely applied to emotional support conversation (ESC). However, complex multi-turn support remains this http URL is because existing alignment schemes rely on sparse outcome-level signals, thus offering limited supervision for intermediate strategy decisions. To fill this gap, this paper proposes affective flow language model for emotional support conversation (AFlow), a framework that introduces fine-grained supervision on dialogue prefixes by modeling a continuous affective flow along multi-turn trajectories. AFlow can estimate intermediate utility over searched trajectories and learn preference-consistent strategy transitions. To improve strategy coherence and empathetic response quality, a subpath-level flow-balance objective is presented to propagate preference signals to intermediate states. Experiment results show consistent and significant improvements over competitive baselines in diverse emotional contexts. Remarkably, AFlow with a compact open-source backbone outperforms proprietary LMMs such as GPT-4o and Claude-3.5 on major ESC metrics. Our code is available at this https URL.
摘要：大语言模型（LLM）已广泛应用于情感支持对话（ESC）。然而，复杂的多轮支持仍然是这个http URL，因为现有的对齐方案依赖于稀疏的结果级信号，从而为中间策略决策提供有限的监督。为了填补这一空白，本文提出了用于情感支持对话的情感流语言模型（AFlow），该框架通过沿多轮轨迹建模连续情感流来引入对对话前缀的细粒度监督。 AFlow 可以估计搜索轨迹的中间效用并学习偏好一致的策略转换。为了提高策略一致性和同理心响应质量，提出了子路径级流量平衡目标，将偏好信号传播到中间状态。实验结果表明，在不同的情绪背景下，与竞争基线相比，有一致且显着的改进。值得注意的是，具有紧凑开源主干的 AFlow 在主要 ESC 指标上的表现优于 GPT-4o 和 Claude-3.5 等专有 LMM。我们的代码可以在这个 https URL 上找到。

Title: WildReward: Learning Reward Models from In-the-Wild Human Interactions

Authors: Hao Peng, Yunjia Qi, Xiaozhi Wang, Zijun Yao, Lei Hou, Juanzi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08829
Pdf URL: https://arxiv.org/pdf/2602.08829
Copy Paste: [[2602.08829]] WildReward: Learning Reward Models from In-the-Wild Human Interactions(https://arxiv.org/abs/2602.08829)
Keywords: language model, llm, chat
Abstract: Reward models (RMs) are crucial for the training of large language models (LLMs), yet they typically rely on large-scale human-annotated preference pairs. With the widespread deployment of LLMs, in-the-wild interactions have emerged as a rich source of implicit reward signals. This raises the question: Can we develop reward models directly from in-the-wild interactions? In this work, we explore this possibility by adopting WildChat as an interaction source and proposing a pipeline to extract reliable human feedback, yielding 186k high-quality instances for training WildReward via ordinal regression directly on user feedback without preference pairs. Extensive experiments demonstrate that WildReward achieves comparable or even superior performance compared to conventional reward models, with improved calibration and cross-sample consistency. We also observe that WildReward benefits directly from user diversity, where more users yield stronger reward models. Finally, we apply WildReward to online DPO training and observe significant improvements across various tasks. Code and data are released at this https URL.
摘要：奖励模型 (RM) 对于大型语言模型 (LLM) 的训练至关重要，但它们通常依赖于大规模人类注释的偏好对。随着法学硕士的广泛部署，野外互动已成为隐性奖励信号的丰富来源。这就提出了一个问题：我们可以直接从野外交互中开发奖励模型吗？在这项工作中，我们通过采用 WildChat 作为交互源并提出一个提取可靠人类反馈的管道来探索这种可能性，从而产生 186k 高质量实例，用于通过直接对用户反馈进行序数回归来训练 WildReward，而无需偏好对。大量实验表明，与传统奖励模型相比，WildReward 实现了可比甚至更优越的性能，并且改进了校准和跨样本一致性。我们还观察到，WildReward 直接受益于用户多样性，更多的用户会产生更强大的奖励模型。最后，我们将 WildReward 应用于在线 DPO 培训，并观察到各种任务的显着改进。代码和数据在此 https URL 发布。

Title: Large Language Models for Geolocation Extraction in Humanitarian Crisis Response

Authors: G. Cafferata, T. Demarco, K. Kalimeri, Y. Mejova, M.G. Beiró
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.08872
Pdf URL: https://arxiv.org/pdf/2602.08872
Copy Paste: [[2602.08872]] Large Language Models for Geolocation Extraction in Humanitarian Crisis Response(https://arxiv.org/abs/2602.08872)
Keywords: language model, llm, agent
Abstract: Humanitarian crises demand timely and accurate geographic information to inform effective response efforts. Yet, automated systems that extract locations from text often reproduce existing geographic and socioeconomic biases, leading to uneven visibility of crisis-affected regions. This paper investigates whether Large Language Models (LLMs) can address these geographic disparities in extracting location information from humanitarian documents. We introduce a two-step framework that combines few-shot LLM-based named entity recognition with an agent-based geocoding module that leverages context to resolve ambiguous toponyms. We benchmark our approach against state-of-the-art pretrained and rule-based systems using both accuracy and fairness metrics across geographic and socioeconomic dimensions. Our evaluation uses an extended version of the HumSet dataset with refined literal toponym annotations. Results show that LLM-based methods substantially improve both the precision and fairness of geolocation extraction from humanitarian texts, particularly for underrepresented regions. By bridging advances in LLM reasoning with principles of responsible and inclusive AI, this work contributes to more equitable geospatial data systems for humanitarian response, advancing the goal of leaving no place behind in crisis analytics.
摘要：人道主义危机需要及时、准确的地理信息来为有效的应对工作提供信息。然而，从文本中提取位置的自动化系统通常会重现现有的地理和社会经济偏见，导致受危机影响地区的可见性不均匀。本文研究了大型语言模型 (LLM) 是否可以解决从人道主义文档中提取位置信息时的这些地理差异。我们引入了一个两步框架，它将基于少量 LLM 的命名实体识别与基于代理的地理编码模块相结合，该模块利用上下文来解决不明确的地名。我们使用跨越地理和社会经济维度的准确性和公平性指标，将我们的方法与最先进的预训练和基于规则的系统进行基准测试。我们的评估使用了 HumSet 数据集的扩展版本，其中包含精炼的文字地名注释。结果表明，基于法学硕士的方法大大提高了从人道主义文本中提取地理位置的精度和公平性，特别是对于代表性不足的地区。通过将法学硕士推理的进步与负责任和包容性人工智能的原则结合起来，这项工作有助于为人道主义响应建立更公平的地理空间数据系统，从而推进危机分析中不留任何一席之地的目标。

Title: Is Reasoning Capability Enough for Safety in Long-Context Language Models?

Authors: Yu Fu, Haz Sameen Shahgir, Huanli Gong, Zhipeng Wei, N. Benjamin Erichson, Yue Dong
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2602.08874
Pdf URL: https://arxiv.org/pdf/2602.08874
Copy Paste: [[2602.08874]] Is Reasoning Capability Enough for Safety in Long-Context Language Models?(https://arxiv.org/abs/2602.08874)
Keywords: language model, gpt, llm, long context, prompt
Abstract: Large language models (LLMs) increasingly combine long-context processing with advanced reasoning, enabling them to retrieve and synthesize information distributed across tens of thousands of tokens. A hypothesis is that stronger reasoning capability should improve safety by helping models recognize harmful intent even when it is not stated explicitly. We test this hypothesis in long-context settings where harmful intent is implicit and must be inferred through reasoning, and find that it does not hold. We introduce compositional reasoning attacks, a new threat model in which a harmful query is decomposed into incomplete fragments that scattered throughout a long context. The model is then prompted with a neutral reasoning query that induces retrieval and synthesis, causing the harmful intent to emerge only after composition. Evaluating 14 frontier LLMs on contexts up to 64k tokens, we uncover three findings: (1) models with stronger general reasoning capability are not more robust to compositional reasoning attacks, often assembling the intent yet failing to refuse; (2) safety alignment consistently degrades as context length increases; and (3) inference-time reasoning effort is a key mitigating factor: increasing inference-time compute reduces attack success by over 50 percentage points on GPT-oss-120b model. Together, these results suggest that safety does not automatically scale with reasoning capability, especially under long-context inference.
摘要：大型语言模型 (LLM) 越来越多地将长上下文处理与高级推理相结合，使它们能够检索和合成分布在数万个标记中的信息。一个假设是，更强的推理能力应该通过帮助模型识别有害意图（即使没有明确说明）来提高安全性。我们在长上下文环境中测试了这个假设，其中有害意图是隐含的，必须通过推理来推断，结果发现它并不成立。我们引入了组合推理攻击，这是一种新的威胁模型，其中有害查询被分解为分散在长上下文中的不完整片段。然后，该模型会被提示进行中性推理查询，从而引发检索和合成，从而导致有害意图仅在合成后才出现。在高达 64k token 的上下文中评估 14 个前沿 LLM，我们发现了三个发现：（1）具有更强通用推理能力的模型对于组合推理攻击并不更稳健，经常组装意图但无法拒绝； (2) 随着上下文长度的增加，安全对齐持续降低； (3) 推理时间推理工作是一个关键的缓解因素：增加推理时间计算会使 GPT-oss-120b 模型上的攻击成功率降低 50 个百分点以上。总之，这些结果表明安全性不会自动随着推理能力而变化，尤其是在长上下文推理下。

Title: Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models

Authors: Yuliang Liu, Yunchong Song, Yixuan Wang, Kewen Ge, Alex Lamb, Qipeng Guo, Kai Chen, Bowen Zhou, Zhouhan Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.08984
Pdf URL: https://arxiv.org/pdf/2602.08984
Copy Paste: [[2602.08984]] Next Concept Prediction in Discrete Latent Space Leads to Stronger Language Models(https://arxiv.org/abs/2602.08984)
Keywords: language model, gpt
Abstract: We propose Next Concept Prediction (NCP), a generative pretraining paradigm built on top of Next Token Prediction (NTP). NCP predicts discrete concepts that span multiple tokens, thereby forming a more challenging pretraining objective. Our model, ConceptLM, quantizes hidden states using Vector Quantization and constructs a concept vocabulary. It leverages both NCP and NTP to drive parameter updates and generates a concept to guide the generation of the following tokens. We train ConceptLM from scratch at scales ranging from 70M to 1.5B parameters with up to 300B training data, including Pythia and GPT-2 backbones. Results on 13 benchmarks show that NCP yields consistent performance gains over traditional token-level models. Furthermore, continual pretraining experiments on an 8B-parameter Llama model indicate that NCP can further improve an NTP-trained model. Our analysis suggests that NCP leads to more powerful language models by introducing a harder pretraining task, providing a promising path toward better language modeling.
摘要：我们提出了下一个概念预测（NCP），这是一种建立在下一个令牌预测（NTP）之上的生成式预训练范例。 NCP 预测跨越多个标记的离散概念，从而形成更具挑战性的预训练目标。我们的模型 ConceptLM 使用矢量量化来量化隐藏状态并构建概念词汇表。它利用 NCP 和 NTP 来驱动参数更新并生成一个概念来指导后续代币的生成。我们从头开始训练 ConceptLM，参数范围从 70M 到 1.5B 不等，训练数据高达 300B，包括 Pythia 和 GPT-2 主干。 13 个基准测试的结果表明，NCP 比传统代币级模型具有一致的性能提升。此外，对 8B 参数 Llama 模型的持续预训练实验表明，NCP 可以进一步改进 NTP 训练的模型。我们的分析表明，NCP 通过引入更困难的预训练任务来产生更强大的语言模型，为更好的语言建模提供了一条有希望的道路。

Title: When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents

Authors: Yuting Ning, Jaylen Jones, Zhehao Zhang, Chentao Ye, Weitong Ruan, Junyi Li, Rahul Gupta, Huan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.08995
Pdf URL: https://arxiv.org/pdf/2602.08995
Copy Paste: [[2602.08995]] When Actions Go Off-Task: Detecting and Correcting Misaligned Actions in Computer-Use Agents(https://arxiv.org/abs/2602.08995)
Keywords: prompt, agent
Abstract: Computer-use agents (CUAs) have made tremendous progress in the past year, yet they still frequently produce misaligned actions that deviate from the user's original intent. Such misaligned actions may arise from external attacks (e.g., indirect prompt injection) or from internal limitations (e.g., erroneous reasoning). They not only expose CUAs to safety risks, but also degrade task efficiency and reliability. This work makes the first effort to define and study misaligned action detection in CUAs, with comprehensive coverage of both externally induced and internally arising misaligned actions. We further identify three common categories in real-world CUA deployment and construct MisActBench, a benchmark of realistic trajectories with human-annotated, action-level alignment labels. Moreover, we propose DeAction, a practical and universal guardrail that detects misaligned actions before execution and iteratively corrects them through structured feedback. DeAction outperforms all existing baselines across offline and online evaluations with moderate latency overhead: (1) On MisActBench, it outperforms baselines by over 15% absolute in F1 score; (2) In online evaluation, it reduces attack success rate by over 90% under adversarial settings while preserving or even improving task success rate in benign environments.
摘要：计算机使用代理（CUA）在过去的一年中取得了巨大的进步，但它们仍然经常产生偏离用户最初意图的错误操作。这种不一致的行为可能源于外部攻击（例如，间接提示注入）或内部限制（例如，错误推理）。它们不仅使 CUA 面临安全风险，还会降低任务效率和可靠性。这项工作首次致力于定义和研究 CUA 中的错位动作检测，全面覆盖外部引起的和内部产生的错位动作。我们进一步确定了现实世界 CUA 部署中的三个常见类别，并构建了 MisActBench，这是具有人工注释、动作级对齐标签的现实轨迹基准。此外，我们提出了 DeAction，这是一种实用且通用的护栏，可以在执行之前检测不一致的操作，并通过结构化反馈迭代纠正它们。 DeAction 在离线和在线评估中优于所有现有基线，且延迟开销适中：(1) 在 MisActBench 上，它的 F1 分数绝对优于基线 15% 以上；（2）在线评估中，在对抗性环境下，其攻击成功率降低了90%以上，同时在良性环境下保持甚至提高了任务成功率。