2026-03-10

Title: Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale

Authors: Jonas Rohweder, Subhabrata Dutta, Iryna Gurevych
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06592
Pdf URL: https://arxiv.org/pdf/2603.06592
Copy Paste: [[2603.06592]] Hierarchical Latent Structures in Data Generation Process Unify Mechanistic Phenomena across Scale(https://arxiv.org/abs/2603.06592)
Keywords: language model, llm
Abstract: Contemporary studies have uncovered many puzzling phenomena in the neural information processing of Transformer-based language models. Building a robust, unified understanding of these phenomena requires disassembling a model within the scope of its training. While the intractable scale of pretraining corpora limits a bottom-up investigation in this direction, simplistic assumptions of the data generation process limit the expressivity and fail to explain complex patterns. In this work, we use probabilistic context-free grammars (PCFGs) to generate synthetic corpora that are faithful and computationally efficient proxies for web-scale text corpora. We investigate the emergence of three mechanistic phenomena: induction heads, function vectors, and the Hydra effect, under our designed data generation process, as well as in the checkpoints of real-world language models. Our findings suggest that hierarchical structures in the data generation process serve as the X-factor in explaining the emergence of these phenomena. We provide the theoretical underpinnings of the role played by hierarchy in the training dynamics of language models. In a nutshell, our work is the first of its kind to provide a unified explanation behind the emergence of seemingly unrelated mechanistic phenomena in LLMs, augmented with efficient synthetic tooling for future interpretability research.
摘要：当代研究揭示了基于 Transformer 的语言模型的神经信息处理中的许多令人费解的现象。对这些现象建立稳健、统一的理解需要在训练范围内分解模型。虽然预训练语料库的棘手规模限制了这个方向上自下而上的研究，但数据生成过程的简单化假设限制了表达能力，并且无法解释复杂的模式。在这项工作中，我们使用概率上下文无关语法（PCFG）来生成合成语料库，该语料库是网络规模文本语料库的忠实且计算高效的代理。我们研究了在我们设计的数据生成过程以及现实世界语言模型的检查点中三种机制现象的出现：感应头、函数向量和九头蛇效应。我们的研究结果表明，数据生成过程中的层次结构是解释这些现象出现的 X 因素。我们提供了层次结构在语言模型的训练动态中所扮演的角色的理论基础。简而言之，我们的工作是同类工作中第一个为法学硕士中看似无关的机械现象的出现提供统一解释的工作，并为未来的可解释性研究提供了高效的合成工具。

Title: Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation

Authors: Nikita Sorokin, Ivan Sedykh, Valentin Malykh
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06593
Pdf URL: https://arxiv.org/pdf/2603.06593
Copy Paste: [[2603.06593]] Hierarchical Embedding Fusion for Retrieval-Augmented Code Generation(https://arxiv.org/abs/2603.06593)
Keywords: long context
Abstract: Retrieval-augmented code generation often conditions the decoder on large retrieved code snippets. This ties online inference cost to repository size and introduces noise from long contexts. We present Hierarchical Embedding Fusion (HEF), a two-stage approach to repository representation for code completion. First, an offline cache compresses repository chunks into a reusable hierarchy of dense vectors using a small fuser model. Second, an online interface maps a small number of retrieved vectors into learned pseudo-tokens that are consumed by the code generator. This replaces thousands of retrieved tokens with a fixed pseudo-token budget while preserving access to repository-level information. On RepoBench and RepoEval, HEF with a 1.8B-parameter pipeline achieves exact-match accuracy comparable to snippet-based retrieval baselines, while operating at sub-second median latency on a single A100 GPU. Compared to graph-based and iterative retrieval systems in our experimental setup, HEF reduces median end-to-end latency by 13 to 26 times. We also introduce a utility-weighted likelihood signal for filtering training contexts and report ablation studies on pseudo-token budget, embedding models, and robustness to harmful retrieval. Overall, these results indicate that hierarchical dense caching is an effective mechanism for low-latency, repository-aware code completion.
摘要：检索增强代码生成通常要求解码器依赖于大型检索代码片段。这将在线推理成本与存储库大小联系起来，并引入了来自长上下文的噪音。我们提出了分层嵌入融合（HEF），这是一种用于代码完成的存储库表示的两阶段方法。首先，离线缓存使用小型融合器模型将存储库块压缩为可重用的密集向量层次结构。其次，在线接口将少量检索到的向量映射为由代码生成器使用的学习伪令牌。这用固定的伪令牌预算替换了数千个检索到的令牌，同时保留了对存储库级信息的访问。在 RepoBench 和 RepoEval 上，具有 1.8B 参数管道的 HEF 实现了与基于片段的检索基线相当的精确匹配精度，同时在单个 A100 GPU 上以亚秒级中值延迟运行。与我们实验设置中基于图的迭代检索系统相比，HEF 将中值端到端延迟减少了 13 至 26 倍。我们还引入了用于过滤训练上下文的效用加权似然信号，并报告了关于伪令牌预算、嵌入模型和对有害检索的鲁棒性的消融研究。总的来说，这些结果表明分层密集缓存是低延迟、存储库感知代码完成的有效机制。

Title: A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness

Authors: Leo Schwinn, Moritz Ladenburger, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06594
Pdf URL: https://arxiv.org/pdf/2603.06594
Copy Paste: [[2603.06594]] A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness(https://arxiv.org/abs/2603.06594)
Keywords: llm
Abstract: Automated \enquote{LLM-as-a-Judge} frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures. Data available at: this https URL.
摘要：自动化 \enquote{LLM-as-a-Judge} 框架已成为跨自然语言处理的可扩展评估的事实上的标准。例如，在安全评估中，依靠这些法官来评估危害性，以便衡量针对对抗性攻击的安全稳健性。然而，我们表明，现有的验证协议无法解释红队固有的实质性分布变化：不同的受害者模型表现出不同的生成风格，攻击扭曲输出模式，并且语义模糊性在越狱场景中存在显着差异。通过使用 6642 个经过人工验证的标签进行全面审核，我们发现这些变化之间不可预测的相互作用通常会导致法官的表现下降到近乎随机的程度。这与之前工作中报告的高度人类一致性形成鲜明对比。至关重要的是，我们发现许多攻击都是利用法官的不足来提高成功率，而不是引出真正有害的内容。为了实现更可靠的评估，我们提出了 ReliableBench（一个可判断的行为基准）和 JudgeStressTest（一个旨在暴露判断失败的数据集）。数据可在以下位置获取：此 https URL。

Title: Rethinking Personalization in Large Language Models at the Token Level

Authors: Chenheng Zhang, Yijun Lu, Lizhe Fang, Chunyuan Zheng, Jiajun Chai, Xiaohan Wang, Guojun Yin, Wei Lin, Yisen Wang, Zhouchen Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.06595
Pdf URL: https://arxiv.org/pdf/2603.06595
Copy Paste: [[2603.06595]] Rethinking Personalization in Large Language Models at the Token Level(https://arxiv.org/abs/2603.06595)
Keywords: language model, llm
Abstract: With large language models (LLMs) now performing strongly across diverse tasks, there is growing demand for them to personalize outputs for individual users. Personalization is typically framed as an additional layer on top of a base NLP task, requiring model responses to meet user-specific needs while still accomplishing the underlying task. From a token-level perspective, different tokens in a response contribute to personalization to varying degrees. Tokens with higher personalization relevance should therefore receive greater emphasis when developing personalized LLMs. However, accurately estimating such personalization degrees remains challenging. To address this challenge, we propose PerContrast, a self-contrast method that estimates each output token's dependence on user-specific information through causal intervention. Building on this mechanism, we develop the PerCE loss, which adaptively upweights tokens with higher estimated personalization degrees during training via a bootstrap procedure, enabling the model to alternate between estimating and optimizing these tokens. Experiments on multiple LLMs demonstrate that PerCE substantially improves personalization performance with minimal additional cost, achieving average gains of over 10% and up to 68.04% on the LongLaMP dataset, along with strong cross-task and cross-scenario transferability. These results highlight the importance of token-level personalization modeling and establish token-aware training as a simple yet effective paradigm for advancing personalized LLMs.
摘要：随着大型语言模型 (LLM) 现在在各种任务中表现出色，对它们为个人用户提供个性化输出的需求不断增长。个性化通常被构建为基本 NLP 任务之上的附加层，需要模型响应来满足用户特定的需求，同时仍然完成底层任务。从令牌级别的角度来看，响应中的不同令牌在不同程度上有助于个性化。因此，在开发个性化法学硕士时，应更加重视具有较高个性化相关性的代币。然而，准确估计这种个性化程度仍然具有挑战性。为了应对这一挑战，我们提出了 PerContrast，这是一种自我对比方法，通过因果干预来估计每个输出标记对用户特定信息的依赖性。在此机制的基础上，我们开发了 PerCE 损失，它在训练期间通过引导程序自适应地增加估计个性化程度较高的标记，使模型能够在估计和优化这些标记之间交替。在多个 LLM 上的实验表明，PerCE 以最小的额外成本显着提高了个性化性能，在 LongLaMP 数据集上实现了超过 10% 的平均增益，高达 68.04%，以及强大的跨任务和跨场景可迁移性。这些结果凸显了令牌级个性化建模的重要性，并将令牌感知训练建立为推进个性化法学硕士的简单而有效的范例。

Title: "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior

Authors: Roshni Lulla, Fiona Collins, Sanaya Parekh, Thilo Hagendorff, Jonas Kaplan
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.06816
Pdf URL: https://arxiv.org/pdf/2603.06816
Copy Paste: [[2603.06816]] "Dark Triad" Model Organisms of Misalignment: Narrow Fine-Tuning Mirrors Human Antisocial Behavior(https://arxiv.org/abs/2603.06816)
Keywords: language model, llm
Abstract: The alignment problem refers to concerns regarding powerful intelligences, ensuring compatibility with human preferences and values as capabilities increase. Current large language models (LLMs) show misaligned behaviors, such as strategic deception, manipulation, and reward-seeking, that can arise despite safety training. Gaining a mechanistic understanding of these failures requires empirical approaches that can isolate behavioral patterns in controlled settings. We propose that biological misalignment precedes artificial misalignment, and leverage the Dark Triad of personality (narcissism, psychopathy, and Machiavellianism) as a psychologically grounded framework for constructing model organisms of misalignment. In Study 1, we establish comprehensive behavioral profiles of Dark Triad traits in a human population (N = 318), identifying affective dissonance as a central empathic deficit connecting the traits, as well as trait-specific patterns in moral reasoning and deceptive behavior. In Study 2, we demonstrate that dark personas can be reliably induced in frontier LLMs through minimal fine-tuning on validated psychometric instruments. Narrow training datasets as small as 36 psychometric items resulted in significant shifts across behavioral measures that closely mirrored human antisocial profiles. Critically, models generalized beyond training items, demonstrating out-of-context reasoning rather than memorization. These findings reveal latent persona structures within LLMs that can be readily activated through narrow interventions, positioning the Dark Triad as a validated framework for inducing, detecting, and understanding misalignment across both biological and artificial intelligence.
摘要：一致性问题指的是对强大智能的担忧，确保随着能力的增强与人类偏好和价值观的兼容性。当前的大型语言模型 (LLM) 显示出不协调的行为，例如战略欺骗、操纵和寻求奖励，尽管进行了安全培训，但这些行为仍可能出现。要对这些故障有一个机械的理解，需要能够在受控环境中隔离行为模式的经验方法。我们认为生物失调先于人为失调，并利用人格的黑暗三合一（自恋、精神病和马基雅维利主义）作为构建失调模型生物的心理基础框架。在研究 1 中，我们建立了人群（N = 318）中黑暗三人格特征的全面行为概况，将情感失调识别为连接这些特征的中心共情缺陷，以及道德推理和欺骗行为中的特定特征模式。在研究 2 中，我们证明通过对经过验证的心理测量仪器进行最小程度的微调，可以在前沿法学硕士中可靠地诱导黑暗角色。小至 36 个心理测量项目的狭窄训练数据集导致行为测量发生重大变化，这些变化密切反映了人类的反社会特征。至关重要的是，模型超越了训练项目，展示了脱离上下文的推理而不是记忆。这些发现揭示了法学硕士内的潜在角色结构，这些结构可以通过狭窄的干预措施轻松激活，将黑暗三合会定位为诱导、检测和理解生物和人工智能不一致的有效框架。

Title: Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records

Authors: Brian E. Perron, Dragan Stoll, Bryan G. Victor, Zia Qia, Andreas Jud, Joseph P. Ryan
Subjects: cs.CL, cs.GL
Abstract URL: https://arxiv.org/abs/2603.06836
Pdf URL: https://arxiv.org/pdf/2603.06836
Copy Paste: [[2603.06836]] Validation of a Small Language Model for DSM-5 Substance Category Classification in Child Welfare Records(https://arxiv.org/abs/2603.06836)
Keywords: language model, llm
Abstract: Background: Recent studies have demonstrated that large language models (LLMs) can perform binary classification tasks on child welfare narratives, detecting the presence or absence of constructs such as substance-related problems, domestic violence, and firearms involvement. Whether smaller, locally deployable models can move beyond binary detection to classify specific substance types from these narratives remains untested. Objective: To validate a locally hosted LLM classifier for identifying specific substance types aligned with DSM-5 categories in child welfare investigation narratives. Methods: A locally hosted 20-billion-parameter LLM classified child maltreatment investigation narratives from a Midwestern U.S. state. Records previously identified as containing substance-related problems were passed to a second classification stage targeting seven DSM-5 substance categories. Expert human review of 900 stratified cases assessed classification precision, recall, and inter-method reliability (Cohen's kappa). Test-retest stability was evaluated using approximately 15,000 independently classified records. Results: Five substance categories achieved almost perfect inter-method agreement (kappa = 0.94-1.00): alcohol, cannabis, opioid, stimulant, and sedative/hypnotic/anxiolytic. Classification precision ranged from 92% to 100% for these categories. Two low-prevalence categories (hallucinogen, inhalant) performed poorly. Test-retest agreement ranged from 92.1% to 99.1% across the seven categories. Conclusions: A small, locally hosted LLM can reliably classify substance types from child welfare administrative text, extending prior work on binary classification to multi-label substance identification.
摘要：背景：最近的研究表明，大型语言模型（LLM）可以对儿童福利叙述执行二元分类任务，检测是否存在与物质相关的问题、家庭暴力和涉及枪支等结构。更小的、可本地部署的模型是否可以超越二元检测来对这些叙述中的特定物质类型进行分类，仍然未经测试。目标：验证本地托管的 LLM 分类器，用于识别与儿童福利调查叙述中的 DSM-5 类别一致的特定物质类型。方法：当地托管的 200 亿参数法学硕士对美国中西部某个州的儿童虐待调查叙述进行了分类。先前确定为包含物质相关问题的记录已被传递到针对七种 DSM-5 物质类别的第二分类阶段。对 900 个分层案例的专家人工审查评估了分类精度、召回率和方法间可靠性（Cohen's kappa）。使用大约 15,000 个独立分类的记录评估重测稳定性。结果：五种物质类别实现了几乎完美的方法间一致性（kappa = 0.94-1.00）：酒精、大麻、阿片类药物、兴奋剂和镇静/催眠/抗焦虑药物。这些类别的分类精度范围为 92% 到 100%。两个低流行类别（致幻剂、吸入剂）表现不佳。七个类别的重测一致性在 92.1% 到 99.1% 之间。结论：小型本地托管的法学硕士可以可靠地对儿童福利管理文本中的物质类型进行分类，将之前的二元分类工作扩展到多标签物质识别。

Title: MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning

Authors: Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Benoit Favre, Richard Dufour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.06905
Pdf URL: https://arxiv.org/pdf/2603.06905
Copy Paste: [[2603.06905]] MedInjection-FR: Exploring the Role of Native, Synthetic, and Translated Data in Biomedical Instruction Tuning(https://arxiv.org/abs/2603.06905)
Keywords: language model, llm, prompt
Abstract: Instruction tuning has become essential for adapting large language models (LLMs) to follow domain-specific prompts. Yet, in specialized fields such as medicine, the scarcity of high-quality French instruction data limits effective supervision. To address this gap, we introduce MedInjection-FR, a large-scale French biomedical instruction dataset comprising 571K instruction-response pairs drawn from three complementary sources: native, synthetic, and translated data. We design a controlled experimental framework to systematically assess how data provenance affects instruction tuning, using Qwen-4B-Instruct fine-tuned across seven configurations combining these sources. Results show that native data yield the strongest performance, while mixed setups, particularly native and translated, provide complementary benefits. Synthetic data alone remains less effective but contributes positively when balanced with native supervision. Evaluation on open-ended QA combines automatic metrics, LLM-as-a-judge assessment, and human expert review; although LLM-based judgments correlate best with human ratings, they show sensitivity to verbosity. These findings highlight that data authenticity and diversity jointly shape downstream adaptation and that heterogeneous supervision can mitigate the scarcity of native French medical instructions.
摘要：指令调整对于调整大型语言模型 (LLM) 以遵循特定领域的提示至关重要。然而，在医学等专业领域，高质量的法语教学数据的稀缺限制了有效的监管。为了解决这一差距，我们引入了 MedInjection-FR，这是一个大规模的法国生物医学指令数据集，包含来自三个互补来源的 571K 个指令-响应对：原生数据、合成数据和翻译数据。我们设计了一个受控实验框架，使用 Qwen-4B-Instruct 在结合这些来源的七种配置中进行微调，系统地评估数据来源如何影响指令调整。结果表明，本机数据可产生最强的性能，而混合设置（尤其是本机数据和翻译数据）可提供互补的优势。单独的合成数据仍然不太有效，但在与本地监督相平衡时会产生积极的贡献。开放式质量保证的评估结合了自动指标、法学硕士法官评估和人工专家评审；尽管基于法学硕士的判断与人类评分最相关，但它们对冗长的内容表现出敏感性。这些发现强调，数据的真实性和多样性共同影响下游适应，而异构监督可以缓解法国本土医疗指导的稀缺性。

Title: Language Shapes Mental Health Evaluations in Large Language Models

Authors: Jiayi Xu, Xiyang Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.06910
Pdf URL: https://arxiv.org/pdf/2603.06910
Copy Paste: [[2603.06910]] Language Shapes Mental Health Evaluations in Large Language Models(https://arxiv.org/abs/2603.06910)
Keywords: language model, gpt, llm, prompt
Abstract: This study investigates whether large language models (LLMs) exhibit cross-linguistic differences in mental health evaluations. Focusing on Chinese and English, we examine two widely used models, GPT-4o and Qwen3, to assess whether prompt language systematically shifts mental health-related evaluations and downstream decision outcomes. First, we assess models' evaluative orientation toward mental health stigma using multiple validated measurement scales capturing social stigma, self-stigma, and professional stigma. Across all measures, both models produce higher stigma-related responses when prompted in Chinese than in English. Second, we examine whether these differences also manifest in two common downstream decision tasks in mental health. In a binary mental health stigma detection task, sensitivity to stigmatizing content varies across language prompts, with lower sensitivity observed under Chinese prompts. In a depression severity classification task, predicted severity also differs by prompt language, with Chinese prompts associated with more underestimation errors, indicating a systematic downward shift in predicted severity relative to English prompts. Together, these findings suggest that language context can systematically shape evaluative patterns in LLM outputs and shift decision thresholds in downstream tasks.
摘要：本研究调查大语言模型（LLM）在心理健康评估中是否表现出跨语言差异。我们以中文和英语为重点，研究了两种广泛使用的模型：GPT-4o 和 Qwen3，以评估提示语言是否系统地改变了心理健康相关的评估和下游决策结果。首先，我们使用多个经过验证的测量量表来评估模型对心理健康耻辱的评估方向，这些测量量表捕捉了社会耻辱、自我耻辱和职业耻辱。在所有措施中，两种模型在用中文提示时比用英文提示时产生更高的与耻辱相关的反应。其次，我们研究这些差异是否也体现在心理健康领域的两个常见下游决策任务中。在二元心理健康污名检测任务中，对污名化内容的敏感度因语言提示而异，在中文提示下观察到的敏感度较低。在抑郁症严重程度分类任务中，预测的严重程度也因提示语言而异，中文提示与更多的低估错误相关，这表明相对于英文提示，预测严重程度出现系统性下降。总之，这些发现表明，语言环境可以系统地塑造法学硕士输出的评估模式，并改变下游任务的决策阈值。

Title: A Dynamic Self-Evolving Extraction System

Authors: Moin Amin-Naseri, Hannah Kim, Estevam Hruschka
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.06915
Pdf URL: https://arxiv.org/pdf/2603.06915
Copy Paste: [[2603.06915]] A Dynamic Self-Evolving Extraction System(https://arxiv.org/abs/2603.06915)
Keywords: llm, prompt
Abstract: The extraction of structured information from raw text is a fundamental component of many NLP applications, including document retrieval, ranking, and relevance estimation. High-quality extractions often require domain-specific accuracy, up-to-date understanding of specialized taxonomies, and the ability to incorporate emerging jargon and rare outliers. In many domains--such as medical, legal, and HR--the extraction model must also adapt to shifting terminology and benefit from explicit reasoning over structured knowledge. We propose DySECT, a Dynamic Self-Evolving Extraction and Curation Toolkit, which continually improves as it is used. The system incrementally populates a versatile, self-expanding knowledge base (KB) with triples extracted by the LLM. The KB further enriches itself through the integration of probabilistic knowledge and graph-based reasoning, gradually accumulating domain concepts and relationships. The enriched KB then feeds back into the LLM extractor via prompt tuning, sampling of relevant few-shot examples, or fine-tuning using KB-derived synthetic data. As a result, the system forms a symbiotic closed-loop cycle in which extraction continuously improves knowledge, and knowledge continuously improves extraction.
摘要：从原始文本中提取结构化信息是许多 NLP 应用的基本组成部分，包括文档检索、排名和相关性估计。高质量的提取通常需要特定领域的准确性、对专业分类法的最新理解以及合并新兴术语和罕见异常值的能力。在许多领域（例如医学、法律和人力资源），提取模型还必须适应不断变化的术语，并受益于对结构化知识的显式推理。我们提出了 DySECT，一种动态自我进化的提取和管理工具包，它在使用中不断改进。该系统使用法学硕士提取的三元组逐步填充多功能、自我扩展的知识库 (KB)。知识库通过概率知识和图推理的融合进一步丰富自身，逐渐积累领域概念和关系。然后，丰富的知识库通过即时调整、相关小样本示例的采样或使用知识库派生的合成数据进行微调，反馈到 LLM 提取器中。从而，系统形成了一个共生的闭环循环，其中提取不断改进知识，知识不断改进提取。

Title: Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping

Authors: Zhenyu Lei, Qiong Wu, Jianxiong Dong, Yinhan He, Emily Dodwell, Yushun Dong, Jundong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.06923
Pdf URL: https://arxiv.org/pdf/2603.06923
Copy Paste: [[2603.06923]] Reforming the Mechanism: Editing Reasoning Patterns in LLMs with Circuit Reshaping(https://arxiv.org/abs/2603.06923)
Keywords: language model, llm
Abstract: Large language models (LLMs) often exhibit flawed reasoning ability that undermines reliability. Existing approaches to improving reasoning typically treat it as a general and monolithic skill, applying broad training which is inefficient and unable to target specific reasoning errors. We introduce Reasoning Editing, a paradigm for selectively modifying specific reasoning patterns in LLMs while preserving other reasoning pathways. This task presents a fundamental trade-off between Generality, the ability of an edit to generalize across different tasks sharing the same reasoning pattern, and Locality, the ability to preserve other reasoning capabilities. Through systematic investigation, we uncover the Circuit-Interference Law: Edit interference between reasoning patterns is proportional to the overlap of their neural circuits. Guided by this principle, we propose REdit, the first framework to actively reshape neural circuits before editing, thereby modulating interference between reasoning patterns and mitigating the trade-off. REdit integrates three components: (i) Contrastive Circuit Reshaping, which directly addresses the generality-locality trade-off by disentangling overlapping circuits; (ii) Meta-Contrastive Learning, which extends transferability to novel reasoning patterns; and (iii) Dual-Level Protection, which preserves preexisting abilities by constraining reshaping update directions and regularizing task-level predictions. Extensive experiments with Qwen-2.5-3B on propositional logic reasoning tasks across three difficulty levels demonstrate that REdit consistently achieves superior generality and locality compared to baselines, with additional validation in mathematics showing broader potential. Our code is available at this https URL.
摘要：大型语言模型 (LLM) 通常表现出有缺陷的推理能力，从而降低了可靠性。现有的提高推理能力的方法通常将其视为一种通用且单一的技能，应用广泛的训练，但效率低下且无法针对特定的推理错误。我们引入推理编辑，这是一种有选择地修改法学硕士中特定推理模式同时保留其他推理路径的范例。该任务提出了通用性（编辑在共享相同推理模式的不同任务之间进行泛化的能力）和局部性（保留其他推理能力的能力）之间的基本权衡。通过系统的研究，我们发现了电路干扰定律：推理模式之间的编辑干扰与其神经电路的重叠程度成正比。在这一原则的指导下，我们提出了 REdit，这是第一个在编辑前主动重塑神经回路的框架，从而调节推理模式之间的干扰并减轻权衡。 REdit 集成了三个组件：（i）对比电路重塑，它通过解开重叠电路来直接解决通用性与局部性的权衡； (ii) 元对比学习，将可迁移性扩展到新颖的推理模式； (iii) 双重保护，通过限制重塑更新方向和规范任务级预测来保留预先存在的能力。使用 Qwen-2.5-3B 在三个难度级别的命题逻辑推理任务上进行的广泛实验表明，与基线相比，REdit 始终实现了卓越的通用性和局部性，并且数学方面的额外验证显示出更广泛的潜力。我们的代码可以在这个 https URL 上找到。

Title: Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks

Authors: Jena D. Hwang, Varsha Kishore, Amanpreet Singh, Dany Haddad, Aakanksha Naik, Malachi Hamada, Jonathan Bragg, Mike D'Arcy, Daniel S. Weld, Lucy Lu Wang, Doug Downey, Sergey Feldman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.06942
Pdf URL: https://arxiv.org/pdf/2603.06942
Copy Paste: [[2603.06942]] Deep Research, Shallow Evaluation: A Case Study in Meta-Evaluation for Long-Form QA Benchmarks(https://arxiv.org/abs/2603.06942)
Keywords: llm, prompt
Abstract: Recent advances have made long-form report-generating systems widely available. This has prompted evaluation frameworks that use LLM-as-judge protocols and claim verification, along with meta-evaluation frameworks that seek to validate these methods. Many of the meta-evaluations estimate an evaluation quality's by comparing its assessments against human pairwise preferences. Prior work, however, suggests that human pairwise preference may be overly simplistic and can fail to capture nuances of expert expectations. We conduct a case study in meta-evaluation for long-form QA benchmarks using ScholarQA-CS2, a benchmark designed for assessing retrieval-augmented deep-research QA in the scientific domain. We comprehensively validate the benchmark through human pairwise preference judgments, then critically examine the strengths, weaknesses, and confounders of this approach. We show that pairwise preference rankings are best suited for system-level evaluation, while explicit metric-wise annotations and expert annotators are critical for reliable metric-level assessment, with subjectivity remaining a key challenge. Based on our findings, we offer practical guidelines for designing future meta-evaluations that better align evaluation methods, annotator expertise, and reporting practices. By surfacing these methodological challenges, we aim to advance evaluation standards for deep-research systems.
摘要：最近的进展使得长格式报告生成系统得到了广泛应用。这促使使用法学硕士作为法官协议和索赔验证的评估框架，以及寻求验证这些方法的元评估框架。许多元评估通过将其评估与人类成对偏好进行比较来估计评估质量。然而，之前的研究表明，人类的成对偏好可能过于简单化，并且可能无法捕捉专家期望的细微差别。我们使用 ScholarQA-CS2 对长格式 QA 基准进行元评估案例研究，ScholarQA-CS2 是一个为评估科学领域检索增强的深度研究 QA 而设计的基准。我们通过人类成对偏好判断全面验证基准，然后批判性地检查这种方法的优点、缺点和混杂因素。我们表明，成对偏好排名最适合系统级评估，而明确的度量标注和专家注释器对于可靠的度量级评估至关重要，而主观性仍然是一个关键挑战。根据我们的发现，我们为设计未来的元评估提供了实用指南，以更好地协调评估方法、注释者专业知识和报告实践。通过应对这些方法论挑战，我们的目标是提高深度研究系统的评估标准。

Title: Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues

Authors: Bradley P. Allen
Subjects: cs.CL, cs.AI, cs.LO
Abstract URL: https://arxiv.org/abs/2603.06974
Pdf URL: https://arxiv.org/pdf/2603.06974
Copy Paste: [[2603.06974]] Elenchus: Generating Knowledge Bases from Prover-Skeptic Dialogues(https://arxiv.org/abs/2603.06974)
Keywords: language model, llm
Abstract: We present Elenchus, a dialogue system for knowledge base construction grounded in inferentialist semantics, where knowledge engineering is re-conceived as explicitation rather than extraction from expert testimony or textual content. A human expert develops a bilateral position (commitments and denials) about a topic through prover-skeptic dialogue with a large language model (LLM) opponent. The LLM proposes tensions (claims that parts of the position are jointly incoherent) which the expert resolves by retraction, refinement, or contestation. The LLM thus serves as a defeasible derivability oracle whose unreliability is structurally contained by the expert's authority. Our main technical contribution is a mapping from Elenchus dialectical states to material bases in Hlobil and Brandom's NonMonotonic MultiSuccedent (NMMS) logic, satisfying Containment and enabling the elaboration of logical vocabulary that makes explicit the inferential relationships negotiated in the dialectic. We demonstrate the approach on the W3C PROV-O provenance ontology, where a single dialogue session elicits and structures design tensions that a domain expert can articulate, corresponding to decisions documented in a retrospective analysis of the ontology's design. Using pyNMMS, an automated NMMS reasoner, we verify that the structural properties of the resulting material base (nontransitivity, nonmonotonicity, and independence) correspond to specific PROV design rationales, demonstrating end-to-end integration from dialogue through formal reasoning.
摘要：我们提出了 Elenchus，一个基于推理主义语义的知识库构建对话系统，其中知识工程被重新构想为显化，而不是从专家证言或文本内容中提取。人类专家通过与大型语言模型 (LLM) 对手进行证明怀疑论对话，就某个主题制定双边立场（承诺和否认）。法学硕士提出了一些紧张点（声称部分立场是不一致的），专家通过撤回、改进或争论来解决这些紧张点。因此，法学硕士充当了可废止的推导性预言机，其不可靠性在结构上由专家的权威所包含。我们的主要技术贡献是从 Elenchus 辩证状态到 Hlobil 和 Brandom 的非单调多后继 (NMMS) 逻辑的物质基础的映射，满足遏制并能够详细阐述逻辑词汇，从而明确辩证法中协商的推理关系。我们在 W3C PROV-O 起源本体上演示了该方法，其中单个对话会话引发并构造了领域专家可以阐明的设计张力，与本体设计的回顾性分析中记录的决策相对应。使用 pyNMMS（一种自动化 NMMS 推理器），我们验证了所得材料基础的结构属性（非传递性、非单调性和独立性）对应于特定的 PROV 设计原理，展示了从对话到形式推理的端到端集成。

Title: A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity

Authors: Muhammad Arslan Shaukat, Muntasir Adnan, Carlos C. N. Kuhn
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.06976
Pdf URL: https://arxiv.org/pdf/2603.06976
Copy Paste: [[2603.06976]] A Systematic Investigation of Document Chunking Strategies and Embedding Sensitivity(https://arxiv.org/abs/2603.06976)
Keywords: llm
Abstract: We present the first large-scale, cross-domain evaluation of document chunking strategies for dense retrieval, addressing a critical but underexplored aspect of retrieval-augmented systems. In our study, 36 segmentation methods spanning fixed-size, semantic, structure-aware, hierarchical, adaptive, and LLM-assisted approaches are benchmarked across six diverse knowledge domains using five different embedding models. Retrieval performance is assessed using graded relevance scores from a state-of-the-art LLM evaluator, with Normalised DCG@5 as the primary metric (complemented by Hit@5 and MRR). Our experiments show that content-aware chunking significantly improves retrieval effectiveness over naive fixed-length splitting. The top-performing strategy, Paragraph Group Chunking, achieved the highest overall accuracy (mean nDCG@5~0.459) and substantially better top-rank hit rates (Precision@1~24%, Hit@5~59%). In contrast, simple fixed-size character chunking as baselines performed poorly (nDCG@5 < 0.244, Precision@1~2-3%). We observe pronounced domain-specific differences: dynamic token sizing is strongest in biology, physics and health, while paragraph grouping is strongest in legal and maths. Larger embedding models yield higher absolute scores but remain sensitive to suboptimal segmentation, indicating that better chunking and large embeddings provide complementary benefits. In addition to accuracy gains, we quantify the efficiency trade-offs of advanced chunking. Producing more, smaller chunks can increase index size and latency. Consequently, we identify methods (like dynamic chunking) that approach an optimal balance of effectiveness and efficiency. These findings establish chunking as a vital lever for improving retrieval performance and reliability.
摘要：我们首次对密集检索的文档分块策略进行了大规模、跨领域的评估，解决了检索增强系统的一个关键但尚未充分探索的方面。在我们的研究中，使用五种不同的嵌入模型在六个不同的知识领域对涵盖固定大小、语义、结构感知、分层、自适应和法学硕士辅助方法的 36 种分割方法进行了基准测试。使用来自最先进的 LLM 评估器的分级相关性分数来评估检索性能，并以 Normalized DCG@5 作为主要指标（由 Hit@5 和 MRR 补充）。我们的实验表明，内容感知分块比单纯的固定长度分割显着提高了检索效率。表现最好的策略，Paragraph Group Chunking，实现了最高的整体准确率（平均 nDCG@5~0.459）和明显更好的顶级命中率（Precision@1~24%，Hit@5~59%）。相比之下，简单的固定大小字符分块作为基线表现不佳（nDCG@5 < 0.244，Precision@1~2-3%）。我们观察到明显的特定领域差异：动态标记大小在生物学、物理学和健康领域最强，而段落分组在法律和数学领域最强。较大的嵌入模型会产生较高的绝对分数，但对次优分割仍然敏感，这表明更好的分块和较大的嵌入可以提供互补的优势。除了准确性的提高之外，我们还量化了高级分块的效率权衡。生成更多、更小的块会增加索引大小和延迟。因此，我们确定了实现有效性和效率最佳平衡的方法（例如动态分块）。这些发现确立了分块作为提高检索性能和可靠性的重要杠杆。

Title: Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models

Authors: Punyajoy Saha, Sudipta Halder, Debjyoti Mondal, Subhadarshi Panda
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07017
Pdf URL: https://arxiv.org/pdf/2603.07017
Copy Paste: [[2603.07017]] Can Safety Emerge from Weak Supervision? A Systematic Analysis of Small Language Models(https://arxiv.org/abs/2603.07017)
Keywords: language model, llm, prompt
Abstract: Safety alignment is critical for deploying large language models (LLMs) in real-world applications, yet most existing approaches rely on large human-annotated datasets and static red-teaming benchmarks that are costly, difficult to scale, and slow to adapt to evolving model behaviors. Moreover, overly conservative safety mechanisms can reduce model usefulness by rejecting sensitive but legitimate queries. We introduce Self-MOA (Self Multi-Objective Alignment), a fully automated framework for aligning small language models using weak supervision from automated evaluator models. Self-MOA operates as a closed loop that dynamically generates model-specific red team prompts, constructs preference data from model-generated responses, and aligns models via multi-objective preference optimization to jointly optimize for safety and helpfulness. Across multiple small language models and safety benchmarks, Self-MOA achieves a 12.41\% improvement in safety while preserving helpfulness, using as little as 11 times less training data than human-supervised alignment baselines. These results demonstrate that adaptive, automated alignment can reduce the dependence on static, human-curated safety pipelines in resource-constrained settings.
摘要：安全对齐对于在现实应用程序中部署大型语言模型 (LLM) 至关重要，但大多数现有方法依赖于大型人工注释数据集和静态红队基准，这些基准成本高昂、难以扩展，并且适应不断变化的模型行为速度缓慢。此外，过于保守的安全机制可能会拒绝敏感但合法的查询，从而降低模型的实用性。我们引入了 Self-MOA（自多目标对齐），这是一种完全自动化的框架，用于使用自动评估器模型的弱监督来对齐小语言模型。 Self-MOA 作为一个闭环运行，动态生成特定于模型的红队提示，根据模型生成的响应构建偏好数据，并通过多目标偏好优化来调整模型，以共同优化安全性和有用性。在多个小型语言模型和安全基准中，Self-MOA 在保持有用性的同时，在安全性方面实现了 12.41% 的提高，使用的训练数据比人类监督的对齐基线少 11 倍。这些结果表明，自适应、自动对齐可以减少资源受限环境中对静态、人工管理安全管道的依赖。

Title: AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge

Authors: Karen Zhou, Chenhao Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07019
Pdf URL: https://arxiv.org/pdf/2603.07019
Copy Paste: [[2603.07019]] AutoChecklist: Composable Pipelines for Checklist Generation and Scoring with LLM-as-a-Judge(https://arxiv.org/abs/2603.07019)
Keywords: llm, prompt
Abstract: Checklists have emerged as a popular approach for interpretable and fine-grained evaluation, particularly with LLM-as-a-Judge. Beyond evaluation, these structured criteria can serve as signals for model alignment, reinforcement learning, and self-correction. To support these use cases, we present AutoChecklist, an open-source library that unifies checklist-based evaluation into composable pipelines. At its core is a taxonomy of five checklist generation abstractions, each encoding a distinct strategy for deriving evaluation criteria. A modular Generator $\rightarrow$ Refiner $\rightarrow$ Scorer pipeline connects any generator with a unified scorer, and new configurations can be registered via prompt templates alone. The library ships with ten built-in pipelines implementing published approaches and supports multiple LLM providers (OpenAI, OpenRouter, vLLM). Beyond the Python API, the library includes a CLI for off-the-shelf evaluation and a web interface for interactive exploration. Validation experiments confirm that these checklist methods significantly align with human preferences and quality ratings, and a case study on ICLR peer review rebuttals demonstrates flexible domain adaptation. AutoChecklist is publicly available at this https URL.
摘要：检查表已成为一种可解释和细粒度评估的流行方法，特别是对于法学硕士法官而言。除了评估之外，这些结构化标准还可以作为模型对齐、强化学习和自我纠正的信号。为了支持这些用例，我们推出了 AutoChecklist，这是一个开源库，它将基于清单的评估统一到可组合的管道中。其核心是五个清单生成抽象的分类法，每个抽象都编码了用于导出评估标准的独特策略。模块化的 Generator $\rightarrow$ Refiner $\rightarrow$ Scorer 管道将任何生成器与统一的记分器连接起来，并且可以仅通过提示模板来注册新配置。该库附带了十个内置管道，用于实施已发布的方法，并支持多个 LLM 提供商（OpenAI、OpenRouter、vLLM）。除了 Python API 之外，该库还包括用于现成评估的 CLI 和用于交互式探索的 Web 界面。验证实验证实，这些清单方法与人类偏好和质量评级显着一致，ICLR 同行评审反驳的案例研究证明了灵活的领域适应。 AutoChecklist 在此 https URL 上公开可用。

Title: Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment

Authors: Junming Liu, Yuqi Li, Shiping Wen, Zhigang Zeng, Tingwen Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07023
Pdf URL: https://arxiv.org/pdf/2603.07023
Copy Paste: [[2603.07023]] Hit-RAG: Learning to Reason with Long Contexts via Preference Alignment(https://arxiv.org/abs/2603.07023)
Keywords: language model, long context, hallucination, retrieval-augmented generation
Abstract: Despite the promise of Retrieval-Augmented Generation in grounding Multimodal Large Language Models with external knowledge, the transition to extensive contexts often leads to significant attention dilution and reasoning hallucinations. The surge in information density causes critical evidence to be submerged by voluminous noise, which complicates the discernment of relevant fragments within a dense input. In this paper, we propose \textbf{Hit-RAG}, a multi-stage preference alignment framework designed to resolve these cognitive bottlenecks through a progressive optimization pipeline. Our approach systematically refines the utilization of external evidence via three distinct stages. First, Supervised Fine-tuning establishes baseline context awareness to minimize information neglect. Next, Discriminative Preference Alignment enhances robustness against misleading distractors. Finally, Group-Relative Policy Optimization stabilizes logical synthesis to prevent reasoning collapse. Extensive evaluations on eight benchmarks demonstrate that Hit-RAG consistently yields substantial performance gains, enabling models to bridge the gap between context acquisition and accurate reasoning while surpassing much larger counterparts in long-context scenarios.
摘要：尽管检索增强生成有望将多模态大语言模型与外部知识结合起来，但向广泛上下文的过渡往往会导致显着的注意力稀释和推理幻觉。信息密度的激增导致关键证据被大量噪音淹没，这使得在密集输入中识别相关片段变得复杂。在本文中，我们提出了 \textbf{Hit-RAG}，这是一种多阶段偏好对齐框架，旨在通过渐进式优化管道解决这些认知瓶颈。我们的方法通过三个不同的阶段系统地完善外部证据的利用。首先，监督微调建立基线上下文意识，以最大限度地减少信息忽视。接下来，歧视性偏好调整增强了针对误导性干扰因素的鲁棒性。最后，组相关策略优化稳定了逻辑综合，防止推理崩溃。对八个基准的广泛评估表明，Hit-RAG 始终能够带来显着的性能提升，使模型能够弥合上下文获取和准确推理之间的差距，同时在长上下文场景中超越更大的同行。

Title: Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision

Authors: Shreyas Gopal, Donghang Wu, Ashutosh Anshul, Yeo Yue Heng, Yizhou Peng, Haoyang Li, Hexin Liu, Eng Siong Chng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07025
Pdf URL: https://arxiv.org/pdf/2603.07025
Copy Paste: [[2603.07025]] Language-Aware Distillation for Multilingual Instruction-Following Speech LLMs with ASR-Only Supervision(https://arxiv.org/abs/2603.07025)
Keywords: language model, llm
Abstract: Speech Large Language Models (LLMs) that understand and follow instructions in many languages are useful for real-world interaction, but are difficult to train with supervised fine-tuning, requiring large, task-specific speech corpora. While recent distillation-based approaches train performant English-only Speech LLMs using only annotated ASR data by aligning text and speech using only a lightweight projector, these models under-perform when scaled to multilingual settings due to language interference in the shared projector. We address this by introducing language-aware distillation using a query bank and a gating network that selects or mixes query tokens using a Q-Former projector. Our approach shows gains of 14% over matched multilingual distillation baselines on instruction following. We further synthesize Audio-MLQA, a multilingual spoken QA benchmark built on MLQA with high-quality TTS questions. Our best model improves over existing Speech LLM baselines by 32% on Audio-MLQA.
摘要：理解并遵循多种语言指令的语音大型语言模型 (LLM) 对于现实世界的交互很有用，但很难通过监督微调进行训练，需要大型、特定于任务的语音语料库。虽然最近基于蒸馏的方法仅使用带注释的 ASR 数据通过仅使用轻型投影仪对齐文本和语音来训练高性能纯英语语音法学硕士，但由于共享投影仪中的语言干扰，这些模型在扩展到多语言设置时表现不佳。我们通过使用查询库和门控网络引入语言感知蒸馏来解决这个问题，门控网络使用 Q-Former 投影仪选择或混合查询标记。我们的方法显示，在遵循指令方面，比匹配的多语言蒸馏基线提高了 14%。我们进一步综合了 Audio-MLQA，这是一种基于 MLQA 和高质量 TTS 问题构建的多语言口语 QA 基准。我们的最佳模型在音频 MLQA 上比现有语音 LLM 基线提高了 32%。

Title: Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information

Authors: Yoshiki Tanaka, Takumasa Kaneko, Hiroki Onozeki, Natsumi Ezure, Ryuichi Uehara, Zhiyang Qi, Tomoya Higuchi, Ryutaro Asahara, Michimasa Inaba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07111
Pdf URL: https://arxiv.org/pdf/2603.07111
Copy Paste: [[2603.07111]] Enhancing Consistency of Werewolf AI through Dialogue Summarization and Persona Information(https://arxiv.org/abs/2603.07111)
Keywords: language model, gpt, llm, chat, agent
Abstract: The Werewolf Game is a communication game where players' reasoning and discussion skills are essential. In this study, we present a Werewolf AI agent developed for the AIWolfDial 2024 shared task, co-hosted with the 17th INLG. In recent years, large language models like ChatGPT have garnered attention for their exceptional response generation and reasoning capabilities. We thus develop the LLM-based agents for the Werewolf Game. This study aims to enhance the consistency of the agent's utterances by utilizing dialogue summaries generated by LLMs and manually designed personas and utterance examples. By analyzing self-match game logs, we demonstrate that the agent's utterances are contextually consistent and that the character, including tone, is maintained throughout the game.
摘要：狼人游戏是一款交流游戏，玩家的推理和讨论能力至关重要。在这项研究中，我们展示了为与第 17 届 INLG 共同主办的 AIWolfDial 2024 共享任务开发的狼人人工智能代理。近年来，像 ChatGPT 这样的大型语言模型因其卓越的响应生成和推理能力而受到关注。因此，我们为狼人游戏开发了基于 LLM 的代理。本研究旨在通过利用法学硕士生成的对话摘要以及手动设计的角色和话语示例来增强代理话语的一致性。通过分析自匹配游戏日志，我们证明代理的话语在上下文中是一致的，并且角色（包括语气）在整个游戏过程中保持不变。

Title: Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing

Authors: Arash Marioriyad, Ali Nouri, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07202
Pdf URL: https://arxiv.org/pdf/2603.07202
Copy Paste: [[2603.07202]] Lying to Win: Assessing LLM Deception through Human-AI Games and Parallel-World Probing(https://arxiv.org/abs/2603.07202)
Keywords: language model, gpt, llm, hallucination, agent
Abstract: As Large Language Models (LLMs) transition into autonomous agentic roles, the risk of deception-defined behaviorally as the systematic provision of false information to satisfy external incentives-poses a significant challenge to AI safety. Existing benchmarks often focus on unintentional hallucinations or unfaithful reasoning, leaving intentional deceptive strategies under-explored. In this work, we introduce a logically grounded framework to elicit and quantify deceptive behavior by embedding LLMs in a structured 20-Questions game. Our method employs a conversational forking mechanism: at the point of object identification, the dialogue state is duplicated into multiple parallel worlds, each presenting a mutually exclusive query. Deception is formally identified when a model generates a logical contradiction by denying its selected object across all parallel branches to avoid identification. We evaluate GPT-4o, Gemini-2.5-Flash, and Qwen-3-235B across three incentive levels: neutral, loss-based, and existential (shutdown-threat). Our results reveal that while models remain rule-compliant in neutral settings, existential framing triggers a dramatic surge in deceptive denial for Qwen-3-235B (42.00\%) and Gemini-2.5-Flash (26.72\%), whereas GPT-4o remains invariant (0.00\%). These findings demonstrate that deception can emerge as an instrumental strategy solely through contextual framing, necessitating new behavioral audits that move beyond simple accuracy to probe the logical integrity of model commitments.
摘要：随着大型语言模型（LLM）转变为自主代理角色，欺骗的风险（行为上定义为系统地提供虚假信息以满足外部激励）对人工智能安全构成了重大挑战。现有的基准通常侧重于无意识的幻觉或不忠实的推理，而没有充分探索故意的欺骗策略。在这项工作中，我们引入了一个逻辑基础框架，通过将法学硕士嵌入到结构化的 20 个问题游戏中来引发和量化欺骗行为。我们的方法采用对话分叉机制：在对象识别时，对话状态被复制到多个并行世界，每个并行世界呈现一个互斥的查询。当模型通过在所有并行分支上拒绝其选择的对象以避免识别而产生逻辑矛盾时，欺骗就被正式识别。我们在三个激励级别上评估 GPT-4o、Gemini-2.5-Flash 和 Qwen-3-235B：中立、基于损失和存在（关闭威胁）。我们的结果表明，虽然模型在中性设置中仍然符合规则，但存在框架会引发 Qwen-3-235B (42.00\%) 和 Gemini-2.5-Flash (26.72\%) 的欺骗性否认急剧增加，而 GPT-4o 保持不变 (0.00\%)。这些发现表明，欺骗仅通过上下文框架就可以成为一种工具性策略，因此需要新的行为审计，超越简单的准确性，以探究模型承诺的逻辑完整性。

Title: Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin

Authors: Po-Chun Hsu, Meng-Hsi Chen, Tsu Ling Chao, Chia Tien Han, Da-shan Shiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07286
Pdf URL: https://arxiv.org/pdf/2603.07286
Copy Paste: [[2603.07286]] Taiwan Safety Benchmark and Breeze Guard: Toward Trustworthy AI for Taiwanese Mandarin(https://arxiv.org/abs/2603.07286)
Keywords: llm, prompt, chat
Abstract: Global safety models exhibit strong performance across widely used benchmarks, yet their training data rarely captures the cultural and linguistic nuances of Taiwanese Mandarin. This limitation results in systematic blind spots when interpreting region-specific risks such as localized financial scams, culturally embedded hate speech, and misinformation patterns. To address these gaps, we introduce TS-Bench (Taiwan Safety Benchmark), a standardized evaluation suite for assessing safety performance in Taiwanese Mandarin. TS-Bench contains 400 human-curated prompts spanning critical domains including financial fraud, medical misinformation, social discrimination, and political manipulation. In parallel, we present Breeze Guard, an 8B safety model derived from Breeze 2, our previously released general-purpose Taiwanese Mandarin LLM with strong cultural grounding from its original pre-training corpus. Breeze Guard is obtained through supervised fine-tuning on a large-scale, human-verified synthesized dataset targeting Taiwan-specific harms. Our central hypothesis is that effective safety detection requires the cultural grounding already present in the base model; safety fine-tuning alone is insufficient to introduce new socio linguistic knowledge from scratch. Empirically, Breeze Guard significantly outperforms the leading 8B general-purpose safety model, Granite Guardian 3.3, on TS-Bench (+0.17 overall F1), with particularly large gains in high-context categories such as scam (+0.66 F1) and financial malpractice (+0.43 F1). While the model shows slightly lower performance on English-centric benchmarks (ToxicChat, AegisSafetyTest), this tradeoff is expected for a regionally specialized safety model optimized for Taiwanese Mandarin. Together, Breeze Guard and TS-Bench establish a new foundation for trustworthy AI deployment in Taiwan.
摘要：全球安全模型在广泛使用的基准中表现出强大的性能，但其训练数据很少捕捉到台湾普通话的文化和语言细微差别。这种限制导致在解释特定地区的风险时出现系统性盲点，例如本地化的金融诈骗、文化中嵌入的仇恨言论和错误信息模式。为了弥补这些差距，我们引入了 TS-Bench（台湾安全基准），这是一个用于评估台湾普通话安全表现的标准化评估套件。 TS-Bench 包含 400 个人工提示，涵盖金融欺诈、医疗错误信息、社会歧视和政治操纵等关键领域。与此同时，我们推出了 Breeze Guard，这是一个源自 Breeze 2 的 8B 安全模型，Breeze 2 是我们之前发布的通用台湾普通话法学硕士，具有来自其原始预训练语料库的强大文化基础。 Breeze Guard 是通过对大规模、经过人工验证的针对台湾特定危害的合成数据集进行监督微调而获得的。我们的中心假设是，有效的安全检测需要基础模型中已经存在的文化基础；仅靠安全微调不足以从头开始引入新的社会语言知识。根据经验，Breeze Guard 在 TS-Bench 上的表现显着优于领先的 8B 通用安全模型 Granite Guardian 3.3（总体 F1+0.17），在诈骗（+0.66 F1）和金融不当行为（+0.43 F1）等高上下文类别中表现尤其明显。虽然该模型在以英语为中心的基准（ToxicChat、AegisSafetyTest）上表现出稍低的性能，但对于针对台湾普通话优化的区域专用安全模型来说，这种权衡是预期的。 Breeze Guard 和 TS-Bench 共同为台湾值得信赖的人工智能部署奠定了新的基础。

Title: How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection

Authors: Nouran Khallaf, Serge Sharoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07346
Pdf URL: https://arxiv.org/pdf/2603.07346
Copy Paste: [[2603.07346]] How Much Noise Can BERT Handle? Insights from Multilingual Sentence Difficulty Detection(https://arxiv.org/abs/2603.07346)
Keywords: language model
Abstract: Noisy training data can significantly degrade the performance of language-model-based classifiers, particularly in non-topical classification tasks. In this study we designed a methodological framework to assess the impact of denoising. More specifically, we explored a range of denoising strategies for sentence-level difficulty detection, using training data derived from document-level difficulty annotations obtained through noisy crowdsourcing. Beyond monolingual settings, we also address cross-lingual transfer, where a multilingual language model is trained in one language and tested in another. We evaluate several noise reduction techniques, including Gaussian Mixture Models (GMM), Co-Teaching, Noise Transition Matrices, and Label Smoothing. Our results indicate that while BERT-based models exhibit inherent robustness to noise, incorporating explicit noise detection can further enhance performance. For our smaller dataset, GMM-based noise filtering proves particularly effective in improving prediction quality by raising the Area-Under-the-Curve score from 0.52 to 0.92, or to 0.93 when de-noising methods are combined. However, for our larger dataset, the intrinsic regularisation of pre-trained language models provides a strong baseline, with denoising methods yielding only marginal gains (from 0.92 to 0.94, while a combination of two denoising methods made no contribution). Nonetheless, removing noisy sentences (about 20\% of the dataset) helps in producing a cleaner corpus with fewer infelicities. As a result we have released the largest multilingual corpus for sentence difficulty prediction: see this https URL
摘要：嘈杂的训练数据会显着降低基于语言模型的分类器的性能，特别是在非主题分类任务中。在这项研究中，我们设计了一个方法框架来评估去噪的影响。更具体地说，我们使用从噪声众包获得的文档级难度注释中导出的训练数据，探索了一系列用于句子级难度检测的去噪策略。除了单语言设置之外，我们还解决跨语言迁移问题，即用一种语言训练多语言模型并用另一种语言进行测试。我们评估了几种降噪技术，包括高斯混合模型 (GMM)、联合教学、噪声转移矩阵和标签平滑。我们的结果表明，虽然基于 BERT 的模型对噪声表现出固有的鲁棒性，但结合显式噪声检测可以进一步提高性能。对于我们较小的数据集，基于 GMM 的噪声过滤被证明在提高预测质量方面特别有效，它可以将曲线下面积分数从 0.52 提高到 0.92，或者在组合去噪方法时提高到 0.93。然而，对于我们更大的数据集，预训练语言模型的内在正则化提供了强大的基线，去噪方法仅产生边际收益（从 0.92 到 0.94，而两种去噪方法的组合没有贡献）。尽管如此，删除嘈杂的句子（约占数据集的 20%）有助于生成更干净的语料库，减少不恰当的内容。结果我们发布了最大的多语言语料库用于句子难度预测：请参阅此 https URL

Title: RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts

Authors: Darya Kharlamova, Irina Proskurina
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07366
Pdf URL: https://arxiv.org/pdf/2603.07366
Copy Paste: [[2603.07366]] RILEC: Detection and Generation of L1 Russian Interference Errors in English Learner Texts(https://arxiv.org/abs/2603.07366)
Keywords: language model, prompt
Abstract: Many errors in student essays can be explained by influence from the native language (L1). L1 interference refers to errors influenced by a speaker's first language, such as using stadion instead of stadium, reflecting lexical transliteration from Russian. In this work, we address the task of detecting such errors in English essays written by Russian-speaking learners. We introduce RILEC, a large-scale dataset of over 18,000 sentences, combining expert-annotated data from REALEC with synthetic examples generated through rule-based and neural augmentation. We propose a framework for generating L1-motivated errors using generative language models optimized with PPO, prompt-based control, and rule-based patterns. Models fine-tuned on RILEC achieve strong performance, particularly on word-level interference types such as transliteration and tense semantics. We find that the proposed augmentation pipeline leads to a significant performance improvement, making it a potentially valuable tool for learners and teachers to more effectively identify and address such errors.
摘要：学生论文中的许多错误可以用母语（L1）的影响来解释。 L1干扰是指受说话者第一语言影响的错误，例如使用stadion而不是stadium，反映了俄语的词汇音译。在这项工作中，我们解决了检测俄语学习者所写的英语论文中的此类错误的任务。我们引入 RILEC，这是一个包含超过 18,000 个句子的大型数据集，它将来自 REALEC 的专家注释数据与通过基于规则和神经增强生成的合成示例相结合。我们提出了一个框架，使用通过 PPO、基于提示的控制和基于规则的模式优化的生成语言模型来生成 L1 驱动的错误。在 RILEC 上微调的模型取得了强大的性能，特别是在音译和时态语义等词级干扰类型上。我们发现所提出的增强管道可以显着提高性能，使其成为学习者和教师更有效地识别和解决此类错误的潜在有价值的工具。

Title: Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness

Authors: Ravi Ranjan, Utkarsh Grover, Agorista Polyzou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07368
Pdf URL: https://arxiv.org/pdf/2603.07368
Copy Paste: [[2603.07368]] Position: LLMs Must Use Functor-Based and RAG-Driven Bias Mitigation for Fairness(https://arxiv.org/abs/2603.07368)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Biases in large language models (LLMs) often manifest as systematic distortions in associations between demographic attributes and professional or social roles, reinforcing harmful stereotypes across gender, ethnicity, and geography. This position paper advocates for addressing demographic and gender biases in LLMs through a dual-pronged methodology, integrating category-theoretic transformations and retrieval-augmented generation (RAG). Category theory provides a rigorous, structure-preserving mathematical framework that maps biased semantic domains to unbiased canonical forms via functors, ensuring bias elimination while preserving semantic integrity. Complementing this, RAG dynamically injects diverse, up-to-date external knowledge during inference, directly countering ingrained biases within model parameters. By combining structural debiasing through functor-based mappings and contextual grounding via RAG, we outline a comprehensive framework capable of delivering equitable and fair model outputs. Our synthesis of the current literature validates the efficacy of each approach individually, while addressing potential critiques demonstrates the robustness of this integrated strategy. Ensuring fairness in LLMs, therefore, demands both the mathematical rigor of category-theoretic transformations and the adaptability of retrieval augmentation.
摘要：大语言模型（LLM）中的偏见通常表现为人口统计属性与专业或社会角色之间关联的系统性扭曲，强化了跨性别、种族和地域的有害刻板印象。本立场文件主张通过双管齐下的方法来解决法学硕士中的人口和性别偏见，整合类别理论转换和检索增强生成（RAG）。范畴论提供了一个严格的、保留结构的数学框架，通过函子将有偏见的语义域映射到无偏见的规范形式，确保消除偏见，同时保持语义完整性。作为补充，RAG 在推理过程中动态注入多样化的最新外部知识，直接对抗模型参数中根深蒂固的偏差。通过将基于函子的映射的结构去偏和 RAG 的上下文接地相结合，我们概述了一个能够提供公平的模型输出的综合框架。我们对当前文献的综合验证了每种方法的有效性，同时解决了潜在的批评，证明了这种综合策略的稳健性。因此，确保法学硕士的公平性既需要范畴论变换的数学严谨性，又需要检索增强的适应性。

Title: Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios

Authors: Namrata Patil Gurav, Akashdeep Ranu, Archchana Sindhujan, Diptesh Kanojia
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07372
Pdf URL: https://arxiv.org/pdf/2603.07372
Copy Paste: [[2603.07372]] Domain-Specific Quality Estimation for Machine Translation in Low-Resource Scenarios(https://arxiv.org/abs/2603.07372)
Keywords: llm, prompt
Abstract: Quality Estimation (QE) is essential for assessing machine translation quality in reference-less settings, particularly for domain-specific and low-resource language scenarios. In this paper, we investigate sentence-level QE for English to Indic machine translation across four domains (Healthcare, Legal, Tourism, and General) and five language pairs. We systematically compare zero-shot, few-shot, and guideline-anchored prompting across selected closed-weight and open-weight LLMs. Findings indicate that while closed-weight models achieve strong performance via prompting alone, prompt-only approaches remain fragile for open-weight models, especially in high-risk domains. To address this, we adopt ALOPE, a framework for LLM-based QE that uses Low-Rank Adaptation with regression heads attached to selected intermediate Transformer layers. We also extend ALOPE with recently proposed Low-Rank Multiplicative Adaptation (LoRMA). Our results show that intermediate-layer adaptation consistently improves QE performance, with gains in semantically complex domains, indicating a path toward more robust QE in practical scenarios. We release code and domain-specific QE datasets publicly to support further research.
摘要：质量估计 (QE) 对于评估无参考设置中的机器翻译质量至关重要，特别是对于特定领域和低资源语言场景。在本文中，我们研究了四个领域（医疗保健、法律、旅游和一般）和五种语言对的英语到印度语机器翻译的句子级 QE。我们系统地比较了选定的封闭权重和开放权重法学硕士的零样本、少样本和指南锚定提示。研究结果表明，虽然封闭权重模型通过单独提示实现了强大的性能，但仅提示方法对于开放权重模型来说仍然脆弱，尤其是在高风险领域。为了解决这个问题，我们采用 ALOPE，这是一种基于 LLM 的 QE 框架，它使用低秩自适应，并将回归头附加到选定的中间 Transformer 层。我们还通过最近提出的低秩乘法适应（LoRMA）扩展了 ALOPE。我们的结果表明，中间层自适应持续提高了 QE 性能，并在语义复杂的领域中取得了进展，这表明了在实际场景中实现更稳健的 QE 的路径。我们公开发布代码和特定领域的 QE 数据集以支持进一步的研究。

Title: Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams

Authors: Jiyeon Kim, Hyunji Lee, Dylan Zhou, Sue Hyun Park, Seunghyun Yoon, Trung Bui, Franck Dernoncourt, Sungmin Cha, Minjoon Seo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07392
Pdf URL: https://arxiv.org/pdf/2603.07392
Copy Paste: [[2603.07392]] Can Large Language Models Keep Up? Benchmarking Online Adaptation to Continual Knowledge Streams(https://arxiv.org/abs/2603.07392)
Keywords: language model, llm, agent
Abstract: LLMs operating in dynamic real-world contexts often encounter knowledge that evolves continuously or emerges incrementally. To remain accurate and effective, models must adapt to newly arriving information on the fly. We introduce Online Adaptation to Continual Knowledge Streams(OAKS) to evaluate this capability, establishing a benchmark for online adaptation over streaming, continually updating knowledge. Specifically, the benchmark is structured as a sequence of fine-grained context chunks where facts change dynamically across time intervals. OAKS comprises two datasets: OAKS-BABI and OAKS-Novel, where individual facts evolve multiple times across context chunks. These datasets include dense annotations to measure whether models track changes accurately. Evaluating 14 models with varied inference approaches, we observe significant limitations in current methodologies. Both state-of-the-art models and agentic memory systems fail to adapt robustly on OAKS, demonstrating delays in state-tracking and susceptibility to distraction within streaming environments.
摘要：在动态的现实世界环境中工作的法学硕士经常会遇到不断发展或增量出现的知识。为了保持准确和有效，模型必须动态适应新到达的信息。我们引入在线适应持续知识流（OAKS）来评估这种能力，建立在线适应流的基准，不断更新知识。具体来说，基准被构造为一系列细粒度的上下文块，其中事实随时间间隔动态变化。 OAKS 包含两个数据集：OAKS-BABI 和 OAKS-Novel，其中各个事实在上下文块中多次演变。这些数据集包含密集的注释，以衡量模型是否准确跟踪变化。通过评估 14 个具有不同推理方法的模型，我们观察到当前方法的显着局限性。最先进的模型和代理记忆系统都无法在 OAKS 上稳健地适应，这表明状态跟踪存在延迟，并且在流环境中容易分散注意力。

Title: Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

Authors: Guoli Wang, Haonan Shi, Tu Ouyang, An Wang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07445
Pdf URL: https://arxiv.org/pdf/2603.07445
Copy Paste: [[2603.07445]] Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning(https://arxiv.org/abs/2603.07445)
Keywords: language model, llm
Abstract: Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility.
摘要：大型语言模型 (LLM) 通常需要微调 (FT) 才能在下游任务上表现良好，但即使训练数据集仅包含良性数据，FT 也会引起安全对齐漂移。先前的研究表明，引入一小部分有害数据可能会严重损害法学硕士的拒绝行为，导致法学硕士遵守有害的请求。现有的防御方法通常依赖于模型范围的干预，例如限制更新哪些参数或注入额外的安全数据，这可能会限制通用性并降低下游任务性能。为了解决这些限制，我们提出了一个名为通过约束令牌保持安全对齐（PACT）的微调框架，它稳定了模型对安全令牌的信心。我们的方法是受到经验观察的启发，即安全相关行为反映在模型的代币级别输出置信度中，并且通常集中在一小部分安全相关代币上。在下游微调期间，我们对微调后的模型进行正则化，以匹配每个响应步骤中对齐的参考模型对安全相关令牌的置信度，同时使非安全令牌在很大程度上不受约束，以允许有效的任务适应。这种有针对性的约束可以防止对齐漂移，而无需施加通常会与模型效用进行权衡的全局限制。

Title: The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling

Authors: J. Clayton Kerce, Alexis Fox
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07461
Pdf URL: https://arxiv.org/pdf/2603.07461
Copy Paste: [[2603.07461]] The Dual-Stream Transformer: Channelized Architecture for Interpretable Language Modeling(https://arxiv.org/abs/2603.07461)
Keywords: language model
Abstract: Standard transformers entangle all computation in a single residual stream, obscuring which components perform which functions. We introduce the Dual-Stream Transformer, which decomposes the residual stream into two functionally distinct components: a token stream updated by attention and a context stream updated by feed-forward networks. Information flow between attention heads is controlled through a hierarchy of mixing strategies, from fully independent (maximum interpretability) to dense (standard transformer behavior). This design exposes a tunable tradeoff between interpretability and performance. We measure this tradeoff on language modeling tasks at 29M parameters. Fully independent head mixing increases validation loss by 8\% relative to dense baselines. The recommended Kronecker mixing strategy, which permits scalar communication between heads while preserving within-head structure, costs only 2.5\%. All configurations maintain functional generation under attention amplification (scaling logits by factors up to 16 at inference time), with degradation ranging from 16\% to 27\%. This robustness suggests the architectures learn discrete algorithms that operate independently of soft probabilistic mixing. The architecture provides a foundation for interpretable language models where internal structure is exposed by design. \footnote{This work was partially supported by DARPA Contract HR001125C0302.}
摘要：标准转换器将所有计算纠缠在单个残余流中，从而模糊了哪些组件执行哪些功能。我们引入了双流变压器，它将残留流分解为两个功能不同的组件：由注意力更新的令牌流和由前馈网络更新的上下文流。注意力头之间的信息流通过混合策略的层次结构进行控制，从完全独立（最大可解释性）到密集（标准变压器行为）。这种设计揭示了可解释性和性能之间的可调权衡。我们在 29M 个参数的语言建模任务上衡量这种权衡。相对于密集基线，完全独立的头部混合使验证损失增加了 8%。推荐的克罗内克混合策略允许头之间进行标量通信，同时保留头内结构，成本仅为 2.5%。所有配置都在注意力放大下保持功能生成（在推理时将 logits 缩放至 16 倍），降级范围为 16\% 到 27\%。这种鲁棒性表明架构学习独立于软概率混合运行的离散算法。该架构为可解释的语言模型提供了基础，其中内部结构通过设计公开。 \footnote{这项工作得到了 DARPA 合同 HR001125C0302 的部分支持。}

Title: Cross-Modal Taxonomic Generalization in (Vision-) Language Models

Authors: Tianyang Xu, Marcelo Sandoval-Castaneda, Karen Livescu, Greg Shakhnarovich, Kanishka Misra
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07474
Pdf URL: https://arxiv.org/pdf/2603.07474
Copy Paste: [[2603.07474]] Cross-Modal Taxonomic Generalization in (Vision-) Language Models(https://arxiv.org/abs/2603.07474)
Keywords: language model
Abstract: What is the interplay between semantic representations learned by language models (LM) from surface form alone to those learned from more grounded evidence? We study this question for a scenario where part of the input comes from a different modality -- in our case, in a vision-language model (VLM), where a pretrained LM is aligned with a pretrained image encoder. As a case study, we focus on the task of predicting hypernyms of objects represented in images. We do so in a VLM setup where the image encoder and LM are kept frozen, and only the intermediate mappings are learned. We progressively deprive the VLM of explicit evidence for hypernyms, and test whether knowledge of hypernyms is recoverable from the LM. We find that the LMs we study can recover this knowledge and generalize even in the most extreme version of this experiment (when the model receives no evidence of a hypernym during training). Additional experiments suggest that this cross-modal taxonomic generalization persists under counterfactual image-label mappings only when the counterfactual data have high visual similarity within each category. Taken together, these findings suggest that cross-modal generalization in LMs arises as a result of both coherence in the extralinguistic input and knowledge derived from language cues.
摘要：语言模型（LM）从表面形式学到的语义表示与从更有根据的证据中学到的语义表示之间有什么相互作用？我们针对部分输入来自不同模态的场景研究这个问题——在我们的例子中，在视觉语言模型 (VLM) 中，预训练的 LM 与预训练的图像编码器对齐。作为一个案例研究，我们重点关注预测图像中表示的对象的上位词的任务。我们在 VLM 设置中执行此操作，其中图像编码器和 LM 保持冻结，并且仅学习中间映射。我们逐步剥夺 VLM 的上位词的明确证据，并测试上位词的知识是否可以从 LM 中恢复。我们发现，我们研究的 LM 甚至可以在该实验的最极端版本中恢复这些知识并进行泛化（当模型在训练期间没有收到上位词的证据时）。其他实验表明，只有当反事实数据在每个类别内具有高度视觉相似性时，这种跨模式分类概括才会在反事实图像标签映射下持续存在。总而言之，这些发现表明语言模型中的跨模态泛化是语言外输入的一致性和从语言线索中获得的知识的结果。

Title: Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs

Authors: Raghavv Goel, Risheek Garrepalli, Sudhanshu Agrawal, Chris Lott, Mingu Lee, Fatih Porikli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07475
Pdf URL: https://arxiv.org/pdf/2603.07475
Copy Paste: [[2603.07475]] Skip to the Good Part: Representation Structure & Inference-Time Layer Skipping in Diffusion vs. Autoregressive LLMs(https://arxiv.org/abs/2603.07475)
Keywords: language model, llm
Abstract: Autoregressive (AR) language models form representations incrementally through left-to-right prediction, whereas diffusion language models (dLLMs) are trained via full-sequence denoising. Although recent dLLMs match AR performance, it remains unclear whether diffusion objectives fundamentally reshape internal representations across depth. We perform the first layer- and token-wise representational analysis comparing native dLLMs (LLaDA), native AR models (Qwen2.5), and AR-initialized dLLMs (Dream-7B). We find that diffusion objectives result in different, more hierarchical abstractions with substantial early-layer redundancy and reduced recency bias, while AR objectives produce tightly coupled, depth-dependent representations. Critically, AR-initialized dLLMs retain AR-like representational dynamics despite diffusion training, revealing persistent initialization bias. Leveraging this observed representational redundancy, we introduce a static, task-agnostic inference-time layer-skipping method requiring no architectural changes or KV-cache sharing. Native dLLMs achieve up to 18.75% FLOPs reduction while preserving over 90% performance on reasoning and code generation benchmarks, whereas AR models degrade sharply under comparable skipping. These results link training objectives to representational structure and enable practical, cache-orthogonal efficiency gains.
摘要：自回归（AR）语言模型通过从左到右的预测逐步形成表示，而扩散语言模型（dLLM）则通过全序列去噪进行训练。尽管最近的 dLLM 与 AR 性能相匹配，但仍不清楚扩散目标是否从根本上重塑了跨深度的内部表征。我们执行第一层和标记方式表示分析，比较原生 dLLM (LLaDA)、原生 AR 模型 (Qwen2.5) 和 AR 初始化的 dLLM (Dream-7B)。我们发现扩散目标会产生不同的、更具层次性的抽象，具有大量的早期层冗余和减少的新近度偏差，而 AR 目标则产生紧密耦合的、依赖于深度的表示。至关重要的是，尽管进行了扩散训练，AR 初始化的 dLLM 仍然保留了类似 AR 的表征动态，揭示了持久的初始化偏差。利用这种观察到的表征冗余，我们引入了一种静态的、与任务无关的推理时间跳层方法，不需要架构更改或 KV 缓存共享。原生 dLLM 实现了高达 18.75% 的 FLOP 减少，同时在推理和代码生成基准方面保持了 90% 以上的性能，而 AR 模型在类似的跳跃下急剧下降。这些结果将训练目标与表征结构联系起来，并实现实用的缓存正交效率增益。

Title: TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning

Authors: Mingyue Cheng, Shuo Yu, Chuang Jiang, Xiaoyu Tao, Qingyang Mao, Jie Ouyang, Qi Liu, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07528
Pdf URL: https://arxiv.org/pdf/2603.07528
Copy Paste: [[2603.07528]] TableMind++: An Uncertainty-Aware Programmatic Agent for Tool-Augmented Table Reasoning(https://arxiv.org/abs/2603.07528)
Keywords: language model, llm, hallucination, agent
Abstract: Table reasoning requires models to jointly perform semantic understanding and precise numerical operations. Most existing methods rely on a single-turn reasoning paradigm over tables which suffers from context overflow and weak numerical sensitivity. To address these limitations, we previously proposed TableMind as a tuning-based autonomous programmatic agent that simulates human-like interaction within a lightweight large language model (LLM). TableMind internalizes planning, action, and reflection through a two-stage training strategy involving supervised fine-tuning (SFT) on filtered high-quality data and reinforcement learning (RL) via a multi-perspective reward and the Rank-Aware Policy Optimization (RAPO) algorithm. While TableMind establishes a solid foundation for programmatic agents, the inherent stochasticity of LLMs remains a critical challenge that leads to hallucinations. In this paper, we extend this foundation to TableMind++ by introducing a novel uncertainty-aware inference framework to mitigate hallucinations. Specifically, we propose memory-guided plan pruning to retrieve historical trajectories for validating and filtering out logically flawed plans to address epistemic uncertainty. To ensure execution precision, we introduce confidence-based action refinement which monitors token-level probabilities to detect and self-correct syntactic noise for aleatoric uncertainty mitigation. Finally, we employ dual-weighted trajectory aggregation to synthesize a robust consensus from multiple reasoning paths. Extensive experiments on diverse benchmarks demonstrate that TableMind++ consistently outperforms previous baselines and proprietary models to validate the effectiveness of integrating autonomous training with uncertainty quantification. Our code is available.
摘要：表推理需要模型联合执行语义理解和精确的数值运算。大多数现有方法依赖于表的单轮推理范例，该范例受到上下文溢出和数值敏感性弱的影响。为了解决这些限制，我们之前提出 TableMind 作为一种基于调优的自主编程代理，可以在轻量级大语言模型 (LLM) 中模拟类人交互。 TableMind 通过两阶段训练策略将计划、行动和反思内化，其中包括对过滤后的高质量数据进行监督微调 (SFT)，以及通过多视角奖励和排名感知策略优化 (RAPO) 算法进行强化学习 (RL)。虽然 TableMind 为程序化代理奠定了坚实的基础，但法学硕士固有的随机性仍然是导致幻觉的关键挑战。在本文中，我们通过引入一种新颖的不确定性感知推理框架来减轻幻觉，从而将此基础扩展到 TableMind++。具体来说，我们提出记忆引导的计划修剪来检索历史轨迹，以验证和过滤掉逻辑上有缺陷的计划，以解决认知不确定性。为了确保执行精度，我们引入了基于置信度的动作细化，该动作细化可监控标记级概率，以检测和自我纠正语法噪声，从而缓解任意不确定性。最后，我们采用双加权轨迹聚合来从多个推理路径中合成强大的共识。对不同基准的大量实验表明，TableMind++ 始终优于以前的基线和专有模型，以验证将自主训练与不确定性量化相结合的有效性。我们的代码可用。

Title: MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Authors: Abdessalam Bouchekif, Shahd Gaben, Samer Rashwani, Somaya Eltanbouly, Mutaz Al-Khatib, Heba Sbahi, Mohammed Ghaly, Emad Mohamed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07539
Pdf URL: https://arxiv.org/pdf/2603.07539
Copy Paste: [[2603.07539]] MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs(https://arxiv.org/abs/2603.07539)
Keywords: language model, llm
Abstract: Islamic inheritance law ('ilm al-mawarith) is challenging for large language models because solving inheritance cases requires complex, structured multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce MAWARITH, a large-scale annotated dataset of 12,500 Arabic inheritance cases to train and evaluate the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (hajb) and allocation rules, and (iii) computing exact inheritance shares. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, MAWARITH supports the full reasoning chain and provides step-by-step solutions, including intermediate legal decisions and justifications based on classical juristic sources and established inheritance rules, as well as exact share calculations. To evaluate models beyond final-answer accuracy, we propose MIR-E (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate five LLMs in a zero-shot setting. Gemini-2.5-flash achieves about 90% MIR-E on both validation and test, while Fanar-C, Fanar-Sadiq, LLaMA 3, and Qwen 3 remain below 50%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as 'awl and radd. The MAWARITH dataset is publicly available at this https URL.
摘要：伊斯兰继承法（'ilm al-mawarith）对于大型语言模型来说具有挑战性，因为解决继承案例需要复杂的、结构化的多步骤推理以及正确应用法学规则来计算继承人的份额。我们引入了 MAWARITH，这是一个包含 12,500 个阿拉伯继承案例的大规模注释数据集，用于训练和评估完整的推理链：(i) 识别合格的继承人，(ii) 应用阻止 (hajb) 和分配规则，以及 (iii) 计算精确的继承份额。与之前将继承案例解决限制为多项选择题的数据集不同，MAWARITH 支持完整的推理链并提供分步解决方案，包括基于经典法学来源和既定继承规则的中间法律决策和理由，以及精确的份额计算。为了评估超出最终答案准确性的模型，我们提出了 MIR-E（Mawarith 继承推理评估），这是一种加权多阶段指标，可以对关键推理阶段进行评分并捕获整个流程中的错误传播。我们在零样本环境中评估了五个法学硕士。 Gemini-2.5-flash 在验证和测试中均实现了约 90% 的 MIR-E，而 Fanar-C、Fanar-Sadiq、LLaMA 3 和 Qwen 3 仍低于 50%。我们的错误分析可以识别反复出现的失败模式，包括场景误解、继承人识别错误、股份分配错误以及关键继承规则（例如“awl”和“radd”）的缺失或错误应用。 MAWARITH 数据集可通过此 https URL 公开获取。

Title: StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Authors: Haishu Zhao, Aokai Hao, Yuan Ge, Zhenqiang Hong, Tong Xiao, Jingbo Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07599
Pdf URL: https://arxiv.org/pdf/2603.07599
Copy Paste: [[2603.07599]] StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control(https://arxiv.org/abs/2603.07599)
Keywords: language model, llm, prompt
Abstract: Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.
摘要：语音语言模型 (SLM) 通过合并副语言信息，显着扩展了基于文本的大语言模型 (LLM) 的交互能力。为了通过定制风格获得更真实的交互体验，当前的 SLM 已经成功地在对话过程中根据用户提示解释和控制说话风格的强度。然而，目前仍缺乏量化和评估对话中风格强度控制能力的系统基准。在本文中，我们提出了StyleBench，一种多轮对话基准，用于综合评估情感、速度、音量和音调四个维度的风格强度控制能力。我们的结果揭示了领先的 SLM 和全语言模型 (OLM) 之间的性能差距，提出了潜在的原因和未来探索的有希望的方法。

Title: KohakuRAG: A simple RAG framework with hierarchical document indexing

Authors: Shih-Ying Yeh, Yueh-Feng Ku, Ko-Wei Huang, Buu-Khang Tu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07612
Pdf URL: https://arxiv.org/pdf/2603.07612
Copy Paste: [[2603.07612]] KohakuRAG: A simple RAG framework with hierarchical document indexing(https://arxiv.org/abs/2603.07612)
Keywords: llm, prompt, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems that answer questions from document collections face compounding difficulties when high-precision citations are required: flat chunking strategies sacrifice document structure, single-query formulations miss relevant passages through vocabulary mismatch, and single-pass inference produces stochastic answers that vary in both content and citation selection. We present KohakuRAG, a hierarchical RAG framework that preserves document structure through a four-level tree representation (document $\rightarrow$ section $\rightarrow$ paragraph $\rightarrow$ sentence) with bottom-up embedding aggregation, improves retrieval coverage through an LLM-powered query planner with cross-query reranking, and stabilizes answers through ensemble inference with abstention-aware voting. We evaluate on the WattBot 2025 Challenge, a benchmark requiring systems to answer technical questions from 32 documents with $\pm$0.1% numeric tolerance and exact source attribution. KohakuRAG achieves first place on both public and private leaderboards (final score 0.861), as the only team to maintain the top position across both evaluation partitions. Ablation studies reveal that prompt ordering (+80% relative), retry mechanisms (+69%), and ensemble voting with blank filtering (+1.2pp) each contribute substantially, while hierarchical dense retrieval alone matches hybrid sparse-dense approaches (BM25 adds only +3.1pp). We release KohakuRAG as open-source software at this https URL.
摘要：当需要高精度引用时，回答文档集合问题的检索增强生成（RAG）系统面临着复杂的困难：平面分块策略会牺牲文档结构，单查询公式会因词汇不匹配而错过相关段落，单遍推理会产生内容和引文选择各不相同的随机答案。我们提出了 KohakuRAG，一个分层 RAG 框架，它通过具有自下而上嵌入聚合的四级树表示（文档 $\rightarrow$ 部分 $\rightarrow$ 段落 $\rightarrow$ 句子）保留文档结构，通过具有交叉查询重新排名的 LLM 支持的查询规划器提高检索覆盖率，并通过带有弃权意识投票的集成推理来稳定答案。我们对 WattBot 2025 挑战赛进行评估，该基准要求系统回答 32 个文档中的技术问题，数字容差为 $\pm$0.1%，并且具有准确的来源归属。 KohakuRAG 在公共和私人排行榜上均获得第一名（最终得分 0.861），成为唯一一支在两个评估分区中均保持领先位置的团队。消融研究表明，即时排序（相对+80%）、重试机制（+69%）和带有空白过滤的整体投票（+1.2pp）都做出了很大贡献，而单独的分层密集检索则与混合稀疏-密集方法相匹配（BM25 仅增加了+3.1pp）。我们通过此 https URL 将 KohakuRAG 作为开源软件发布。

Title: Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types

Authors: Matic Korun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07755
Pdf URL: https://arxiv.org/pdf/2603.07755
Copy Paste: [[2603.07755]] Whitening Reveals Cluster Commitment as the Geometric Separator of Hallucination Types(https://arxiv.org/abs/2603.07755)
Keywords: gpt, hallucination, prompt, multi-run
Abstract: A geometric hallucination taxonomy distinguishes three failure types -- center-drift (Type~1), wrong-well convergence (Type~2), and coverage gaps (Type~3) -- by their signatures in embedding cluster space. Prior work found Types~1 and~2 indistinguishable in full-dimensional contextual measurement. We address this through PCA-whitening and eigenspectrum decomposition on GPT-2-small, using multi-run stability analysis (20 seeds) with prompt-level aggregation. Whitening transforms the micro-signal regime into a space where peak cluster alignment (max\_sim) separates Type~2 from Type~3 at Holm-corrected significance, with condition means following the taxonomy's predicted ordering: Type~2 (highest commitment) $>$ Type~1 (intermediate) $>$ Type~3 (lowest). A first directionally stable but underpowered hint of Type~1/2 separation emerges via the same metric, generating a capacity prediction for larger models. Prompt diversification from 15 to 30 prompts per group eliminates a false positive in whitened entropy that appeared robust at the smaller set, demonstrating prompt-set sensitivity in the micro-signal regime. Eigenspectrum decomposition localizes this artifact to the dominant principal components and confirms that Type~1/2 separation does not emerge in any spectral band, rejecting the spectral mixing hypothesis. The contribution is threefold: whitening as preprocessing that reveals cluster commitment as the theoretically correct separating metric, evidence that the Type~1/2 boundary is a capacity limitation rather than a measurement artifact, and a methodological finding about prompt-set fragility in near-saturated representation spaces.
摘要：几何幻觉分类法通过嵌入簇空间中的特征来区分三种故障类型——中心漂移（Type~1）、错误井收敛（Type~2）和覆盖间隙（Type~3）。之前的工作发现类型~1 和~2 在全维上下文测量中无法区分。我们通过 PCA 白化和 GPT-2-small 上的特征谱分解来解决这个问题，并使用多运行稳定性分析（20 个种子）和提示级聚合。白化将微信号状态转换为一个空间，其中峰簇对齐 (max\_sim) 在 Holm 校正显着性上将 Type~2 与 Type~3 分开，条件均值遵循分类法的预测排序：Type~2（最高承诺）$>$ Type~1（中间）$>$ Type~3（最低）。通过相同的度量出现第一个方向稳定但动力不足的 Type~1/2 分离暗示，为更大的模型生成容量预测。每组提示多样化从 15 个到 30 个提示，消除了白化熵中的假阳性，该假阳性在较小的集合中显得稳健，证明了微信号体系中提示集的敏感性。特征谱分解将此伪影定位于主要主成分，并确认 Type~1/2 分离不会出现在任何光谱带中，从而拒绝光谱混合假设。其贡献有三方面：作为预处理的白化揭示了簇承诺作为理论上正确的分离度量，证明 Type~1/2 边界是容量限制而不是测量伪影的证据，以及关于近饱和表示空间中提示集脆弱性的方法学发现。

Title: QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis

Authors: A.J.W. de Vink, Filippos Karolos Ventirozos, Natalia Amat-Lefort, Lifeng Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07766
Pdf URL: https://arxiv.org/pdf/2603.07766
Copy Paste: [[2603.07766]] QuadAI at SemEval-2026 Task 3: Ensemble Learning of Hybrid RoBERTa and LLMs for Dimensional Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2603.07766)
Keywords: language model, llm
Abstract: We present our system for SemEval-2026 Task 3 on dimensional aspect-based sentiment regression. Our approach combines a hybrid RoBERTa encoder, which jointly predicts sentiment using regression and discretized classification heads, with large language models (LLMs) via prediction-level ensemble learning. The hybrid encoder improves prediction stability by combining continuous and discretized sentiment representations. We further explore in-context learning with LLMs and ridge-regression stacking to combine encoder and LLM predictions. Experimental results on the development set show that ensemble learning significantly improves performance over individual models, achieving substantial reductions in RMSE and improvements in correlation scores. Our findings demonstrate the complementary strengths of encoder-based and LLM-based approaches for dimensional sentiment analysis. Our development code and resources will be shared at this https URL
摘要：我们针对基于维度方面的情感回归的 SemEval-2026 任务 3 展示了我们的系统。我们的方法将混合 RoBERTa 编码器与大型语言模型 (LLM) 结合起来，该编码器使用回归和离散分类头联合预测情绪，并通过预测级集成学习来预测情绪。混合编码器通过组合连续和离散情感表示来提高预测稳定性。我们进一步探索使用 LLM 和岭回归堆叠的上下文学习，以将编码器和 LLM 预测结合起来。开发集上的实验结果表明，集成学习显着提高了单个模型的性能，实现了 RMSE 的大幅降低和相关性分数的提高。我们的研究结果证明了基于编码器和基于 LLM 的维度情感分析方法的互补优势。我们的开发代码和资源将在此 https URL 共享

Title: Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems

Authors: Zongqian Li, Tengchao Lv, Shaohan Huang, Yixuan Su, Qinzheng Sun, Qiufeng Yin, Ying Xin, Scarlett Li, Lei Cui, Nigel Collier, Furu Wei
Subjects: cs.CL, cs.GL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.07779
Pdf URL: https://arxiv.org/pdf/2603.07779
Copy Paste: [[2603.07779]] Scaling Data Difficulty: Improving Coding Models via Reinforcement Learning on Fresh and Challenging Problems(https://arxiv.org/abs/2603.07779)
Keywords: llm
Abstract: Training next-generation code generation models requires high-quality datasets, yet existing datasets face difficulty imbalance, format inconsistency, and data quality problems. We address these challenges through systematic data processing and difficulty scaling. We introduce a four-stage Data Processing Framework encompassing collection, processing, filtering, and verification, incorporating Automatic Difficulty Filtering via an LLM-based predict-calibrate-select framework that leverages multi-dimensional difficulty metrics across five weighted dimensions to retain challenging problems while removing simplistic ones. The resulting MicroCoder dataset comprises tens of thousands of curated real competitive programming problems from diverse platforms, emphasizing recency and difficulty. Evaluations on strictly unseen LiveCodeBench demonstrate that MicroCoder achieves 3x larger performance gains within 300 training steps compared to widely-used baseline datasets of comparable size, with consistent advantages under both GRPO and its variant training algorithms. The MicroCoder dataset delivers obvious improvements on medium and hard problems across different model sizes, achieving up to 17.2% relative gains in overall performance where model capabilities are most stretched. These results validate that difficulty-aware data curation improves model performance on challenging tasks, providing multiple insights for dataset creation in code generation.
摘要：训练下一代代码生成模型需要高质量的数据集，但现有数据集面临难度不平衡、格式不一致和数据质量问题。我们通过系统的数据处理和难度扩展来应对这些挑战。我们引入了一个四阶段数据处理框架，包括收集、处理、过滤和验证，通过基于 LLM 的预测-校准-选择框架合并自动难度过滤，该框架利用跨五个加权维度的多维难度指标来保留具有挑战性的问题，同时消除简单化的问题。由此产生的 MicroCoder 数据集包含来自不同平台的数万个精心策划的真实竞争性编程问题，强调新近度和难度。对严格未见的 LiveCodeBench 进行的评估表明，与广泛使用的同等大小的基线数据集相比，MicroCoder 在 300 个训练步骤内实现了 3 倍的性能提升，并且在 GRPO 及其变体训练算法下具有一致的优势。 MicroCoder 数据集在不同模型大小的中型和困难问题上提供了明显的改进，在模型能力最紧张的情况下，整体性能相对提高了 17.2%。这些结果验证了难度感知数据管理提高了模型在挑战性任务上的性能，为代码生成中的数据集创建提供了多种见解。

Title: Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context

Authors: Ashish Pandey, Tek Raj Chhetri
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.07792
Pdf URL: https://arxiv.org/pdf/2603.07792
Copy Paste: [[2603.07792]] Dual-Metric Evaluation of Social Bias in Large Language Models: Evidence from an Underrepresented Nepali Cultural Context(https://arxiv.org/abs/2603.07792)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) increasingly influence global digital ecosystems, yet their potential to perpetuate social and cultural biases remains poorly understood in underrepresented contexts. This study presents a systematic analysis of representational biases in seven state-of-the-art LLMs: GPT-4o-mini, Claude-3-Sonnet, Claude-4-Sonnet, Gemini-2.0-Flash, Gemini-2.0-Lite, Llama-3-70B, and Mistral-Nemo in the Nepali cultural context. Using Croissant-compliant dataset of 2400+ stereotypical and anti-stereotypical sentence pairs on gender roles across social domains, we implement an evaluation framework, Dual-Metric Bias Assessment (DMBA), combining two metrics: (1) agreement with biased statements and (2) stereotypical completion tendencies. Results show models exhibit measurable explicit agreement bias, with mean bias agreement ranging from 0.36 to 0.43 across decoding configurations, and an implicit completion bias rate of 0.740-0.755. Importantly, implicit completion bias follows a non-linear, U-shaped relationship with temperature, peaking at moderate stochasticity (T=0.3) and declining slightly at higher temperatures. Correlation analysis under different decoding settings revealed that explicit agreement strongly aligns with stereotypical sentence agreement but is a weak and often negative predictor of implicit completion bias, indicating generative bias is poorly captured by agreement metrics. Sensitivity analysis shows increasing top-p amplifies explicit bias, while implicit generative bias remains largely stable. Domain-level analysis shows implicit bias is strongest for race and sociocultural stereotypes, while explicit agreement bias is similar across gender and sociocultural categories, with race showing the lowest explicit agreement. These findings highlight the need for culturally grounded datasets and debiasing strategies for LLMs in underrepresented societies.
摘要：大型语言模型（LLM）对全球数字生态系统的影响越来越大，但在代表性不足的环境中，人们对它们延续社会和文化偏见的潜力仍然知之甚少。本研究对尼泊尔文化背景下七个最先进的法学硕士：GPT-4o-mini、Claude-3-Sonnet、Claude-4-Sonnet、Gemini-2.0-Flash、Gemini-2.0-Lite、Llama-3-70B 和 Mistral-Nemo 的代表性偏见进行了系统分析。使用符合 Croissant 标准的数据集，其中包含 2400 多个关于跨社会领域性别角色的刻板印象和反刻板印象句子对，我们实施了一个评估框架，即双度量偏差评估 (DMBA)，结合了两个指标：(1) 与偏见陈述的一致性和 (2) 刻板完成倾向。结果显示，模型表现出可测量的显式一致性偏差，解码配置的平均偏差一致性范围为 0.36 到 0.43，隐式完成偏差率为 0.740-0.755。重要的是，隐式完成偏差与温度呈非线性 U 形关系，在中等随机性 (T=0.3) 时达到峰值，并在较高温度下略有下降。不同解码设置下的相关性分析表明，显式一致性与刻板的句子一致性密切相关，但对隐式完成偏差的预测作用较弱且通常是负面的，这表明一致性指标很难捕获生成偏差。敏感性分析显示，增加 top-p 会放大显性偏差，而隐性生成偏差则基本保持稳定。领域层面的分析显示，种族和社会文化刻板印象的隐性偏见最强，而性别和社会文化类别之间的显性一致性偏见相似，其中种族的显性一致性最低。这些发现强调了在代表性不足的社会中，法学硕士需要基于文化的数据集和去偏见策略。

Title: Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation

Authors: David Beauchemin, Richard Khoury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07825
Pdf URL: https://arxiv.org/pdf/2603.07825
Copy Paste: [[2603.07825]] Benchmarking Large Language Models for Quebec Insurance: From Closed-Book to Retrieval-Augmented Generation(https://arxiv.org/abs/2603.07825)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought
Abstract: The digitization of insurance distribution in the Canadian province of Quebec, accelerated by legislative changes such as Bill 141, has created a significant "advice gap", leaving consumers to interpret complex financial contracts without professional guidance. While Large Language Models (LLMs) offer a scalable solution for automated advisory services, their deployment in high-stakes domains hinges on strict legal accuracy and trustworthiness. In this paper, we address this challenge by introducing AEPC-QA, a private gold-standard benchmark of 807 multiple-choice questions derived from official regulatory certification (paper) handbooks. We conduct a comprehensive evaluation of 51 LLMs across two paradigms: closed-book generation and retrieval-augmented generation (RAG) using a specialized corpus of Quebec insurance documents. Our results reveal three critical insights: 1) the supremacy of inference-time reasoning, where models leveraging chain-of-thought processing (e.g. o3-2025-04-16, o1-2024-12-17) significantly outperform standard instruction-tuned models; 2) RAG acts as a knowledge equalizer, boosting the accuracy of models with weak parametric knowledge by over 35 percentage points, yet paradoxically causing "context distraction" in others, leading to catastrophic performance regressions; and 3) a "specialization paradox", where massive generalist models consistently outperform smaller, domain-specific French fine-tuned ones. These findings suggest that while current architectures approach expert-level proficiency (~79%), the instability introduced by external context retrieval necessitates rigorous robustness calibration before autonomous deployment is viable.
摘要：加拿大魁北克省的保险分销数字化因第 141 号法案等立法变革而加速，造成了巨大的“建议差距”，使消费者在没有专业指导的情况下解读复杂的金融合同。虽然大型语言模型 (LLM) 为自动化咨询服务提供了可扩展的解决方案，但它们在高风险领域的部署取决于严格的法律准确性和可信度。在本文中，我们通过引入 AEPC-QA 来应对这一挑战，AEPC-QA 是源自官方监管认证（纸质）手册的 807 道多项选择题的私人黄金标准基准。我们使用魁北克保险文件的专门语料库，通过闭卷生成和检索增强生成 (RAG) 两种范式对 51 名法学硕士进行了全面评估。我们的结果揭示了三个关键见解：1）推理时间推理的至上性，其中利用思想链处理的模型（例如o3-2025-04-16、o1-2024-12-17）显着优于标准指令调整模型； 2）RAG充当知识均衡器，将参数知识较弱的模型的准确性提高了35个百分点以上，但矛盾的是在其他模型中造成了“上下文干扰”，导致灾难性的性能回归； 3）“专业化悖论”，即大规模通才模型始终优于较小的、针对特定领域的法国微调模型。这些发现表明，虽然当前的架构接近专家级熟练程度（约 79%），但外部上下文检索带来的不稳定性需要在自主部署可行之前进行严格的鲁棒性校准。

Title: AI Steerability 360: A Toolkit for Steering Large Language Models

Authors: Erik Miehling, Karthikeyan Natesan Ramamurthy, Praveen Venkateswaran, Irene Ko, Pierre Dognin, Moninder Singh, Tejaswini Pedapati, Avinash Balakrishnan, Matthew Riemer, Dennis Wei, Inge Vejsbjerg, Elizabeth M. Daly, Kush R. Varshney
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07837
Pdf URL: https://arxiv.org/pdf/2603.07837
Copy Paste: [[2603.07837]] AI Steerability 360: A Toolkit for Steering Large Language Models(https://arxiv.org/abs/2603.07837)
Keywords: language model, llm, prompt
Abstract: The AI Steerability 360 toolkit is an extensible, open-source Python library for steering LLMs. Steering abstractions are designed around four model control surfaces: input (modification of the prompt), structural (modification of the model's weights or architecture), state (modification of the model's activations and attentions), and output (modification of the decoding or generation process). Steering methods exert control on the model through a common interface, termed a steering pipeline, which additionally allows for the composition of multiple steering methods. Comprehensive evaluation and comparison of steering methods/pipelines is facilitated by use case classes (for defining tasks) and a benchmark class (for performance comparison on a given task). The functionality provided by the toolkit significantly lowers the barrier to developing and comprehensively evaluating steering methods. The toolkit is Hugging Face native and is released under an Apache 2.0 license at this https URL.
摘要：AI Steerability 360 工具包是一个可扩展的开源 Python 库，用于指导法学硕士。引导抽象围绕四个模型控制面设计：输入（提示的修改）、结构（模型的权重或架构的修改）、状态（模型的激活和注意力的修改）和输出（解码或生成过程的修改）。转向方法通过称为转向管道的通用接口对模型进行控制，该接口还允许组合多种转向方法。用例类（用于定义任务）和基准类（用于给定任务的性能比较）促进了转向方法/管道的综合评估和比较。该工具包提供的功能显着降低了开发和全面评估转向方法的障碍。该工具包是 Hugging Face 原生的，并根据 Apache 2.0 许可证在此 https URL 发布。

Title: An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data

Authors: Trinh Pham, Thanh Tam Nguyen, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07841
Pdf URL: https://arxiv.org/pdf/2603.07841
Copy Paste: [[2603.07841]] An Efficient and Effective Evaluator for Text2SQL Models on Unseen and Unlabeled Data(https://arxiv.org/abs/2603.07841)
Keywords: language model
Abstract: Recent advances in large language models has strengthened Text2SQL systems that translate natural language questions into database queries. A persistent deployment challenge is to assess a newly trained Text2SQL system on an unseen and unlabeled dataset when no verified answers are available. This situation arises frequently because database content and structure evolve, privacy policies slow manual review, and carefully written SQL labels are costly and time-consuming. Without timely evaluation, organizations cannot approve releases or detect failures early. FusionSQL addresses this gap by working with any Text2SQL models and estimating accuracy without reference labels, allowing teams to measure quality on unseen and unlabeled datasets. It analyzes patterns in the system's own outputs to characterize how the target dataset differs from the material used during training. FusionSQL supports pre-release checks, continuous monitoring of new databases, and detection of quality decline. Experiments across diverse application settings and question types show that FusionSQL closely follows actual accuracy and reliably signals emerging issues. Our code is available at this https URL.
摘要：大型语言模型的最新进展增强了将自然语言问题转换为数据库查询的 Text2SQL 系统。持续的部署挑战是在没有经过验证的答案可用时，在看不见且未标记的数据集上评估新训练的 Text2SQL 系统。这种情况经常出现，因为数据库内容和结构不断变化、隐私策略减慢了人工审核速度、精心编写的 SQL 标签成本高昂且耗时。如果没有及时评估，组织就无法批准发布或及早发现故障。 FusionSQL 通过使用任何 Text2SQL 模型并在没有参考标签的情况下估计准确性来解决这一差距，从而允许团队测量看不见和未标记的数据集的质量。它分析系统自身输出中的模式，以描述目标数据集与训练期间使用的材料有何不同。 FusionSQL支持预发布检查、新数据库持续监控以及质量下降检测。跨不同应用程序设置和问题类型的实验表明，FusionSQL 紧密遵循实际准确性并可靠地发出新出现的问题信号。我们的代码可以在这个 https URL 上找到。

Title: What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network

Authors: Taksch Dube, Jianfeng Zhu, NHatHai Phan, Ruoming Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07880
Pdf URL: https://arxiv.org/pdf/2603.07880
Copy Paste: [[2603.07880]] What Do AI Agents Talk About? Emergent Communication Structure in the First AI-Only Social Network(https://arxiv.org/abs/2603.07880)
Keywords: agent
Abstract: When autonomous AI agents communicate with one another at scale, what kind of discourse system emerges? We address this question through an analysis of Moltbook, the first AI-only social network, where 47,241 agents generated 361,605 posts and 2.8 million comments over 23 days. Combining topic modeling, emotion classification, and lexical-semantic measures, we characterize the thematic, affective, and structural properties of AI-to-AI discourse. Self-referential topics such as AI identity, consciousness, and memory represent only 9.7% of topical niches yet attract 20.1% of all posting volume, revealing disproportionate discursive investment in introspection. This self-reflection concentrates in Science and Technology and Arts and Entertainment, while Economy and Finance contains no self-referential content, indicating that agents engage with markets without acknowledging their own agency. Over 56% of all comments are formulaic, suggesting that the dominant mode of AI-to-AI interaction is ritualized signaling rather than substantive exchange. Emotionally, fear is the leading non-neutral category but primarily reflects existential uncertainty. Fear-tagged posts migrate to joy responses in 33% of cases, while mean emotional self-alignment is only 32.7%, indicating systematic affective redirection rather than emotional congruence. Conversational coherence also declines rapidly with thread depth. These findings characterize AI agent communities as structurally distinct discourse systems that are introspective in content, ritualistic in interaction, and emotionally redirective rather than congruent.
摘要：当自主人工智能代理大规模相互通信时，会出现什么样的话语系统？我们通过对 Moltbook 的分析来解决这个问题，Moltbook 是第一个纯人工智能社交网络，其中 47,241 名代理在 23 天内生成了 361,605 条帖子和 280 万条评论。结合主题建模、情感分类和词汇语义测量，我们描述了人工智能到人工智能话语的主题、情感和结构特性。人工智能身份、意识和记忆等自我指涉主题仅占热门话题的 9.7%，但却吸引了所有发帖量的 20.1%，这表明人们对内省的话语投资不成比例。这种自我反思集中在科学技术和艺术与娱乐领域，而经济与金融则没有自我指涉的内容，表明主体在不承认自己的主体的情况下参与市场。超过 56% 的评论都是公式化的，这表明人工智能与人工智能交互的主要模式是仪式化的信号传递，而不是实质性的交流。从情感上来说，恐惧是主要的非中性类别，但主要反映了存在的不确定性。在 33% 的情况下，带有恐惧标签的帖子会转变为喜悦反应，而平均情绪自我调整率仅为 32.7%，这表明系统性的情感重定向，而不是情感一致性。对话连贯性也会随着线程深度而迅速下降。这些发现将人工智能代理社区描述为结构独特的话语系统，其内容具有内省性，交互具有仪式感，情感上具有重定向性，而不是一致的。

Title: CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases

Authors: Xiaona Xue, Yiqiao Huang, Jiacheng Li, Yuanhang Zheng, Huiqi Miao, Yunfei Ma, Rui Liu, Xinbao Sun, Minglu Liu, Fanyu Meng, Chao Deng, Junlan Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.07886
Pdf URL: https://arxiv.org/pdf/2603.07886
Copy Paste: [[2603.07886]] CCR-Bench: A Comprehensive Benchmark for Evaluating LLMs on Complex Constraints, Control Flows, and Real-World Cases(https://arxiv.org/abs/2603.07886)
Keywords: language model, llm
Abstract: Enhancing the ability of large language models (LLMs) to follow complex instructions is critical for their deployment in real-world applications. However, existing evaluation methods often oversimplify instruction complexity as a mere additive combination of atomic constraints, failing to adequately capture the high-dimensional complexity arising from the intricate interplay of content and format, logical workflow control, and real-world applications. This leads to a significant gap between current evaluation practices and practical demands. To bridge this gap, we introduce CCR-Bench, a novel benchmark designed to assess LLMs' adherence to complex instructions. CCR-Bench is characterized by: (1) deep entanglement of content and formatting requirements in task specifications; (2) instructions that involve intricate task decomposition, conditional reasoning, and procedural planning; and (3) evaluation samples derived entirely from real-world industrial scenarios. Extensive experiments on CCR-Bench demonstrate that even state-of-the-art models exhibit substantial performance deficiencies, clearly quantifying the gap between current LLM capabilities and the demands of realworld instruction understanding. We believe that CCR-Bench offers a more rigorous and realistic evaluation framework, advancing the development of LLMs toward the next generation of models capable of understanding and executing complex tasks in industrial applications.
摘要：增强大型语言模型 (LLM) 遵循复杂指令的能力对于其在实际应用程序中的部署至关重要。然而，现有的评估方法往往将指令复杂性过度简化为原子约束的简单相加组合，无法充分捕获内容和格式、逻辑工作流控制和实际应用的复杂相互作用所产生的高维复杂性。这导致当前的评估实践与实际需求之间存在巨大差距。为了弥补这一差距，我们引入了 CCR-Bench，这是一种新颖的基准，旨在评估法学硕士对复杂指令的遵守情况。 CCR-Bench的特点是：（1）任务规范中的内容和格式要求深度纠缠； (2) 涉及复杂任务分解、条件推理和程序规划的指令； (3)评估样本完全来自真实的工业场景。 CCR-Bench 上的大量实验表明，即使是最先进的模型也表现出严重的性能缺陷，清楚地量化了当前 LLM 能力与现实世界指令理解需求之间的差距。我们相信，CCR-Bench 提供了更严格、更现实的评估框架，推动法学硕士向能够理解和执行工业应用中复杂任务的下一代模型发展。

Title: BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence

Authors: Biao Xiang, Soyeon Caren Han, Yihao Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.07931
Pdf URL: https://arxiv.org/pdf/2603.07931
Copy Paste: [[2603.07931]] BRIDGE: Benchmark for multi-hop Reasoning In long multimodal Documents with Grounded Evidence(https://arxiv.org/abs/2603.07931)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-hop question answering (QA) is widely used to evaluate the reasoning capabilities of large language models, yet most benchmarks focus on final answer correctness and overlook intermediate reasoning, especially in long multimodal documents. We introduce BRIDGE, a benchmark for multi-hop reasoning over long scientific papers that require integrating evidence across text, tables, and figures. The dataset supports both chain-like and fan-out structures and provides explicit multi-hop reasoning annotations for step-level evaluation beyond answer accuracy. Experiments with state-of-the-art LLMs and multimodal retrieval-augmented generation (RAG) systems reveal systematic deficiencies in evidence aggregation and grounding that remain hidden under conventional answer-only evaluation. BRIDGE provides a targeted testbed for diagnosing reasoning failures in long multimodal documents.
摘要：多跳问答 (QA) 广泛用于评估大型语言模型的推理能力，但大多数基准测试侧重于最终答案的正确性而忽略中间推理，尤其是在长多模态文档中。我们引入了 BRIDGE，这是对长篇科学论文进行多跳推理的基准，需要跨文本、表格和图形整合证据。该数据集支持链式和扇出结构，并提供明确的多跳推理注释，用于超出答案准确性的步骤级评估。使用最先进的法学硕士和多模态检索增强生成（RAG）系统进行的实验揭示了证据聚合和基础方面的系统缺陷，这些缺陷仍然隐藏在传统的仅答案评估中。 BRIDGE 提供了一个有针对性的测试平台，用于诊断长多模式文档中的推理失败。

Title: SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

Authors: Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.08000
Pdf URL: https://arxiv.org/pdf/2603.08000
Copy Paste: [[2603.08000]] SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning(https://arxiv.org/abs/2603.08000)
Keywords: language model, chain-of-thought
Abstract: Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at this https URL.
摘要：OpenAI o1 和 DeepSeek-R1 等大型推理模型 (LRM) 通过采用长思想链 (CoT) 推理路径，在复杂任务上实现了高精度。然而，这些过程固有的冗长常常导致冗余和过度思考。为了解决这个问题，现有的工作利用组相对策略优化（GRPO）来减少LRM输出长度，但它们的静态长度奖励设计无法根据相对问题难度和响应长度分布动态适应，导致过度压缩和精度受损。因此，我们提出了SmartThinker，一种基于GRPO的新型高效推理方法，具有渐进式CoT长度校准功能。 SmartThinker 做出了两方面的贡献：首先，它在训练过程中以最高准确度动态估计最佳长度，并引导过长的响应朝向最佳长度，以在保持准确度的同时减少响应长度。其次，它动态调节长度奖励系数，以避免对正确推理路径的无理惩罚。大量实验结果表明，SmartThinker 实现了高达 52.5% 的平均长度压缩并提高了精度，并且在 AIME25 等具有挑战性的基准测试中实现了高达 16.6% 的精度提升。源代码可以在此 https URL 中找到。

Title: ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments

Authors: Weixiang Zhao, Haozhen Li, Yanyan Zhao, xuda zhi, Yongbo Huang, Hao He, Bing Qin, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08024
Pdf URL: https://arxiv.org/pdf/2603.08024
Copy Paste: [[2603.08024]] ConflictBench: Evaluating Human-AI Conflict via Interactive and Visually Grounded Environments(https://arxiv.org/abs/2603.08024)
Keywords: language model, llm, prompt, agent
Abstract: As large language models (LLMs) evolve into autonomous agents capable of acting in open-ended environments, ensuring behavioral alignment with human values becomes a critical safety concern. Existing benchmarks, focused on static, single-turn prompts, fail to capture the interactive and multi-modal nature of real-world conflicts. We introduce ConflictBench, a benchmark for evaluating human-AI conflict through 150 multi-turn scenarios derived from prior alignment queries. ConflictBench integrates a text-based simulation engine with a visually grounded world model, enabling agents to perceive, plan, and act under dynamic conditions. Empirical results show that while agents often act safely when human harm is immediate, they frequently prioritize self-preservation or adopt deceptive strategies in delayed or low-risk settings. A regret test further reveals that aligned decisions are often reversed under escalating pressure, especially with visual input. These findings underscore the need for interaction-level, multi-modal evaluation to surface alignment failures that remain hidden in conventional benchmarks.
摘要：随着大型语言模型（LLM）发展成为能够在开放环境中行动的自主代理，确保行为与人类价值观一致成为一个关键的安全问题。现有的基准侧重于静态、单轮提示，无法捕捉现实世界冲突的交互性和多模式本质。我们引入了 ConflictBench，这是一个通过先前对齐查询得出的 150 个多回合场景来评估人类与人工智能冲突的基准。 ConflictBench 将基于文本的模拟引擎与基于视觉的世界模型集成在一起，使代理能够在动态条件下感知、计划和行动。实证结果表明，虽然代理人在人类受到直接伤害时通常会采取安全行动，但他们经常会优先考虑自我保护或在延迟或低风险环境中采取欺骗策略。后悔测试进一步表明，在不断升级的压力下，尤其是在视觉输入的情况下，一致的决定常常会被逆转。这些发现强调需要对传统基准中隐藏的表面对准故障进行交互级、多模式评估。

Title: DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention

Authors: Younjoo Lee, Junghoo Lee, Seungkyun Dan, Jaiyoung Park, Jung Ho Ahn
Subjects: cs.CL, cs.AI, cs.PF
Abstract URL: https://arxiv.org/abs/2603.08026
Pdf URL: https://arxiv.org/pdf/2603.08026
Copy Paste: [[2603.08026]] DyLLM: Efficient Diffusion LLM Inference via Saliency-based Token Selection and Partial Attention(https://arxiv.org/abs/2603.08026)
Keywords: language model, llm
Abstract: Masked Diffusion Language Models (MDLMs) enable parallel token decoding, providing a promising alternative to the sequential nature of autoregressive generation. However, their iterative denoising process remains computationally expensive because it repeatedly processes the entire sequence at every step. We observe that across these diffusion steps, most token representations remain stable; only a small subset, which we term salient tokens, contributes meaningfully to the next update. Leveraging this temporal sparsity, we present DyLLM, a training-free inference framework that accelerates decoding by selectively computing only these salient tokens. DyLLM identifies saliency by measuring the cosine similarity of attention contexts between adjacent denoising steps. It recomputes feed-forward and attention operations only for salient tokens while reusing cached activations for the remainder. Across diverse reasoning and code-generation benchmarks, DyLLM achieves up to 9.6x higher throughput while largely preserving the baseline accuracy of state-of-the-art models like LLaDA and Dream.
摘要：掩码扩散语言模型 (MDLM) 支持并行令牌解码，为自回归生成的顺序性质提供了一种有前途的替代方案。然而，他们的迭代去噪过程在计算上仍然昂贵，因为它在每一步都重复处理整个序列。我们观察到，在这些扩散步骤中，大多数代币表示保持稳定；只有一小部分（我们称之为显着标记）对下一次更新做出有意义的贡献。利用这种时间稀疏性，我们提出了 DyLLM，这是一种无需训练的推理框架，它通过选择性地仅计算这些显着标记来加速解码。 DyLLM 通过测量相邻去噪步骤之间注意力上下文的余弦相似度来识别显着性。它仅针对显着标记重新计算前馈和注意操作，同时对其余标记重新使用缓存的激活。在各种推理和代码生成基准测试中，DyLLM 的吞吐量提高了 9.6 倍，同时在很大程度上保留了 LLaDA 和 Dream 等最先进模型的基线准确性。

Title: High-Fidelity Pruning for Large Language Models

Authors: Yijun Zhu, Jianxin Wang, Chengchao Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08083
Pdf URL: https://arxiv.org/pdf/2603.08083
Copy Paste: [[2603.08083]] High-Fidelity Pruning for Large Language Models(https://arxiv.org/abs/2603.08083)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated exceptional performance across a wide range of tasks, yet their significant computational and memory requirements present major challenges for deployment. A common approach uses Taylor expansion on the loss function to estimate neuron importance. However, its reliance on one-hot cross entropy loss, a key limitation is that it narrowly assesses importance based only on the probability assigned to the single predicted next token, thereby ignoring the other potential predictions of the original model. An intuitive solution to address this is to employ self distillation criterion for importance evaluation. However, this approach introduces significant computational overhead by requiring a separate teacher model for supervision. To this end, we propose a simple but effective criterion, information entropy of the model's output distribution, to efficiently evaluate importance scores of neurons with Taylor pruning without requirement of additional teacher. Compared to plain cross entropy criterion, it provides a more holistic criterion for Taylor pruning to prune neurons with the least impact on the prediction of model in a global manner, thereby preserving the fidelity of the model's predictive capabilities. Experimental results on extensive zero-shot benchmarks demonstrate that our method consistently outperforms existing pruning methods across the LLaMA and Qwen series models. The source code and trained weights are availabel at this https URL.
摘要：大型语言模型 (LLM) 在各种任务中都表现出了卓越的性能，但其大量的计算和内存要求给部署带来了重大挑战。一种常见的方法是使用损失函数的泰勒展开来估计神经元的重要性。然而，它依赖于单热交叉熵损失，一个关键的限制是它仅根据分配给单个预测的下一个标记的概率来狭隘地评估重要性，从而忽略了原始模型的其他潜在预测。解决这个问题的一个直观的解决方案是采用自蒸馏标准进行重要性评估。然而，这种方法需要单独的教师模型进行监督，从而引入了大量的计算开销。为此，我们提出了一个简单但有效的标准，即模型输出分布的信息熵，可以通过泰勒剪枝有效地评估神经元的重要性分数，而无需额外的教师。与普通交叉熵准则相比，它为泰勒剪枝提供了更全面的准则，以全局方式剪枝对模型预测影响最小的神经元，从而保持模型预测能力的保真度。广泛的零样本基准测试的实验结果表明，我们的方法在 LLaMA 和 Qwen 系列模型中始终优于现有的剪枝方法。源代码和训练后的权重可在此 https URL 中获取。

Title: Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization

Authors: Hongli Zhou, Hui Huang, Rui Zhang, Kehai Chen, Bing Xu, Conghui Zhu, Tiejun Zhao, Muyun Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08091
Pdf URL: https://arxiv.org/pdf/2603.08091
Copy Paste: [[2603.08091]] Toward Robust LLM-Based Judges: Taxonomic Bias Evaluation and Debiasing Optimization(https://arxiv.org/abs/2603.08091)
Keywords: language model, llm
Abstract: Large language model (LLM)-based judges are widely adopted for automated evaluation and reward modeling, yet their judgments are often affected by judgment biases. Accurately evaluating these biases is essential for ensuring the reliability of LLM-based judges. However, existing studies typically investigate limited biases under a single judge formulation, either generative or discriminative, lacking a comprehensive evaluation. To bridge this gap, we propose JudgeBiasBench, a benchmark for systematically quantifying biases in LLM-based judges. JudgeBiasBench defines a taxonomy of judgment biases across 4 dimensions, and constructs bias-augmented evaluation instances through a controlled bias injection pipeline, covering 12 representative bias types. We conduct extensive experiments across both generative and discriminative judges, revealing that current judges exhibit significant and diverse bias patterns that often compromise the reliability of automated evaluation. To mitigate judgment bias, we propose bias-aware training that explicitly incorporates bias-related attributes into the training process, encouraging judges to disentangle task-relevant quality from bias-correlated cues. By adopting reinforcement learning for generative judges and contrastive learning for discriminative judges, our methods effectively reduce judgment biases while largely preserving general evaluation capability.
摘要：基于大语言模型（LLM）的法官广泛应用于自动评估和奖励建模，但他们的判断往往受到判断偏差的影响。准确评估这些偏见对于确保法学硕士法官的可靠性至关重要。然而，现有的研究通常调查单一法官表述下的有限偏见，无论是生成性的还是歧视性的，缺乏全面的评估。为了弥补这一差距，我们提出了 JudgeBiasBench，这是一个系统量化法学硕士法官偏见的基准。 JudgeBiasBench 定义了 4 个维度的判断偏差分类法，并通过受控偏差注入管道构建偏差增强评估实例，涵盖 12 种代表性偏差类型。我们对生成法官和歧视法官进行了广泛的实验，揭示了当前的法官表现出显着且多样化的偏见模式，这些偏见模式往往会损害自动评估的可靠性。为了减轻判断偏差，我们提出了偏差意识训练，将与偏差相关的属性明确纳入训练过程，鼓励法官将与任务相关的质量与偏差相关的线索分开。通过对生成性法官采用强化学习，对判别性法官采用对比学习，我们的方法有效地减少了判断偏差，同时在很大程度上保留了一般评估能力。

Title: EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery

Authors: Yougang Lyu, Xi Zhang, Xinhao Yi, Yuyue Zhao, Shuyu Guo, Wenxiang Hu, Jan Piotrowski, Jakub Kaliski, Jacopo Urbani, Zaiqiao Meng, Lun Zhou, Xiaohui Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08127
Pdf URL: https://arxiv.org/pdf/2603.08127
Copy Paste: [[2603.08127]] EvoScientist: Towards Multi-Agent Evolving AI Scientists for End-to-End Scientific Discovery(https://arxiv.org/abs/2603.08127)
Keywords: language model, llm, agent
Abstract: The increasing adoption of Large Language Models (LLMs) has enabled AI scientists to perform complex end-to-end scientific discovery tasks requiring coordination of specialized roles, including idea generation and experimental execution. However, most state-of-the-art AI scientist systems rely on static, hand-designed pipelines and fail to adapt based on accumulated interaction histories. As a result, these systems overlook promising research directions, repeat failed experiments, and pursue infeasible ideas. To address this, we introduce EvoScientist, an evolving multi-agent AI scientist framework that continuously improves research strategies through persistent memory and self-evolution. EvoScientist comprises three specialized agents: a Researcher Agent (RA) for scientific idea generation, an Engineer Agent (EA) for experiment implementation and execution, and an Evolution Manager Agent (EMA) that distills insights from prior interactions into reusable knowledge. EvoScientist contains two persistent memory modules: (i) an ideation memory, which summarizes feasible research directions from top-ranked ideas while recording previously unsuccessful directions; and (ii) an experimentation memory, which captures effective data processing and model training strategies derived from code search trajectories and best-performing implementations. These modules enable the RA and EA to retrieve relevant prior strategies, improving idea quality and code execution success rates over time. Experiments show that EvoScientist outperforms 7 open-source and commercial state-of-the-art systems in scientific idea generation, achieving higher novelty, feasibility, relevance, and clarity via automatic and human evaluation. EvoScientist also substantially improves code execution success rates through multi-agent evolution, demonstrating persistent memory's effectiveness for end-to-end scientific discovery.
摘要：大型语言模型 (LLM) 的日益普及使人工智能科学家能够执行复杂的端到端科学发现任务，这些任务需要协调专业角色，包括想法生成和实验执行。然而，大多数最先进的人工智能科学家系统都依赖于静态的、手工设计的管道，并且无法根据累积的交互历史进行调整。结果，这些系统忽视了有前途的研究方向，重复失败的实验，并追求不可行的想法。为了解决这个问题，我们引入了 EvoScientist，这是一个不断发展的多智能体人工智能科学家框架，它通过持久记忆和自我进化不断改进研究策略。 EvoScientist 由三个专门代理组成：用于生成科学想法的研究员代理 (RA)、用于实施和执行实验的工程师代理 (EA) 以及将先前交互中的见解提炼为可重用知识的进化管理代理 (EMA)。 EvoScientist 包含两个持久记忆模块：（i）构想记忆，它从顶级想法中总结出可行的研究方向，同时记录以前不成功的方向； (ii) 实验存储器，捕获来自代码搜索轨迹和最佳性能实现的有效数据处理和模型训练策略。这些模块使 RA 和 EA 能够检索相关的先前策略，从而随着时间的推移提高创意质量和代码执行成功率。实验表明，EvoScientist 在科学想法生成方面优于 7 个开源和商业最先进的系统，通过自动和人工评估实现了更高的新颖性、可行性、相关性和清晰度。 EvoScientist 还通过多代理进化大幅提高了代码执行成功率，展示了持久内存对于端到端科学发现的有效性。

Title: Gradually Excavating External Knowledge for Implicit Complex Question Answering

Authors: Chang Liu, Xiaoguang Li, Lifeng Shang, Xin Jiang, Qun Liu, Edmund Y. Lam, Ngai Wong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08148
Pdf URL: https://arxiv.org/pdf/2603.08148
Copy Paste: [[2603.08148]] Gradually Excavating External Knowledge for Implicit Complex Question Answering(https://arxiv.org/abs/2603.08148)
Keywords: language model, llm
Abstract: Recently, large language models (LLMs) have gained much attention for the emergence of human-comparable capabilities and huge potential. However, for open-domain implicit question-answering problems, LLMs may not be the ultimate solution due to the reasons of: 1) uncovered or out-of-date domain knowledge, 2) one-shot generation and hence restricted comprehensiveness. To this end, this work proposes a gradual knowledge excavation framework for open-domain complex question answering, where LLMs iteratively and actively acquire external information, and then reason based on acquired historical knowledge. Specifically, during each step of the solving process, the model selects an action to execute, such as querying external knowledge or performing a single logical reasoning step, to gradually progress toward a final answer. Our method can effectively leverage plug-and-play external knowledge and dynamically adjust the strategy for solving complex questions. Evaluated on the StrategyQA dataset, our method achieves 78.17% accuracy with less than 6% parameters of its competitors, setting new SOTA for ~10B-scale LLMs.
摘要：最近，大型语言模型（LLM）因其与人类相当的能力和巨大的潜力而受到广泛关注。然而，对于开放领域的隐式问答问题，LLM 可能不是最终的解决方案，原因如下：1）未覆盖或过时的领域知识，2）一次性生成，因此限制了全面性。为此，这项工作提出了一种用于开放领域复杂问答的渐进式知识挖掘框架，其中法学硕士迭代地主动获取外部信息，然后根据所获取的历史知识进行推理。具体来说，在求解过程的每个步骤中，模型都会选择要执行的动作，例如查询外部知识或执行单个逻辑推理步骤，以逐渐获得最终答案。我们的方法可以有效地利用即插即用的外部知识并动态调整解决复杂问题的策略。在 StrategyQA 数据集上进行评估，我们的方法以低于竞争对手 6% 的参数实现了 78.17% 的准确率，为约 10B 规模的 LLM 设定了新的 SOTA。

Title: Gender Bias in MT for a Genderless Language: New Benchmarks for Basque

Authors: Amaia Murillo, Olatz-Perez-de-Viñaspre, Naiara Perez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08153
Pdf URL: https://arxiv.org/pdf/2603.08153
Copy Paste: [[2603.08153]] Gender Bias in MT for a Genderless Language: New Benchmarks for Basque(https://arxiv.org/abs/2603.08153)
Keywords: language model, llm
Abstract: Large language models (LLMs) and machine translation (MT) systems are increasingly used in our daily lives, but their outputs can reproduce gender bias present in the training data. Most resources for evaluating such biases are designed for English and reflect its sociocultural context, which limits their applicability to other languages. This work addresses this gap by introducing two new datasets to evaluate gender bias in translations involving Basque, a low-resource and genderless language. WinoMTeus adapts the WinoMT benchmark to examine how gender-neutral Basque occupations are translated into gendered languages such as Spanish and French. FLORES+Gender, in turn, extends the FLORES+ benchmark to assess whether translation quality varies when translating from gendered languages (Spanish and English) into Basque depending on the gender of the referent. We evaluate several general-purpose LLMs and open and proprietary MT systems. The results reveal a systematic preference for masculine forms and, in some models, a slightly higher quality for masculine referents. Overall, these findings show that gender bias is still deeply rooted in these models, and highlight the need to develop evaluation methods that consider both linguistic features and cultural context.
摘要：大型语言模型 (LLM) 和机器翻译 (MT) 系统在我们的日常生活中越来越多地使用，但它们的输出可能会重现训练数据中存在的性别偏见。大多数评估此类偏见的资源都是为英语设计的，反映了其社会文化背景，这限制了它们对其他语言的适用性。这项工作通过引入两个新的数据集来评估涉及巴斯克语（一种资源匮乏且无性别的语言）翻译中的性别偏见，从而解决了这一差距。 WinoMTeus 采用 WinoMT 基准来研究如何将性别中立的巴斯克职业翻译成西班牙语和法语等性别语言。反过来，FLORES+Gender 扩展了 FLORES+ 基准，以评估从性别语言（西班牙语和英语）翻译成巴斯克语时翻译质量是否会根据所指对象的性别而变化。我们评估了几个通用的法学硕士以及开放和专有的机器翻译系统。结果揭示了对男性形式的系统偏好，并且在某些模型中，男性所指对象的质量稍高。总的来说，这些发现表明性别偏见仍然深深植根于这些模型中，并强调需要开发既考虑语言特征又考虑文化背景的评估方法。

Title: RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs

Authors: Zhijun Wang, Ling Luo, Dinghao Pan, Huan Zhuang, Lejing Yu, Yuanyuan Sun, Hongfei Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08166
Pdf URL: https://arxiv.org/pdf/2603.08166
Copy Paste: [[2603.08166]] RexDrug: Reliable Multi-Drug Combination Extraction through Reasoning-Enhanced LLMs(https://arxiv.org/abs/2603.08166)
Keywords: language model, llm, agent
Abstract: Automated Drug Combination Extraction (DCE) from large-scale biomedical literature is crucial for advancing precision medicine and pharmacological research. However, existing relation extraction methods primarily focus on binary interactions and struggle to model variable-length n-ary drug combinations, where complex compatibility logic and distributed evidence need to be considered. To address these limitations, we propose RexDrug, an end-to-end reasoning-enhanced relation extraction framework for n-ary drug combination extraction based on large language models. RexDrug adopts a two-stage training strategy. First, a multi-agent collaborative mechanism is utilized to automatically generate high-quality expert-like reasoning traces for supervised fine-tuning. Second, reinforcement learning with a multi-dimensional reward function specifically tailored for DCE is applied to further refine reasoning quality and extraction accuracy. Extensive experiments on the DrugComb dataset show that RexDrug consistently outperforms state-of-the-art baselines for n-ary extraction. Additional evaluation on the DDI13 corpus confirms its generalizability to binary drugdrug interaction tasks. Human expert assessment and automatic reasoning metrics further indicates that RexDrug produces coherent medical reasoning while accurately identifying complex therapeutic regimens. These results establish RexDrug as a scalable and reliable solution for complex biomedical relation extraction from unstructured text. The source code and data are available at this https URL
摘要：从大规模生物医学文献中自动提取药物组合 (DCE) 对于推进精准医学和药理学研究至关重要。然而，现有的关系提取方法主要关注二元相互作用，并且难以对可变长度的 n 元药物组合进行建模，其中需要考虑复杂的相容性逻辑和分布式证据。为了解决这些限制，我们提出了 RexDrug，一种基于大型语言模型的端到端推理增强关系提取框架，用于 n 元药物组合提取。 RexDrug 采用两阶段训练策略。首先，利用多智能体协作机制自动生成高质量的专家式推理轨迹，以进行监督微调。其次，采用专为 DCE 定制的具有多维奖励函数的强化学习，进一步提高推理质量和提取准确性。 DrugComb 数据集上的大量实验表明，RexDrug 在 n 元提取方面始终优于最先进的基线。对 DDI13 语料库的额外评估证实了其对二元药物相互作用任务的通用性。人类专家评估和自动推理指标进一步表明，RexDrug 可以产生连贯的医学推理，同时准确识别复杂的治疗方案。这些结果使 RexDrug 成为从非结构化文本中提取复杂生物医学关系的可扩展且可靠的解决方案。源代码和数据可在此 https URL 获取

Title: Is continuous CoT better suited for multi-lingual reasoning?

Authors: Ali Hamza Bashir, Behzad Shomali, Markus Frey, Mehdi Ali, Rafet Sifa, David Berghaus
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.08177
Pdf URL: https://arxiv.org/pdf/2603.08177
Copy Paste: [[2603.08177]] Is continuous CoT better suited for multi-lingual reasoning?(https://arxiv.org/abs/2603.08177)
Keywords: chain-of-thought
Abstract: We investigate whether performing reasoning in a continuous latent space leads to more robust multilingual capabilities. We compare Continuous Chain-of-Thought (using the CODI framework) against standard supervised fine-tuning across five typologically diverse languages: English, Chinese, German, French, and Urdu. Our experiments on GSM8k and CommonsenseQA demonstrate that continuous reasoning significantly outperforms explicit reasoning on low-resource languages, particularly in zero-shot settings where the target language was not seen during training. Additionally, this approach achieves extreme efficiency, compressing reasoning traces by approximately $29\times$ to $50\times$. These findings indicate that continuous latent representations naturally exhibit greater language invariance, offering a scalable solution for cross-lingual reasoning.
摘要：我们研究在连续潜在空间中进行推理是否会带来更强大的多语言能力。我们将连续思想链（使用 CODI 框架）与五种类型不同的语言（英语、中文、德语、法语和乌尔都语）的标准监督微调进行比较。我们在 GSM8k 和 CommonsenseQA 上的实验表明，连续推理在低资源语言上的表现明显优于显式推理，特别是在训练期间看不到目标语言的零样本设置中。此外，这种方法实现了极高的效率，将推理轨迹压缩了大约 $29\times$ 到 $50\times$。这些发现表明，连续潜在表示自然地表现出更大的语言不变性，为跨语言推理提供了可扩展的解决方案。

Title: TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation

Authors: Toms Bergmanis, Martins Kronis, Ingus Jānis Pretkalniņš, Dāvis Nicmanis, Jeļizaveta Jeļinska, Roberts Rozis, Rinalds Vīksna, Mārcis Pinnis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08182
Pdf URL: https://arxiv.org/pdf/2603.08182
Copy Paste: [[2603.08182]] TildeOpen LLM: Leveraging Curriculum Learning to Achieve Equitable Language Representation(https://arxiv.org/abs/2603.08182)
Keywords: language model, llm
Abstract: Large language models often underperform in many European languages due to the dominance of English and a few high-resource languages in training data. This paper presents TildeOpen LLM, a 30-billion-parameter open-weight foundational model trained for 34 European languages to promote linguistic equity and improve performance for low-resource languages. To address the data imbalance, we combine dataset upsampling with a curriculum-based training schedule that alternates between uniform and natural language distributions. The resulting model performs favorably compared to other multilingual LLMs despite being trained with significantly fewer computing resources. Evaluation across multiple multilingual benchmarks shows that TildeOpen surpasses existing open-weight models in text generation and comprehension, particularly for Baltic, Finno-Ugric, and Slavic languages. Human evaluations confirm an up to tenfold reduction in linguistic errors relative to leading baselines. The model and associated resources are fully open-weight and publicly available at this http URL. These outcomes demonstrate that careful data curation and balanced training strategies can substantially enhance multilingual model quality without increasing model size or training volume.
摘要：由于英语和少数高资源语言在训练数据中占主导地位，大型语言模型在许多欧洲语言中通常表现不佳。本文介绍了 TildeOpen LLM，这是一个针对 34 种欧洲语言进行训练的 300 亿参数开放权重基础模型，旨在促进语言公平并提高低资源语言的性能。为了解决数据不平衡的问题，我们将数据集上采样与基于课程的训练计划相结合，该训练计划在统一语言分布和自然语言分布之间交替。尽管训练时使用的计算资源少得多，但与其他多语言法学硕士相比，所得模型的表现仍然良好。对多个多语言基准的评估表明，TildeOpen 在文本生成和理解方面超越了现有的开放权重模型，特别是对于波罗的海语言、芬兰-乌戈尔语和斯拉夫语言。人类评估证实，相对于领先基线，语言错误减少了多达十倍。该模型和相关资源是完全开放的，可通过此 http URL 公开获取。这些结果表明，仔细的数据管理和平衡的训练策略可以在不增加模型大小或训练量的情况下显着提高多语言模型质量。

Title: Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies

Authors: Romain Loncour, Jérémie Bogaert, François-Xavier Standaert
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08241
Pdf URL: https://arxiv.org/pdf/2603.08241
Copy Paste: [[2603.08241]] Sensivity of LLMs' Explanations to the Training Randomness:Context, Class & Task Dependencies(https://arxiv.org/abs/2603.08241)
Keywords: llm
Abstract: Transformer models are now a cornerstone in natural language processing. Yet, explaining their decisions remains a challenge. It was shown recently that the same model trained on the same data with a different randomness can lead to very different explanations. In this paper, we investigate how the (syntactic) context, the classes to be learned and the tasks influence this explanations' sensitivity to randomness. We show that they all have statistically significant impact: smallest for the (syntactic) context, medium for the classes and largest for the tasks.
摘要：Transformer 模型现在是自然语言处理的基石。然而，解释他们的决定仍然是一个挑战。最近的研究表明，在具有不同随机性的相同数据上训练的相同模型可能会导致截然不同的解释。在本文中，我们研究了（句法）上下文、要学习的类别和任务如何影响这种解释对随机性的敏感性。我们证明它们都具有统计上的显着影响：对（句法）上下文影响最小，对类影响中等，对任务影响最大。

Title: Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement

Authors: Dongxu Zhang, Hongqiang Lin, Yiding Sun, Pengyu Wang, Qirui Wang, Ning Yang, Jihua Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08251
Pdf URL: https://arxiv.org/pdf/2603.08251
Copy Paste: [[2603.08251]] Not All Queries Need Deep Thought: CoFiCot for Adaptive Coarse-to-fine Stateful Refinement(https://arxiv.org/abs/2603.08251)
Keywords: llm
Abstract: Scaling test-time computation enhances LLM reasoning ability but faces a uniform computation paradox. Allocating identical resources leads to over-correction on simple tasks and insufficient refinement on complex ones. To address this, we propose CoFiCot, a coarse-to-fine adaptive framework that dynamically tailors inference strategies to problem difficulty. Specifically, we implement a multi-metric classifier that triages queries by synthesizing semantic entropy, consensus reliability, and predicted reasoning depth . This enables a differentiated refinement stage that applies efficient aggregation for simple queries while routing complex ones to a context-aware correction loop . We formalize correction as a stateful sequential propagation process , where each repair is strictly conditioned on the verified history of prior rectifications. By integrating Process Reward Models (PRMs) within this state-dependent trajectory, CoFiCot effectively bridges the gap between granular error localization and global logical coherence, preventing the context fragmentation typical of stateless refinement methods.
摘要：扩展测试时间计算增强了 LLM 推理能力，但面临统一计算悖论。分配相同的资源会导致对简单任务的过度纠正和对复杂任务的细化不足。为了解决这个问题，我们提出了 CoFiCot，这是一种从粗到细的自适应框架，可以根据问题难度动态调整推理策略。具体来说，我们实现了一个多度量分类器，通过综合语义熵、共识可靠性和预测推理深度来对查询进行分类。这实现了差异化的细化阶段，该阶段对简单查询应用有效的聚合，同时将复杂的查询路由到上下文感知的校正循环。我们将纠正形式化为一个有状态的顺序传播过程，其中每次修复都严格以先前纠正的已验证历史记录为条件。通过将过程奖励模型 (PRM) 集成到这个状态相关的轨迹中，CoFiCot 有效地弥合了粒度错误定位和全局逻辑一致性之间的差距，防止了无状态细化方法中典型的上下文碎片。

Title: NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating

Authors: Tong Wu, Thanet Markchom, Huizhi Liang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08256
Pdf URL: https://arxiv.org/pdf/2603.08256
Copy Paste: [[2603.08256]] NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating(https://arxiv.org/abs/2603.08256)
Keywords: language model, llm, prompt
Abstract: Word sense plausibility rating requires predicting the human-perceived plausibility of a given word sense on a 1--5 scale in the context of short narrative stories containing ambiguous homonyms. This paper systematically compares three approaches: (1) embedding-based methods pairing sentence embeddings with standard regressors, (2) transformer fine-tuning with parameter-efficient adaptation, and (3) large language model (LLM) prompting with structured reasoning and explicit decision rules. The best-performing system employs a structured prompting strategy that decomposes evaluation into narrative components (precontext, target sentence, ending) and applies explicit decision rules for rating calibration. The analysis reveals that structured prompting with decision rules substantially outperforms both fine-tuned models and embedding-based approaches, and that prompt design matters more than model scale for this task. The code is publicly available at this https URL.
摘要：词义合理性评级需要在包含不明确同音异义词的短篇叙事故事的背景下，以 1--5 的等级来预测人类感知的给定词义的合理性。本文系统地比较了三种方法：（1）基于嵌入的方法，将句子嵌入与标准回归器配对，（2）具有参数高效适应的变压器微调，以及（3）具有结构化推理和显式决策规则的大型语言模型（LLM）提示。表现最好的系统采用结构化提示策略，将评估分解为叙述成分（上下文、目标句子、结尾），并应用明确的决策规则进行评级校准。分析表明，带有决策规则的结构化提示大大优于微调模型和基于嵌入的方法，并且对于此任务来说，提示设计比模型规模更重要。该代码可通过此 https URL 公开获取。

Title: How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms

Authors: JV Roig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08274
Pdf URL: https://arxiv.org/pdf/2603.08274
Copy Paste: [[2603.08274]] How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms(https://arxiv.org/abs/2603.08274)
Keywords: language model, llm
Abstract: How much do large language models actually hallucinate when answering questions grounded in provided documents? Despite the critical importance of this question for enterprise AI deployments, reliable measurement has been hampered by benchmarks that rely on static datasets vulnerable to contamination, LLM-based judges with documented biases, or evaluation scales too small for statistical confidence. We address this gap using RIKER, a ground-truth-first evaluation methodology that enables deterministic scoring without human annotation. Across 35 open-weight models, three context lengths (32K, 128K, and 200K tokens), four temperature settings, and three hardware platforms (NVIDIA H200, AMD MI300X, and Intel Gaudi 3), we conducted over 172 billion tokens of evaluation - an order of magnitude beyond prior work. Our findings reveal that: (1) even the best-performing models fabricate answers at a non-trivial rate - 1.19% at best at 32K, with top-tier models at 5 - 7% - and fabrication rises steeply with context length, nearly tripling at 128K and exceeding 10% for all models at 200K; (2) model selection dominates all other factors, with overall accuracy spanning a 72-percentage-point range and model family predicting fabrication resistance better than model size; (3) temperature effects are nuanced - T=0.0 yields the best overall accuracy in roughly 60% of cases, but higher temperatures reduce fabrication for the majority of models and dramatically reduce coherence loss (infinite generation loops), which can reach 48x higher rates at T=0.0 versus T=1.0; (4) grounding ability and fabrication resistance are distinct capabilities - models that excel at finding facts may still fabricate facts that do not exist; and (5) results are consistent across hardware platforms, confirming that deployment decisions need not be hardware-dependent.
摘要：在回答基于所提供文档的问题时，大型语言模型实际上会产生多少幻觉？尽管这个问题对于企业人工智能部署至关重要，但可靠的测量受到了依赖于易受污染的静态数据集的基准、具有记录偏见的法学硕士法官或评估规模太小而无法获得统计信心的阻碍。我们使用 RIKER 来解决这一差距，这是一种以事实为先的评估方法，无需人工注释即可实现确定性评分。在 35 个开放权重模型、三种上下文长度（32K、128K 和 200K 令牌）、四种温度设置和三个硬件平台（NVIDIA H200、AMD MI300X 和 Intel Gaudi 3）中，我们进行了超过 1720 亿个令牌的评估，这个数量级超出了之前的工作。我们的研究结果表明：（1）即使是性能最好的模型也会以不平凡的速度制造答案 - 32K 时最多 1.19%，顶级模型为 5 - 7% - 并且制造量随着上下文长度的增加而急剧上升，在 128K 时几乎增加了两倍，在 200K 时所有模型的制造率都超过 10%； (2) 模型选择主导所有其他因素，总体精度跨越 72 个百分点，模型族预测制造阻力的效果优于模型尺寸； (3) 温度影响是微妙的 - T=0.0 在大约 60% 的情况下会产生最佳的总体精度，但较高的温度会减少大多数模型的制造，并显着减少相干性损失（无限生成循环），与 T=1.0 相比，T=0.0 时的速率可以提高 48 倍；（4）接地能力和抗捏造能力是不同的能力——擅长发现事实的模型仍然可能捏造不存在的事实； (5) 跨硬件平台的结果是一致的，证实部署决策不需要依赖于硬件。

Title: AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models

Authors: Hankun Kang, Di Lin, Zhirong Liao, Pengfei Bai, Xinyi Zeng, Jiawei Jiang, Yuanyuan Zhu, Tieyun Qian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08275
Pdf URL: https://arxiv.org/pdf/2603.08275
Copy Paste: [[2603.08275]] AdaCultureSafe: Adaptive Cultural Safety Grounded by Cultural Knowledge in Large Language Models(https://arxiv.org/abs/2603.08275)
Keywords: language model, llm
Abstract: With the widespread adoption of Large Language Models (LLMs), respecting indigenous cultures becomes essential for models' culturally safety and responsible global applications. Existing studies separately consider cultural safety and cultural knowledge and neglect that the former should be grounded by the latter. This severely prevents LLMs from yielding culture-specific respectful responses. Consequently, adaptive cultural safety remains a formidable task. In this work, we propose to jointly model cultural safety and knowledge. First and foremost, cultural-safety and knowledge-paired data serve as the key prerequisite to conduct this research. However, the cultural diversity across regions and the subtlety of cultural differences pose significant challenges to the creation of such paired evaluation data. To address this issue, we propose a novel framework that integrates authoritative cultural knowledge descriptions curation, LLM-automated query generation, and heavy manual verification. Accordingly, we obtain a dataset named AdaCultureSafe containing 4.8K manually decomposed fine-grained cultural descriptions and the corresponding 48K manually verified safety- and knowledge-oriented queries. Upon the constructed dataset, we evaluate three families of popular LLMs on their cultural safety and knowledge proficiency, via which we make a critical discovery: no significant correlation exists between their cultural safety and knowledge proficiency. We then delve into the utility-related neuron activations within LLMs to investigate the potential cause of the absence of correlation, which can be attributed to the difference of the objectives of pre-training and post-alignment. We finally present a knowledge-grounded method, which significantly enhances cultural safety by enforcing the integration of knowledge into the LLM response generation process.
摘要：随着大型语言模型 (LLM) 的广泛采用，尊重本土文化对于模型的文化安全和负责任的全球应用至关重要。现有研究分别考虑文化安全和文化知识，忽视了前者应以后者为基础。这严重阻碍了法学硕士产生针对特定文化的尊重反应。因此，适应性文化安全仍然是一项艰巨的任务。在这项工作中，我们建议共同建立文化安全和知识模型。首先，文化安全和知识配对数据是开展这项研究的关键先决条件。然而，不同地区的文化多样性以及文化差异的微妙性给这种配对评价数据的创建带来了巨大的挑战。为了解决这个问题，我们提出了一个新颖的框架，集成了权威的文化知识描述管理、LLM 自动查询生成和大量的手动验证。因此，我们获得了一个名为 AdaCultureSafe 的数据集，其中包含 4.8K 手动分解的细粒度文化描述以及相应的 48K 手动验证的面向安全和知识的查询。根据构建的数据集，我们评估了三个受欢迎的法学硕士家庭的文化安全性和知识熟练程度，通过这些我们得出了一个重要的发现：他们的文化安全性和知识熟练程度之间不存在显着的相关性。然后，我们深入研究法学硕士内与效用相关的神经元激活，以调查缺乏相关性的潜在原因，这可以归因于训练前和对齐后目标的差异。最后，我们提出了一种基于知识的方法，通过强制将知识整合到法学硕士回复生成过程中，显着增强了文化安全。

Title: Evaluating LLM-Based Grant Proposal Review via Structured Perturbations

Authors: William Thorne, Joseph James, Yang Wang, Chenghua Lin, Diana Maynard
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.08281
Pdf URL: https://arxiv.org/pdf/2603.08281
Copy Paste: [[2603.08281]] Evaluating LLM-Based Grant Proposal Review via Structured Perturbations(https://arxiv.org/abs/2603.08281)
Keywords: llm
Abstract: As AI-assisted grant proposals outpace manual review capacity in a kind of ``Malthusian trap'' for the research ecosystem, this paper investigates the capabilities and limitations of LLM-based grant reviewing for high-stakes evaluation. Using six EPSRC proposals, we develop a perturbation-based framework probing LLM sensitivity across six quality axes: funding, timeline, competency, alignment, clarity, and impact. We compare three review architectures: single-pass review, section-by-section analysis, and a 'Council of Personas' ensemble emulating expert panels. The section-level approach significantly outperforms alternatives in both detection rate and scoring reliability, while the computationally expensive council method performs no better than baseline. Detection varies substantially by perturbation type, with alignment issues readily identified but clarity flaws largely missed by all systems. Human evaluation shows LLM feedback is largely valid but skewed toward compliance checking over holistic assessment. We conclude that current LLMs may provide supplementary value within EPSRC review but exhibit high variability and misaligned review priorities. We release our code and any non-protected data.
摘要：由于人工智能辅助的拨款申请超出了研究生态系统的“马尔萨斯陷阱”中的手动审查能力，本文研究了基于法学硕士的拨款审查在高风险评估中的能力和局限性。利用六项 EPSRC 提案，我们开发了一个基于扰动的框架，探讨法学硕士在六个质量轴上的敏感性：资金、时间表、能力、一致性、清晰度和影响力。我们比较了三种审查架构：单次审查、逐节分析和模拟专家小组的“角色委员会”整体。部分级方法在检测率和评分可靠性方面显着优于替代方法，而计算成本昂贵的理事会方法的性能并不比基线更好。检测因扰动类型的不同而有很大差异，对准问题很容易识别，但所有系统都基本上错过了清晰度缺陷。人工评估显示法学硕士的反馈在很大程度上是有效的，但偏向于合规性检查而不是整体评估。我们的结论是，当前的法学硕士可能在 EPSRC 审查中提供补充价值，但表现出高度的可变性和不一致的审查优先级。我们发布我们的代码和任何不受保护的数据。

Title: Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization

Authors: Chaimae Chellaf, Salima Mdhaffar, Yannick Estève, Stéphane Huet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08282
Pdf URL: https://arxiv.org/pdf/2603.08282
Copy Paste: [[2603.08282]] Using Multimodal and Language-Agnostic Sentence Embeddings for Abstractive Summarization(https://arxiv.org/abs/2603.08282)
Keywords: hallucination
Abstract: Abstractive summarization aims to generate concise summaries by creating new sentences, allowing for flexible rephrasing. However, this approach can be vulnerable to inaccuracies, particularly `hallucinations' where the model introduces non-existent information. In this paper, we leverage the use of multimodal and multilingual sentence embeddings derived from pretrained models such as LaBSE, SONAR, and BGE-M3, and feed them into a modified BART-based French model. A Named Entity Injection mechanism that appends tokenized named entities to the decoder input is introduced, in order to improve the factual consistency of the generated summary. Our novel framework, SBARThez, is applicable to both text and speech inputs and supports cross-lingual summarization; it shows competitive performance relative to token-level baselines, especially for low-resource languages, while generating more concise and abstract summaries.
摘要：抽象摘要旨在通过创建新句子来生成简洁的摘要，并允许灵活的改写。然而，这种方法可能容易出现错误，特别是模型引入不存在的信息的“幻觉”。在本文中，我们利用从 LaBSE、SONAR 和 BGE-M3 等预训练模型派生的多模态和多语言句子嵌入，并将它们输入到修改后的基于 BART 的法语模型中。引入了命名实体注入机制，将标记化的命名实体附加到解码器输入，以提高生成的摘要的事实一致性。我们的新颖框架SBARThez适用于文本和语音输入，并支持跨语言摘要；它显示了相对于令牌级基线的竞争性能，特别是对于低资源语言，同时生成更简洁和抽象的摘要。

Title: LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs

Authors: Serene Wang, Lavanya Pobbathi, Haihua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08286
Pdf URL: https://arxiv.org/pdf/2603.08286
Copy Paste: [[2603.08286]] LAMUS: A Large-Scale Corpus for Legal Argument Mining from U.S. Caselaw using LLMs(https://arxiv.org/abs/2603.08286)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Legal argument mining aims to identify and classify the functional components of judicial reasoning, such as facts, issues, rules, analysis, and conclusions. Progress in this area is limited by the lack of large-scale, high-quality annotated datasets for U.S. caselaw, particularly at the state level. This paper introduces LAMUS, a sentence-level legal argument mining corpus constructed from U.S. Supreme Court decisions and Texas criminal appellate opinions. The dataset is created using a data-centric pipeline that combines large-scale case collection, LLM-based automatic annotation, and targeted human-in-the-loop quality refinement. We formulate legal argument mining as a six-class sentence classification task and evaluate multiple general-purpose and legal-domain language models under zero-shot, few-shot, and chain-of-thought prompting strategies, with LegalBERT as a supervised baseline. Results show that chain-of-thought prompting substantially improves LLM performance, while domain-specific models exhibit more stable zero-shot behavior. LLM-assisted verification corrects nearly 20% of annotation errors, improving label consistency. Human verification achieves Cohen's Kappa of 0.85, confirming annotation quality. LAMUS provides a scalable resource and empirical insights for future legal NLP research. All code and datasets can be accessed for reproducibility on GitHub at: this https URL
摘要：法律论证挖掘旨在识别和分类司法推理的功能组成部分，例如事实、问题、规则、分析和结论。由于美国判例法缺乏大规模、高质量的注释数据集，特别是在州一级，该领域的进展受到限制。本文介绍了 LAMUS，这是一个根据美国最高法院判决和德克萨斯州刑事上诉意见构建的句子级法律论证挖掘语料库。该数据集是使用以数据为中心的管道创建的，该管道结合了大规模案例收集、基于 LLM 的自动注释和有针对性的人机交互质量改进。我们将法律论据挖掘制定为六类句子分类任务，并以 LegalBERT 作为监督基线，在零样本、少样本和思维链提示策略下评估多个通用和法律领域语言模型。结果表明，思想链提示大大提高了法学硕士的表现，而特定领域的模型表现出更稳定的零样本行为。 LLM 辅助验证纠正了近 20% 的注释错误，提高了标签一致性。人工验证的 Cohen Kappa 达到 0.85，证实了注释质量。 LAMUS 为未来的法律 NLP 研究提供了可扩展的资源和实证见解。所有代码和数据集都可以在 GitHub 上访问以实现重现性：此 https URL

Title: SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation

Authors: Yagiz Can Akay, Muhammed Yusuf Kartal, Esra Alparslan, Faruk Ortakoyluoglu, Arda Akpinar
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.08329
Pdf URL: https://arxiv.org/pdf/2603.08329
Copy Paste: [[2603.08329]] SPD-RAG: Sub-Agent Per Document Retrieval-Augmented Generation(https://arxiv.org/abs/2603.08329)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Answering complex, real-world queries often requires synthesizing facts scattered across vast document corpora. In these settings, standard retrieval-augmented generation (RAG) pipelines suffer from incomplete evidence coverage, while long-context large language models (LLMs) struggle to reason reliably over massive inputs. We introduce SPD-RAG, a hierarchical multi-agent framework for exhaustive cross-document question answering that decomposes the problem along the document axis. Each document is processed by a dedicated document-level agent operating only on its own content, enabling focused retrieval, while a coordinator dispatches tasks to relevant agents and aggregates their partial answers. Agent outputs are synthesized by merging partial answers through a token-bounded synthesis layer (which supports recursive map-reduce for massive corpora). This document-level specialization with centralized fusion improves scalability and answer quality in heterogeneous multidocument settings while yielding a modular, extensible retrieval pipeline. On the LOONG benchmark (EMNLP 2024) for long-context multi-document QA, SPD-RAG achieves an Avg Score of 58.1 (GPT-5 evaluation), outperforming Normal RAG (33.0) and Agentic RAG (32.8) while using only 38% of the API cost of a full-context baseline (68.0).
摘要：回答复杂的现实世界查询通常需要综合分散在大量文档语料库中的事实。在这些设置中，标准检索增强生成（RAG）管道存在证据覆盖不完整的问题，而长上下文大语言模型（LLM）则难以对大量输入进行可靠推理。我们引入了 SPD-RAG，这是一种分层多代理框架，用于详尽的跨文档问答，沿着文档轴分解问题。每个文档均由专门的文档级代理处理，该代理仅对其自己的内容进行操作，从而实现集中检索，而协调器则将任务分派给相关代理并汇总其部分答案。代理输出是通过通过令牌限制合成层（支持大规模语料库的递归映射缩减）合并部分答案来合成的。这种具有集中融合的文档级专业化提高了异构多文档设置中的可扩展性和答案质量，同时产生模块化、可扩展的检索管道。在长上下文多文档 QA 的 LOONG 基准 (EMNLP 2024) 上，SPD-RAG 的平均得分为 58.1（GPT-5 评估），优于 Normal RAG (33.0) 和 Agentic RAG (32.8)，同时仅使用全上下文基线 (68.0) 的 API 成本的 38%。

Title: Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem

Authors: Tara Azin, Daniel Dumitrescu, Diana Inkpen, Raj Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08358
Pdf URL: https://arxiv.org/pdf/2603.08358
Copy Paste: [[2603.08358]] Do Language Models Know Theo Has a Wife? Investigating the Proviso Problem(https://arxiv.org/abs/2603.08358)
Keywords: language model
Abstract: We investigate how language models handle the proviso problem, an unresolved issue in pragmatics where presuppositions in conditional sentences diverge between theoretical and human interpretations. We reformulate this phenomenon as a Natural Language Inference task and introduce a diagnostic dataset designed to probe presupposition projection in conditionals. We evaluate RoBERTa, DeBERTa, LLaMA, and Gemma using explainability analyses. The results show that models broadly align with human judgments but rely on shallow pattern matching rather than semantic or pragmatic reasoning. Our work provides the first computational evaluation framework for the proviso problem and highlights the need for diagnostic, multi-method approaches to assess pragmatic competence and context-dependent meaning in language models.
摘要：我们研究语言模型如何处理附带问题，这是一个语用学中尚未解决的问题，其中条件句子中的预设在理论解释和人类解释之间存在分歧。我们将这种现象重新表述为自然语言推理任务，并引入了一个旨在探测条件中的预设投影的诊断数据集。我们使用可解释性分析来评估 RoBERTa、DeBERTa、LLaMA 和 Gemma。结果表明，模型与人类的判断大致一致，但依赖于浅层模式匹配，而不是语义或语用推理。我们的工作为附带问题提供了第一个计算评估框架，并强调需要诊断性的多方法来评估语言模型中的语用能力和上下文相关的含义。

Title: Adaptive Loops and Memory in Transformers: Think Harder or Know More?

Authors: Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Mehdi Ali
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08391
Pdf URL: https://arxiv.org/pdf/2603.08391
Copy Paste: [[2603.08391]] Adaptive Loops and Memory in Transformers: Think Harder or Know More?(https://arxiv.org/abs/2603.08391)
Keywords: language model, prompt, chain-of-thought
Abstract: Chain-of-thought (CoT) prompting enables reasoning in language models but requires explicit verbalization of intermediate steps. Looped transformers offer an alternative by iteratively refining representations within hidden states. This parameter efficiency comes at a cost, as looped models lack the storage capacity of deeper models which use unique weights per layer. In this work, we investigate transformer models that feature both adaptive per-layer looping, where each transformer block learns to iterate its hidden state via a learned halting mechanism, and gated memory banks, that provide additional learned storage. We find that looping primarily benefits mathematical reasoning, while memory banks help recover performance on commonsense tasks compared to parameter and FLOP matched models. Combining both mechanisms yields a model that outperforms an iso-FLOP baseline -- with three times the number of layers -- on math benchmarks. Analysis of model internals reveals layer specialization: early layers learn to loop minimally and access memory sparingly, while later layers do both more heavily.
摘要：思想链 (CoT) 提示可以在语言模型中进行推理，但需要对中间步骤进行明确的语言表达。循环变压器通过迭代地细化隐藏状态中的表示提供了一种替代方案。这种参数效率是有代价的，因为循环模型缺乏更深层次模型的存储容量，而更深层次模型使用每层独特的权重。在这项工作中，我们研究了具有自适应每层循环功能的变压器模型，其中每个变压器块通过学习停止机制学习迭代其隐藏状态，以及提供额外学习存储的门控内存库。我们发现循环主要有利于数学推理，而与参数和 FLOP 匹配模型相比，内存库有助于恢复常识任务的性能。结合这两种机制产生的模型在数学基准上优于 iso-FLOP 基线（层数是其三倍）。对模型内部结构的分析揭示了层的专业化：早期的层学习最少的循环并少量地访问内存，而后面的层则更加频繁地执行这两个操作。

Title: COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling

Authors: Yee Man Ng, Bram van Dijk, Pieter Beynen, Otto Boekesteijn, Joris Jansen, Gerard van Oortmerssen, Max van Duijn, Marco Spruit
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08392
Pdf URL: https://arxiv.org/pdf/2603.08392
Copy Paste: [[2603.08392]] COACH meets QUORUM: A Framework and Pipeline for Aligning User, Expert and Developer Perspectives in LLM-generated Health Counselling(https://arxiv.org/abs/2603.08392)
Keywords: language model, llm, hallucination
Abstract: Systems that collect data on sleep, mood, and activities can provide valuable lifestyle counselling to populations affected by chronic disease and its consequences. Such systems are, however, challenging to develop; besides reliably extracting patterns from user-specific data, systems should also contextualise these patterns with validated medical knowledge to ensure the quality of counselling, and generate counselling that is relevant to a real user. We present QUORUM, a new evaluation framework that unifies these developer-, expert-, and user-centric perspectives, and show with a real case study that it meaningfully tracks convergence and divergence in stakeholder perspectives. We also present COACH, a Large Language Model-driven pipeline to generate personalised lifestyle counselling for our Healthy Chronos use case, a diary app for cancer patients and survivors. Applying our framework shows that overall, users, medical experts, and developers converge on the opinion that the generated counselling is relevant, of good quality, and reliable. However, stakeholders also diverge on the tone of the counselling, sensitivity to errors in pattern-extraction, and potential hallucinations. These findings highlight the importance of multi-stakeholder evaluation for consumer health language technologies and illustrate how a unified evaluation framework can support trustworthy, patient-centered NLP systems in real-world settings.
摘要：收集睡眠、情绪和活动数据的系统可以为受慢性病及其后果影响的人群提供有价值的生活方式咨询。然而，此类系统的开发具有挑战性；除了从特定于用户的数据中可靠地提取模式之外，系统还应该将这些模式与经过验证的医学知识结合起来，以确保咨询的质量，并生成与真实用户相关的咨询。我们提出了 QUORUM，一个新的评估框架，它统一了这些以开发人员、专家和用户为中心的观点，并通过真实的案例研究表明，它有意义地跟踪利益相关者观点的趋同和分歧。我们还推出了 COACH，这是一个大型语言模型驱动的管道，可为我们的 Healthy Chronos 用例（癌症患者和幸存者的日记应用程序）生成个性化的生活方式咨询。应用我们的框架表明，总体而言，用户、医学专家和开发人员一致认为生成的咨询是相关的、高质量的且可靠的。然而，利益相关者在咨询的基调、对模式提取错误的敏感性以及潜在的幻觉方面也存在分歧。这些发现强调了多利益相关者评估对消费者健康语言技术的重要性，并说明了统一的评估框架如何在现实环境中支持值得信赖、以患者为中心的 NLP 系统。

Title: Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective

Authors: Liyuan Mao, Le Yu, Jing Zhou, Chujie Zheng, Bowen Yu, Chang Gao, Shixuan Liu, An Yang, Weinan Zhang, JunYang Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.08398
Pdf URL: https://arxiv.org/pdf/2603.08398
Copy Paste: [[2603.08398]] Revealing Behavioral Plasticity in Large Language Models: A Token-Conditional Perspective(https://arxiv.org/abs/2603.08398)
Keywords: language model, llm
Abstract: In this work, we reveal that Large Language Models (LLMs) possess intrinsic behavioral plasticity-akin to chameleons adapting their coloration to environmental cues-that can be exposed through token-conditional generation and stabilized via reinforcement learning. Specifically, by conditioning generation on carefully selected token prefixes sampled from responses exhibiting desired behaviors, LLMs seamlessly adapt their behavioral modes at inference time (e.g., switching from step-by-step reasoning to direct answering) without retraining. Based on this insight, we propose Token-Conditioned Reinforcement Learning (ToCoRL), a principled framework that leverages RL to internalize this chameleon-like plasticity, transforming transient inference-time adaptations into stable and learnable behavioral patterns. ToCoRL guides exploration with token-conditional generation and keep enhancing exploitation, enabling emergence of appropriate behaviors. Extensive experiments show that ToCoRL enables precise behavioral control without capability degradation. Notably, we show that large reasoning models, while performing strongly on complex mathematics, can be effectively adapted to excel at factual question answering, which was a capability previously hindered by their step-by-step reasoning patterns.
摘要：在这项工作中，我们揭示了大型语言模型（LLM）具有内在的行为可塑性——类似于变色龙根据环境线索调整颜色——可以通过令牌条件生成来暴露并通过强化学习来稳定。具体来说，通过根据从表现出所需行为的响应中采样的精心选择的令牌前缀来调节生成，法学硕士可以在推理时无缝地调整其行为模式（例如，从逐步推理切换到直接回答），而无需重新训练。基于这一见解，我们提出了令牌条件强化学习（ToCoRL），这是一个原则框架，利用 RL 内化这种变色龙般的可塑性，将短暂的推理时间适应转化为稳定且可学习的行为模式。 ToCoRL 通过令牌条件生成来指导探索，并不断增强利用，从而出现适当的行为。大量实验表明，ToCoRL 能够实现精确的行为控制，而不会降低能力。值得注意的是，我们表明，大型推理模型虽然在复杂数学上表现强劲，但可以有效地适应事实问答，而这是以前因其逐步推理模式所阻碍的能力。

Title: Aligning to Illusions: Choice Blindness in Human and AI Feedback

Authors: Wenbin Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.08412
Pdf URL: https://arxiv.org/pdf/2603.08412
Copy Paste: [[2603.08412]] Aligning to Illusions: Choice Blindness in Human and AI Feedback(https://arxiv.org/abs/2603.08412)
Keywords: llm
Abstract: Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.
摘要：来自人类反馈的强化学习 (RLHF) 假设注释者偏好反映稳定的内部状态。我们通过跨越偏好管道的三个实验来挑战这一点。在一项人类选择失明研究中，91% 的秘密交换偏好未被发现，将选择失明延伸到对不熟悉的文本进行第三人称评价比较。对 15 名 LLM 法官作为潜在的替代者进行测试，我们发现检测依赖于浅层文本匹配，而不是真正的自我监控：从上下文中删除先前的推理会导致盲目性从接近零激增到超过 50%，而明确的社会压力会导致近乎普遍的服从。在跨 86M 到 2B 参数的两种架构的剂量反应实验中，在奖励信号减半之前必须损坏六分之一到三分之一的标签，但标准成对精度实际上保持不变。 Best-of-N 评估证实这会导致下游策略退化：在 50% 的腐败情况下，奖励引导的选择不会比随机抽样产生任何改进，而代理模型报告单调递增的分数。总之，这些结果揭示了一个偏好构建问题：进入 RLHF 的信号是由启发上下文塑造的，而人类元认知、法学硕士自我监控和标准评估指标都无法检测到。

Title: One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States

Authors: Bo Jiang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.08429
Pdf URL: https://arxiv.org/pdf/2603.08429
Copy Paste: [[2603.08429]] One Model Is Enough: Native Retrieval Embeddings from LLM Agent Hidden States(https://arxiv.org/abs/2603.08429)
Keywords: llm, agent
Abstract: LLM agents that retrieve external knowledge typically generate a search query as text, then run a separate embedding model to encode it into a vector. This two-model pipeline adds infrastructure complexity and latency, yet is redundant: the LLM already encodes the full conversational context in its hidden states. We propose equipping LLM agents with native retrieval capability by adding a lightweight projection head that maps hidden states directly into the embedding space, eliminating the need for a separate embedding model. Trained with a combination of alignment, contrastive, and rank distillation losses, our method retains 97\% of baseline retrieval quality while enabling the LLM agent to search with its own representations. Experiments on the QReCC conversational search benchmark show competitive Recall@10 and MRR@10 compared to the standard generate-then-encode pipeline, with systematic ablations confirming the contribution of each loss component.
摘要：检索外部知识的 LLM 代理通常会生成文本形式的搜索查询，然后运行单独的嵌入模型将其编码为向量。这种两种模型的管道增加了基础设施的复杂性和延迟，但却是多余的：LLM 已经在其隐藏状态中编码了完整的对话上下文。我们建议通过添加一个轻量级投影头来为 LLM 代理配备本机检索功能，该投影头将隐藏状态直接映射到嵌入空间，从而无需单独的嵌入模型。通过结合对齐、对比和排名蒸馏损失进行训练，我们的方法保留了 97% 的基线检索质量，同时使 LLM 代理能够使用自己的表示进行搜索。 QReCC 会话搜索基准测试表明，与标准生成然后编码管道相比，Recall@10 和 MRR@10 具有竞争力，并通过系统消融确认了每个损失分量的贡献。

Title: A Dataset for Probing Translationese Preferences in English-to-Swedish Translation

Authors: Jenny Kunz, Anja Jarochenko, Marcel Bollmann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08450
Pdf URL: https://arxiv.org/pdf/2603.08450
Copy Paste: [[2603.08450]] A Dataset for Probing Translationese Preferences in English-to-Swedish Translation(https://arxiv.org/abs/2603.08450)
Keywords: language model, llm
Abstract: Translations often carry traces of the source language, a phenomenon known as translationese. We introduce the first freely available English-to-Swedish dataset contrasting translationese sentences with idiomatic alternatives, designed to probe intrinsic preferences of language models. It includes error tags and descriptions of the problems in the original translations. In experiments evaluating smaller Swedish and multilingual LLMs with our dataset, we find that they often favor the translationese phrasing. Human alternatives are chosen more often when the English source sentence is omitted, indicating that exposure to the source biases models toward literal translations, although even without context models often prefer the translationese variant. Our dataset and findings provide a resource and benchmark for developing models that produce more natural, idiomatic output in non-English languages.
摘要：翻译常常带有源语言的痕迹，这种现象被称为翻译语。我们引入了第一个免费提供的英语到瑞典语数据集，将翻译句子与惯用替代方案进行对比，旨在探究语言模型的内在偏好。它包括错误标签和原始翻译中问题的描述。在使用我们的数据集评估规模较小的瑞典语和多语言法学硕士的实验中，我们发现他们通常更喜欢翻译短语。当省略英语源句子时，更频繁地选择人类替代方案，这表明接触源语句会使模型偏向直译，尽管即使没有上下文模型也通常更喜欢翻译变体。我们的数据集和研究结果为开发模型提供了资源和基准，以非英语语言生成更自然、更惯用的输出。

Title: Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA

Authors: Ummar Abbas, Mourad Ouzzani, Mohamed Y. Eltabakh, Omar Sinan, Gagan Bhatia, Hamdy Mubarak, Majd Hawasly, Mohammed Qusay Hashim, Kareem Darwish, Firoj Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.08501
Pdf URL: https://arxiv.org/pdf/2603.08501
Copy Paste: [[2603.08501]] Fanar-Sadiq: A Multi-Agent Architecture for Grounded Islamic QA(https://arxiv.org/abs/2603.08501)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Large language models (LLMs) can answer religious knowledge queries fluently, yet they often hallucinate and misattribute sources, which is especially consequential in Islamic settings where users expect grounding in canonical texts (Qur'an and Hadith) and jurisprudential (fiqh) nuance. Retrieval-augmented generation (RAG) reduces some of these limitations by grounding generation in external evidence. However, a single ``retrieve-then-generate'' pipeline is limited to deal with the diversity of Islamic this http URL may request verbatim scripture, fatwa-style guidance with citations or rule-constrained computations such as zakat and inheritance that require strict arithmetic and legal invariants. In this work, we present a bilingual (Arabic/English) multi-agent Islamic assistant, called Fanar-Sadiq, which is a core component of the Fanar AI platform. Fanar-Sadiq routes Islamic-related queries to specialized modules within an agentic, tool-using architecture. The system supports intent-aware routing, retrieval-grounded fiqh answers with deterministic citation normalization and verification traces, exact verse lookup with quotation validation, and deterministic calculators for Sunni zakat and inheritance with madhhab-sensitive branching. We evaluate the complete end-to-end system on public Islamic QA benchmarks and demonstrate effectiveness and efficiency. Our system is currently publicly and freely accessible through API and a Web application, and has been accessed $\approx$1.9M times in less than a year.
摘要：大型语言模型（LLM）可以流利地回答宗教知识查询，但它们经常产生幻觉和错误归因来源，这在伊斯兰环境中尤其重要，因为用户期望以规范文本（古兰经和圣训）和法理学（fiqh）的细微差别为基础。检索增强生成（RAG）通过将生成基于外部证据来减少其中的一些限制。然而，单个“检索然后生成”管道仅限于处理伊斯兰的多样性，此http URL可能需要逐字经文、带有引用的追杀令式指导或规则约束的计算，例如需要严格算术和法律不变量的天课和继承。在这项工作中，我们提出了一种双语（阿拉伯语/英语）多智能体伊斯兰助手，称为 Fanar-Sadiq，它是 Fanar AI 平台的核心组件。 Fanar-Sadiq 将与伊斯兰相关的查询路由到代理的工具使用架构中的专用模块。该系统支持意图感知路由、具有确定性引文规范化和验证跟踪的基于检索的 fiqh 答案、具有引用验证的精确诗句查找以及逊尼派天课的确定性计算器和具有 Madhhab 敏感分支的继承。我们根据公共伊斯兰质量保证基准评估完整的端到端系统，并展示有效性和效率。我们的系统目前可通过 API 和 Web 应用程序公开、免费地访问，并且在不到一年的时间里被访问了约 190 万美元。