2026-03-06

Title: CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models

Authors: Zhehao Tan, Yihan Jiao, Dan Yang, Junjie Wang, Duolin Sun, Jie Feng, Xidong Wang, Lei Liu, Yue Shen, Jian Wang, Jinjie Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04406
Pdf URL: https://arxiv.org/pdf/2603.04406
Copy Paste: [[2603.04406]] CTRL-RAG: Contrastive Likelihood Reward Based Reinforcement Learning for Context-Faithful RAG Models(https://arxiv.org/abs/2603.04406)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: With the growing use of Retrieval-Augmented Generation (RAG), training large language models (LLMs) for context-sensitive reasoning and faithfulness is increasingly important. Existing RAG-oriented reinforcement learning (RL) methods rely on external rewards that often fail to evaluate document faithfulness, and may misjudge similar answers in open-domain settings. In addition, there is no RAG-based selfreward mechanism. Moreover, although such a mechanism could in principle estimate answer confidence given documents, the absence of objective feedback in a self-judgment can cause hallucination accumulation and eventual model collapse. To tackle these issues, we propose a novel "internal-external" hybrid reward framework centered on a Contrastive Likelihood Reward (CLR). CLR directly optimizes the log-likelihood gap between responses conditioned on prompts with and without supporting evidence. This encourages the model to extract relevant evidence and increases its confidence when grounded in a specific context. Experiments show that our method (used alone or combined with external correctness rewards) achieves strong performance on singlehop, multi-hop, vertical-domain, and faithfulness benchmarks. Our training code and models are coming soon.
摘要：随着检索增强生成（RAG）的使用越来越多，训练大型语言模型（LLM）以进行上下文敏感推理和忠实度变得越来越重要。现有的面向 RAG 的强化学习 (RL) 方法依赖于外部奖励，这些奖励通常无法评估文档的真实性，并且可能会在开放域设置中误判类似的答案。此外，没有基于RAG的自我奖励机制。此外，虽然这种机制原则上可以估计给定文档的答案置信度，但自我判断中缺乏客观反馈可能会导致幻觉积累并最终模型崩溃。为了解决这些问题，我们提出了一种以对比似然奖励（CLR）为中心的新颖的“内部-外部”混合奖励框架。 CLR 直接优化以有或没有支持证据的提示为条件的响应之间的对数似然差距。这鼓励模型提取相关证据并在基于特定上下文时增加其信心。实验表明，我们的方法（单独使用或与外部正确性奖励结合使用）在单跳、多跳、垂直域和忠实度基准上实现了强大的性能。我们的训练代码和模型即将推出。

Title: Semantic Containment as a Fundamental Property of Emergent Misalignment

Authors: Rohan Saxena
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04407
Pdf URL: https://arxiv.org/pdf/2603.04407
Copy Paste: [[2603.04407]] Semantic Containment as a Fundamental Property of Emergent Misalignment(https://arxiv.org/abs/2603.04407)
Keywords: language model
Abstract: Fine-tuning language models on narrowly harmful data causes emergent misalignment (EM) -- behavioral failures extending far beyond training distributions. Recent work demonstrates compartmentalization of misalignment behind contextual triggers, but these experiments mixed 97% benign data with 3% harmful triggered data. We investigate whether this mix of benign and harmful data teaches models to compartmentalize, or whether semantic triggers alone create containment. We train three model families (Qwen 2.5 14B, Llama 3.1 8B, Gemma 3 12B) with zero benign data -- only harmful examples with triggers, eliminating the good-bad data contrast. We demonstrate that baseline EM rates of 9.5--23.5% drop to 0.0--1.0% when triggers are removed during inference, but recover to 12.2--22.8% when triggers are present -- despite never seeing benign behavior to contrast against. Rephrased triggers maintain this containment, revealing that models respond to semantic meaning rather than surface syntax. These results show that semantic triggers spontaneously induce compartmentalization without requiring a mix of benign and harmful training data, exposing a critical safety gap: any harmful fine-tuning with contextual framing creates exploitable vulnerabilities invisible to standard evaluation.
摘要：对有害数据进行微调的语言模型会导致出现紧急偏差（EM）——行为失败远远超出了训练分布的范围。最近的工作证明了上下文触发背后的错位的划分，但这些实验将 97% 的良性数据与 3% 的有害触发数据混合在一起。我们研究这种良性和有害数据的混合是否教会模型进行划分，或者语义触发器是否单独创建遏制。我们用零良性数据训练三个模型系列（Qwen 2.5 14B、Llama 3.1 8B、Gemma 3 12B）——只有带有触发器的有害示例，消除了好坏数据对比。我们证明，当推理过程中移除触发器时，基线 EM 率从 9.5--23.5% 下降到 0.0--1.0%，但当存在触发器时，会恢复到 12.2--22.8%——尽管从未见过与之对比的良性行为。重新措辞的触发器维持了这种遏制，揭示了模型响应语义而不是表面语法。这些结果表明，语义触发器会自发地引发分区，而不需要混合良性和有害的训练数据，从而暴露出一个关键的安全漏洞：任何带有上下文框架的有害微调都会产生标准评估不可见的可利用漏洞。

Title: Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World

Authors: Luzhou Peng, Zhengxin Yang, Honglu Ji, Yikang Yang, Fanda Fan, Wanling Gao, Jiayuan Ge, Yilin Han, Jianfeng Zhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04408
Pdf URL: https://arxiv.org/pdf/2603.04408
Copy Paste: [[2603.04408]] Probing Memes in LLMs: A Paradigm for the Entangled Evaluation World(https://arxiv.org/abs/2603.04408)
Keywords: language model, llm
Abstract: Current evaluation paradigms for large language models (LLMs) characterize models and datasets separately, yielding coarse descriptions: items in datasets are treated as pre-labeled entries, and models are summarized by overall scores such as accuracy, together ignoring the diversity of population-level model behaviors across items with varying properties. To address this gap, this paper conceptualizes LLMs as composed of memes, a notion introduced by Dawkins as cultural genes that replicate knowledge and behavior. Building on this perspective, the Probing Memes paradigm reconceptualizes evaluation as an entangled world of models and data. It centers on a Perception Matrix that captures model-item interactions, enabling Probe Properties for characterizing items and Meme Scores for depicting model behavioral traits. Applied to 9 datasets and 4,507 LLMs, Probing Memes reveals hidden capability structures and quantifies phenomena invisible under traditional paradigms (e.g., elite models failing on problems that most models answer easily). It not only supports more informative and extensible benchmarks but also enables population-based evaluation of LLMs.
摘要：当前大型语言模型（LLM）的评估范式分别表征模型和数据集，产生粗略的描述：数据集中的项目被视为预先标记的条目，模型通过准确性等总体分数进行总结，同时忽略了具有不同属性的项目之间群体级模型行为的多样性。为了解决这一差距，本文将法学硕士概念化为由迷因组成的概念，这是道金斯提出的复制知识和行为的文化基因的概念。基于这个观点，探索模因范式将评估重新概念化为模型和数据的纠缠世界。它以感知矩阵为中心，捕获模型-项目交互，启用探针属性来描述项目特征，并使用模因分数来描述模型行为特征。应用于 9 个数据集和 4,507 个法学硕士，探索模因揭示了隐藏的能力结构，并量化了传统范式下不可见的现象（例如，精英模型在大多数模型轻松回答的问题上失败了）。它不仅支持信息更丰富、可扩展的基准，而且还能够对法学硕士进行基于人群的评估。

Title: Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

Authors: Nora Petrova, Andrew Gordon, Enzo Blindow
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.04409
Pdf URL: https://arxiv.org/pdf/2603.04409
Copy Paste: [[2603.04409]] Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework(https://arxiv.org/abs/2603.04409)
Keywords: language model, llm
Abstract: The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. \textbf{(1)} We establish a clear performance hierarchy where \texttt{google/gemini-2.5-pro} ranks first overall, with a 95.6\% posterior probability of being the top-ranked model. \textbf{(2)} We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. \textbf{(3)} We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like \textit{Trust, Ethics \& Safety} showing a 65\% tie rate, in stark contrast to the decisive 10\% tie rate for \textit{Overall Winner}. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.
摘要：大型语言模型的评估面临着重大挑战。技术基准往往缺乏现实世界的相关性，而现有的人类偏好评估则存在抽样不具代表性、评估深度肤浅和单一指标还原论等问题。为了解决这些问题，我们引入了 HUMAINE，这是一个用于多维、人口统计感知的人类与人工智能交互测量的框架。我们收集了来自美国和英国 22 个人口群体的 23,404 名参与者的多轮、自然主义对话，以评估跨越五个以人为中心的维度的 28 个最先进的模型。我们使用分层贝叶斯 Bradley-Terry-Davidson (BTD) 模型，并对人口普查数据进行后分层，我们的分析揭示了三个关键见解。 \textbf{(1)} 我们建立了一个清晰的性能层次结构，其中 \texttt{google/gemini-2.5-pro} 整体排名第一，后验概率为 95.6\% 成为排名第一的模型。 \textbf{(2)} 我们发现了显着的偏好异质性，用户年龄成为主要的人口统计轴；模型的感知排名可能会在不同年龄组之间发生很大变化，从而暴露出不具有代表性的样本通常会掩盖的泛化失败。 \textbf{(3)} 我们量化了评估维度上判别力的巨大差异，\textit{信任、道德\&安全}等模糊品质显示出 65\% 的平局率，与 \textit{总体获胜者} 决定性的 10\% 平局率形成鲜明对比。我们的工作强调在法学硕士评估中需要更加多维、具有人口统计意识的视角。我们发布了完整的数据集、交互式排行榜和开源框架。

Title: SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models

Authors: Omar Abdelnasser, Fatemah Alharbi, Khaled Khasawneh, Ihsen Alouani, Mohammed E. Fouda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04410
Pdf URL: https://arxiv.org/pdf/2603.04410
Copy Paste: [[2603.04410]] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models(https://arxiv.org/abs/2603.04410)
Keywords: language model, prompt
Abstract: Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.
摘要：语言模型 (LM) 中的安全一致性是值得信赖的人工智能的基础。然而，虽然不同的利益相关者都在尝试利用阿拉伯语言模型 (ALM)，但 ALM 的系统安全评估在很大程度上仍未得到充分探索，限制了其主流采用。现有的安全基准和保障模型主要以英语为中心，限制了它们对阿拉伯自然语言处理（NLP）系统的适用性，并掩盖了细粒度的类别级安全漏洞。本文介绍了 SalamaBench，这是用于评估 ALM 安全性的统一基准，其中包含 12 美元不同类别的 8,170 美元提示，与 MLCommons 安全危害分类法一致。 SalamaBench 通过涉及人工智能过滤和多阶段人工验证的严格管道协调异构数据集而构建，可实现标准化、类别感知的安全评估。使用此基准，我们在多种保护配置下评估了五种最先进的 ALM，包括 Fanar 1 和 2、ALLaM 2、Falcon H1R 和 Jais 2，包括个人防护模型、多数投票聚合以及针对人工注释金标签的验证。我们的结果揭示了安全一致性的巨大差异：虽然 Fanar 2 实现了最低的总体攻击成功率，但其在特定危害领域的稳健性并不均衡。相比之下，Jais 2 始终表现出较高的脆弱性，表明本质安全性较弱。我们进一步证明，在充当安全法官时，原生 ALM 的表现比专用保障模型要差得多。总体而言，我们的研究结果强调了类别感知评估和专门保障机制对于 ALM 中强有力的危害缓解的必要性。

Title: One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

Authors: Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04411
Pdf URL: https://arxiv.org/pdf/2603.04411
Copy Paste: [[2603.04411]] One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache(https://arxiv.org/abs/2603.04411)
Keywords: language model, llm
Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.
摘要：尽管大型语言模型 (LLM) 取得了显着进步，但键值 (KV) 缓存不断增加的内存占用仍然是高效推理的关键瓶颈。虽然降维提供了一种有前途的压缩途径，但现有的方法通常要么需要从头开始进行极其昂贵的预训练，要么在高压缩状态下会遭受严重的性能恶化。在这项工作中，我们提出了 DynaKV，一种用于低秩 KV 缓存压缩的新颖的后训练框架。据我们所知，DynaKV 是第一种根据语义含义动态地将压缩率分配给各个标记的方法，这使得它能够以激进的压缩率实现更好的保真度。大量的实验表明，我们的方法始终优于现有的最先进的压缩技术，在保持有竞争力的生成质量的同时显着减少内存。此外，我们的方法与序列级修剪方法正交。与 SnapKV 集成时，DynaKV 仅保留 6% 的 KV 缓存，同时在 LongBench 基准测试中保持 94% 的基准性能。

Title: Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models

Authors: O.V. Usatenko, S.S. Melnyk, G.M. Pritula
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04412
Pdf URL: https://arxiv.org/pdf/2603.04412
Copy Paste: [[2603.04412]] Additive Multi-Step Markov Chains and the Curse of Dimensionality in Large Language Models(https://arxiv.org/abs/2603.04412)
Keywords: language model, llm
Abstract: Large-scale language models (LLMs) operate in extremely high-dimensional state spaces, where both token embeddings and their hidden representations create complex dependencies that are not easily reduced to classical Markov structures. In this paper, we explore a theoretically feasible approximation of LLM dynamics using N-order additive Markov chains. Such models allow the conditional probability of the next token to be decomposed into a superposition of contributions from multiple historical depths, reducing the combinatorial explosion typically associated with high-order Markov processes. The main result of the work is the establishment of a correspondence between an additive multi-step chain and a chain with a step-wise memory function. This equivalence allowed the introduction of the concept of information temperature not only for stepwise but also for additive N-order Markov chains.
摘要：大规模语言模型（LLM）在极高维的状态空间中运行，其中令牌嵌入及其隐藏表示创建了复杂的依赖关系，这些依赖关系不容易简化为经典的马尔可夫结构。在本文中，我们利用 N 阶加性马尔可夫链探索了理论上可行的 LLM 动力学近似。这种模型允许下一个标记的条件概率被分解为来自多个历史深度的贡献的叠加，从而减少通常与高阶马尔可夫过程相关的组合爆炸。这项工作的主要结果是建立了加性多步链和具有逐步记忆功能的链之间的对应关系。这种等价性允许引入信息温度的概念，不仅适用于逐步马尔可夫链，也适用于附加 N 阶马尔可夫链。

Title: Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries

Authors: Natalie Perez, Sreyoshi Bhaduri, Aman Chadha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04413
Pdf URL: https://arxiv.org/pdf/2603.04413
Copy Paste: [[2603.04413]] Simulating Meaning, Nevermore! Introducing ICR: A Semiotic-Hermeneutic Metric for Evaluating Meaning in LLM Text Summaries(https://arxiv.org/abs/2603.04413)
Keywords: language model, llm
Abstract: Meaning in human language is relational, context dependent, and emergent, arising from dynamic systems of signs rather than fixed word-concept mappings. In computational settings, this semiotic and interpretive complexity complicates the generation and evaluation of meaning. This article proposes an interdisciplinary framework for studying meaning in large language model (LLM) generated language by integrating semiotics and hermeneutics with qualitative research methods. We review prior scholarship on meaning and machines, examining how linguistic signs are transformed into vectorized representations in static and contextualized embedding models, and identify gaps between statistical approximation and human interpretive meaning. We then introduce the Inductive Conceptual Rating (ICR) metric, a qualitative evaluation approach grounded in inductive content analysis and reflexive thematic analysis, designed to assess semantic accuracy and meaning alignment in LLM-outputs beyond lexical similarity metrics. We apply ICR in an empirical comparison of LLM generated and human generated thematic summaries across five datasets (N = 50 to 800). While LLMs achieve high linguistic similarity, they underperform on semantic accuracy, particularly in capturing contextually grounded meanings. Performance improves with larger datasets but remains variable across models, potentially reflecting differences in the frequency and coherence of recurring concepts and meanings. We conclude by arguing for evaluation frameworks that leverage systematic qualitative interpretation practices when assessing meaning in LLM-generated outputs from reference texts.
摘要：人类语言中的意义是相关的、依赖于上下文的、自然出现的，源自动态的符号系统，而不是固定的单词概念映射。在计算环境中，这种符号学和解释的复杂性使意义的生成和评估变得复杂。本文提出了一个跨学科框架，通过将符号学和解释学与定性研究方法相结合来研究大语言模型（LLM）生成的语言中的含义。我们回顾了先前关于意义和机器的学术成果，研究了语言符号如何在静态和情境化嵌入模型中转化为矢量化表示，并确定统计近似和人类解释意义之间的差距。然后，我们介绍归纳概念评级（ICR）指标，这是一种基于归纳内容分析和反思主题分析的定性评估方法，旨在评估法学硕士输出中超越词汇相似性指标的语义准确性和意义对齐。我们将 ICR 应用到五个数据集（N = 50 到 800）中 LLM 生成的和人类生成的主题摘要的实证比较中。虽然法学硕士实现了高度的语言相似性，但它们在语义准确性方面表现不佳，特别是在捕获基于上下文的含义方面。数据集越大，性能越好，但不同模型之间仍然存在差异，这可能反映了重复出现的概念和含义的频率和连贯性的差异。最后，我们主张在评估法学硕士从参考文本生成的输出中的含义时利用系统定性解释实践的评估框架。

Title: The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning

Authors: Ruobing Zheng, Tianqi Li, Jianing Li, Qingpei Guo, Yi Yuan, Jingdong Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.04415
Pdf URL: https://arxiv.org/pdf/2603.04415
Copy Paste: [[2603.04415]] The Thinking Boundary: Quantifying Reasoning Suitability of Multimodal Tasks via Dual Tuning(https://arxiv.org/abs/2603.04415)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: While reasoning-enhanced Large Language Models (LLMs) have demonstrated remarkable advances in complex tasks such as mathematics and coding, their effectiveness across universal multimodal scenarios remains uncertain. The trend of releasing parallel "Instruct" and "Thinking" models by leading developers serves merely as a resource-intensive workaround, stemming from the lack of a criterion for determining when reasoning is truly beneficial. In this paper, we propose Dual Tuning, a framework designed to assess whether reasoning yields positive gains for target tasks under given base models and datasets. By jointly fine-tuning on paired Chain-of-Thought (CoT) and Direct-Answer (DA) data under controlled prompts, we systematically quantify and compare the gains of both training modes using the proposed metrics, and establish the "Thinking Boundary" to evaluate the suitability of reasoning training across diverse multimodal tasks, including spatial, mathematical, and multi-disciplinary domains. We further explore the impact of reinforcement training and thinking patterns on reasoning suitability, and validate whether the "Thinking Boundary" can guide data refinement. Our findings challenge the "reasoning-for-all" paradigm, providing practical guidance for identifying appropriate data and training strategies, and motivating the development of resource-efficient, adaptive auto-think systems.
摘要：虽然推理增强型大型语言模型 (LLM) 在数学和编码等复杂任务中表现出了显着的进步，但它们在通用多模态场景中的有效性仍然不确定。领先的开发人员发布并行“指导”和“思考”模型的趋势仅仅是一种资源密集型解决方法，因为缺乏确定推理何时真正有益的标准。在本文中，我们提出了 Dual Tuning，这是一个旨在评估推理是否在给定基础模型和数据集下为目标任务带来积极收益的框架。通过在受控提示下对成对的思维链（CoT）和直接回答（DA）数据进行联合微调，我们使用所提出的指标系统地量化和比较两种训练模式的收益，并建立“思维边界”来评估跨不同多模态任务（包括空间、数学和多学科领域）的推理训练的适用性。我们进一步探讨强化训练和思维模式对推理适用性的影响，并验证“思维边界”是否可以指导数据细化。我们的研究结果挑战了“所有人推理”范式，为识别适当的数据和培训策略提供了实用指导，并推动了资源高效、自适应自动思考系统的开发。

Title: Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction

Authors: Rabab Alkhalifa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04416
Pdf URL: https://arxiv.org/pdf/2603.04416
Copy Paste: [[2603.04416]] Optimizing What We Trust: Reliability-Guided QUBO Selection of Multi-Agent Weak Framing Signals for Arabic Sentiment Prediction(https://arxiv.org/abs/2603.04416)
Keywords: llm, agent
Abstract: Framing detection in Arabic social media is difficult due to interpretive ambiguity, cultural grounding, and limited reliable supervision. Existing LLM-based weak supervision methods typically rely on label aggregation, which is brittle when annotations are few and socially dependent. We propose a reliability-aware weak supervision framework that shifts the focus from label fusion to data curation. A small multi-agent LLM pipeline, two framers, a critic, and a discriminator, treats disagreement and reasoning quality as epistemic signals and produces instance-level reliability estimates. These estimates guide a QUBO-based subset selection procedure that enforces frame balance while reducing redundancy. Intrinsic diagnostics and an out-of-domain Arabic sentiment transfer test show that the selected subsets are more reliable and encode non-random, transferable structure, without degrading strong text-only baselines.
摘要：由于解释模糊、文化基础和可靠监督有限，阿拉伯社交媒体中的框架检测很困难。现有的基于 LLM 的弱监督方法通常依赖于标签聚合，当注释很少且依赖于社会时，这种方法很脆弱。我们提出了一种具有可靠性意识的弱监督框架，将重点从标签融合转移到数据管理。一个小型多代理 LLM 管道、两个成帧器、一个批评器和一个鉴别器，将分歧和推理质量视为认知信号，并生成实例级可靠性估计。这些估计指导基于 QUBO 的子集选择过程，该过程强制帧平衡，同时减少冗余。内在诊断和域外阿拉伯语情感转移测试表明，所选子集更可靠，并且编码非随机、可转移结构，而不会降低强大的纯文本基线。

Title: Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge

Authors: Fiona Lau
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04417
Pdf URL: https://arxiv.org/pdf/2603.04417
Copy Paste: [[2603.04417]] Same Input, Different Scores: A Multi Model Study on the Inconsistency of LLM Judge(https://arxiv.org/abs/2603.04417)
Keywords: language model, gpt, llm, retrieval-augmented generation
Abstract: Large language models are increasingly used as automated evaluators in research and enterprise settings, a practice known as LLM-as-a-judge. While prior work has examined accuracy, bias, and alignment with human preferences, far less attention has been given to how consistently LLMs assign numerical scores, an important concern for many production workflows. This study systematically evaluates scoring stability across five commonly used models, GPT-4o, GPT-4o-mini, Gemini-2.5-Flash, Claude-Haiku-4.5, and Claude-Sonnet-4.5, two temperature settings, and real enterprise question-answer pairs drawn from a retrieval-augmented generation (RAG) system. We address three questions: how stable a model's scores are across repeated runs, how differently models score identical inputs, and how temperature affects scoring consistency. Temperature controls the determinism of an LLM's output. Despite expectations of stability at temperature=0, we observe substantial variability across models, with completeness scoring showing the largest fluctuations. Cross-model comparisons reveal systematic differences in strictness and interpretive style, leading to divergent ratings for the same answers. Lower temperatures improve stability for some models, notably GPT-4o and Gemini, but have limited or inconsistent effects for Anthropic models. These findings have important implications for enterprise pipelines that rely on LLM-generated scores for routing, triage, gating, or quality control. Identical inputs can receive different scores depending on model, family, or temperature, raising concerns around fairness, reproducibility, and operational reliability. Our results highlight the need for monitoring, robust parsing, and hybrid human-LLM evaluation strategies to ensure dependable use of LLM-as-a-judge in production environments.
摘要：大型语言模型越来越多地用作研究和企业环境中的自动评估器，这种做法被称为法学硕士作为法官。虽然之前的工作已经检验了准确性、偏差以及与人类偏好的一致性，但很少有人关注法学硕士分配数字分数的一致性，而这是许多生产工作流程的一个重要问题。本研究系统地评估了五种常用模型（GPT-4o、GPT-4o-mini、Gemini-2.5-Flash、Claude-Haiku-4.5 和 Claude-Sonnet-4.5）、两种温度设置以及从检索增强生成 (RAG) 系统中提取的真实企业问答对的评分稳定性。我们解决了三个问题：模型得分在重复运行中的稳定性如何，模型对相同输入的得分有何不同，以及温度如何影响得分的一致性。温度控制着法学硕士输出的确定性。尽管期望温度 = 0 时保持稳定，但我们观察到模型之间存在很大的差异，完整性评分显示出最大的波动。跨模型比较揭示了严格性和解释风格的系统差异，导致对相同答案的评级不同。较低的温度可以提高某些模型的稳定性，特别是 GPT-4o 和 Gemini，但对人择模型的影响有限或不一致。这些发现对于依赖法学硕士生成的分数进行路由、分类、门控或质量控制的企业管道具有重要意义。根据型号、系列或温度的不同，相同的输入可能会获得不同的分数，这引发了人们对公平性、可重复性和操作可靠性的担忧。我们的结果强调需要监控、强大的解析和混合人类-LLM 评估策略，以确保在生产环境中可靠地使用 LLM 作为法官。

Title: Context-Dependent Affordance Computation in Vision-Language Models

Authors: Murad Farzulla
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04419
Pdf URL: https://arxiv.org/pdf/2603.04419
Copy Paste: [[2603.04419]] Context-Dependent Affordance Computation in Vision-Language Models(https://arxiv.org/abs/2603.04419)
Keywords: language model, agent
Abstract: We characterize the phenomenon of context-dependent affordance computation in vision-language models (VLMs). Through a large-scale computational study (n=3,213 scene-context pairs from COCO-2017) using Qwen-VL 30B and LLaVA-1.5-13B subject to systematic context priming across 7 agentic personas, we demonstrate massive affordance drift: mean Jaccard similarity between context conditions is 0.095 (95% CI: [0.093, 0.096], p < 0.0001), indicating that >90% of lexical scene description is context-dependent. Sentence-level cosine similarity confirms substantial drift at the semantic level (mean = 0.415, 58.5% context-dependent). Stochastic baseline experiments (2,384 inference runs across 4 temperatures and 5 seeds) confirm this drift reflects genuine context effects rather than generation noise: within-prime variance is substantially lower than cross-prime variance across all conditions. Tucker decomposition with bootstrap stability analysis (n=1,000 resamples) reveals stable orthogonal latent factors: a "Culinary Manifold" isolated to chef contexts and an "Access Axis" spanning child-mobility contrasts. These findings establish that VLMs compute affordances in a substantially context-dependent manner -- with the difference between lexical (90%) and semantic (58.5%) measures reflecting that surface vocabulary changes more than underlying meaning under context shifts -- and suggest a direction for robotics research: dynamic, query-dependent ontological projection (JIT Ontology) rather than static world modeling. We do not claim to establish processing order or architectural primacy; such claims require internal representational analysis beyond output behavior.
摘要：我们描述了视觉语言模型（VLM）中上下文相关可供性计算的现象。通过使用 Qwen-VL 30B 和 LLaVA-1.5-13B 进行的大规模计算研究（n = 3,213 个来自 COCO-2017 的场景上下文对），我们在 7 个代理角色上进行了系统上下文启动，证明了巨大的可供性漂移：上下文条件之间的平均 Jaccard 相似度为 0.095（95% CI：[0.093，0.096]，p < 0.0001），表明>90%的词汇场景描述是上下文相关的。句子级别的余弦相似度证实了语义级别上的实质性漂移（平均值 = 0.415，58.5% 依赖于上下文）。随机基线实验（在 4 个温度和 5 个种子上运行 2,384 次推理）证实，这种漂移反映了真实的上下文效应，而不是生成噪声：在所有条件下，素数内方差均远低于素数间方差。塔克分解与引导稳定性分析（n=1,000 次重新采样）揭示了稳定的正交潜在因素：与厨师背景隔离的“烹饪流形”和跨越儿童流动性对比的“访问轴”。这些发现表明，VLM 以一种基本上依赖于上下文的方式计算可供性，词汇 (90%) 和语义 (58.5%) 测量之间的差异反映了上下文变化下表面词汇的变化大于潜在含义，并为机器人研究提出了一个方向：动态的、依赖于查询的本体投影（JIT 本体），而不是静态世界建模。我们并不声称要建立处理顺序或架构首要地位；此类主张需要超出输出行为的内部代表性分析。

Title: Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?

Authors: Grace Chang Yuan, Xiaoman Zhang, Sung Eun Kim, Pranav Rajpurkar
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2603.04421
Pdf URL: https://arxiv.org/pdf/2603.04421
Copy Paste: [[2603.04421]] Do Mixed-Vendor Multi-Agent LLMs Improve Clinical Diagnosis?(https://arxiv.org/abs/2603.04421)
Keywords: language model, llm, agent
Abstract: Multi-agent large language model (LLM) systems have emerged as a promising approach for clinical diagnosis, leveraging collaboration among agents to refine medical reasoning. However, most existing frameworks rely on single-vendor teams (e.g., multiple agents from the same model family), which risk correlated failure modes that reinforce shared biases rather than correcting them. We investigate the impact of vendor diversity by comparing Single-LLM, Single-Vendor, and Mixed-Vendor Multi-Agent Conversation (MAC) frameworks. Using three doctor agents instantiated with o4-mini, Gemini-2.5-Pro, and Claude-4.5-Sonnet, we evaluate performance on RareBench and DiagnosisArena. Mixed-vendor configurations consistently outperform single-vendor counterparts, achieving state-of-the-art recall and accuracy. Overlap analysis reveals the underlying mechanism: mixed-vendor teams pool complementary inductive biases, surfacing correct diagnoses that individual models or homogeneous teams collectively miss. These results highlight vendor diversity as a key design principle for robust clinical diagnostic systems.
摘要：多智能体大语言模型（LLM）系统已成为临床诊断的一种有前途的方法，利用智能体之间的协作来完善医学推理。然而，大多数现有框架依赖于单一供应商团队（例如，来自同一模型系列的多个代理），这存在相关故障模式的风险，这些故障模式会强化共同的偏见，而不是纠正它们。我们通过比较单一 LLM、单一供应商和混合供应商多代理对话 (MAC) 框架来调查供应商多样性的影响。使用用 o4-mini、Gemini-2.5-Pro 和 Claude-4.5-Sonnet 实例化的三个医生代理，我们评估了 RareBench 和 DiagnosisArena 上的性能。混合供应商配置始终优于单一供应商配置，实现了最先进的召回率和准确性。重叠分析揭示了潜在的机制：混合供应商团队汇集互补的归纳偏差，呈现出单个模型或同质团队共同错过的正确诊断。这些结果凸显了供应商多样性是稳健临床诊断系统的关键设计原则。

Title: Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation

Authors: Gürsel Akdeniz, Emin Cagatay Nakilcioglu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04423
Pdf URL: https://arxiv.org/pdf/2603.04423
Copy Paste: [[2603.04423]] Generating Realistic, Protocol-Compliant Maritime Radio Dialogues using Self-Instruct and Low-Rank Adaptation(https://arxiv.org/abs/2603.04423)
Keywords: hallucination
Abstract: VHF radio miscommunication remains a major safety risk in maritime operations, with human factors accounting for over 58% of recorded incidents in Europe between 2014 and 2023. Despite decades of operational use, VHF radio communications are still prone to noise, interference, linguistic variability, and the absence of real-time transcription, making procedural errors both frequent and difficult to correct. Developing AI-assisted systems to support real-time communication and decision-making requires a considerable amount of high-quality maritime data, yet operational, regulatory, and privacy constraints render such datasets scarce. This study introduces a compliance aware Self-Instruct methodology for generating realistic maritime radio dialogues that conform to the IMO's SMCP. Our approach integrates a 26-filter verification pipeline directly into the iterative generation loop to enforce entity information accuracy, hallucination detection, SMCP-compliance, logical consistency, and linguistic diversity. We employ LORA for parameter-efficient fine-tuning, reducing computational overhead during training and enabling efficient deployment of the resulting models on resource-constrained maritime systems. To assess dataset quality, we introduce a novel evaluation framework combining automated and expert assessments: Format Accuracy, Information Accuracy, Uniqueness, and Logical Coherence. Experiments using publicly available vessel, coastal and AIS datasets demonstrate that the approach produces synthetically diverse, procedurally compliant, and operationally realistic dialogues. Although downstream applications such as automatic speech recognition and natural language processing are reserved for future work, the released code, datasets, and verification tools provide a reproducible foundation for artificial intelligence-assisted maritime safety and other safety-critical domains.
摘要：VHF 无线电通讯错误仍然是海上作业的一个主要安全风险，2014 年至 2023 年间，欧洲记录的事件中人为因素占 58% 以上。尽管使用了数十年，VHF 无线电通信仍然容易受到噪音、干扰、语言变异和缺乏实时转录的影响，导致程序错误频繁发生且难以纠正。开发人工智能辅助系统来支持实时通信和决策需要大量高质量的海事数据，但操作、监管和隐私限制导致此类数据集稀缺。本研究介绍了一种合规意识自我指导方法，用于生成符合 IMO 的 SMCP 的真实海事无线电对话。我们的方法将 26 过滤器验证管道直接集成到迭代生成循环中，以增强实体信息准确性、幻觉检测、SMCP 合规性、逻辑一致性和语言多样性。我们采用 LORA 进行参数高效的微调，减少训练期间的计算开销，并能够在资源有限的海事系统上有效部署生成的模型。为了评估数据集质量，我们引入了一种新颖的评估框架，结合了自动化和专家评估：格式准确性、信息准确性、唯一性和逻辑一致性。使用公开的船舶、沿海和 AIS 数据集进行的实验表明，该方法可以产生综合多样化、程序合规且操作现实的对话。尽管自动语音识别和自然语言处理等下游应用保留给未来的工作，但发布的代码、数据集和验证工具为人工智能辅助的海事安全和其他安全关键领域提供了可重复的基础。

Title: What Is Missing: Interpretable Ratings for Large Language Model Outputs

Authors: Nicholas Stranges, Yimin Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04429
Pdf URL: https://arxiv.org/pdf/2603.04429
Copy Paste: [[2603.04429]] What Is Missing: Interpretable Ratings for Large Language Model Outputs(https://arxiv.org/abs/2603.04429)
Keywords: language model, llm
Abstract: Current Large Language Model (LLM) preference learning methods such as Proximal Policy Optimization and Direct Preference Optimization learn from direct rankings or numerical ratings of model outputs, these rankings are subjective, and a single numerical rating chosen directly by a judge is a poor proxy for the quality of natural language, we introduce the What Is Missing (WIM) rating system to produce rankings from natural-language feedback, WIM integrates into existing training pipelines, can be combined with other rating techniques, and can be used as input to any preference learning method without changing the learning algorithm, to compute a WIM rating, a human or LLM judge writes feedback describing what the model output is missing, we embed the output and the feedback with a sentence embedding model and compute the cosine similarity between the resulting vectors, we empirically observe that, compared to discrete numerical ratings, WIM yields fewer ties and larger rating deltas, which improves the availability of a learning signal in pairwise preference data, we use interpretable in the following limited sense: for each scalar rating, we can inspect the judge's missing-information text that produced it, enabling qualitative debugging of the preference labels.
摘要：当前的大语言模型（LLM）偏好学习方法，例如近端策略优化和直接偏好优化，是从模型输出的直接排名或数字评级中学习的，这些排名是主观的，由法官直接选择的单个数字评级不能很好地代表自然语言的质量，我们引入了What Is Missing（WIM）评级系统来根据自然语言反馈生成排名，WIM集成到现有的训练管道中，可以与其他评级技术相结合，并且可以在不改变的情况下用作任何偏好学习方法的输入学习算法，为了计算 WIM 评分，人类或法学硕士法官编写反馈来描述模型输出缺失的内容，我们用句子嵌入模型嵌入输出和反馈，并计算结果向量之间的余弦相似度，我们凭经验观察到，与离散数值评分相比，WIM 产生更少的联系和更大的评分增量，这提高了成对偏好数据中学习信号的可用性，我们在以下有限意义上使用可解释性：对于每个标量评分，我们可以检查法官生成的缺失信息文本，可以对偏好标签进行定性调试。

Title: A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science

Authors: Zonglin Yang, Runze Mao, Tianhao Wu, Han Li, QingGuo Zhou, Zhi X. Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04452
Pdf URL: https://arxiv.org/pdf/2603.04452
Copy Paste: [[2603.04452]] A unified foundational framework for knowledge injection and evaluation of Large Language Models in Combustion Science(https://arxiv.org/abs/2603.04452)
Keywords: language model, llm, retrieval-augmented generation
Abstract: To advance foundation Large Language Models (LLMs) for combustion science, this study presents the first end-to-end framework for developing domain-specialized models for the combustion community. The framework comprises an AI-ready multimodal knowledge base at the 3.5 billion-token scale, extracted from over 200,000 peer-reviewed articles, 8,000 theses and dissertations, and approximately 400,000 lines of combustion CFD code; a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields); and a three-stage knowledge-injection pathway that progresses from lightweight retrieval-augmented generation (RAG) to knowledge-graph-enhanced retrieval and continued pretraining. We first quantitatively validate Stage 1 (naive RAG) and find a hard ceiling: standard RAG accuracy peaks at 60%, far surpassing zero-shot performance (23%) yet well below the theoretical upper bound (87%). We further demonstrate that this stage's performance is severely constrained by context contamination. Consequently, building a domain foundation model requires structured knowledge graphs and continued pretraining (Stages 2 and 3).
摘要：为了推进燃烧科学的基础大型语言模型 (LLM)，本研究提出了第一个用于为燃烧界开发领域专用模型的端到端框架。该框架包括一个 35 亿代币规模的人工智能多模态知识库，从超过 200,000 篇同行评审文章、8,000 篇论文和约 400,000 行燃烧 CFD 代码中提取； a rigorous and largely automated evaluation benchmark (CombustionQA, 436 questions across eight subfields);以及一个三阶段的知识注入路径，从轻量级检索增强生成（RAG）发展到知识图增强检索和持续预训练。我们首先定量验证第 1 阶段（朴素 RAG）并找到硬性上限：标准 RAG 准确度峰值为 60%，远远超过零样本性能 (23%)，但远低于理论上限 (87%)。我们进一步证明该阶段的性能受到上下文污染的严重限制。因此，构建领域基础模型需要结构化知识图和持续的预训练（第 2 阶段和第 3 阶段）。

Title: Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models

Authors: Wai Tuck Wong, Jun Sun, Arunesh Sinha
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04453
Pdf URL: https://arxiv.org/pdf/2603.04453
Copy Paste: [[2603.04453]] Induced Numerical Instability: Hidden Costs in Multimodal Large Language Models(https://arxiv.org/abs/2603.04453)
Keywords: language model
Abstract: The use of multimodal large language models has become widespread, and as such the study of these models and their failure points has become of utmost importance. We study a novel mode of failure that causes degradation in performance indirectly by optimizing a loss term that seeks to maximize numerical instability in the inference stage of these models. We apply this loss term as the optimization target to construct images that, when used on multimodal large language models, cause significant degradation in the output. We validate our hypothesis on state of the art models large vision language models (LLaVa-v1.5-7B, Idefics3-8B, SmolVLM-2B-Instruct) against standard datasets (Flickr30k, MMVet, TextVQA, VQAv2, POPE, COCO) and show that performance degrades significantly, even with a very small change to the input image, compared to baselines. Our results uncover a fundamentally different vector of performance degradation, highlighting a failure mode not captured by adversarial perturbations.
摘要：多模态大语言模型的使用已经变得广泛，因此对这些模型及其故障点的研究变得至关重要。我们研究了一种新的故障模式，它通过优化损失项来间接导致性能下降，该损失项旨在最大限度地提高这些模型推理阶段的数值不稳定性。我们应用这个损失项作为优化目标来构建图像，当在多模态大型语言模型上使用时，会导致输出显着下降。我们根据标准数据集（Flickr30k、MMVet、TextVQA、VQAv2、POPE、COCO）验证了我们对最先进的大型视觉语言模型（LLaVa-v1.5-7B、Idefics3-8B、SmolVLM-2B-Instruct）的假设，并表明，与基线相比，即使输入图像发生非常小的变化，性能也会显着下降。我们的结果揭示了一个根本不同的性能下降向量，突出了对抗性扰动未捕获的故障模式。

Title: Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam

Authors: Michael Majurski, Cynthia Matuszek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04454
Pdf URL: https://arxiv.org/pdf/2603.04454
Copy Paste: [[2603.04454]] Query Disambiguation via Answer-Free Context: Doubling Performance on Humanity's Last Exam(https://arxiv.org/abs/2603.04454)
Keywords: language model, gpt, prompt
Abstract: How carefully and unambiguously a question is phrased has a profound impact on the quality of the response, for Language Models (LMs) as well as people. While model capabilities continue to advance, the interplay between grounding context and query formulation remains under-explored. This work investigates how the quality of background grounding information in a model's context window affects accuracy. We find that combining well-grounded dynamic context construction (i.e, RAG) with query rewriting reduces question ambiguity, resulting in significant accuracy gains. Given a user question with associated answer-free grounding context, rewriting the question to reduce ambiguity produces benchmark improvements without changing the answer itself, even compared to prepending that context before the question. Using \texttt{gpt-oss-20b} to rewrite a subset of Humanity's Last Exam using answer-free grounding context improves \texttt{gpt-5-mini} accuracy from 0.14 to 0.37. We demonstrate that this accuracy improvement cannot be fully recovered just through prompting at inference time; rather, distinct rewriting and answering phases are required. Code and data are available at this https URL
摘要：对于语言模型 (LM) 以及人类来说，问题的措辞是否仔细和明确会对回答的质量产生深远的影响。尽管模型功能不断进步，但基础上下文和查询公式之间的相互作用仍未得到充分探索。这项工作研究了模型上下文窗口中的背景基础信息的质量如何影响准确性。我们发现，将基础良好的动态上下文构建（即 RAG）与查询重写相结合可以减少问题的歧义，从而显着提高准确性。给定一个具有关联的无答案基础上下文的用户问题，重写问题以减少歧义可以在不改变答案本身的情况下产生基准改进，即使与在问题之前添加该上下文相比也是如此。使用 \texttt{gpt-oss-20b} 使用无答案基础上下文重写人类最后考试的子集，将 \texttt{gpt-5-mini} 准确性从 0.14 提高到 0.37。我们证明，仅通过推理时的提示无法完全恢复这种准确性的提高；相反，需要不同的重写和回答阶段。代码和数据可在此 https URL 获取

Title: From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models

Authors: Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04592
Pdf URL: https://arxiv.org/pdf/2603.04592
Copy Paste: [[2603.04592]] From Static Inference to Dynamic Interaction: Navigating the Landscape of Streaming Large Language Models(https://arxiv.org/abs/2603.04592)
Keywords: language model, llm
Abstract: Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at this https URL.
摘要：标准大型语言模型 (LLM) 主要设计用于使用预定义输入进行静态推理，这限制了它们在动态实时场景中的适用性。为了解决这一差距，流媒体法学硕士范式应运而生。然而，流式LLM的现有定义仍然支离破碎，将流式生成、流式输入和交互式流式架构混为一谈，同时仍然缺乏系统的分类法。本文对流媒体法学硕士进行了全面的概述和分析。首先，我们基于数据流和动态交互建立了流式LLM的统一定义，以澄清现有的歧义。在此定义的基础上，我们提出了当前流媒体法学硕士的系统分类，并对其基本方法进行了深入讨论。此外，我们探索了流媒体法学硕士在现实场景中的应用，并概述了有前途的研究方向，以支持流媒体智能的不断进步。我们在此 https URL 维护一个不断更新的相关论文存储库。

Title: Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning

Authors: Lei Huang, Xiang Cheng, Chenxiao Zhao, Guobin Shen, Junjie Yang, Xiaocheng Feng, Yuxuan Gu, Xing Yu, Bing Qin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04597
Pdf URL: https://arxiv.org/pdf/2603.04597
Copy Paste: [[2603.04597]] Bootstrapping Exploration with Group-Level Natural Language Feedback in Reinforcement Learning(https://arxiv.org/abs/2603.04597)
Keywords: language model, llm
Abstract: Large language models (LLMs) typically receive diverse natural language (NL) feedback through interaction with the environment. However, current reinforcement learning (RL) algorithms rely solely on scalar rewards, leaving the rich information in NL feedback underutilized and leading to inefficient exploration. In this work, we propose GOLF, an RL framework that explicitly exploits group-level language feedback to guide targeted exploration through actionable refinements. GOLF aggregates two complementary feedback sources: (i) external critiques that pinpoint errors or propose targeted fixes, and (ii) intra-group attempts that supply alternative partial ideas and diverse failure patterns. These group-level feedbacks are aggregated to produce high-quality refinements, which are adaptively injected into training as off-policy scaffolds to provide targeted guidance in sparse-reward regions. Meanwhile, GOLF jointly optimizes generation and refinement within a unified RL loop, creating a virtuous cycle that continuously improves both capabilities. Experiments on both verifiable and non-verifiable benchmarks show that GOLF achieves superior performance and exploration efficiency, achieving 2.2$\times$ improvements in sample efficiency compared to RL methods trained solely on scalar rewards. Code is available at this https URL.
摘要：大型语言模型 (LLM) 通常通过与环境交互来接收不同的自然语言 (NL) 反馈。然而，当前的强化学习（RL）算法仅依赖于标量奖励，使得 NL 反馈中的丰富信息未得到充分利用，导致探索效率低下。在这项工作中，我们提出了 GOLF，这是一个 RL 框架，它明确利用群体级语言反馈，通过可操作的改进来指导有针对性的探索。 GOLF 聚合了两个互补的反馈源：(i) 查明错误或提出有针对性的修复方案的外部批评，以及 (ii) 提供替代的部分想法和不同失败模式的内部尝试。这些群体层面的反馈被汇总起来以产生高质量的改进，这些改进被自适应地注入到训练中作为离策略支架，以在稀疏奖励区域提供有针对性的指导。同时，GOLF 在统一的 RL 循环中联合优化生成和细化，形成不断提高两种能力的良性循环。在可验证和不可验证基准上的实验表明，GOLF 实现了卓越的性能和探索效率，与仅基于标量奖励训练的 RL 方法相比，样本效率提高了 2.2$\times$。代码可从此 https URL 获取。

Title: Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models

Authors: Xin Chen, Saili Uday Gadgil, Jiarong Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04647
Pdf URL: https://arxiv.org/pdf/2603.04647
Copy Paste: [[2603.04647]] Coordinated Semantic Alignment and Evidence Constraints for Retrieval-Augmented Generation with Large Language Models(https://arxiv.org/abs/2603.04647)
Keywords: language model, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval augmented generation mitigates limitations of large language models in factual consistency and knowledge updating by introducing external knowledge. However, practical applications still suffer from semantic misalignment between retrieved results and generation objectives, as well as insufficient evidence utilization. To address these challenges, this paper proposes a retrieval augmented generation method that integrates semantic alignment with evidence constraints through coordinated modeling of retrieval and generation stages. The method first represents the relevance between queries and candidate evidence within a unified semantic space. This ensures that retrieved results remain semantically consistent with generation goals and reduces interference from noisy evidence and semantic drift. On this basis, an explicit evidence constraint mechanism is introduced. Retrieved evidence is transformed from an implicit context into a core control factor in generation. This restricts the expression scope of generated content and strengthens dependence on evidence. By jointly modeling semantic consistency and evidence constraints within a unified framework, the proposed approach improves factual reliability and verifiability while preserving natural language fluency. Comparative results show stable improvements across multiple generation quality metrics. This confirms the effectiveness and necessity of coordinated semantic alignment and evidence constraint modeling in retrieval augmented generation tasks.
摘要：检索增强生成通过引入外部知识来减轻大型语言模型在事实一致性和知识更新方面的局限性。然而，实际应用仍然存在检索结果与生成目标之间语义不一致以及证据利用不足的问题。为了应对这些挑战，本文提出了一种检索增强生成方法，通过检索和生成阶段的协调建模将语义对齐与证据约束相结合。该方法首先表示统一语义空间内查询和候选证据之间的相关性。这确保检索结果在语义上与生成目标保持一致，并减少噪声证据和语义漂移的干扰。在此基础上，引入显式证据约束机制。检索到的证据从隐含的背景转化为生成的核心控制因素。这限制了生成内容的表达范围，强化了对证据的依赖。通过在统一框架内对语义一致性和证据约束进行联合建模，所提出的方法提高了事实可靠性和可验证性，同时保持自然语言的流畅性。比较结果显示多代质量指标的稳定改进。这证实了检索增强生成任务中协调语义对齐和证据约束建模的有效性和必要性。

Title: iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics

Authors: Preetam Prabhu Srikar Dammu, Arnav Palkhiwala, Tanya Roosta, Chirag Shah
Subjects: cs.CL, cs.IR, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2603.04656
Pdf URL: https://arxiv.org/pdf/2603.04656
Copy Paste: [[2603.04656]] iAgentBench: Benchmarking Sensemaking Capabilities of Information-Seeking Agents on High-Traffic Topics(https://arxiv.org/abs/2603.04656)
Keywords: llm, agent
Abstract: With the emergence of search-enabled generative QA systems, users are increasingly turning to tools that browse, aggregate, and reconcile evidence across multiple sources on their behalf. Yet many widely used QA benchmarks remain answerable by retrieving a single relevant passage, making them poorly suited for measuring cross-source sensemaking, such as integrating evidence, tracking causal links, and resolving dependencies across facets of a topic. We present iAgentBench, a dynamic ODQA benchmark that targets these higher-level information needs while keeping questions natural and grounded in realistic information-seeking behavior. iAgentBench draws seed topics from real-world attention signals and uses common user intent patterns to construct user-like questions whose answers require combining evidence from multiple sources, not just extracting a single snippet. Each instance is released with traceable evidence and auditable intermediate artifacts that support contamination checks and enable fine-grained diagnosis of failures in retrieval versus synthesis. Experiments across multiple LLMs show that retrieval improves accuracy, but retrieval alone does not reliably resolve these questions, underscoring the need to evaluate evidence use, not just evidence access.
摘要：随着支持搜索的生成问答系统的出现，用户越来越多地转向代表他们浏览、聚合和协调多个来源的证据的工具。然而，许多广泛使用的 QA 基准仍然通过检索单个相关段落来进行回答，这使得它们不太适合衡量跨源意义建构，例如整合证据、跟踪因果关系以及解决主题各个方面的依赖关系。我们推出了 iAgentBench，这是一个动态 ODQA 基准，它针对这些更高级别的信息需求，同时保持问题自然并基于现实的信息寻求行为。 iAgentBench 从现实世界的注意力信号中提取种子主题，并使用常见的用户意图模式来构建类似用户的问题，其答案需要结合多个来源的证据，而不仅仅是提取单个片段。每个实例都会发布可追溯的证据和可审计的中间工件，这些工件支持污染检查并能够对检索与合成中的失败进行细粒度诊断。多个法学硕士的实验表明，检索可以提高准确性，但仅靠检索并不能可靠地解决这些问题，这强调了评估证据使用而不仅仅是证据获取的必要性。

Title: Stan: An LLM-based thermodynamics course assistant

Authors: Eric M. Furst, Vasudevan Venkateshwaran
Subjects: cs.CL, cs.CY, physics.ed-ph
Abstract URL: https://arxiv.org/abs/2603.04657
Pdf URL: https://arxiv.org/pdf/2603.04657
Copy Paste: [[2603.04657]] Stan: An LLM-based thermodynamics course assistant(https://arxiv.org/abs/2603.04657)
Keywords: llm, chat, retrieval-augmented generation
Abstract: Discussions of AI in education focus predominantly on student-facing tools -- chatbots, tutors, and problem generators -- while the potential for the same infrastructure to support instructors remains largely unexplored. We describe Stan, a suite of tools for an undergraduate chemical engineering thermodynamics course built on a data pipeline that we develop and deploy in dual roles: serving students and supporting instructors from a shared foundation of lecture transcripts and a structured textbook index. On the student side, a retrieval-augmented generation (RAG) pipeline answers natural-language queries by extracting technical terms, matching them against the textbook index, and synthesizing grounded responses with specific chapter and page references. On the instructor side, the same transcript corpus is processed through structured analysis pipelines that produce per-lecture summaries, identify student questions and moments of confusion, and catalog the anecdotes and analogies used to motivate difficult material -- providing a searchable, semester-scale record of teaching that supports course reflection, reminders, and improvement. All components, including speech-to-text transcription, structured content extraction, and interactive query answering, run entirely on locally controlled hardware using open-weight models (Whisper large-v3, Llama~3.1 8B) with no dependence on cloud APIs, ensuring predictable costs, full data privacy, and reproducibility independent of third-party services. We describe the design, implementation, and practical failure modes encountered when deploying 7--8 billion parameter models for structured extraction over long lecture transcripts, including context truncation, bimodal output distributions, and schema drift, along with the mitigations that resolved them.
摘要：关于教育领域人工智能的讨论主要集中在面向学生的工具——聊天机器人、导师和问题生成器——而同样的基础设施支持教师的潜力在很大程度上仍未得到开发。我们描述了 Stan，这是一套用于本科生化学工程热力学课程的工具，它建立在我们开发和部署的数据管道上，具有双重作用：为学生提供服务，并通过共享的讲座成绩单和结构化教科书索引为教师提供支持。在学生方面，检索增强生成（RAG）管道通过提取技术术语、将其与教科书索引进行匹配以及根据特定章节和页面参考来综合接地响应来回答自然语言查询。在教师方面，相同的成绩单语料库通过结构化分析管道进行处理，生成每堂课的摘要，识别学生的问题和困惑时刻，并对用于激发困难材料的轶事和类比进行分类，提供可搜索的、学期规模的教学记录，支持课程反思、提醒和改进。所有组件（包括语音到文本转录、结构化内容提取和交互式查询应答）完全在使用开放权重模型（Whisper large-v3、Llama~3.1 8B）的本地控制硬件上运行，不依赖于云 API，从而确保可预测的成本、完整的数据隐私和独立于第三方服务的可重复性。我们描述了在部署 7--80 亿个参数模型以对长讲稿进行结构化提取时遇到的设计、实现和实际故障模式，包括上下文截断、双峰输出分布和模式漂移，以及解决这些问题的缓解措施。

Title: Optimizing Language Models for Crosslingual Knowledge Consistency

Authors: Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04678
Pdf URL: https://arxiv.org/pdf/2603.04678
Copy Paste: [[2603.04678]] Optimizing Language Models for Crosslingual Knowledge Consistency(https://arxiv.org/abs/2603.04678)
Keywords: language model, llm
Abstract: Large language models are known to often exhibit inconsistent knowledge. This is particularly problematic in multilingual scenarios, where models are likely to be asked similar questions in different languages, and inconsistent responses can undermine their reliability. In this work, we show that this issue can be mitigated using reinforcement learning with a structured reward function, which leads to an optimal policy with consistent crosslingual responses. We introduce Direct Consistency Optimization (DCO), a DPO-inspired method that requires no explicit reward model and is derived directly from the LLM itself. Comprehensive experiments show that DCO significantly improves crosslingual consistency across diverse LLMs and outperforms existing methods when training with samples of multiple languages, while complementing DPO when gold labels are available. Extra experiments demonstrate the effectiveness of DCO in bilingual settings, significant out-of-domain generalizability, and controllable alignment via direction hyperparameters. Taken together, these results establish DCO as a robust and efficient solution for improving knowledge consistency across languages in multilingual LLMs. All code, training scripts, and evaluation benchmarks are released at this https URL.
摘要：众所周知，大型语言模型经常表现出不一致的知识。这在多语言场景中尤其成问题，模型可能会被用不同的语言询问类似的问题，而不一致的响应可能会破坏其可靠性。在这项工作中，我们表明可以使用具有结构化奖励函数的强化学习来缓解这个问题，从而产生具有一致的跨语言响应的最佳策略。我们引入直接一致性优化（DCO），这是一种受 DPO 启发的方法，不需要明确的奖励模型，并且直接源自 LLM 本身。综合实验表明，DCO 显着提高了不同法学硕士的跨语言一致性，并且在使用多种语言样本进行训练时优于现有方法，同时在黄金标签可用时补充了 DPO。额外的实验证明了 DCO 在双语环境中的有效性、显着的域外泛化性以及通过方向超参数的可控对齐。总而言之，这些结果使 DCO 成为一种强大而高效的解决方案，可提高多语言法学硕士中跨语言知识的一致性。所有代码、训练脚本和评估基准均在此 https URL 发布。

Title: Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement

Authors: Brian Jing Hong Nge, Stefan Su, Thanh Thi Nguyen, Campbell Wilson, Alexandra Phelan, Naomi Pfitzner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04698
Pdf URL: https://arxiv.org/pdf/2603.04698
Copy Paste: [[2603.04698]] Hate Speech Detection using Large Language Models with Data Augmentation and Feature Enhancement(https://arxiv.org/abs/2603.04698)
Keywords: language model, gpt
Abstract: This paper evaluates data augmentation and feature enhancement techniques for hate speech detection, comparing traditional classifiers, e.g., Delta Term Frequency-Inverse Document Frequency (Delta TF-IDF), with transformer-based models (DistilBERT, RoBERTa, DeBERTa, Gemma-7B, gpt-oss-20b) across diverse datasets. It examines the impact of Synthetic Minority Over-sampling Technique (SMOTE), weighted loss determined by inverse class proportions, Part-of-Speech (POS) tagging, and text data augmentation on model performance. The open-source gpt-oss-20b consistently achieves the highest results. On the other hand, Delta TF-IDF responds strongly to data augmentation, reaching 98.2% accuracy on the Stormfront dataset. The study confirms that implicit hate speech is more difficult to detect than explicit hateful content and that enhancement effectiveness depends on dataset, model, and technique interaction. Our research informs the development of hate speech detection by highlighting how dataset properties, model architectures, and enhancement strategies interact, supporting more accurate and context-aware automated detection.
摘要：本文评估了用于仇恨语音检测的数据增强和特征增强技术，将传统分类器（例如 Delta TermFrequency-Inverse DocumentFrequency (Delta TF-IDF)）与基于 Transformer 的模型（DistilBERT、RoBERTa、DeBERTa、Gemma-7B、gpt-oss-20b）在不同数据集上进行比较。它研究了合成少数过采样技术 (SMOTE)、由逆类比例确定的加权损失、词性 (POS) 标记和文本数据增强对模型性能的影响。开源 gpt-oss-20b 始终获得最高结果。另一方面，Delta TF-IDF 对数据增强反应强烈，在 Stormfront 数据集上达到 98.2% 的准确率。该研究证实，隐性仇恨言论比显性仇恨内容更难检测，并且增强效果取决于数据集、模型和技术交互。我们的研究通过强调数据集属性、模型架构和增强策略如何相互作用，支持更准确和上下文感知的自动检测，为仇恨语音检测的发展提供信息。

Title: Detection of Illicit Content on Online Marketplaces using Large Language Models

Authors: Quoc Khoa Tran, Thanh Thi Nguyen, Campbell Wilson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04707
Pdf URL: https://arxiv.org/pdf/2603.04707
Copy Paste: [[2603.04707]] Detection of Illicit Content on Online Marketplaces using Large Language Models(https://arxiv.org/abs/2603.04707)
Keywords: language model, llm
Abstract: Online marketplaces, while revolutionizing global commerce, have inadvertently facilitated the proliferation of illicit activities, including drug trafficking, counterfeit sales, and cybercrimes. Traditional content moderation methods such as manual reviews and rule-based automated systems struggle with scalability, dynamic obfuscation techniques, and multilingual content. Conventional machine learning models, though effective in simpler contexts, often falter when confronting the semantic complexities and linguistic nuances characteristic of illicit marketplace communications. This research investigates the efficacy of Large Language Models (LLMs), specifically Meta's Llama 3.2 and Google's Gemma 3, in detecting and classifying illicit online marketplace content using the multilingual DUTA10K dataset. Employing fine-tuning techniques such as Parameter-Efficient Fine-Tuning (PEFT) and quantization, these models were systematically benchmarked against a foundational transformer-based model (BERT) and traditional machine learning baselines (Support Vector Machines and Naive Bayes). Experimental results reveal a task-dependent advantage for LLMs. In binary classification (illicit vs. non-illicit), Llama 3.2 demonstrated performance comparable to traditional methods. However, for complex, imbalanced multi-class classification involving 40 specific illicit categories, Llama 3.2 significantly surpassed all baseline models. These findings offer substantial practical implications for enhancing online safety, equipping law enforcement agencies, e-commerce platforms, and cybersecurity specialists with more effective, scalable, and adaptive tools for illicit content detection and moderation.
摘要：在线市场在彻底改变全球商业的同时，也在无意中助长了非法活动的扩散，包括贩毒、假冒产品销售和网络犯罪。传统的内容审核方法（例如手动审核和基于规则的自动化系统）与可扩展性、动态混淆技术和多语言内容作斗争。传统的机器学习模型虽然在较简单的环境中有效，但在面对非法市场通信的语义复杂性和语言细微差别时往往会出现问题。这项研究调查了大型语言模型 (LLM)（特别是 Meta 的 Llama 3.2 和 Google 的 Gemma 3）在使用多语言 DUTA10K 数据集检测和分类非法在线市场内容方面的功效。这些模型采用参数高效微调 (PEFT) 和量化等微调技术，系统地针对基于变压器的基础模型 (BERT) 和传统机器学习基线（支持向量机和朴素贝叶斯）进行基准测试。实验结果揭示了法学硕士的任务依赖优势。在二元分类（非法与非非法）中，Llama 3.2 表现出与传统方法相当的性能。然而，对于涉及 40 个特定非法类别的复杂、不平衡的多类分类，Llama 3.2 显着超过了所有基线模型。这些发现对于增强在线安全、为执法机构、电子商务平台和网络安全专家提供更有效、可扩展和适应性强的非法内容检测和审核工具具有重大的实际意义。

Title: AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments

Authors: Kylie Zhang, Nimra Nadeem, Lucia Zheng, Dominik Stammbach, Peter Henderson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04718
Pdf URL: https://arxiv.org/pdf/2603.04718
Copy Paste: [[2603.04718]] AI-Assisted Moot Courts: Simulating Justice-Specific Questioning in Oral Arguments(https://arxiv.org/abs/2603.04718)
Keywords: prompt, agent
Abstract: In oral arguments, judges probe attorneys with questions about the factual record, legal claims, and the strength of their arguments. To prepare for this questioning, both law schools and practicing attorneys rely on moot courts: practice simulations of appellate hearings. Leveraging a dataset of U.S. Supreme Court oral argument transcripts, we examine whether AI models can effectively simulate justice-specific questioning for moot court-style training. Evaluating oral argument simulation is challenging because there is no single correct question for any given turn. Instead, effective questioning should reflect a combination of desirable qualities, such as anticipating substantive legal issues, detecting logical weaknesses, and maintaining an appropriately adversarial tone. We introduce a two-layer evaluation framework that assesses both the realism and pedagogical usefulness of simulated questions using complementary proxy metrics. We construct and evaluate both prompt-based and agentic oral argument simulators. We find that simulated questions are often perceived as realistic by human annotators and achieve high recall of ground truth substantive legal issues. However, models still face substantial shortcomings, including low diversity in question types and sycophancy. Importantly, these shortcomings would remain undetected under naive evaluation approaches.
摘要：在口头辩论中，法官会询问律师有关事实记录、法律主张及其论据强度的问题。为了准备这一质疑，法学院和执业律师都依赖模拟法庭：模拟上诉听证会。利用美国最高法院口头辩论笔录的数据集，我们研究了人工智能模型是否可以有效地模拟模拟法庭式训练的司法特定提问。评估口头辩论模拟具有挑战性，因为对于任何给定的回合都没有单一的正确问题。相反，有效的提问应该反映出一系列理想的品质，例如预测实质性法律问题、发现逻辑弱点以及保持适当的对抗语气。我们引入了一个两层评估框架，该框架使用互补的代理指标来评估模拟问题的现实性和教学实用性。我们构建并评估基于提示和代理的口头辩论模拟器。我们发现模拟问题通常被人类注释者认为是现实的，并且能够高度回忆真实的实质性法律问题。然而，模型仍然面临重大缺陷，包括问题类型多样性低和阿谀奉承等。重要的是，在简单的评估方法下，这些缺陷仍然未被发现。

Title: IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation

Authors: Bosi Wen, Yilin Niu, Cunxiang Wang, Xiaoying Ling, Ying Zhang, Pei Ke, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04738
Pdf URL: https://arxiv.org/pdf/2603.04738
Copy Paste: [[2603.04738]] IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation(https://arxiv.org/abs/2603.04738)
Keywords: language model, llm
Abstract: Instruction-following is a foundational capability of large language models (LLMs), with its improvement hinging on scalable and accurate feedback from judge models. However, the reliability of current judge models in instruction-following remains underexplored due to several deficiencies of existing meta-evaluation benchmarks, such as their insufficient data coverage and oversimplified pairwise evaluation paradigms that misalign with model optimization scenarios. To this end, we propose IF-RewardBench, a comprehensive meta-evaluation benchmark for instruction-following that covers diverse instruction and constraint types. For each instruction, we construct a preference graph containing all pairwise preferences among multiple responses based on instruction-following quality. This design enables a listwise evaluation paradigm that assesses the capabilities of judge models to rank multiple responses, which is essential in guiding model alignment. Extensive experiments on IF-RewardBench reveal significant deficiencies in current judge models and demonstrate that our benchmark achieves a stronger positive correlation with downstream task performance compared to existing benchmarks. Our codes and data are available at this https URL.
摘要：指令跟踪是大型语言模型（LLM）的一项基本能力，其改进取决于判断模型的可扩展且准确的反馈。然而，由于现有元评估基准的一些缺陷，例如数据覆盖范围不足以及与模型优化场景不一致的过度简化的成对评估范式，当前指令跟踪判断模型的可靠性仍未得到充分探索。为此，我们提出了 IF-RewardBench，这是一个全面的指令跟踪元评估基准，涵盖了不同的指令和约束类型。对于每条指令，我们构建一个偏好图，其中包含基于指令遵循质量的多个响应之间的所有成对偏好。这种设计支持列表式评估范式，评估判断模型对多个响应进行排名的能力，这对于指导模型对齐至关重要。 IF-RewardBench 上的大量实验揭示了当前判断模型的显着缺陷，并证明与现有基准相比，我们的基准与下游任务绩效实现了更强的正相关性。我们的代码和数据可通过此 https URL 获取。

Title: Stacked from One: Multi-Scale Self-Injection for Context Window Extension

Authors: Wei Han, Pan Zhou, Shuicheng Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04759
Pdf URL: https://arxiv.org/pdf/2603.04759
Copy Paste: [[2603.04759]] Stacked from One: Multi-Scale Self-Injection for Context Window Extension(https://arxiv.org/abs/2603.04759)
Keywords: language model, llm
Abstract: The limited context window of contemporary large language models (LLMs) remains a primary bottleneck for their broader application across diverse domains. Although continual pre-training on long-context data offers a straightforward solution, it incurs prohibitive data acquisition and computational costs. To address this challenge, we propose~\modelname, a novel framework based on multi-grained context compression and query-aware information acquisition. SharedLLM comprises two stacked short-context LLMs: a lower model serving as a compressor and an upper model acting as a decoder. The lower model compresses long inputs into compact, multi-grained representations, which are then forwarded to the upper model for context-aware processing. To maximize efficiency, this information transfer occurs exclusively at the lowest layers, bypassing lengthy forward passes and redundant cross-attention operations. This entire process, wherein the upper and lower models are derived from the same underlying LLM layers, is termed~\textit{self-injection}. To support this architecture, a specialized tree-based data structure enables the efficient encoding and query-aware retrieval of contextual information. Despite being trained on sequences of only 8K tokens, \modelname~effectively generalizes to inputs exceeding 128K tokens. Across a comprehensive suite of long-context modeling and understanding benchmarks, \modelname~achieves performance superior or comparable to strong baselines, striking an optimal balance between efficiency and accuracy. Furthermore, these design choices allow \modelname~to substantially reduce the memory footprint and yield notable inference speedups ($2\times$ over streaming and $3\times$ over encoder-decoder architectures).
摘要：当代大语言模型（LLM）有限的上下文窗口仍然是其在不同领域更广泛应用的主要瓶颈。尽管对长上下文数据进行持续预训练提供了一种简单的解决方案，但它会产生高昂的数据采集和计算成本。为了应对这一挑战，我们提出了~\modelname，一种基于多粒度上下文压缩和查询感知信息获取的新颖框架。 SharedLLM 包含两个堆叠的短上下文 LLM：充当压缩器的下层模型和充当解码器的上层模型。下层模型将长输入压缩为紧凑的多粒度表示，然后将其转发到上层模型进行上下文感知处理。为了最大限度地提高效率，这种信息传输仅发生在最低层，绕过了冗长的前向传递和冗余的交叉注意操作。整个过程（其中上层模型和下层模型源自相同的底层 LLM 层）被称为~\textit{自注入}。为了支持这种架构，专门的基于树的数据结构可以实现上下文信息的高效编码和查询感知检索。尽管只接受了 8K 标记序列的训练，\modelname~仍能有效地泛化到超过 128K 标记的输入。在一套全面的长上下文建模和理解基准中，\modelname~实现了优于或与强基线相当的性能，在效率和准确性之间取得了最佳平衡。此外，这些设计选择允许 modelname~ 大幅减少内存占用并显着提高推理速度（比流式传输快 2 倍，比编码器-解码器架构快 3 倍）。

Title: TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings

Authors: Yebo Wu, Feng Liu, Ziwei Xie, Zhiyuan Liu, Changwang Zhang, Jun Wang, Li Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04772
Pdf URL: https://arxiv.org/pdf/2603.04772
Copy Paste: [[2603.04772]] TSEmbed: Unlocking Task Scaling in Universal Multimodal Embeddings(https://arxiv.org/abs/2603.04772)
Keywords: language model, llm
Abstract: Despite the exceptional reasoning capabilities of Multimodal Large Language Models (MLLMs), their adaptation into universal embedding models is significantly impeded by task conflict. To address this, we propose TSEmbed, a universal multimodal embedding framework that synergizes Mixture-of-Experts (MoE) with Low-Rank Adaptation (LoRA) to explicitly disentangle conflicting task objectives. Moreover, we introduce Expert-Aware Negative Sampling (EANS), a novel strategy that leverages expert routing distributions as an intrinsic proxy for semantic similarity. By dynamically prioritizing informative hard negatives that share expert activation patterns with the query, EANS effectively sharpens the model's discriminative power and refines embedding boundaries. To ensure training stability, we further devise a two-stage learning paradigm that solidifies expert specialization before optimizing representations via EANS. TSEmbed achieves state-of-the-art performance on both the Massive Multimodal Embedding Benchmark (MMEB) and real-world industrial production datasets, laying a foundation for task-level scaling in universal multimodal embeddings.
摘要：尽管多模态大语言模型（MLLM）具有出色的推理能力，但任务冲突严重阻碍了它们对通用嵌入模型的适应。为了解决这个问题，我们提出了 TSEmbed，这是一种通用的多模态嵌入框架，它将专家混合 (MoE) 与低秩适应 (LoRA) 相结合，以明确地解决相互冲突的任务目标。此外，我们引入了专家感知负采样（EANS），这是一种利用专家路由分布作为语义相似性的内在代理的新颖策略。通过动态地对与查询共享专家激活模式的信息性硬否定进行优先级排序，EANS 有效地增强了模型的判别能力并细化了嵌入边界。为了确保训练稳定性，我们进一步设计了一个两阶段学习范式，在通过 EANS 优化表示之前巩固专家专业化。 TSEmbed 在大规模多模态嵌入基准 (MMEB) 和现实世界工业生产数据集上实现了最先进的性能，为通用多模态嵌入中的任务级扩展奠定了基础。

Title: Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation

Authors: Edward Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04805
Pdf URL: https://arxiv.org/pdf/2603.04805
Copy Paste: [[2603.04805]] Attention's Gravitational Field:A Power-Law Interpretation of Positional Correlation(https://arxiv.org/abs/2603.04805)
Keywords: language model, llm
Abstract: This paper explores the underlying principles of positional relationships and encodings within Large Language Models (LLMs) and introduces the concept of the Attention Gravitational Field (AGF). By decoupling positional encodings from semantic embeddings, we optimize the model architecture and achieve superior accuracy compared to prevailing encoding methods. Furthermore, we provide an in-depth analysis of AGF, demonstrating its intrinsic consistency with learning and stability curves, as well as its empirical alignment with Newton's Law of Universal Gravitation. By offering a rigorous theoretical exploration of these phenomena, this work represents a significant step toward interpreting the Attention mechanism and unlocks new possibilities for future research in model optimization and interpretability.
摘要：本文探讨了大型语言模型 (LLM) 中位置关系和编码的基本原理，并介绍了注意力引力场 (AGF) 的概念。通过将位置编码与语义嵌入解耦，我们优化了模型架构，并与流行的编码方法相比实现了更高的准确性。此外，我们对 AGF 进行了深入分析，证明了其与学习曲线和稳定性曲线的内在一致性，以及其与牛顿万有引力定律的经验一致性。通过对这些现象进行严格的理论探索，这项工作代表了解释注意力机制的重要一步，并为模型优化和可解释性的未来研究开启了新的可能性。

Title: Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents

Authors: Natchanon Pollertlam, Witchayut Kornsuwannawit
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04814
Pdf URL: https://arxiv.org/pdf/2603.04814
Copy Paste: [[2603.04814]] Beyond the Context Window: A Cost-Performance Analysis of Fact-Based Memory vs. Long-Context LLMs for Persistent Agents(https://arxiv.org/abs/2603.04814)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Persistent conversational AI systems face a choice between passing full conversation histories to a long-context large language model (LLM) and maintaining a dedicated memory system that extracts and retrieves structured facts. We compare a fact-based memory system built on the Mem0 framework against long-context LLM inference on three memory-centric benchmarks - LongMemEval, LoCoMo, and PersonaMemv2 - and evaluate both architectures on accuracy and cumulative API cost. Long-context GPT-5-mini achieves higher factual recall on LongMemEval and LoCoMo, while the memory system is competitive on PersonaMemv2, where persona consistency depends on stable, factual attributes suited to flat-typed extraction. We construct a cost model that incorporates prompt caching and show that the two architectures have structurally different cost profiles: long-context inference incurs a per-turn charge that grows with context length even under caching, while the memory system's per-turn read cost remains roughly fixed after a one-time write phase. At a context length of 100k tokens, the memory system becomes cheaper after approximately ten interaction turns, with the break-even point decreasing as context length grows. These results characterize the accuracy-cost trade-off between the two approaches and provide a concrete criterion for selecting between them in production deployments.
摘要：持久性会话人工智能系统面临着一个选择：是将完整的会话历史传递给长上下文大语言模型（LLM），还是维护一个提取和检索结构化事实的专用记忆系统。我们将基于 Mem0 框架构建的基于事实的内存系统与三个以内存为中心的基准（LongMemEval、LoCoMo 和 PersonaMemv2）上的长上下文 LLM 推理进行比较，并评估这两种架构的准确性和累积 API 成本。长上下文 GPT-5-mini 在 LongMemEval 和 LoCoMo 上实现了更高的事实召回，而内存系统在 PersonaMemv2 上具有竞争力，其中角色一致性取决于适合平面类型提取的稳定的事实属性。我们构建了一个包含即时缓存的成本模型，并表明这两种架构具有结构上不同的成本概况：即使在缓存下，长上下文推理也会产生每回合费用，该费用会随着上下文长度而增长，而内存系统的每回合读取成本在一次性写入阶段后仍大致保持不变。在上下文长度为 100k 令牌时，内存系统在大约十次交互轮次后变得更便宜，并且随着上下文长度的增长，盈亏平衡点会降低。这些结果描述了两种方法之间的准确性与成本权衡，并为在生产部署中选择它们提供了具体标准。

Title: Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses

Authors: Michael Hardy
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.04820
Pdf URL: https://arxiv.org/pdf/2603.04820
Copy Paste: [[2603.04820]] Autoscoring Anticlimax: A Meta-analytic Understanding of AI's Short-answer Shortcomings and Wording Weaknesses(https://arxiv.org/abs/2603.04820)
Keywords: llm
Abstract: Automated short-answer scoring lags other LLM applications. We meta-analyze 890 culminating results across a systematic review of LLM short-answer scoring studies, modeling the traditional effect size of Quadratic Weighted Kappa (QWK) with mixed effects metaregression. We quantitatively illustrate that that the level of difficulty for human experts to perform the task of scoring written work of children has no observed statistical effect on LLM performance. Particularly, we show that some scoring tasks measured as the easiest by human scorers were the hardest for LLMs. Whether by poor implementation by thoughtful researchers or patterns traceable to autoregressive training, on average decoder-only architectures underperform encoders by 0.37--a substantial difference in agreement with humans. Additionally, we measure the contributions of various aspects of LLM technology on successful scoring such as tokenizer vocabulary size, which exhibits diminishing returns--potentially due to undertrained tokens. Findings argue for systems design which better anticipates known statistical shortcomings of autoregressive models. Finally, we provide additional experiments to illustrate wording and tokenization sensitivity and bias elicitation in high-stakes education contexts, where LLMs demonstrate racial discrimination. Code and data for this study are available.
摘要：自动简答评分落后于其他法学硕士申请。我们对 LLM 简答评分研究的系统回顾中的 890 个最终结果进行了荟萃分析，利用混合效应元回归对二次加权 Kappa (QWK) 的传统效应大小进行建模。我们定量地表明，人类专家执行儿童书面作业评分任务的难度水平对法学硕士的表现没有观察到统计影响。特别是，我们表明，一些被人类评分者认为是最简单的评分任务对于法学硕士来说却是最难的。无论是深思熟虑的研究人员实施不当，还是可追溯到自回归训练的模式，仅解码器架构的平均性能比编码器低 0.37，这与人类的一致性存在显着差异。此外，我们还衡量了 LLM 技术各个方面对成功评分的贡献，例如标记器词汇量大小，其表现出收益递减——可能是由于标记训练不足。研究结果表明系统设计可以更好地预测自回归模型的已知统计缺陷。最后，我们提供了额外的实验来说明高风险教育背景下的措辞和标记化敏感性以及偏见引发，其中法学硕士表现出种族歧视。本研究的代码和数据现已提供。

Title: From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Authors: Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04828
Pdf URL: https://arxiv.org/pdf/2603.04828
Copy Paste: [[2603.04828]] From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models(https://arxiv.org/abs/2603.04828)
Keywords: language model, llm
Abstract: Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.
摘要：法学硕士的预训练数据检测对于解决版权问题和减轻基准污染至关重要。现有方法主要关注微调前后基于似然的统计特征或启发式信号，但前者容易受到语料库中词频偏差的影响，而后者强烈依赖于微调数据的相似性。从优化的角度来看，我们观察到，在训练过程中，样本从不熟悉到熟悉的转变方式反映了梯度行为的系统差异。熟悉的样本表现出较小的更新幅度、模型组件中不同的更新位置以及更敏锐地激活的神经元。基于这一见解，我们提出了 GDS，一种通过探测目标样本的梯度偏差分数来识别预训练数据的方法。具体来说，我们首先使用梯度分布来表示每个样本，该梯度分布捕获 FFN 和注意力模块中参数更新的幅度、位置和浓度，从而揭示成员数据和非成员数据之间的一致区别。然后将这些特征输入到轻量级分类器中以执行二元成员推理。对五个公共数据集的实验表明，GDS 实现了最先进的性能，在强基线上显着提高了跨数据集可转移性。进一步的可解释性分析显示梯度特征分布差异，从而实现实用且可扩展的预训练数据检测。

Title: SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts

Authors: Minduli Lasandi, Nevidu Jayatilleke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04854
Pdf URL: https://arxiv.org/pdf/2603.04854
Copy Paste: [[2603.04854]] SinhaLegal: A Benchmark Corpus for Information Extraction and Analysis in Sinhala Legislative Texts(https://arxiv.org/abs/2603.04854)
Keywords: language model
Abstract: SinhaLegal introduces a Sinhala legislative text corpus containing approximately 2 million words across 1,206 legal documents. The dataset includes two types of legal documents: 1,065 Acts dated from 1981 to 2014 and 141 Bills from 2010 to 2014, which were systematically collected from official sources. The texts were extracted using OCR with Google Document AI, followed by extensive post-processing and manual cleaning to ensure high-quality, machine-readable content, along with dedicated metadata files for each document. A comprehensive evaluation was conducted, including corpus statistics, lexical diversity, word frequency analysis, named entity recognition, and topic modelling, demonstrating the structured and domain-specific nature of the corpus. Additionally, perplexity analysis using both large and small language models was performed to assess how effectively language models respond to domain-specific texts. The SinhaLegal corpus represents a vital resource designed to support NLP tasks such as summarisation, information extraction, and analysis, thereby bridging a critical gap in Sinhala legal research.
摘要：SinhaLegal 引入了僧伽罗立法文本语料库，其中包含 1,206 份法律文件、约 200 万字。该数据集包括两类法律文件：1981年至2014年的1,065项法案和2010年至2014年的141项法案，这些都是从官方来源系统收集的。使用 OCR 和 Google Document AI 提取文本，然后进行广泛的后处理和手动清理，以确保高质量、机器可读的内容以及每个文档的专用元数据文件。进行了全面的评估，包括语料统计、词汇多样性、词频分析、命名实体识别和主题建模，展示了语料库的结构化和领域特定性。此外，还使用大型和小型语言模型进行困惑度分析，以评估语言模型如何有效地响应特定领域的文本。 SinhaLegal 语料库是一种重要资源，旨在支持摘要、信息提取和分析等 NLP 任务，从而弥补僧伽罗法律研究中的关键差距。

Title: HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents

Authors: Yilin Jiang, Fei Tan, Xuanyu Yin, Jing Leng, Aimin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04855
Pdf URL: https://arxiv.org/pdf/2603.04855
Copy Paste: [[2603.04855]] HACHIMI: Scalable and Controllable Student Persona Generation via Orchestrated Agents(https://arxiv.org/abs/2603.04855)
Keywords: llm, prompt, agent
Abstract: Student Personas (SPs) are emerging as infrastructure for educational LLMs, yet prior work often relies on ad-hoc prompting or hand-crafted profiles with limited control over educational theory and population distributions. We formalize this as Theory-Aligned and Distribution-Controllable Persona Generation (TAD-PG) and introduce HACHIMI, a multi-agent Propose-Validate-Revise framework that generates theory-aligned, quota-controlled personas. HACHIMI factorizes each persona into a theory-anchored educational schema, enforces developmental and psychological constraints via a neuro-symbolic validator, and combines stratified sampling with semantic deduplication to reduce mode collapse. The resulting HACHIMI-1M corpus comprises 1 million personas for Grades 1-12. Intrinsic evaluation shows near-perfect schema validity, accurate quotas, and substantial diversity, while external evaluation instantiates personas as student agents answering CEPS and PISA 2022 surveys; across 16 cohorts, math and curiosity/growth constructs align strongly between humans and agents, whereas classroom-climate and well-being constructs are only moderately aligned, revealing a fidelity gradient. All personas are generated with Qwen2.5-72B, and HACHIMI provides a standardized synthetic student population for group-level benchmarking and social-science simulations. Resources available at this https URL
摘要：学生角色（SP）正在成为教育法学硕士的基础设施，但之前的工作往往依赖于临时提示或手工制作的档案，对教育理论和人口分布的控制有限。我们将其形式化为理论对齐和分布可控的角色生成（TAD-PG），并引入了 HACHIMI，这是一个多代理提议-验证-修改框架，可生成理论对齐、配额控制的角色。 HACHIMI 将每个角色分解为理论锚定的教育模式，通过神经符号验证器强制发展和心理约束，并将分层采样与语义重复数据删除相结合以减少模式崩溃。由此产生的 HACHIMI-1M 语料库包含 100 万个 1-12 年级的角色。内部评估显示近乎完美的模式有效性、准确的配额和实质性的多样性，而外部评估将角色实例化为回答 CEPS 和 PISA 2022 调查的学生代理；在 16 个队列中，人类和智能体之间的数学和好奇心/成长结构高度一致，而课堂氛围和幸福感结构则只有中等程度的一致性，揭示了保真度梯度。所有角色均使用 Qwen2.5-72B 生成，HACHIMI 为团体级别基准测试和社会科学模拟提供标准化的综合学生群体。此 https URL 提供可用资源

Title: FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications

Authors: Yunfan Zhang, Yijie Bei, Jetashree Ravi, Pawel Garbacki
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.04857
Pdf URL: https://arxiv.org/pdf/2603.04857
Copy Paste: [[2603.04857]] FireBench: Evaluating Instruction Following in Enterprise and API-Driven LLM Applications(https://arxiv.org/abs/2603.04857)
Keywords: llm, chat, agent
Abstract: Instruction following is critical for LLMs deployed in enterprise and API-driven settings, where strict adherence to output formats, content constraints, and procedural requirements is essential for enabling reliable LLM-assisted workflows. However, existing instruction following benchmarks predominantly evaluate natural language generation constraints that reflect the needs of chat assistants rather than enterprise users. To bridge this gap, we introduce FireBench, an LLM instruction following benchmark grounded in real-world enterprise and API usage patterns. FireBench evaluates six core capability dimensions across diverse applications including information extraction, customer support, and coding agents, comprising over 2,400 samples. We evaluate 11 LLMs and present key findings on their instruction following behavior in enterprise scenarios. We open-source FireBench at this http URL to help users assess model suitability, support model developers in diagnosing performance, and invite community contributions.
摘要：对于在企业和 API 驱动的环境中部署的法学硕士来说，遵循指令至关重要，在这些环境中，严格遵守输出格式、内容限制和程序要求对于实现可靠的法学硕士辅助工作流程至关重要。然而，现有的指令遵循基准主要评估自然语言生成约束，这些约束反映了聊天助理而不是企业用户的需求。为了弥补这一差距，我们引入了 FireBench，这是一种 LLM 指令，遵循基于现实企业和 API 使用模式的基准。 FireBench 评估不同应用程序的六个核心能力维度，包括信息提取、客户支持和编码代理，包含 2,400 多个样本。我们评估了 11 名法学硕士，并展示了他们在企业场景中遵循指令行为的主要发现。我们在此 http URL 上开源 FireBench，以帮助用户评估模型适用性、支持模型开发人员诊断性能并邀请社区贡献。

Title: Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models

Authors: Sean Lamont, Christian Walder, Paul Montague, Amir Dezfouli, Michael Norrish
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04893
Pdf URL: https://arxiv.org/pdf/2603.04893
Copy Paste: [[2603.04893]] Free Lunch for Pass@$k$? Low Cost Diverse Sampling for Diffusion Language Models(https://arxiv.org/abs/2603.04893)
Keywords: language model
Abstract: Diverse outputs in text generation are necessary for effective exploration in complex reasoning tasks, such as code generation and mathematical problem solving. Such Pass@$k$ problems benefit from distinct candidates covering the solution space. However, traditional sampling approaches often waste computational resources on repetitive failure modes. While Diffusion Language Models have emerged as a competitive alternative to the prevailing Autoregressive paradigm, they remain susceptible to this redundancy, with independent samples frequently collapsing into similar modes. To address this, we propose a training free, low cost intervention to enhance generative diversity in Diffusion Language Models. Our approach modifies intermediate samples in a batch sequentially, where each sample is repelled from the feature space of previous samples, actively penalising redundancy. Unlike prior methods that require retraining or beam search, our strategy incurs negligible computational overhead, while ensuring that each sample contributes a unique perspective to the batch. We evaluate our method on the HumanEval and GSM8K benchmarks using the LLaDA-8B-Instruct model. Our results demonstrate significantly improved diversity and Pass@$k$ performance across various temperature settings. As a simple modification to the sampling process, our method offers an immediate, low-cost improvement for current and future Diffusion Language Models in tasks that benefit from diverse solution search. We make our code available at this https URL.
摘要：文本生成中的多样化输出对于有效探索复杂的推理任务（例如代码生成和数学问题解决）是必要的。此类 Pass@$k$ 问题受益于涵盖解决方案空间的不同候选者。然而，传统的采样方法经常在重复的故障模式上浪费计算资源。虽然扩散语言模型已经成为流行的自回归范式的竞争替代品，但它们仍然容易受到这种冗余的影响，独立样本经常会陷入类似的模式。为了解决这个问题，我们提出了一种免费培训、低成本的干预措施，以增强扩散语言模型的生成多样性。我们的方法按顺序修改批次中的中间样本，其中每个样本都被先前样本的特征空间排斥，从而主动惩罚冗余。与之前需要重新训练或波束搜索的方法不同，我们的策略产生的计算开销可以忽略不计，同时确保每个样本为批次贡献独特的视角。我们使用 LLaDA-8B-Instruct 模型在 HumanEval 和 GSM8K 基准测试上评估我们的方法。我们的结果表明，在各种温度设置下，多样性和 Pass@$k$ 性能显着提高。作为对采样过程的简单修改，我们的方法为当前和未来的扩散语言模型在受益于多样化解决方案搜索的任务中提供了即时、低成本的改进。我们通过此 https URL 提供代码。

Title: Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research

Authors: Arina Kostina, Marios Dikaiakos, Alejandro Porcel, Tassos Stassopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04897
Pdf URL: https://arxiv.org/pdf/2603.04897
Copy Paste: [[2603.04897]] Can LLMs Capture Expert Uncertainty? A Comparative Analysis of Value Alignment in Ethnographic Qualitative Research(https://arxiv.org/abs/2603.04897)
Keywords: language model, llm
Abstract: Qualitative analysis of open-ended interviews plays a central role in ethnographic and economic research by uncovering individuals' values, motivations, and culturally embedded financial behaviors. While large language models (LLMs) offer promising support for automating and enriching such interpretive work, their ability to produce nuanced, reliable interpretations under inherent task ambiguity remains unclear. In our work we evaluate LLMs on the task of identifying the top three human values expressed in long-form interviews based on the Schwartz Theory of Basic Values framework. We compare their outputs to expert annotations, analyzing both performance and uncertainty patterns relative to the experts. Results show that LLMs approach the human ceiling on set-based metrics (F1, Jaccard) but struggle to recover exact value rankings, as reflected in lower RBO scores. While the average Schwartz value distributions of most models closely match those of human analysts, their uncertainty structures across the Schwartz values diverge from expert uncertainty patterns. Among the evaluated models, Qwen performs closest to expert-level agreement and exhibits the strongest alignment with expert Schwartz value distributions. LLM ensemble methods yield consistent gains across metrics, with Majority Vote and Borda Count performing best. Notably, systematic overemphasis on certain Schwartz values, like Security, suggests both the potential of LLMs to provide complementary perspectives and the need to further investigate model-induced value biases. Overall, our findings highlight both the promise and the limitations of LLMs as collaborators in inherently ambiguous qualitative value analysis.
摘要：开放式访谈的定性分析通过揭示个人的价值观、动机和文化中嵌入的金融行为，在民族志和经济研究中发挥着核心作用。虽然大型语言模型（LLM）为自动化和丰富此类解释工作提供了有希望的支持，但它们在固有的任务模糊性下产生细致入微、可靠的解释的能力仍不清楚。在我们的工作中，我们评估法学硕士的任务是识别基于施瓦茨基本价值观框架的长篇访谈中表达的三大人类价值观。我们将他们的输出与专家注释进行比较，分析相对于专家的绩效和不确定性模式。结果表明，法学硕士在基于集合的指标（F1、Jaccard）上接近人类上限，但很难恢复准确的价值排名，这反映在较低的 RBO 分数上。虽然大多数模型的平均 Schwartz 值分布与人类分析师的分布非常匹配，但它们的 Schwartz 值的不确定性结构与专家的不确定性模式不同。在评估的模型中，Qwen 的表现最接近专家级一致性，并且与专家 Schwartz 值分布表现出最强的一致性。 LLM 集成方法在各个指标上产生一致的收益，其中多数投票和博尔达计数表现最好。值得注意的是，对某些施瓦茨价值观（例如安全性）的系统性过分强调表明法学硕士具有提供补充观点的潜力，并且需要进一步研究模型引起的价值偏差。总的来说，我们的研究结果强调了法学硕士作为合作者在本质上模糊的定性价值分析中的前景和局限性。

Title: AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection

Authors: Panagiotis Alexios Spanakis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04921
Pdf URL: https://arxiv.org/pdf/2603.04921
Copy Paste: [[2603.04921]] AILS-NTUA at SemEval-2026 Task 10: Agentic LLMs for Psycholinguistic Marker Extraction and Conspiracy Endorsement Detection(https://arxiv.org/abs/2603.04921)
Keywords: llm, chain-of-thought, agent
Abstract: This paper presents a novel agentic LLM pipeline for SemEval-2026 Task 10 that jointly extracts psycholinguistic conspiracy markers and detects conspiracy endorsement. Unlike traditional classifiers that conflate semantic reasoning with structural localization, our decoupled design isolates these challenges. For marker extraction, we propose Dynamic Discriminative Chain-of-Thought (DD-CoT) with deterministic anchoring to resolve semantic ambiguity and character-level brittleness. For conspiracy detection, an "Anti-Echo Chamber" architecture, consisting of an adversarial Parallel Council adjudicated by a Calibrated Judge, overcomes the "Reporter Trap," where models falsely penalize objective reporting. Achieving 0.24 Macro F1 (+100\% over baseline) on S1 and 0.79 Macro F1 (+49\%) on S2, with the S1 system ranking 3rd on the development leaderboard, our approach establishes a versatile paradigm for interpretable, psycholinguistically-grounded NLP.
摘要：本文为 SemEval-2026 任务 10 提出了一种新颖的代理 LLM 流程，该流程联合提取心理语言学阴谋标记并检测阴谋背书。与将语义推理与结构定位混为一谈的传统分类器不同，我们的解耦设计隔离了这些挑战。对于标记提取，我们提出了具有确定性锚定的动态判别思想链（DD-CoT），以解决语义模糊性和字符级脆弱性。对于阴谋侦查，“反回声室”架构由经过校准的法官裁决的对抗性并行委员会组成，克服了“记者陷阱”，即模型错误地惩罚客观报道。在 S1 上实现 0.24 Macro F1（超出基线+100%），在 S2 上实现 0.79 Macro F1（+49%），S1 系统在开发排行榜上排名第三，我们的方法为可解释的、以心理语言学为基础的 NLP 建立了一个通用范例。

Title: AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis

Authors: Stavros Gazetas, Giorgos Filandrianos, Maria Lymperaiou, Paraskevi Tzouveli, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04933
Pdf URL: https://arxiv.org/pdf/2603.04933
Copy Paste: [[2603.04933]] AILS-NTUA at SemEval-2026 Task 3: Efficient Dimensional Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2603.04933)
Keywords: language model
Abstract: In this paper, we present AILS-NTUA system for Track-A of SemEval-2026 Task 3 on Dimensional Aspect-Based Sentiment Analysis (DimABSA), which encompasses three complementary problems: Dimensional Aspect Sentiment Regression (DimASR), Dimensional Aspect Sentiment Triplet Extraction (DimASTE), and Dimensional Aspect Sentiment Quadruplet Prediction (DimASQP) within a multilingual and multi-domain framework. Our methodology combines fine-tuning of language-appropriate encoder backbones for continuous aspect-level sentiment prediction with language-specific instruction tuning of large language models using LoRA for structured triplet and quadruplet extraction. This unified yet task-adaptive design emphasizes parameter-efficient specialization across languages and domains, enabling reduced training and inference requirements while maintaining strong effectiveness. Empirical results demonstrate that the proposed models achieve competitive performance and consistently surpass the provided baselines across most evaluation settings.
摘要：在本文中，我们提出了基于维度方面情感分析（DimABSA）的 SemEval-2026 任务 3 的 Track-A 的 AILS-NTUA 系统，该系统包含三个互补问题：多语言中的维度方面情感回归（DimaSR）、维度方面情感三元组提取（DimASTE）和维度方面情感四元组预测（DimASQP）和多领域框架。我们的方法将用于连续方面级情感预测的适合语言的编码器主干的微调与使用 LoRA 进行结构化三元组和四元组提取的大型语言模型的特定于语言的指令调整相结合。这种统一且任务自适应的设计强调跨语言和领域的参数高效专业化，从而减少训练和推理要求，同时保持强大的有效性。实证结果表明，所提出的模型实现了有竞争力的性能，并且在大多数评估设置中始终超过了提供的基线。

Title: Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition

Authors: Mengze Hong, Yi Gu, Di Jiang, Hanlin Gu, Chen Jason Zhang, Lu Wang, Zhiyang Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04945
Pdf URL: https://arxiv.org/pdf/2603.04945
Copy Paste: [[2603.04945]] Federated Heterogeneous Language Model Optimization for Hybrid Automatic Speech Recognition(https://arxiv.org/abs/2603.04945)
Keywords: language model
Abstract: Training automatic speech recognition (ASR) models increasingly relies on decentralized federated learning to ensure data privacy and accessibility, producing multiple local models that require effective merging. In hybrid ASR systems, while acoustic models can be merged using established methods, the language model (LM) for rescoring the N-best speech recognition list faces challenges due to the heterogeneity of non-neural n-gram models and neural network models. This paper proposes a heterogeneous LM optimization task and introduces a match-and-merge paradigm with two algorithms: the Genetic Match-and-Merge Algorithm (GMMA), using genetic operations to evolve and pair LMs, and the Reinforced Match-and-Merge Algorithm (RMMA), leveraging reinforcement learning for efficient convergence. Experiments on seven OpenSLR datasets show RMMA achieves the lowest average Character Error Rate and better generalization than baselines, converging up to seven times faster than GMMA, highlighting the paradigm's potential for scalable, privacy-preserving ASR systems.
摘要：训练自动语音识别 (ASR) 模型越来越依赖去中心化联合学习来确保数据隐私和可访问性，从而产生需要有效合并的多个本地模型。在混合 ASR 系统中，虽然可以使用已建立的方法合并声学模型，但由于非神经 n-gram 模型和神经网络模型的异质性，用于重新评分 N 最佳语音识别列表的语言模型 (LM) 面临着挑战。本文提出了一种异构 LM 优化任务，并引入了具有两种算法的匹配和合并范例：遗传匹配和合并算法（GMMA），使用遗传操作来进化和配对 LM；以及强化匹配和合并算法（RMMA），利用强化学习实现高效收敛。在七个 OpenSLR 数据集上进行的实验表明，RMMA 实现了最低的平均字符错误率和比基线更好的泛化能力，收敛速度比 GMMA 快七倍，凸显了该范式在可扩展、保护隐私的 ASR 系统中的潜力。

Title: LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services

Authors: Jinwen Chen (1 and 2), Shuai Gong, Shiwen Zhang (1 and 2), Zheng Zhang, Yachao Zhao, Lingxiang Wang (1 and 2), Haibo Zhou, Yuan Zhan, Wei Lin, Hainan Zhang (1 and 2) ((1) Beijing Advanced Innovation Center for Future Blockchain and Privacy Computing, (2) School of Artificial Intelligence, Beihang University, China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04946
Pdf URL: https://arxiv.org/pdf/2603.04946
Copy Paste: [[2603.04946]] LocalSUG: Geography-Aware LLM for Query Suggestion in Local-Life Services(https://arxiv.org/abs/2603.04946)
Keywords: llm
Abstract: In local-life service platforms, the query suggestion module plays a crucial role in enhancing user experience by generating candidate queries based on user input prefixes, thus reducing user effort and accelerating search. Traditional multi-stage cascading systems rely heavily on historical top queries, limiting their ability to address long-tail demand. While LLMs offer strong semantic generalization, deploying them in local-life services introduces three key challenges: lack of geographic grounding, exposure bias in preference optimization, and online inference latency. To address these issues, we propose LocalSUG, an LLM-based query suggestion framework tailored for local-life service platforms. First, we introduce a city-aware candidate mining strategy based on term co-occurrence to inject geographic grounding into generation. Second, we propose a beam-search-driven GRPO algorithm that aligns training with inference-time decoding, reducing exposure bias in autoregressive generation. A multi-objective reward mechanism further optimizes both relevance and business-oriented metrics. Finally, we develop quality-aware beam acceleration and vocabulary pruning techniques that significantly reduce online latency while preserving generation quality. Extensive offline evaluations and large-scale online A/B testing demonstrate that LocalSUG improves click-through rate (CTR) by +0.35% and reduces the low/no-result rate by 2.56%, validating its effectiveness in real-world deployment.
摘要：在本地生活服务平台中，查询建议模块通过根据用户输入的前缀生成候选查询，从而减少用户工作量并加速搜索，在增强用户体验方面发挥着至关重要的作用。传统的多级级联系统严重依赖历史热门查询，限制了其解决长尾需求的能力。虽然法学硕士提供了强大的语义泛化能力，但将它们部署在本地生活服务中会带来三个关键挑战：缺乏地理基础、偏好优化中的暴露偏差以及在线推理延迟。为了解决这些问题，我们提出了LocalSUG，一个基于LLM的查询建议框架，专为本地生活服务平台量身定制。首先，我们引入了一种基于术语共现的城市感知候选挖掘策略，为生成注入地理基础。其次，我们提出了一种波束搜索驱动的 GRPO 算法，该算法将训练与推理时间解码结合起来，减少自回归生成中的暴露偏差。多目标奖励机制进一步优化相关性和面向业务的指标。最后，我们开发了质量感知波束加速和词汇修剪技术，可显着减少在线延迟，同时保持生成质量。广泛的离线评估和大规模在线 A/B 测试表明，LocalSUG 将点击率 (CTR) 提高了 +0.35%，并将低结果/无结果率降低了 2.56%，验证了其在实际部署中的有效性。

Title: Replaying pre-training data improves fine-tuning

Authors: Suhas Kotha, Percy Liang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04964
Pdf URL: https://arxiv.org/pdf/2603.04964
Copy Paste: [[2603.04964]] Replaying pre-training data improves fine-tuning(https://arxiv.org/abs/2603.04964)
Keywords: language model, agent
Abstract: To obtain a language model for a target domain (e.g. math), the current paradigm is to pre-train on a vast amount of generic web text and then fine-tune on the relatively limited amount of target data. Typically, generic data is only mixed in during fine-tuning to prevent catastrophic forgetting of the generic domain. We surprisingly find that replaying the generic data during fine-tuning can actually improve performance on the (less related) target task. Concretely, in a controlled pre-training environment with 4M target tokens, 4B total tokens, and 150M parameter models, generic replay increases target data efficiency by up to $1.87\times$ for fine-tuning and $2.06\times$ for mid-training. We further analyze data schedules that introduce target data during pre-training and find that replay helps more when there is less target data present in pre-training. We demonstrate the success of replay in practice for fine-tuning 8B parameter models, improving agentic web navigation success by $4.5\%$ and Basque question-answering accuracy by $2\%$.
摘要：为了获得目标领域（例如数学）的语言模型，当前的范例是对大量通用网络文本进行预训练，然后对相对有限的目标数据进行微调。通常，通用数据仅在微调期间混合，以防止通用域的灾难性遗忘。我们惊讶地发现，在微调期间重播通用数据实际上可以提高（不太相关的）目标任务的性能。具体来说，在具有 4M 目标标记、4B 总标记和 150M 参数模型的受控预训练环境中，通用重放可将目标数据效率提高高达 1.87\times$（微调）和 2.06\times（中期训练）。我们进一步分析了在预训练期间引入目标数据的数据计划，发现当预训练中存在的目标数据较少时，重播更有帮助。我们在实践中展示了重播在微调 8B 参数模型方面的成功，将代理网络导航的成功率提高了 $4.5\%$，将巴斯克语问答的准确性提高了 $2\%$。

Title: When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger

Authors: Amirabbas Afzali, Myeongho Jeon, Maria Brbic
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04968
Pdf URL: https://arxiv.org/pdf/2603.04968
Copy Paste: [[2603.04968]] When Weak LLMs Speak with Confidence, Preference Alignment Gets Stronger(https://arxiv.org/abs/2603.04968)
Keywords: language model, llm
Abstract: Preference alignment is an essential step in adapting large language models (LLMs) to human values, but existing approaches typically depend on costly human annotations or large-scale API-based models. We explore whether a weak LLM can instead act as an effective annotator. We surprisingly find that selecting only a subset of a weak LLM's highly confident samples leads to substantially better performance than using full human annotations. Building on this insight, we propose Confidence-Weighted Preference Optimization (CW-PO), a general framework that re-weights training samples by a weak LLM's confidence and can be applied across different preference optimization objectives. Notably, the model aligned by CW-PO with just 20% of human annotations outperforms the model trained with 100% of annotations under standard DPO. These results suggest that weak LLMs, when paired with confidence weighting, can dramatically reduce the cost of preference alignment while even outperforming methods trained on fully human-labeled data.
摘要：偏好对齐是使大型语言模型 (LLM) 适应人类价值观的重要步骤，但现有方法通常依赖于昂贵的人工注释或基于 API 的大规模模型。我们探讨弱法学硕士是否可以充当有效的注释者。我们惊讶地发现，仅选择弱 LLM 的高度置信样本的子集会比使用完整的人工注释带来更好的性能。基于这一见解，我们提出了置信度加权偏好优化（CW-PO），这是一个通用框架，通过弱 LLM 的置信度重新加权训练样本，并且可以应用于不同的偏好优化目标。值得注意的是，由 CW-PO 与仅 20% 的人类注释对齐的模型优于在标准 DPO 下使用 100% 注释训练的模型。这些结果表明，弱法学硕士与置信权重结合使用时，可以显着降低偏好对齐的成本，甚至优于在完全人类标记的数据上训练的方法。

Title: VRM: Teaching Reward Models to Understand Authentic Human Preferences

Authors: Biao Liu, Ning Xu, Junming Yang, Hao Xu, Xin Geng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04974
Pdf URL: https://arxiv.org/pdf/2603.04974
Copy Paste: [[2603.04974]] VRM: Teaching Reward Models to Understand Authentic Human Preferences(https://arxiv.org/abs/2603.04974)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved remarkable success across diverse natural language tasks, yet the reward models employed for aligning LLMs often encounter challenges of reward hacking, where the approaches predominantly rely on directly mapping prompt-response pairs to scalar scores, which may inadvertently capture spurious correlations rather than authentic human preferences. In contrast, human evaluation employs a sophisticated process that initially weighs the relative importance of multiple high-dimensional objectives according to the prompt context, subsequently evaluating response quality through low-dimensional semantic features such as logical coherence and contextual appropriateness. Motivated by this consideration, we propose VRM, i.e., Variational Reward Modeling, a novel framework that explicitly models the evaluation process of human preference judgments by incorporating both high-dimensional objective weights and low-dimensional semantic features as latent variables, which are inferred through variational inference techniques. Additionally, we provide a theoretical analysis showing that VRM can achieve a tighter generalization error bound compared to the traditional reward model. Extensive experiments on benchmark datasets demonstrate that VRM significantly outperforms existing methods in capturing authentic human preferences.
摘要：大型语言模型 (LLM) 在各种自然语言任务中取得了显着的成功，但用于调整 LLM 的奖励模型经常遇到奖励黑客的挑战，其中方法主要依赖于直接将提示响应对映射到标量分数，这可能会无意中捕获虚假相关性，而不是真实的人类偏好。相比之下，人类评估采用复杂的过程，首先根据提示上下文权衡多个高维目标的相对重要性，随后通过逻辑连贯性和上下文适当性等低维语义特征评估响应质量。出于这种考虑，我们提出了 VRM，即变分奖励模型，这是一种新颖的框架，通过将高维客观权重和低维语义特征作为潜在变量，通过变分推理技术来推断，显式地模拟人类偏好判断的评估过程。此外，我们提供的理论分析表明，与传统奖励模型相比，VRM 可以实现更严格的泛化误差界限。对基准数据集的大量实验表明，VRM 在捕捉真实的人类偏好方面明显优于现有方法。

Title: ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts

Authors: Trapoom Ukarapol, Nut Chukamphaeng, Kunat Pipatanakul, Pakhapoom Sarapat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04992
Pdf URL: https://arxiv.org/pdf/2603.04992
Copy Paste: [[2603.04992]] ThaiSafetyBench: Assessing Language Model Safety in Thai Cultural Contexts(https://arxiv.org/abs/2603.04992)
Keywords: language model, gpt, llm, prompt
Abstract: The safety evaluation of large language models (LLMs) remains largely centered on English, leaving non-English languages and culturally grounded risks underexplored. In this work, we investigate LLM safety in the context of the Thai language and culture and introduce ThaiSafetyBench, an open-source benchmark comprising 1,954 malicious prompts written in Thai. The dataset covers both general harmful prompts and attacks that are explicitly grounded in Thai cultural, social, and contextual nuances. Using ThaiSafetyBench, we evaluate 24 LLMs, with GPT-4.1 and Gemini-2.5-Pro serving as LLM-as-a-judge evaluators. Our results show that closed-source models generally demonstrate stronger safety performance than open-source counterparts, raising important concerns regarding the robustness of openly available models. Moreover, we observe a consistently higher Attack Success Rate (ASR) for Thai-specific, culturally contextualized attacks compared to general Thai-language attacks, highlighting a critical vulnerability in current safety alignment methods. To improve reproducibility and cost efficiency, we further fine-tune a DeBERTa-based harmful response classifier, which we name ThaiSafetyClassifier. The model achieves a weighted F1 score of 84.4%, matching GPT-4.1 judgments. We publicly release the fine-tuning weights and training scripts to support reproducibility. Finally, we introduce the ThaiSafetyBench leaderboard to provide continuously updated safety evaluations and encourage community participation. - ThaiSafetyBench HuggingFace Dataset: this https URL - ThaiSafetyBench Github: this https URL - ThaiSafetyClassifier HuggingFace Model: this https URL - ThaiSafetyBench Leaderboard: this https URL
摘要：大型语言模型（LLM）的安全性评估仍然主要以英语为中心，而对非英语语言和文化风险的探索不足。在这项工作中，我们研究了泰语和文化背景下的法学硕士安全性，并介绍了 ThaiSafetyBench，这是一个开源基准，包含 1,954 个用泰语编写的恶意提示。该数据集涵盖了明显基于泰国文化、社会和背景细微差别的一般有害提示和攻击。使用 ThaiSafetyBench，我们评估了 24 个法学硕士，其中 GPT-4.1 和 Gemini-2.5-Pro 作为法学硕士评审评估器。我们的结果表明，闭源模型通常表现出比开源模型更强的安全性能，这引起了人们对开放模型稳健性的严重担忧。此外，我们观察到，与一般泰语攻击相比，泰语特定的文化背景攻击的攻击成功率 (ASR) 始终较高，这凸显了当前安全调整方法中的一个关键漏洞。为了提高可重复性和成本效率，我们进一步微调基于 DeBERTa 的有害响应分类器，我们将其命名为 ThaiSafetyClassifier。该模型的加权 F1 得分为 84.4%，与 GPT-4.1 判断相匹配。我们公开发布微调权重和训练脚本以支持可重复性。最后，我们引入 ThaiSafetyBench 排行榜，提供持续更新的安全评估并鼓励社区参与。 - ThaiSafetyBench HuggingFace 数据集：此 https URL - ThaiSafetyBench Github：此 https URL - ThaiSafetyClassifier HuggingFace 模型：此 https URL - ThaiSafetyBench 排行榜：此 https URL

Title: HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation

Authors: Yifan Zhu, Guanting Chen, Bing Wei, Haoran Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04996
Pdf URL: https://arxiv.org/pdf/2603.04996
Copy Paste: [[2603.04996]] HiFlow: Hierarchical Feedback-Driven Optimization for Constrained Long-Form Text Generation(https://arxiv.org/abs/2603.04996)
Keywords: language model
Abstract: Large language models perform well in short text generation but still struggle with long text generation, particularly under complex constraints. Such tasks involve multiple tightly coupled objectives, including global structural consistency, local semantic coherence, and constraint feasibility, forming a challenging constrained optimization problem. Existing approaches mainly rely on static planning or offline supervision, limiting effective coordination between global and local objectives during generation. To address these challenges, we propose HiFlow, a hierarchical feedback-driven optimization framework for constrained long text generation. HiFlow formulates generation as a two-level optimization process, consisting of a planning layer for global structure and constraint modeling, and a generation layer for conditioned text generation. By incorporating constraint-aware plan screening and closed-loop feedback at both levels, HiFlow enables joint optimization of planning quality and generation behavior, progressively guiding the model toward high-quality, constraint-satisfying outputs. Experiments on multiple backbones confirm HiFlow's effectiveness over baseline methods.
摘要：大型语言模型在短文本生成方面表现良好，但在长文本生成方面仍然存在困难，特别是在复杂的约束下。此类任务涉及多个紧密耦合的目标，包括全局结构一致性、局部语义一致性和约束可行性，形成具有挑战性的约束优化问题。现有方法主要依赖静态规划或离线监督，限制了生成过程中全局目标和局部目标之间的有效协调。为了应对这些挑战，我们提出了 HiFlow，一种用于约束长文本生成的分层反馈驱动优化框架。 HiFlow 将生成制定为两级优化过程，包括用于全局结构和约束建模的规划层，以及用于条件文本生成的生成层。通过在两个层面上结合约束感知计划筛选和闭环反馈，HiFlow 能够联合优化规划质量和生成行为，逐步引导模型获得高质量、满足约束的输出。对多个主干网的实验证实了 HiFlow 相对于基线方法的有效性。

Title: NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension

Authors: Rongzhi Li, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05046
Pdf URL: https://arxiv.org/pdf/2603.05046
Copy Paste: [[2603.05046]] NeuronMoE: Neuron-Guided Mixture-of-Experts for Efficient Multilingual LLM Extension(https://arxiv.org/abs/2603.05046)
Keywords: language model, llm
Abstract: Extending large language models to low-resource languages is essential for global accessibility, but training separate models per language is prohibitively expensive. Mixture-of-Experts (MoE) architectures address this by adding sparse language-specific parameters, but determining how many experts each layer needs remains an open question. Current approaches allocate experts based on layer-level similarity, yet language processing exhibits fine-grained specialization at individual neurons. We propose $\textbf{NeuronMoE}$, a method that analyzes language-specific neurons across all transformer components to guide expert allocation per layer based on empirically measured cross-lingual neuron diversity. Applied to Llama-3.2-3B for low-resource languages (Greek, Turkish, and Hungarian), this approach achieves approximately 40% average parameter reduction while matching the performance of the LayerMoE baseline. We find that low-resource language experts independently develop neuron specialization patterns mirroring the high-resource language, which are concentrated in early and late layers. This reveals potential universal architectural principles in how multilingual models organize linguistic knowledge.
摘要：将大型语言模型扩展到资源匮乏的语言对于全球可访问性至关重要，但为每种语言训练单独的模型成本高昂。专家混合 (MoE) 架构通过添加稀疏的特定于语言的参数来解决这个问题，但确定每层需要多少专家仍然是一个悬而未决的问题。当前的方法根据层级相似性来分配专家，但语言处理在单个神经元上表现出细粒度的专业化。我们提出了 $\textbf{NeuronMoE}$，一种分析所有 Transformer 组件中的语言特定神经元的方法，以根据经验测量的跨语言神经元多样性指导每层的专家分配。该方法应用于低资源语言（希腊语、土耳其语和匈牙利语）的 Llama-3.2-3B，可实现约 40% 的平均参数减少，同时与 LayerMoE 基线的性能相匹配。我们发现低资源语言专家独立开发了反映高资源语言的神经元专业化模式，这些模式集中在早期和晚期层。这揭示了多语言模型如何组织语言知识的潜在通用架构原则。

Title: Measuring the Redundancy of Decoder Layers in SpeechLLMs

Authors: Adel Moumen, Guangzhi Sun, Philip C Woodland
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05121
Pdf URL: https://arxiv.org/pdf/2603.05121
Copy Paste: [[2603.05121]] Measuring the Redundancy of Decoder Layers in SpeechLLMs(https://arxiv.org/abs/2603.05121)
Keywords: language model, llm
Abstract: Speech Large Language Models route speech encoder representations into an LLM decoder that typically accounts for over 90% of total parameters. We study how much of this decoder capacity is actually needed for speech tasks. Across two LLM families and three scales (1-8B), we show that decoder redundancy is largely inherited from the pretrained LLM: text and speech inputs yield similar redundant blocks. We then measure excess capacity by pruning decoder layers and analysing post-pruning healing to increase robustness. Our findings show that 7-8B models retain good ASR performance with only 60% of decoder layers, and the same trend extends to smaller scales with reduced pruning tolerance. We then generalise to speech translation, and show that the same blocks of layers are redundant across speech encoders, tasks and languages, indicating that a more global redundancy structure exists, enabling a single pruned and multi-tasks SpeechLLM backbone to be deployed.
摘要：语音大语言模型将语音编码器表示路由到 LLM 解码器中，该解码器通常占总参数的 90% 以上。我们研究语音任务实际需要多少解码器容量。在两个 LLM 系列和三个尺度 (1-8B) 中，我们表明解码器冗余很大程度上继承自预训练的 LLM：文本和语音输入产生类似的冗余块。然后，我们通过修剪解码器层并分析修剪后修复来测量过剩容量，以提高鲁棒性。我们的研究结果表明，7-8B 模型仅使用 60% 的解码器层就保留了良好的 ASR 性能，并且相同的趋势延伸到具有降低的剪枝容差的较小规模。然后，我们将其推广到语音翻译，并表明相同的层块在语音编码器、任务和语言之间是冗余的，这表明存在更全局的冗余结构，从而能够部署单个修剪和多任务的 SpeechLLM 主干。

Title: LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting

Authors: Yewen Li, Zhiyi Lyu, Peng Jiang, Qingpeng Cai, Fei Pan, Bo An, Peng Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05134
Pdf URL: https://arxiv.org/pdf/2603.05134
Copy Paste: [[2603.05134]] LBM: Hierarchical Large Auto-Bidding Model via Reasoning and Acting(https://arxiv.org/abs/2603.05134)
Keywords: language model, llm, hallucination
Abstract: The growing scale of ad auctions on online advertising platforms has intensified competition, making manual bidding impractical and necessitating auto-bidding to help advertisers achieve their economic goals. Current auto-bidding methods have evolved to use offline reinforcement learning or generative methods to optimize bidding strategies, but they can sometimes behave counterintuitively due to the black-box training manner and limited mode coverage of datasets, leading to challenges in understanding task status and generalization in dynamic ad environments. Large language models (LLMs) offer a promising solution by leveraging prior human knowledge and reasoning abilities to improve auto-bidding performance. However, directly applying LLMs to auto-bidding faces difficulties due to the need for precise actions in competitive auctions and the lack of specialized auto-bidding knowledge, which can lead to hallucinations and suboptimal decisions. To address these challenges, we propose a hierarchical Large autoBidding Model (LBM) to leverage the reasoning capabilities of LLMs for developing a superior auto-bidding strategy. This includes a high-level LBM-Think model for reasoning and a low-level LBM-Act model for action generation. Specifically, we propose a dual embedding mechanism to efficiently fuse two modalities, including language and numerical inputs, for language-guided training of the LBM-Act; then, we propose an offline reinforcement fine-tuning technique termed GQPO for mitigating the LLM-Think's hallucinations and enhancing decision-making performance without simulation or real-world rollout like previous multi-turn LLM-based methods. Experiments demonstrate the superiority of a generative backbone based on our LBM, especially in an efficient training manner and generalization ability.
摘要：在线广告平台上的广告拍卖规模不断扩大，竞争加剧，手动竞价变得不切实际，需要自动竞价来帮助广告商实现其经济目标。当前的自动出价方法已经发展到使用离线强化学习或生成方法来优化出价策略，但由于黑盒训练方式和数据集的有限模式覆盖，它们有时可能会表现出违反直觉的行为，从而导致在动态广告环境中理解任务状态和泛化方面面临挑战。大型语言模型 (LLM) 通过利用人类先验知识和推理能力来提高自动出价性能，从而提供了一种有前景的解决方案。然而，由于竞争性拍卖中需要精确的行动以及缺乏专门的自动投标知识，直接将法学硕士应用于自动投标面临着困难，这可能会导致幻觉和次优决策。为了应对这些挑战，我们提出了一种分层大型自动竞价模型（LBM），以利用法学硕士的推理能力来开发卓越的自动竞价策略。这包括用于推理的高级 LBM-Think 模型和用于生成动作的低级 LBM-Act 模型。具体来说，我们提出了一种双重嵌入机制，可以有效地融合两种模式，包括语言和数字输入，用于 LBM-Act 的语言引导训练；然后，我们提出了一种称为 GQPO 的离线强化微调技术，用于减轻 LLM-Think 的幻觉并提高决策性能，而无需像以前的基于多轮 LLM 的方法那样进行模拟或现实世界的推广。实验证明了基于 LBM 的生成骨干网的优越性，特别是在高效的训练方式和泛化能力方面。

Title: Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers

Authors: Ruichen Xu, Wenjing Yan, Ying-Jun Angela Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.05143
Pdf URL: https://arxiv.org/pdf/2603.05143
Copy Paste: [[2603.05143]] Feature Resemblance: On the Theoretical Understanding of Analogical Reasoning in Transformers(https://arxiv.org/abs/2603.05143)
Keywords: language model
Abstract: Understanding reasoning in large language models is complicated by evaluations that conflate multiple reasoning types. We isolate analogical reasoning (inferring shared properties between entities based on known similarities) and analyze its emergence in transformers. We theoretically prove three key results: (1) Joint training on similarity and attribution premises enables analogical reasoning through aligned representations; (2) Sequential training succeeds only when similarity structure is learned before specific attributes, revealing a necessary curriculum; (3) Two-hop reasoning ($a \to b, b \to c \implies a \to c$) reduces to analogical reasoning with identity bridges ($b = b$), which must appear explicitly in training data. These results reveal a unified mechanism: transformers encode entities with similar properties into similar representations, enabling property transfer through feature alignment. Experiments with architectures up to 1.5B parameters validate our theory and demonstrate how representational geometry shapes inductive reasoning capabilities.
摘要：由于混合了多种推理类型的评估，理解大型语言模型中的推理变得复杂。我们分离类比推理（根据已知的相似性推断实体之间的共享属性）并分析它在 Transformer 中的出现。我们从理论上证明了三个关键结果：（1）对相似性和归因前提的联合训练可以通过对齐表示进行类比推理；（2）只有在特定属性之前学习相似结构时，顺序训练才能成功，从而揭示必要的课程； (3) 两跳推理（$a \to b, b \to c \implies a \to c$）简化为具有恒等桥的类比推理（$b = b$），其必须明确出现在训练数据中。这些结果揭示了一个统一的机制：转换器将具有相似属性的实体编码为相似的表示，从而通过特征对齐实现属性转移。使用高达 1.5B 参数的架构进行的实验验证了我们的理论，并演示了表征几何如何塑造归纳推理能力。

Title: C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning

Authors: Avni Mittal, Rauno Arike
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05167
Pdf URL: https://arxiv.org/pdf/2603.05167
Copy Paste: [[2603.05167]] C2-Faith: Benchmarking LLM Judges for Causal and Coverage Faithfulness in Chain-of-Thought Reasoning(https://arxiv.org/abs/2603.05167)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) are increasingly used as judges of chain-of-thought (CoT) reasoning, but it remains unclear whether they can reliably assess process faithfulness rather than just answer plausibility. We introduce C2-Faith, a benchmark built from PRM800K that targets two complementary dimensions of faithfulness: causality (does each step logically follow from prior context?) and coverage (are essential intermediate inferences present?). Using controlled perturbations, we create examples with known causal error positions by replacing a single step with an acausal variant, and with controlled coverage deletions at varying deletion rates (scored against reference labels). We evaluate three frontier judges under three tasks: binary causal detection, causal step localization, and coverage scoring. The results show that model rankings depend strongly on task framing, with no single judge dominating all settings; all judges exhibit a substantial gap between detecting an error and localizing it; and coverage judgments are systematically inflated for incomplete reasoning. These findings clarify when LLM judges are dependable and where they fail, and provide practical guidance for selecting judges in process-level evaluation
摘要：大型语言模型（LLM）越来越多地被用作思想链（CoT）推理的判断者，但目前尚不清楚它们是否能够可靠地评估过程的真实性而不仅仅是回答合理性。我们引入 C2-Faith，这是一个基于 PRM800K 构建的基准，针对忠诚度的两个互补维度：因果关系（每个步骤是否在逻辑上遵循先前的上下文？）和覆盖范围（是否存在必要的中间推论？）。使用受控扰动，我们通过用非因果变量替换单个步骤来创建具有已知因果错误位置的示例，并以不同的删除率（根据参考标签评分）进行受控覆盖删除。我们在三个任务下评估三个前沿法官：二元因果检测、因果步骤定位和覆盖评分。结果表明，模型排名很大程度上取决于任务框架，没有一个评判主宰所有设置；所有法官在发现错误和定位错误之间都存在巨大差距；由于推理不完整，覆盖范围判断被系统性地夸大。这些发现阐明了法学硕士法官何时可靠以及何时失败，并为在过程级评估中选择法官提供了实用指导

Title: Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity

Authors: Di Zhang, Xun Wu, Shaohan Huang, Yudong Wang, Hanyong Shao, Yingbo Hao, Zewen Chi, Li Dong, Ting Song, Yan Xia, Zhifang Sui, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05168
Pdf URL: https://arxiv.org/pdf/2603.05168
Copy Paste: [[2603.05168]] Sparse-BitNet: 1.58-bit LLMs are Naturally Friendly to Semi-Structured Sparsity(https://arxiv.org/abs/2603.05168)
Keywords: language model, llm
Abstract: Semi-structured N:M sparsity and low-bit quantization (e.g., 1.58-bit BitNet) are two promising approaches for improving the efficiency of large language models (LLMs), yet they have largely been studied in isolation. In this work, we investigate their interaction and show that 1.58-bit BitNet is naturally more compatible with N:M sparsity than full-precision models. To study this effect, we propose Sparse-BitNet, a unified framework that jointly applies 1.58-bit quantization and dynamic N:M sparsification while ensuring stable training for the first time. Across multiple model scales and training regimes (sparse pretraining and dense-to-sparse schedules), 1.58-bit BitNet consistently exhibits smaller performance degradation than full-precision baselines at the same sparsity levels and can tolerate higher structured sparsity before accuracy collapse. Moreover, using our custom sparse tensor core, Sparse-BitNet achieves substantial speedups in both training and inference, reaching up to 1.30X. These results highlight that combining extremely low-bit quantization with semi-structured N:M sparsity is a promising direction for efficient LLMs. Code available at this https URL
摘要：半结构化 N:M 稀疏性和低位量化（例如 1.58 位 BitNet）是提高大型语言模型 (LLM) 效率的两种有前途的方法，但它们在很大程度上是孤立研究的。在这项工作中，我们研究了它们的相互作用，并表明 1.58 位 BitNet 自然比全精度模型更兼容 N:M 稀疏性。为了研究这种效应，我们提出了 Sparse-BitNet，这是一个统一的框架，联合应用 1.58 位量化和动态 N:M 稀疏化，同时首次确保稳定的训练。在多个模型规模和训练方案（稀疏预训练和稠密到稀疏计划）中，1.58 位 BitNet 在相同稀疏度水平下始终表现出比全精度基线更小的性能下降，并且可以在精度崩溃之前容忍更高的结构化稀疏度。此外，使用我们定制的稀疏张量核心，Sparse-BitNet 在训练和推理方面都实现了大幅加速，高达 1.30 倍。这些结果强调，将极低比特量化与半结构化 N:M 稀疏性相结合是高效 LLM 的一个有前途的方向。代码可在此 https URL 获取

Title: Transducing Language Models

Authors: Vésteinn Snæbjarnarson, Samuel Kiegeland, Tianyu Liu, Reda Boumasmoud, Ryan Cotterell, Tim Vieira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05193
Pdf URL: https://arxiv.org/pdf/2603.05193
Copy Paste: [[2603.05193]] Transducing Language Models(https://arxiv.org/abs/2603.05193)
Keywords: language model
Abstract: Modern language models define distributions over strings, but downstream tasks often require different output formats. For instance, a model that generates byte-pair strings does not directly produce word-level predictions, and a DNA model does not directly produce amino-acid sequences. In such cases, a deterministic string-to-string transformation can convert the model's output to the desired form. This is a familiar pattern in probability theory: applying a function $f$ to a random variable $X\sim p$ yields a transformed random variable $f(X)$ with an induced distribution. While such transformations are occasionally used in language modeling, prior work does not treat them as yielding new, fully functional language models. We formalize this perspective and introduce a general framework for language models derived from deterministic string-to-string transformations. We focus on transformations representable as finite-state transducers -- a commonly used state-machine abstraction for efficient string-to-string mappings. We develop algorithms that compose a language model with an FST to *marginalize* over source strings mapping to a given target, propagating probabilities through the transducer without altering model parameters and enabling *conditioning* on transformed outputs. We present an exact algorithm, an efficient approximation, and a theoretical analysis. We conduct experiments in three domains: converting language models from tokens to bytes, from tokens to words, and from DNA to amino acids. These experiments demonstrate inference-time adaptation of pretrained language models to match application-specific output requirements.
摘要：现代语言模型定义了字符串的分布，但下游任务通常需要不同的输出格式。例如，生成字节对字符串的模型不会直接生成字级预测，DNA 模型不会直接生成氨基酸序列。在这种情况下，确定性字符串到字符串转换可以将模型的输出转换为所需的形式。这是概率论中常见的模式：将函数 $f$ 应用于随机变量 $X\sim p$ 会产生具有诱导分布的变换随机变量 $f(X)$。虽然这种转换偶尔会在语言建模中使用，但之前的工作并没有将它们视为产生新的、功能齐全的语言模型。我们形式化了这一观点，并引入了从确定性字符串到字符串转换派生的语言模型的通用框架。我们专注于可表示为有限状态转换器的转换——一种用于高效字符串到字符串映射的常用状态机抽象。我们开发的算法使用 FST 组成语言模型，以对映射到给定目标的源字符串进行“边缘化”，通过转换器传播概率而不改变模型参数，并对转换后的输出启用“调节”。我们提出了精确的算法、有效的近似和理论分析。我们在三个领域进行实验：将语言模型从标记转换为字节、从标记转换为单词、从 DNA 转换为氨基酸。这些实验演示了预训练语言模型的推理时间适应，以满足特定于应用程序的输出要求。

Title: Diffusion LLMs can think EoS-by-EoS

Authors: Sarah Breckner, Sebastian Schuster
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05197
Pdf URL: https://arxiv.org/pdf/2603.05197
Copy Paste: [[2603.05197]] Diffusion LLMs can think EoS-by-EoS(https://arxiv.org/abs/2603.05197)
Keywords: llm, prompt
Abstract: Diffusion LLMs have been proposed as an alternative to autoregressive LLMs, excelling especially at complex reasoning tasks with interdependent sub-goals. Curiously, this is particularly true if the generation length, i.e., the number of tokens the model has to output, is set to a much higher value than is required for providing the correct answer to the task, and the model pads its answer with end-of-sequence (EoS) tokens. We hypothesize that diffusion models think EoS-by-EoS, that is, they use the representations of EoS tokens as a hidden scratchpad, which allows them to solve harder reasoning problems. We experiment with the diffusion models LLaDA1.5, LLaDA2.0-mini, and Dream-v0 on the tasks Addition, Entity Tracking, and Sudoku. In a controlled prompting experiment, we confirm that adding EoS tokens improves the LLMs' reasoning capabilities. To further verify whether they serve as space for hidden computations, we patch the hidden states of the EoS tokens with those of a counterfactual generation, which frequently changes the generated output to the counterfactual. The success of the causal intervention underscores that the EoS tokens, which one may expect to be devoid of meaning, carry information on the problem to solve. The behavioral experiments and the causal interventions indicate that diffusion LLMs can indeed think EoS-by-EoS.
摘要：扩散法学硕士已被提议作为自回归法学硕士的替代方案，尤其擅长具有相互依赖的子目标的复杂推理任务。奇怪的是，如果将生成长度（即模型必须输出的标记数量）设置为比为任务提供正确答案所需的值高得多的值，并且模型用序列结束（EoS）标记填充其答案，则尤其如此。我们假设扩散模型认为 EoS-by-EoS，也就是说，它们使用 EoS 代币的表示作为隐藏的暂存器，这使它们能够解决更难的推理问题。我们在加法、实体跟踪和数独任务上使用扩散模型 LLaDA1.5、LLaDA2.0-mini 和 Dream-v0 进行实验。在受控提示实验中，我们确认添加 EoS 代币可以提高 LLM 的推理能力。为了进一步验证它们是否充当隐藏计算的空间，我们将 EoS 令牌的隐藏状态与反事实生成的隐藏状态进行修补，这经常将生成的输出更改为反事实。因果干预的成功强调了 EoS 代币（人们可能认为它没有任何意义）携带了要解决的问题的信息。行为实验和因果干预表明，扩散法学硕士确实可以逐个 EoS 进行思考。

Title: Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding

Authors: Ofir Ben Shoham
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.05210
Pdf URL: https://arxiv.org/pdf/2603.05210
Copy Paste: [[2603.05210]] Balancing Coverage and Draft Latency in Vocabulary Trimming for Faster Speculative Decoding(https://arxiv.org/abs/2603.05210)
Keywords: language model
Abstract: Speculative decoding accelerates inference for Large Language Models by using a lightweight draft model to propose candidate tokens that are verified in parallel by a larger target model. Prior work shows that the draft model often dominates speculative decoding latency, since it generates tokens sequentially and incurs high cost from its language modeling head as vocabulary size grows. This exposes a fundamental trade-off in draft model design: larger vocabularies improve token coverage and agreement with the target model, but incur higher draft latency, while smaller vocabularies reduce latency at the risk of missing tokens required for accurate draft generation. We address this trade-off through vocabulary trimming for draft models, motivated by the observation that domain-specific workloads use only a small fraction of the full vocabulary. We cast draft vocabulary selection as a constrained optimization problem that balances token coverage and draft latency. Coverage is computed over assistant responses in the training data, while latency is estimated using architecture-aware FLOPs that capture the cost of the language modeling head as a function of vocabulary size. We optimize a utility function with a Tree-structured Parzen Estimator to efficiently explore the coverage-latency Pareto frontier under a minimum coverage constraint. Experiments show improved speculative decoding throughput while reducing draft vocabularies by up to 97% with high coverage. On domain-specific tasks, we achieve up to 16% latency reduction and 20% throughput improvement, and up to 6.7% throughput gains on diverse out-of-distribution tasks.
摘要：推测性解码通过使用轻量级草稿模型来提出由更大的目标模型并行验证的候选标记，从而加速大型语言模型的推理。先前的工作表明，草稿模型通常会主导推测性解码延迟，因为它会顺序生成令牌，并且随着词汇量的增长，其语言建模头会产生很高的成本。这暴露了草稿模型设计中的一个基本权衡：较大的词汇表可以提高令牌覆盖率以及与目标模型的一致性，但会导致较高的草稿延迟，而较小的词汇表可以减少延迟，但存在丢失准确草稿生成所需的令牌的风险。我们通过对草稿模型的词汇量修剪来解决这种权衡问题，其动机是观察到特定领域的工作负载仅使用完整词汇量的一小部分。我们将草稿词汇选择视为一个平衡令牌覆盖率和草稿延迟的约束优化问题。覆盖范围是根据训练数据中的助理响应计算的，而延迟是使用架构感知的 FLOP 来估计的，这些 FLOP 捕获语言建模头的成本作为词汇量的函数。我们使用树结构 Parzen 估计器优化效用函数，以在最小覆盖约束下有效地探索覆盖延迟帕累托前沿。实验表明，推测解码吞吐量得到提高，同时草稿词汇量减少高达 97%，且覆盖率高。在特定于域的任务上，我们实现了高达 16% 的延迟减少和 20% 的吞吐量提高，并且在各种分布外任务上实现了高达 6.7% 的吞吐量增益。

Title: VietJobs: A Vietnamese Job Advertisement Dataset

Authors: Hieu Pham Dinh, Hung Nguyen Huy, Mo El-Haj
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05262
Pdf URL: https://arxiv.org/pdf/2603.05262
Copy Paste: [[2603.05262]] VietJobs: A Vietnamese Job Advertisement Dataset(https://arxiv.org/abs/2603.05262)
Keywords: language model, llm
Abstract: VietJobs is the first large-scale, publicly available corpus of Vietnamese job advertisements, comprising 48,092 postings and over 15 million words collected from all 34 provinces and municipalities across Vietnam. The dataset provides extensive linguistic and structured information, including job titles, categories, salaries, skills, and employment conditions, covering 16 occupational domains and multiple employment types (full-time, part-time, and internship). Designed to support research in natural language processing and labour market analytics, VietJobs captures substantial linguistic, regional, and socio-economic diversity. We benchmark several generative large language models (LLMs) on two core tasks: job category classification and salary estimation. Instruction-tuned models such as Qwen2.5-7B-Instruct and Llama-SEA-LION-v3-8B-IT demonstrate notable gains under few-shot and fine-tuned settings, while highlighting challenges in multilingual and Vietnamese-specific modelling for structured labour market prediction. VietJobs establishes a new benchmark for Vietnamese NLP and offers a valuable foundation for future research on recruitment language, socio-economic representation, and AI-driven labour market analysis. All code and resources are available at: this https URL.
摘要：VietJobs 是第一个大规模、公开的越南招聘广告语料库，包含从越南所有 34 个省市收集的 48,092 个帖子和超过 1500 万字。该数据集提供了广泛的语言和结构化信息，包括职称、类别、薪资、技能和就业条件，涵盖 16 个职业领域和多种就业类型（全职、兼职和实习）。 VietJobs 旨在支持自然语言处理和劳动力市场分析方面的研究，捕捉了大量的语言、区域和社会经济多样性。我们在两个核心任务上对几个生成式大语言模型（LLM）进行了基准测试：工作类别分类和薪资估算。 Qwen2.5-7B-Instruct 和 Llama-SEA-LION-v3-8B-IT 等指令调整模型在少量样本和微调设置下表现出显着的收益，同时凸显了结构化劳动力市场预测的多语言和越南特定建模所面临的挑战。 VietJobs 为越南 NLP 建立了新的基准，并为未来招聘语言、社会经济代表性和人工智能驱动的劳动力市场分析的研究提供了宝贵的基础。所有代码和资源均可在以下位置获取：此 https URL。

Title: Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh

Authors: Mohammad Mamun Or Rashid
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2603.05272
Pdf URL: https://arxiv.org/pdf/2603.05272
Copy Paste: [[2603.05272]] Oral to Web: Digitizing 'Zero Resource'Languages of Bangladesh(https://arxiv.org/abs/2603.05272)
Keywords: prompt
Abstract: We present the Multilingual Cloud Corpus, the first national-scale, parallel, multimodal linguistic dataset of Bangladesh's ethnic and indigenous languages. Despite being home to approximately 40 minority languages spanning four language families, Bangladesh has lacked a systematic, cross-family digital corpus for these predominantly oral, computationally "zero resource" varieties, 14 of which are classified as endangered. Our corpus comprises 85792 structured textual entries, each containing a Bengali stimulus text, an English translation, and an IPA transcription, together with approximately 107 hours of transcribed audio recordings, covering 42 language varieties from the Tibeto-Burman, Indo-European, Austro-Asiatic, and Dravidian families, plus two genetically unclassified languages. The data were collected through systematic fieldwork over 90 days across nine districts of Bangladesh, involving 16 data collectors, 77 speakers, and 43 validators, following a predefined elicitation template of 2224 unique items organized at three levels of linguistic granularity: isolated lexical items (475 words across 22 semantic domains), grammatical constructions (887 sentences across 21 categories including verbal conjugation paradigms), and directed speech (862 prompts across 46 conversational scenarios). Post-field processing included IPA transcription by 10 linguists with independent adjudication by 6 reviewers. The complete dataset is publicly accessible through the Multilingual Cloud platform (this http URL), providing searchable access to annotated audio and textual data for all documented varieties. We describe the corpus design, fieldwork methodology, dataset structure, and per-language coverage, and discuss implications for endangered language documentation, low-resource NLP, and digital preservation in linguistically diverse developing countries.
摘要：我们推出了多语言云语料库，这是第一个全国范围的、并行的、多模态的孟加拉国民族和土著语言的语言数据集。尽管孟加拉国拥有跨越四个语系的大约 40 种少数民族语言，但孟加拉国缺乏一个系统的、跨语系的数字语料库来存储这些主要是口头的、计算上“零资源”的语言品种，其中 14 种被列为濒危语言。我们的语料库包含 85792 个结构化文本条目，每个条目包含孟加拉语刺激文本、英语翻译和国际音标转录，以及大约 107 小时的转录录音，涵盖藏缅语、印欧语、奥亚语和德拉威语系的 42 种语言变体，以及两种基因上未分类的语言。这些数据是通过在孟加拉国九个地区进行系统性实地工作超过 90 天收集的，涉及 16 名数据收集者、77 名发言者和 43 名验证者，遵循预定义的启发模板，该模板包含按三个语言粒度级别组织的 2224 个独特项目：孤立的词汇项目（跨 22 个语义域的 475 个单词）、语法结构（跨 21 个类别的 887 个句子，包括言语变形范式）和定向语音（862 个提示）跨越 46 个对话场景）。现场后处理包括由 10 名语言学家进行 IPA 转录，并由 6 名审稿人进行独立裁决。完整的数据集可通过多语言云平台（此http URL）公开访问，提供对所有记录品种的带注释的音频和文本数据的可搜索访问。我们描述了语料库设计、实地调查方法、数据集结构和每种语言的覆盖范围，并讨论了对语言多样化的发展中国家的濒危语言文档、低资源 NLP 和数字保存的影响。

Title: Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution

Authors: Qiao Jin, Yin Fang, Lauren He, Yifan Yang, Guangzhi Xiong, Zhizheng Wang, Nicholas Wan, Joey Chan, Donald C. Comeau, Robert Leaman, Charalampos S. Floudas, Aidong Zhang, Michael F. Chiang, Yifan Peng, Zhiyong Lu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05308
Pdf URL: https://arxiv.org/pdf/2603.05308
Copy Paste: [[2603.05308]] Med-V1: Small Language Models for Zero-shot and Scalable Biomedical Evidence Attribution(https://arxiv.org/abs/2603.05308)
Keywords: language model, gpt, llm, hallucination
Abstract: Assessing whether an article supports an assertion is essential for hallucination detection and claim verification. While large language models (LLMs) have the potential to automate this task, achieving strong performance requires frontier models such as GPT-5 that are prohibitively expensive to deploy at scale. To efficiently perform biomedical evidence attribution, we present Med-V1, a family of small language models with only three billion parameters. Trained on high-quality synthetic data newly developed in this study, Med-V1 substantially outperforms (+27.0% to +71.3%) its base models on five biomedical benchmarks unified into a verification format. Despite its smaller size, Med-V1 performs comparably to frontier LLMs such as GPT-5, along with high-quality explanations for its predictions. We use Med-V1 to conduct a first-of-its-kind use case study that quantifies hallucinations in LLM-generated answers under different citation instructions. Results show that the format instruction strongly affects citation validity and hallucination, with GPT-5 generating more claims but exhibiting hallucination rates similar to GPT-4o. Additionally, we present a second use case showing that Med-V1 can automatically identify high-stakes evidence misattributions in clinical practice guidelines, revealing potentially negative public health impacts that are otherwise challenging to identify at scale. Overall, Med-V1 provides an efficient and accurate lightweight alternative to frontier LLMs for practical and real-world applications in biomedical evidence attribution and verification tasks. Med-V1 is available at this https URL.
摘要：评估一篇文章是否支持某个断言对于幻觉检测和声明验证至关重要。虽然大型语言模型 (LLM) 有潜力自动执行此任务，但要实现强大的性能需要 GPT-5 等前沿模型，而大规模部署的成本却极其昂贵。为了有效地执行生物医学证据归因，我们提出了 Med-V1，这是一个只有 30 亿个参数的小语言模型家族。经过本研究中新开发的高质量合成数据的训练，Med-V1 在统一为验证格式的五个生物医学基准上大大优于（+27.0% 至 +71.3%）其基本模型。尽管规模较小，Med-V1 的表现与 GPT-5 等前沿法学硕士相当，并且对其预测有高质量的解释。我们使用 Med-V1 进行了首个用例研究，该研究量化了法学硕士在不同引文说明下生成的答案中的幻觉。结果表明，格式指令强烈影响引文有效性和幻觉，GPT-5 产生更多的声明，但表现出与 GPT-4o 相似的幻觉率。此外，我们提出了第二个用例，表明 Med-V1 可以自动识别临床实践指南中的高风险证据错误归因，揭示潜在的负面公共健康影响，否则很难大规模识别。总体而言，Med-V1 为前沿法学硕士提供了一种高效、准确的轻量级替代方案，可用于生物医学证据归因和验证任务中的实际和现实应用。 Med-V1 可通过此 https URL 获取。

Title: PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration

Authors: Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05314
Pdf URL: https://arxiv.org/pdf/2603.05314
Copy Paste: [[2603.05314]] PersianPunc: A Large-Scale Dataset and BERT-Based Approach for Persian Punctuation Restoration(https://arxiv.org/abs/2603.05314)
Keywords: language model
Abstract: Punctuation restoration is essential for improving the readability and downstream utility of automatic speech recognition (ASR) outputs, yet remains underexplored for Persian despite its importance. We introduce PersianPunc, a large-scale, high-quality dataset of 17 million samples for Persian punctuation restoration, constructed through systematic aggregation and filtering of existing textual resources. We formulate punctuation restoration as a token-level sequence labeling task and fine-tune ParsBERT to achieve strong performance. Through comparative evaluation, we demonstrate that while large language models can perform punctuation restoration, they suffer from critical limitations: over-correction tendencies that introduce undesired edits beyond punctuation insertion (particularly problematic for speech-to-text pipelines) and substantially higher computational requirements. Our lightweight BERT-based approach achieves a macro-averaged F1 score of 91.33% on our test set while maintaining efficiency suitable for real-time applications. We make our dataset (this https URL) and model (this https URL) publicly available to facilitate future research in Persian NLP and provide a scalable framework applicable to other morphologically rich, low-resource languages.
摘要：标点符号恢复对于提高自动语音识别 (ASR) 输出的可读性和下游实用性至关重要，但尽管波斯语很重要，但其研究仍然不足。我们引入了 PersianPunc，这是一个包含 1700 万个波斯语标点符号恢复样本的大规模、高质量数据集，是通过对现有文本资源进行系统聚合和过滤而构建的。我们将标点符号恢复制定为令牌级序列标记任务，并对 ParsBERT 进行微调以实现强大的性能。通过比较评估，我们证明，虽然大型语言模型可以执行标点符号恢复，但它们受到严重限制：过度校正倾向会引入标点符号插入之外的不需要的编辑（对于语音到文本管道尤其有问题）以及更高的计算要求。我们基于 BERT 的轻量级方法在我们的测试集上实现了 91.33% 的宏观平均 F1 分数，同时保持了适合实时应用程序的效率。我们公开我们的数据集（此 https URL）和模型（此 https URL），以促进波斯语 NLP 的未来研究，并提供适用于其他形态丰富、资源匮乏的语言的可扩展框架。

Title: DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning

Authors: Mohammad Mahdi Moradi, Sudhir Mudur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05357
Pdf URL: https://arxiv.org/pdf/2603.05357
Copy Paste: [[2603.05357]] DiSCTT: Consensus-Guided Self-Curriculum for Efficient Test-Time Adaptation in Reasoning(https://arxiv.org/abs/2603.05357)
Keywords: language model
Abstract: Test-time adaptation offers a promising avenue for improving reasoning performance in large language models without additional supervision, but existing approaches often apply a uniform optimization objective across all inputs, leading to inefficient or unstable adaptation on heterogeneous reasoning problems. We propose DiSCTT, a difficulty-aware, consensus-guided self-curriculum framework that dynamically allocates test-time optimization strategies based on instance-level epistemic uncertainty estimated from agreement among sampled reasoning trajectories. Inputs with high consensus are consolidated via supervised fine-tuning using majority-agreed solutions as pseudo-labels, while low-consensus inputs are optimized via reinforcement learning with a consensus-regularized objective that encourages diversity under relevance constraints. Across a broad suite of mathematical and general reasoning benchmarks, DiSCTT consistently outperforms strong test-time adaptation baselines, achieving higher accuracy with reduced variance and substantially lower computation and wall-clock training times. These results demonstrate that explicitly accounting for instance difficulty and uncertainty enables more stable, efficient, and effective test-time adaptation for reasoning models.
摘要：测试时适应为在没有额外监督的情况下提高大型语言模型的推理性能提供了一种有前景的途径，但现有方法通常在所有输入上应用统一的优化目标，导致异构推理问题的适应效率低下或不稳定。我们提出了 DiSCTT，这是一种具有难度意识、以共识为导向的自学课程框架，它基于根据采样推理轨迹之间的一致性估计的实例级认知不确定性来动态分配测试时间优化策略。使用多数人同意的解决方案作为伪标签，通过监督微调来巩固具有高度共识的输入，而低共识输入则通过强化学习进行优化，其共识正则化目标鼓励相关性约束下的多样性。在一系列广泛的数学和一般推理基准中，DiSCTT 始终优于强大的测试时间适应基线，在减少方差的同时实现更高的准确度，并大幅缩短计算和挂钟训练时间。这些结果表明，明确考虑实例难度和不确定性可以使推理模型的测试时间适应更加稳定、高效和有效。

Title: Progressive Residual Warmup for Language Model Pretraining

Authors: Tianhao Chen, Xin Xu, Lu Yin, Hao Chen, Yang Wang, Shizhe Diao, Can Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05369
Pdf URL: https://arxiv.org/pdf/2603.05369
Copy Paste: [[2603.05369]] Progressive Residual Warmup for Language Model Pretraining(https://arxiv.org/abs/2603.05369)
Keywords: language model
Abstract: Transformer architectures serve as the backbone for most modern Large Language Models, therefore their pretraining stability and convergence speed are of central concern. Motivated by the logical dependency of sequentially stacked layers, we propose Progressive Residual Warmup (ProRes) for language model pretraining. ProRes implements an "early layer learns first" philosophy by multiplying each layer's residual with a scalar that gradually warms up from 0 to 1, with deeper layers taking longer warmup steps. In this way, deeper layers wait for early layers to settle into a more stable regime before contributing to learning. We demonstrate the effectiveness of ProRes through pretraining experiments across various model scales, as well as normalization and initialization schemes. Comprehensive analysis shows that ProRes not only stabilizes pretraining but also introduces a unique optimization trajectory, leading to faster convergence, stronger generalization and better downstream performance. Our code is available at this https URL.
摘要：Transformer 架构是大多数现代大型语言模型的支柱，因此它们的预训练稳定性和收敛速度是核心问题。受顺序堆叠层的逻辑依赖性的启发，我们提出了用于语言模型预训练的渐进残差预热（ProRes）。 ProRes 通过将每个层的残差乘以从 0 逐渐预热到 1 的标量来实现“早期层首先学习”的理念，更深的层需要更长的预热步骤。通过这种方式，更深的层会等待早期的层进入更稳定的状态，然后再为学习做出贡献。我们通过各种模型尺度的预训练实验以及归一化和初始化方案证明了 ProRes 的有效性。综合分析表明，ProRes不仅稳定了预训练，还引入了独特的优化轨迹，从而导致更快的收敛、更强的泛化和更好的下游性能。我们的代码可以在这个 https URL 上找到。

Title: An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs

Authors: Deshan Sumanathilaka, Nicholas Micallef, Julian Hough
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05400
Pdf URL: https://arxiv.org/pdf/2603.05400
Copy Paste: [[2603.05400]] An Exploration-Analysis-Disambiguation Reasoning Framework for Word Sense Disambiguation with Low-Parameter LLMs(https://arxiv.org/abs/2603.05400)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Word Sense Disambiguation (WSD) remains a key challenge in Natural Language Processing (NLP), especially when dealing with rare or domain-specific senses that are often misinterpreted. While modern high-parameter Large Language Models (LLMs) such as GPT-4-Turbo have shown state-of-the-art WSD performance, their computational and energy demands limit scalability. This study investigates whether low-parameter LLMs (<4B parameters) can achieve comparable results through fine-tuning strategies that emphasize reasoning-driven sense identification. Using the FEWS dataset augmented with semi-automated, rationale-rich annotations, we fine-tune eight small-scale open-source LLMs (e.g. Gemma and Qwen). Our results reveal that Chain-of-Thought (CoT)-based reasoning combined with neighbour-word analysis achieves performance comparable to GPT-4-Turbo in zero-shot settings. Importantly, Gemma-3-4B and Qwen-3-4B models consistently outperform all medium-parameter baselines and state-of-the-art models on FEWS, with robust generalization to unseen senses. Furthermore, evaluation on the unseen "Fool Me If You Can'' dataset confirms strong cross-domain adaptability without task-specific fine-tuning. This work demonstrates that with carefully crafted reasoning-centric fine-tuning, low-parameter LLMs can deliver accurate WSD while substantially reducing computational and energy demands.
摘要：词义消歧 (WSD) 仍然是自然语言处理 (NLP) 中的一个关键挑战，特别是在处理经常被误解的罕见或特定领域的词义时。虽然 GPT-4-Turbo 等现代高参数大型语言模型 (LLM) 已展现出最先进的 WSD 性能，但其计算和能源需求限制了可扩展性。本研究调查低参数法学硕士（<4B 参数）是否可以通过强调推理驱动的意义识别的微调策略获得可比较的结果。使用添加了半自动化、原理丰富注释的 FEWS 数据集，我们对八个小型开源 LLM（例如 Gemma 和 Qwen）进行了微调。我们的结果表明，基于思想链 (CoT) 的推理与邻近词分析相结合，在零样本设置中实现了与 GPT-4-Turbo 相当的性能。重要的是，Gemma-3-4B 和 Qwen-3-4B 模型始终优于 FEWS 上的所有中等参数基线和最先进的模型，并对看不见的感官具有强大的泛化能力。此外，对未见的“Fool Me If You Can”数据集的评估证实了强大的跨域适应性，无需针对特定任务进行微调。这项工作表明，通过精心设计的以推理为中心的微调，低参数 LLM 可以提供准确的 WSD，同时大幅减少计算和能源需求。

Title: Ensembling Language Models with Sequential Monte Carlo

Authors: Robin Shing Moon Chan, Tianyu Liu, Samuel Kiegeland, Clemente Pasti, Jacob Hoover Vigly, Timothy J. O'Donnell, Ryan Cotterell, Tim Vieira
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.05432
Pdf URL: https://arxiv.org/pdf/2603.05432
Copy Paste: [[2603.05432]] Ensembling Language Models with Sequential Monte Carlo(https://arxiv.org/abs/2603.05432)
Keywords: language model, prompt
Abstract: Practitioners have access to an abundance of language models and prompting strategies for solving many language modeling tasks; yet prior work shows that modeling performance is highly sensitive to both choices. Classical machine learning ensembling techniques offer a principled approach: aggregate predictions from multiple sources to achieve better performance than any single one. However, applying ensembling to language models during decoding is challenging: naively aggregating next-token probabilities yields samples from a locally normalized, biased approximation of the generally intractable ensemble distribution over strings. In this work, we introduce a unified framework for composing $K$ language models into $f$-ensemble distributions for a wide range of functions $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$. To sample from these distributions, we propose a byte-level sequential Monte Carlo (SMC) algorithm that operates in a shared character space, enabling ensembles of models with mismatching vocabularies and consistent sampling in the limit. We evaluate a family of $f$-ensembles across prompt and model combinations for various structured text generation tasks, highlighting the benefits of alternative aggregation strategies over traditional probability averaging, and showing that better posterior approximations can yield better ensemble performance.
摘要：从业者可以获得大量的语言模型和解决许多语言建模任务的提示策略；然而之前的工作表明，建模性能对这两种选择都高度敏感。经典的机器学习集成技术提供了一种原则性的方法：聚合来自多个来源的预测，以实现比任何单一来源更好的性能。然而，在解码过程中将集成应用于语言模型具有挑战性：天真地聚合下一个标记的概率会从字符串上通常难以处理的集成分布的局部归一化、有偏差的近似中产生样本。在这项工作中，我们引入了一个统一的框架，用于将 $K$ 语言模型组合成 $f$-集成分布，适用于各种函数 $f\colon\mathbb{R}_{\geq 0}^{K}\to\mathbb{R}_{\geq 0}$。为了从这些分布中进行采样，我们提出了一种字节级顺序蒙特卡罗（SMC）算法，该算法在共享字符空间中运行，从而实现具有不匹配词汇表的模型集成和极限内的一致采样。我们针对各种结构化文本生成任务评估了一系列提示和模型组合的 $f$ 集成，强调了替代聚合策略相对于传统概率平均的优势，并表明更好的后验近似可以产生更好的集成性能。

Title: FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling

Authors: Ted Zadouri, Markus Hoehnerbach, Jay Shah, Timmy Liu, Vijay Thakkar, Tri Dao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.05451
Pdf URL: https://arxiv.org/pdf/2603.05451
Copy Paste: [[2603.05451]] FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling(https://arxiv.org/abs/2603.05451)
Keywords: language model
Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. While FlashAttention-3 optimized attention for Hopper GPUs through asynchronous execution and warp specialization, it primarily targets the H100 architecture. The AI industry has rapidly transitioned to deploying Blackwell-based systems such as the B200 and GB200, which exhibit fundamentally different performance characteristics due to asymmetric hardware scaling: tensor core throughput doubles while other functional units (shared memory bandwidth, exponential units) scale more slowly or remain unchanged. We develop several techniques to address these shifting bottlenecks on Blackwell GPUs: (1) redesigned pipelines that exploit fully asynchronous MMA operations and larger tile sizes, (2) software-emulated exponential and conditional softmax rescaling that reduces non-matmul operations, and (3) leveraging tensor memory and the 2-CTA MMA mode to reduce shared memory traffic and atomic adds in the backward pass. We demonstrate that our method, FlashAttention-4, achieves up to 1.3$\times$ speedup over cuDNN 9.13 and 2.7$\times$ over Triton on B200 GPUs with BF16, reaching up to 1613 TFLOPs/s (71% utilization). Beyond algorithmic innovations, we implement FlashAttention-4 entirely in CuTe-DSL embedded in Python, achieving 20-30$\times$ faster compile times compared to traditional C++ template-based approaches while maintaining full expressivity.
摘要：注意力作为无处不在的 Transformer 架构的核心层，是大型语言模型和长上下文应用的瓶颈。虽然 FlashAttention-3 通过异步执行和扭曲专门化优化了 Hopper GPU 的注意力，但它主要针对 H100 架构。 AI 行业已迅速过渡到部署基于 Blackwell 的系统，例如 B200 和 GB200，由于不对称的硬件扩展，这些系统表现出根本不同的性能特征：张量核心吞吐量翻倍，而其他功能单元（共享内存带宽、指数单元）扩展速度更慢或保持不变。我们开发了几种技术来解决 Blackwell GPU 上的这些转移瓶颈：(1) 重新设计的管道，利用完全异步 MMA 操作和更大的图块大小；(2) 软件模拟的指数和条件 Softmax 重新缩放，减少非 matmul 操作；(3) 利用张量内存和 2-CTA MMA 模式来减少共享内存流量和后向传递中的原子添加。我们证明，我们的方法 FlashAttention-4 在使用 BF16 的 B200 GPU 上比 cuDNN 9.13 实现了高达 1.3$\times$ 的加速，比 Triton 实现了 2.7$\times$ 的加速，达到了 1613 TFLOPs/s（利用率为 71%）。除了算法创新之外，我们还完全在嵌入 Python 的 CuTe-DSL 中实现了 FlashAttention-4，与传统的基于 C++ 模板的方法相比，编译时间加快了 20-30 倍，同时保持了完整的表达能力。

Title: Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval

Authors: Artem Vazhentsev, Maria Marina, Daniil Moskovskiy, Sergey Pletenev, Mikhail Seleznyov, Mikhail Salnikov, Elena Tutubalina, Vasily Konovalov, Irina Nikishina, Alexander Panchenko, Viktor Moskvoretskii
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.05471
Pdf URL: https://arxiv.org/pdf/2603.05471
Copy Paste: [[2603.05471]] Leveraging LLM Parametric Knowledge for Fact Checking without Retrieval(https://arxiv.org/abs/2603.05471)
Keywords: language model, llm, agent
Abstract: Trustworthiness is a core research challenge for agentic AI systems built on Large Language Models (LLMs). To enhance trust, natural language claims from diverse sources, including human-written text, web content, and model outputs, are commonly checked for factuality by retrieving external knowledge and using an LLM to verify the faithfulness of claims to the retrieved evidence. As a result, such methods are constrained by retrieval errors and external data availability, while leaving the models intrinsic fact-verification capabilities largely unused. We propose the task of fact-checking without retrieval, focusing on the verification of arbitrary natural language claims, independent of their source. To study this setting, we introduce a comprehensive evaluation framework focused on generalization, testing robustness to (i) long-tail knowledge, (ii) variation in claim sources, (iii) multilinguality, and (iv) long-form generation. Across 9 datasets, 18 methods and 3 models, our experiments indicate that logit-based approaches often underperform compared to those that leverage internal model representations. Building on this finding, we introduce INTRA, a method that exploits interactions between internal representations and achieves state-of-the-art performance with strong generalization. More broadly, our work establishes fact-checking without retrieval as a promising research direction that can complement retrieval-based frameworks, improve scalability, and enable the use of such systems as reward signals during training or as components integrated into the generation process.
摘要：可信度是基于大型语言模型 (LLM) 的代理人工智能系统的核心研究挑战。为了增强信任，通常通过检索外部知识并使用法学硕士来验证来自不同来源（包括人工编写的文本、网络内容和模型输出）的自然语言声明的真实性，并使用法学硕士来验证声明对检索到的证据的真实性。因此，此类方法受到检索错误和外部数据可用性的限制，同时使模型内在的事实验证功能基本上未被使用。我们提出了无需检索的事实检查任务，重点是验证任意自然语言声明，无论其来源如何。为了研究这种设置，我们引入了一个综合评估框架，重点关注泛化，测试对（i）长尾知识，（ii）声明来源的变化，（iii）多语言性和（iv）长格式生成的稳健性。在 9 个数据集、18 种方法和 3 个模型中，我们的实验表明，与利用内部模型表示的方法相比，基于 logit 的方法通常表现不佳。基于这一发现，我们引入了 INTRA，一种利用内部表示之间的交互并通过强泛化实现最先进性能的方法。更广泛地说，我们的工作将无需检索的事实检查作为一个有前途的研究方向，可以补充基于检索的框架，提高可扩展性，并允许使用此类系统作为训练期间的奖励信号或作为集成到生成过程中的组件。

Title: Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

Authors: Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.05488
Pdf URL: https://arxiv.org/pdf/2603.05488
Copy Paste: [[2603.05488]] Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought(https://arxiv.org/abs/2603.05488)
Keywords: gpt, chain-of-thought
Abstract: We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.
摘要：我们提供了推理模型中表演性思维链（CoT）的证据，其中模型对其最终答案非常有信心，但继续生成令牌而不透露其内部信念。我们的分析比较了两个大型模型（DeepSeek-R1 671B 和 GPT-OSS 120B）的激活探测、早期强迫回答和 CoT 监视器，并发现任务难度特定的差异：模型的最终答案可以从 CoT 中比监视器能够说出的更早的激活中解码，特别是对于基于简单回忆的 MMLU 问题。我们将其与困难的多跳 GPQA-Diamond 问题中的真实推理进行对比。尽管如此，拐点（例如回溯、“啊哈”时刻）几乎只出现在调查显示出巨大信念转变的反应中，这表明这些行为追踪真正的不确定性，而不是习得的“推理剧场”。最后，探针引导的提前退出在 MMLU 上减少了高达 80% 的标记，在 GPQA-Diamond 上减少了 30%，且具有相似的精度，将注意力探测定位为检测执行推理和实现自适应计算的有效工具。