2026-03-18

Title: Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context

Authors: Keivan Alizadeh, Parshin Shojaee, Minsik Cho, Mehrdad Farajtabar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15653
Pdf URL: https://arxiv.org/pdf/2603.15653
Copy Paste: [[2603.15653]] Recursive Language Models Meet Uncertainty: The Surprising Effectiveness of Self-Reflective Program Search for Long Context(https://arxiv.org/abs/2603.15653)
Keywords: language model, long context, agent
Abstract: Long-context handling remains a core challenge for language models: even with extended context windows, models often fail to reliably extract, reason over, and use the information across long contexts. Recent works like Recursive Language Models (RLM) have approached this challenge by agentic way of decomposing long contexts into recursive sub-calls through programmatic interaction at inference. While promising, the success of RLM critically depends on how these context-interaction programs are selected, which has remained largely unexplored. In this paper, we study this problem and introduce SRLM, a framework that augments programmatic context interaction with uncertainty-aware Self-Reflection. SRLM leverages three intrinsic signals: self consistency, reasoning length, and verbalized confidence. These serve as complementary indicators of a model's internal uncertainty, and the model uses them to evaluate and compare candidate context-interaction programs. Extensive experiments across diverse benchmark datasets, context lengths, and backbone models, show that SRLM consistently outperforms state-of-the-art baselines, yielding up to 22% improvement over RLM under the same time budget. Our findings show that recursion itself is not the primary driver of performance in RLM, and a simple self-reflective program search can match or surpass RLM without requiring self-query or explicit recursion mechanisms. We find that for context lengths within the model's window, RLMs with recursion often degrade performance relative to the base model, whereas SRLM yields consistent gains across both short and long contexts. We also find that RLM is less effective in tasks with semantically intensive nature, where heuristic program search is insufficient and broader contextual understanding is required, while self-reflection in SRLM provides a semantic signal that better steers reasoning in these scenarios.
摘要：长上下文处理仍然是语言模型的核心挑战：即使具有扩展的上下文窗口，模型通常也无法可靠地提取、推理和使用长上下文中的信息。最近的工作，如递归语言模型（RLM），通过推理时的编程交互将长上下文分解为递归子调用的代理方式来应对这一挑战。尽管前景广阔，但 RLM 的成功关键取决于如何选择这些情境交互程序，而这在很大程度上尚未得到探索。在本文中，我们研究了这个问题并介绍了 SRLM，这是一个通过不确定性感知自我反思来增强程序化上下文交互的框架。 SRLM 利用三个内在信号：自我一致性、推理长度和言语信心。这些作为模型内部不确定性的补充指标，模型使用它们来评估和比较候选上下文交互程序。跨不同基准数据集、上下文长度和主干模型的大量实验表明，SRLM 始终优于最先进的基线，在相同的时间预算下比 RLM 提高了 22%。我们的研究结果表明，递归本身并不是 RLM 性能的主要驱动因素，简单的自反射程序搜索可以匹配或超越 RLM，而不需要自查询或显式递归机制。我们发现，对于模型窗口内的上下文长度，具有递归的 RLM 通常会相对于基本模型降低性能，而 SRLM 在短上下文和长上下文中都能产生一致的增益。我们还发现，RLM 在语义密集型任务中效果较差，其中启发式程序搜索不足，需要更广泛的上下文理解，而 SRLM 中的自我反思提供了语义信号，可以更好地引导这些场景中的推理。

Title: MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences

Authors: Eric Wu, Kevin Wu, Jason Hom, Paul H. Yi, Angela Zhang, Alejandro Lozano, Jeff Nirschl, Jeff Tangney, Kevin Byram, Braydon Dymm, Narender Annapureddy, Eric Topol, David Ouyang, James Zou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15677
Pdf URL: https://arxiv.org/pdf/2603.15677
Copy Paste: [[2603.15677]] MedArena: Comparing LLMs for Medicine-in-the-Wild Clinician Preferences(https://arxiv.org/abs/2603.15677)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly central to clinician workflows, spanning clinical decision support, medical education, and patient communication. However, current evaluation methods for medical LLMs rely heavily on static, templated benchmarks that fail to capture the complexity and dynamics of real-world clinical practice, creating a dissonance between benchmark performance and clinical utility. To address these limitations, we present MedArena, an interactive evaluation platform that enables clinicians to directly test and compare leading LLMs using their own medical queries. Given a clinician-provided query, MedArena presents responses from two randomly selected models and asks the user to select the preferred response. Out of 1571 preferences collected across 12 LLMs up to November 1, 2025, Gemini 2.0 Flash Thinking, Gemini 2.5 Pro, and GPT-4o were the top three models by Bradley-Terry rating. Only one-third of clinician-submitted questions resembled factual recall tasks (e.g., MedQA), whereas the majority addressed topics such as treatment selection, clinical documentation, or patient communication, with ~20% involving multi-turn conversations. Additionally, clinicians cited depth and detail and clarity of presentation more often than raw factual accuracy when explaining their preferences, highlighting the importance of readability and clinical nuance. We also confirm that the model rankings remain stable even after controlling for style-related factors like response length and formatting. By grounding evaluation in real-world clinical questions and preferences, MedArena offers a scalable platform for measuring and improving the utility and efficacy of medical LLMs.
摘要：大语言模型 (LLM) 在临床医生工作流程中越来越重要，涵盖临床决策支持、医学教育和患者沟通。然而，目前医学法学硕士的评估方法严重依赖静态、模板化的基准，无法捕捉现实临床实践的复杂性和动态，从而造成基准性能和临床实用性之间的不一致。为了解决这些限制，我们推出了 MedArena，这是一个交互式评估平台，使临床医生能够使用自己的医学查询直接测试和比较领先的法学硕士。给定临床医生提供的查询，MedArena 会显示两个随机选择的模型的响应，并要求用户选择首选响应。截至 2025 年 11 月 1 日，在 12 个法学硕士收集的 1571 个偏好中，Gemini 2.0 Flash Thinking、Gemini 2.5 Pro 和 GPT-4o 是 Bradley-Terry 评级排名前三的模型。临床医生提交的问题中只有三分之一类似于事实回忆任务（例如 MedQA），而大多数问题涉及治疗选择、临床文档或患者沟通等主题，其中约 20% 涉及多轮对话。此外，临床医生在解释自己的偏好时，更多地引用了演示的深度、细节和清晰度，而不是原始的事实准确性，这强调了可读性和临床细微差别的重要性。我们还确认，即使在控制了响应长度和格式等与风格相关的因素后，模型排名仍然保持稳定。通过根据现实世界的临床问题和偏好进行评估，MedArena 提供了一个可扩展的平台，用于衡量和提高医学法学硕士的实用性和功效。

Title: MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification

Authors: MiroMind Team: S. Bai, L. Bing, L. Lei, R. Li, X. Li, X. Lin, E. Min, L. Su, B. Wang, L. Wang, L. Wang, S. Wang, X. Wang, Y. Zhang, Z. Zhang, G. Chen, L. Chen, Z. Cheng, Y. Deng, Z. Huang, D. Ng, J. Ni, Q. Ren, X. Tang, B.L. Wang, H. Wang, N. Wang, C. Wei, Q. Wu, J. Xia, Y. Xiao, H. Xu, X. Xu, C. Xue, Z. Yang, Z. Yang, F. Ye, H. Ye, J. Yu, C. Zhang, W. Zhang, H. Zhao, P. Zhu
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15726
Pdf URL: https://arxiv.org/pdf/2603.15726
Copy Paste: [[2603.15726]] MiroThinker-1.7 & H1: Towards Heavy-Duty Research Agents via Verification(https://arxiv.org/abs/2603.15726)
Keywords: agent
Abstract: We present MiroThinker-1.7, a new research agent designed for complex long-horizon reasoning tasks. Building on this foundation, we further introduce MiroThinker-H1, which extends the agent with heavy-duty reasoning capabilities for more reliable multi-step problem solving. In particular, MiroThinker-1.7 improves the reliability of each interaction step through an agentic mid-training stage that emphasizes structured planning, contextual reasoning, and tool interaction. This enables more effective multi-step interaction and sustained reasoning across complex tasks. MiroThinker-H1 further incorporates verification directly into the reasoning process at both local and global levels. Intermediate reasoning decisions can be evaluated and refined during inference, while the overall reasoning trajectory is audited to ensure that final answers are supported by coherent chains of evidence. Across benchmarks covering open-web research, scientific reasoning, and financial analysis, MiroThinker-H1 achieves state-of-the-art performance on deep research tasks while maintaining strong results on specialized domains. We also release MiroThinker-1.7 and MiroThinker-1.7-mini as open-source models, providing competitive research-agent capabilities with significantly improved efficiency.
摘要：我们推出了 MiroThinker-1.7，这是一种专为复杂的长视野推理任务而设计的新型研究代理。在此基础上，我们进一步推出了 MiroThinker-H1，它扩展了代理的重型推理能力，以实现更可靠的多步骤问题解决。特别是，MiroThinker-1.7 通过强调结构化规划、上下文推理和工具交互的代理中期训练阶段提高了每个交互步骤的可靠性。这使得跨复杂任务的更有效的多步骤交互和持续推理成为可能。 MiroThinker-H1 进一步将验证直接纳入局部和全局层面的推理过程中。可以在推理过程中评估和完善中间推理决策，同时审核整体推理轨迹，以确保最终答案得到连贯证据链的支持。在涵盖开放网络研究、科学推理和财务分析的基准测试中，MiroThinker-H1 在深度研究任务上实现了最先进的性能，同时在专业领域保持了强劲的成果。我们还发布了 MiroThinker-1.7 和 MiroThinker-1.7-mini 作为开源模型，提供具有竞争力的研究代理功能，并显着提高了效率。

Title: Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs

Authors: Yara Alakeel, Chatrine Qwaider, Hanan Aldarmaki, Sawsan Alqahtani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15773
Pdf URL: https://arxiv.org/pdf/2603.15773
Copy Paste: [[2603.15773]] Morphemes Without Borders: Evaluating Root-Pattern Morphology in Arabic Tokenizers and LLMs(https://arxiv.org/abs/2603.15773)
Keywords: language model, llm
Abstract: This work investigates how effectively large language models (LLMs) and their tokenization schemes represent and generate Arabic root-pattern morphology, probing whether they capture genuine morphological structure or rely on surface memorization. Arabic morphological system provides a rich testbed for analyzing how LLMs handle complex, non-concatenative forms and how tokenization choices influence this process. Our study begins with an evaluation of morphological fidelity across Arabic and multilingual tokenizers against gold-standard segmentation, followed by an analysis of LLM performance in productive root-pattern generation using a newly developed test set. Our findings across seven Arabic-centric and multilingual LLMs and their respective tokenizers reveal that tokenizer morphological alignment is not necessary nor sufficient for morphological generation, which questions the role of morphological tokenization in downstream performance.
摘要：这项工作研究了大型语言模型（LLM）及其标记化方案如何有效地表示和生成阿拉伯语根模式形态，探讨它们是否捕获真正的形态结构或依赖于表面记忆。阿拉伯语形态系统提供了一个丰富的测试平台，用于分析法学硕士如何处理复杂的非连接形式以及标记化选择如何影响这一过程。我们的研究首先根据黄金标准分割评估阿拉伯语和多语言分词器的形态保真度，然后使用新开发的测试集分析 LLM 在高效根模式生成中的性能。我们对七个以阿拉伯语为中心的多语言法学硕士及其各自的分词器的研究结果表明，分词器形态对齐对于形态生成来说既不必要也不充分，这对形态分词在下游性能中的作用提出了质疑。

Title: COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives

Authors: Azwad Anjum Islam, Tisa Islam Erana
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15897
Pdf URL: https://arxiv.org/pdf/2603.15897
Copy Paste: [[2603.15897]] COGNAC at SemEval-2026 Task 5: LLM Ensembles for Human-Level Word Sense Plausibility Rating in Challenging Narratives(https://arxiv.org/abs/2603.15897)
Keywords: llm, prompt, chain-of-thought
Abstract: We describe our system for SemEval-2026 Task 5, which requires rating the plausibility of given word senses of homonyms in short stories on a 5-point Likert scale. Systems are evaluated by the unweighted average of accuracy (within one standard deviation of mean human judgments) and Spearman Rank Correlation. We explore three prompting strategies using multiple closed-source commercial LLMs: (i) a baseline zero-shot setup, (ii) Chain-of-Thought (CoT) style prompting with structured reasoning, and (iii) a comparative prompting strategy for evaluating candidate word senses simultaneously. Furthermore, to account for the substantial inter-annotator variation present in the gold labels, we propose an ensemble setup by averaging model predictions. Our best official system, comprising an ensemble of LLMs across all three prompting strategies, placed 4th on the competition leaderboard with 0.88 accuracy and 0.83 Spearman's rho (0.86 average). Post-competition experiments with additional models further improved this performance to 0.92 accuracy and 0.85 Spearman's rho (0.89 average). We find that comparative prompting consistently improved performance across model families, and model ensembling significantly enhanced alignment with mean human judgments, suggesting that LLM ensembles are especially well suited for subjective semantic evaluation tasks involving multiple annotators.
摘要：我们描述了 SemEval-2026 任务 5 的系统，该系统要求按照 5 点李克特量表对短篇小说中同音异义词的给定词义的合理性进行评级。系统通过未加权平均准确度（在人类判断平均值的一个标准差内）和斯皮尔曼等级相关性进行评估。我们使用多个闭源商业法学硕士探索了三种提示策略：（i）基线零样本设置，（ii）具有结构化推理的思想链（CoT）风格提示，以及（iii）同时评估候选词义的比较提示策略。此外，为了解释黄金标签中存在的大量注释者间差异，我们通过平均模型预测提出了一种集成设置。我们最好的官方系统，由涵盖所有三种提示策略的法学硕士组成，以 0.88 的准确度和 0.83 Spearman's rho（平均 0.86）在竞赛排行榜上排名第四。赛后使用其他模型进行的实验进一步将这一性能提高到 0.92 准确率和 0.85 Spearman's rho（平均 0.89）。我们发现比较提示持续提高了模型系列的性能，并且模型集成显着增强了与人类平均判断的一致性，这表明法学硕士集成特别适合涉及多个注释器的主观语义评估任务。

Title: Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies

Authors: Nathaniel Imel, Richard Futrell, Michael Franke, Noga Zaslavsky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15903
Pdf URL: https://arxiv.org/pdf/2603.15903
Copy Paste: [[2603.15903]] Agent-based imitation dynamics can yield efficiently compressed population-level vocabularies(https://arxiv.org/abs/2603.15903)
Keywords: agent
Abstract: Natural languages have been argued to evolve under pressure to efficiently compress meanings into words by optimizing the Information Bottleneck (IB) complexity-accuracy tradeoff. However, the underlying social dynamics that could drive the optimization of a language's vocabulary towards efficiency remain largely unknown. In parallel, evolutionary game theory has been invoked to explain the emergence of language from rudimentary agent-level dynamics, but it has not yet been tested whether such an approach can lead to efficient compression in the IB sense. Here, we provide a unified model integrating evolutionary game theory with the IB framework and show how near-optimal compression can arise in a population through an independently motivated dynamic of imprecise strategy imitation in signaling games. We find that key parameters of the model -- namely, those that regulate precision in these games, as well as players' tendency to confuse similar states -- lead to constrained variation of the tradeoffs achieved by emergent vocabularies. Our results suggest that evolutionary game dynamics could potentially provide a mechanistic basis for the evolution of vocabularies with information-theoretically optimal and empirically attested properties.
摘要：有人认为，自然语言是在压力下进化的，通过优化信息瓶颈（IB）复杂性与准确性的权衡，有效地将含义压缩为单词。然而，推动语言词汇优化以提高效率的潜在社会动力仍然很大程度上未知。与此同时，进化博弈论已被用来解释语言从基本代理级动态中的出现，但尚未测试这种方法是否可以导致 IB 意义上的有效压缩。在这里，我们提供了一个将进化博弈论与 IB 框架相结合的统一模型，并展示了如何通过信号博弈中不精确策略模仿的独立激励动态在群体中产生接近最优的压缩。我们发现模型的关键参数——即那些调节这些游戏精度的参数，以及玩家混淆相似状态的倾向——导致新兴词汇所实现的权衡变化受到限制。我们的结果表明，进化博弈动力学可能为具有信息理论上最优和经验证明属性的词汇进化提供机制基础。

Title: BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction

Authors: Tanvir Ahmed Sijan, S. M Golam Rifat, Pankaj Chowdhury Partha, Md. Tanjeed Islam, Md. Musfique Anwar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15949
Pdf URL: https://arxiv.org/pdf/2603.15949
Copy Paste: [[2603.15949]] BANGLASOCIALBENCH: A Benchmark for Evaluating Sociopragmatic and Cultural Alignment of LLMs in Bangladeshi Social Interaction(https://arxiv.org/abs/2603.15949)
Keywords: language model, llm
Abstract: Large Language Models have demonstrated strong multilingual fluency, yet fluency alone does not guarantee socially appropriate language use. In high-context languages, communicative competence requires sensitivity to social hierarchy, relational roles, and interactional norms that are encoded directly in everyday language. Bangla exemplifies this challenge through its three-tiered pronominal system, kinship-based addressing, and culturally embedded social customs. We introduce BANGLASOCIALBENCH, the first benchmark designed to evaluate sociopragmatic competence in Bangla through context-dependent language use rather than factual recall. The benchmark spans three domains: Bangla Address Terms, Kinship Reasoning, and Social Customs, and consists of 1,719 culturally grounded instances written and verified by native Bangla speakers. We evaluate twelve contemporary LLMs in a zero-shot setting and observe systematic patterns of cultural misalignment. Models frequently default to overly formal address forms, fail to recognize multiple socially acceptable address pronouns, and conflate kinship terminology across religious contexts. Our findings show that sociopragmatic failures are often structured and non-random, revealing persistent limitations in how current LLMs infer and apply culturally appropriate language use in realistic Bangladeshi social interactions.
摘要：大型语言模型已表现出强大的多语言流畅性，但仅流畅性并不能保证社交上适当的语言使用。在高语境语言中，交际能力需要对直接以日常语言编码的社会等级、关系角色和互动规范敏感。孟加拉语通过其三层代词系统、基于亲属关系的称呼和根植于文化的社会习俗来体现这一挑战。我们推出 BANGLASOCIALBENCH，这是第一个旨在通过依赖于上下文的语言使用而不是事实回忆来评估孟加拉语的社会语用能力的基准。该基准涵盖三个领域：孟加拉语地址术语、亲属关系推理和社会习俗，由 1,719 个由孟加拉语母语人士编写和验证的基于文化的实例组成。我们在零样本环境中评估了十二个当代法学硕士，并观察了文化失调的系统模式。模型经常默认使用过于正式的称呼形式，无法识别多个社会可接受的称呼代词，并将跨宗教背景的亲属术语混为一谈。我们的研究结果表明，社会语用失误通常是结构化的和非随机的，揭示了当前法学硕士如何在现实的孟加拉国社交互动中推断和应用文化上适当的语言使用的持续局限性。

Title: POLAR:A Per-User Association Test in Embedding Space

Authors: Pedro Bento, Arthur Buzelin, Arthur Chagas, Yan Aquino, Victoria Estanislau, Samira Malaquias, Pedro Robles Dutenhefner, Gisele L. Pappa, Virgilio Almeida, Wagner MeiraJr
Subjects: cs.CL, cs.CY, cs.SI
Abstract URL: https://arxiv.org/abs/2603.15950
Pdf URL: https://arxiv.org/pdf/2603.15950
Copy Paste: [[2603.15950]] POLAR:A Per-User Association Test in Embedding Space(https://arxiv.org/abs/2603.15950)
Keywords: language model, llm
Abstract: Most intrinsic association probes operate at the word, sentence, or corpus level, obscuring author-level variation. We present POLAR (Per-user On-axis Lexical Association Re-port), a per-user lexical association test that runs in the embedding space of a lightly adapted masked language model. Authors are represented by private deterministic to-kens; POLAR projects these vectors onto curated lexicalaxes and reports standardized effects with permutation p-values and Benjamini--Hochberg control. On a balanced bot--human Twitter benchmark, POLAR cleanly separates LLM-driven bots from organic accounts; on an extremist forum,it quantifies strong alignment with slur lexicons and reveals rightward drift over time. The method is modular to new attribute sets and provides concise, per-author diagnostics for computational social science. All code is publicly avail-able at this https URL.
摘要：大多数内在关联探测在单词、句子或语料库级别进行操作，掩盖了作者级别的差异。我们提出了 POLAR（每用户轴上词汇关联报告），这是一种每用户词汇关联测试，在稍微适应的屏蔽语言模型的嵌入空间中运行。作者由私有确定性令牌代表； POLAR 将这些向量投影到策划的词汇轴上，并报告具有排列 p 值和 Benjamini-Hochberg 控制的标准化效果。在平衡的机器人-人类 Twitter 基准上，POLAR 将 LLM 驱动的机器人与有机帐户完全分开；在一个极端主义论坛上，它量化了与诽谤词汇的强烈一致性，并揭示了随着时间的推移向右漂移。该方法对新的属性集进行了模块化，并为计算社会科学提供了简洁的、按作者分类的诊断。所有代码均可通过此 https URL 公开获取。

Title: A Family of LLMs Liberated from Static Vocabularies

Authors: Aleph Alpha: Adnen Abdessaied, Artur Baranowski, Lukas Balles, Michael Barlow, Fabien C. Y. Benureau, Felix Berkenkamp, Lukas Bluebaum, Bastian Boll, Thomas F. Burns, Björn Deiseroth, Constantin Eichenberg, David Friede, Pablo Iyu Guerrero, Ahmed Hammam, Bastian Harren, Johann Higl, Yasser Jadidi, Carina Kauf, Johannes Messner, Jan Hendrik Metzen, Max Meuer, Vedant Nanda, Pit Neitemeier, Koen Oostermeijer, Letitia Parcalabescu, Markus Pernpointner, Felix Reinfurt, Dylan Rodriquez, Grégory Schott, Philipp Siedler, Martin Simonovsky, Till Speicher, Volker Stampa, Stephan Wäldchen, Samuel Weinbach, Gregor Ziegltrum
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15953
Pdf URL: https://arxiv.org/pdf/2603.15953
Copy Paste: [[2603.15953]] A Family of LLMs Liberated from Static Vocabularies(https://arxiv.org/abs/2603.15953)
Keywords: language model, llm
Abstract: Tokenization is a central component of natural language processing in current large language models (LLMs), enabling models to convert raw text into processable units. Although learned tokenizers are widely adopted, they exhibit notable limitations, including their large, fixed vocabulary sizes and poor adaptability to new domains or languages. We present a family of models with up to 70 billion parameters based on the hierarchical autoregressive transformer (HAT) architecture. In HAT, an encoder transformer aggregates bytes into word embeddings and then feeds them to the backbone, a classical autoregressive transformer. The outputs of the backbone are then cross-attended by the decoder and converted back into bytes. We show that we can reuse available pre-trained models by converting the Llama 3.1 8B and 70B models into the HAT architecture: Llama-3.1-8B-TFree-HAT and Llama-3.1-70B-TFree-HAT are byte-level models whose encoder and decoder are trained from scratch, but where we adapt the pre-trained Llama backbone, i.e., the transformer blocks with the embedding matrix and head removed, to handle word embeddings instead of the original tokens. We also provide a 7B HAT model, Llama-TFree-HAT-Pretrained, trained entirely from scratch on nearly 4 trillion words. The HAT architecture improves text compression by reducing the number of required sequence positions and enhances robustness to intra-word variations, e.g., spelling differences. Through pre-training, as well as subsequent supervised fine-tuning and direct preference optimization in English and German, we show strong proficiency in both languages, improving on the original Llama 3.1 in most benchmarks. We release our models (including 200 pre-training checkpoints) on Hugging Face.
摘要：标记化是当前大型语言模型 (LLM) 中自然语言处理的核心组成部分，使模型能够将原始文本转换为可处理单元。尽管学习分词器被广泛采用，但它们表现出明显的局限性，包括词汇量大且固定以及对新领域或语言的适应性较差。我们基于分层自回归变压器 (HAT) 架构提出了一系列具有多达 700 亿个参数的模型。在 HAT 中，编码器转换器将字节聚合成字嵌入，然后将它们馈送到骨干网（一个经典的自回归转换器）。然后，主干网的输出由解码器交叉参与并转换回字节。我们表明，我们可以通过将 Llama 3.1 8B 和 70B 模型转换为 HAT 架构来重用可用的预训练模型：Llama-3.1-8B-TFree-HAT 和 Llama-3.1-70B-TFree-HAT 是字节级模型，其编码器和解码器是从头开始训练的，但我们采用预训练的 Llama 主干，即具有嵌入矩阵和头的变压器块删除，以处理词嵌入而不是原始标记。我们还提供了 7B HAT 模型 Llama-TFree-HAT-Pretrained，完全从头开始训练近 4 万亿个单词。 HAT 架构通过减少所需序列位置的数量来改进文本压缩，并增强对单词内变化（例如拼写差异）的鲁棒性。通过英语和德语的预训练以及后续的监督微调和直接偏好优化，我们对两种语言都表现出了很强的熟练程度，在大多数基准测试中都比原始的 Llama 3.1 有所改进。我们在 Hugging Face 上发布了我们的模型（包括 200 个预训练检查点）。

Title: Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning

Authors: Jingxiang Chen, Minseok Kim, Seong-Gyun Leem, Yin Huang, Rashi Rungta, Zhicheng Ouyang, Haibin Wu, Surya Teja Appini, Ankur Bansal, Yang Bai, Yue Liu, Florian Metze, Ahmed A Aly, Anuj Kumar, Ariya Rastrow, Zhaojiang Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15981
Pdf URL: https://arxiv.org/pdf/2603.15981
Copy Paste: [[2603.15981]] Aligning Paralinguistic Understanding and Generation in Speech LLMs via Multi-Task Reinforcement Learning(https://arxiv.org/abs/2603.15981)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Speech large language models (LLMs) observe paralinguistic cues such as prosody, emotion, and non-verbal sounds--crucial for intent understanding. However, leveraging these cues faces challenges: limited training data, annotation difficulty, and models exploiting lexical shortcuts over paralinguistic signals. We propose multi-task reinforcement learning (RL) with chain-of-thought prompting that elicits explicit affective reasoning. To address data scarcity, we introduce a paralinguistics-aware speech LLM (PALLM) that jointly optimizes sentiment classification from audio and paralinguistics-aware response generation via a two-stage pipeline. Experiments demonstrate that our approach improves paralinguistics understanding over both supervised baselines and strong proprietary models (Gemini-2.5-Pro, GPT-4o-audio) by 8-12% on Expresso, IEMOCAP, and RAVDESS. The results show that modeling paralinguistic reasoning with multi-task RL is crucial for building emotionally intelligent speech LLMs.
摘要：语音大语言模型 (LLM) 观察副语言线索，例如韵律、情感和非语言声音，这对于意图理解至关重要。然而，利用这些线索面临着挑战：有限的训练数据、注释困难以及利用副语言信号的词汇快捷方式的模型。我们提出了具有思想链提示的多任务强化学习（RL），可以引发明确的情感推理。为了解决数据稀缺问题，我们引入了副语言感知语音法学硕士（PALLM），它通过两级管道联合优化音频的情感分类和副语言感知响应生成。实验表明，我们的方法在 Expresso、IEMOCAP 和 RAVDESS 上将监督基线和强大的专有模型（Gemini-2.5-Pro、GPT-4o-audio）的副语言理解提高了 8-12%。结果表明，使用多任务强化学习对副语言推理进行建模对于构建情感智能语音法学硕士至关重要。

Title: RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation

Authors: Saisha Pradeep Shetty, Roger Eric Goldman, Vladimir Filkov
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16002
Pdf URL: https://arxiv.org/pdf/2603.16002
Copy Paste: [[2603.16002]] RadAnnotate: Large Language Models for Efficient and Reliable Radiology Report Annotation(https://arxiv.org/abs/2603.16002)
Keywords: language model, llm
Abstract: Radiology report annotation is essential for clinical NLP, yet manual labeling is slow and costly. We present RadAnnotate, an LLM-based framework that studies retrieval-augmented synthetic reports and confidence-based selective automation to reduce expert effort for labeling in RadGraph. We study RadGraph-style entity labeling (graph nodes) and leave relation extraction (edges) to future work. First, we train entity-specific classifiers on gold-standard reports and characterize their strengths and failure modes across anatomy and observation categories, with uncertain observations hardest to learn. Second, we generate RAG-guided synthetic reports and show that synthetic-only models remain within 1-2 F1 points of gold-trained models, and that synthetic augmentation is especially helpful for uncertain observations in a low-resource setting, improving F1 from 0.61 to 0.70. Finally, by learning entity-specific confidence thresholds, RadAnnotate can automatically annotate 55-90% of reports at 0.86-0.92 entity match score while routing low-confidence cases for expert review.
摘要：放射学报告注释对于临床 NLP 至关重要，但手动标记速度慢且成本高。我们推出了 RadAnnotate，这是一个基于法学硕士的框架，它研究检索增强的合成报告和基于置信度的选择性自动化，以减少专家在 RadGraph 中进行标记的工作。我们研究 RadGraph 风格的实体标记（图节点）并将关系提取（边）留给未来的工作。首先，我们根据黄金标准报告训练特定于实体的分类器，并在解剖和观察类别中描述它们的优势和故障模式，其中不确定的观察最难学习。其次，我们生成 RAG 引导的合成报告，并表明纯合成模型与黄金训练模型的 F1 点保持在 1-2 个范围内，并且合成增强对于低资源环境中的不确定观测特别有帮助，将 F1 从 0.61 提高到 0.70。最后，通过学习特定于实体的置信度阈值，RadAnnotate 可以以 0.86-0.92 的实体匹配分数自动注释 55-90% 的报告，同时路由低置信度案例供专家评审。

Title: Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

Authors: Fan Huang, Haewoon Kwak, Jisun An
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16017
Pdf URL: https://arxiv.org/pdf/2603.16017
Copy Paste: [[2603.16017]] Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability(https://arxiv.org/abs/2603.16017)
Keywords: language model, llm
Abstract: Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).
摘要：大型语言模型（LLM）越来越多地参与道德敏感的决策，但它们如何跨推理步骤组织道德框架仍有待探索。我们引入了 \textit{道德推理轨迹}，即跨中间推理步骤的道德框架调用序列，并分析了它们在六个模型和三个基准上的动态。我们发现道德推理涉及系统的多框架审议：55.4--57.7%的连续步骤涉及框架切换，只有16.4--17.8%的轨迹保持框架一致。不稳定的轨迹仍然更容易受到说服性攻击 1.29$\times$ ($p=0.015$)。在表示级别，线性探针将特定于框架的编码定位到特定于模型的层（Llama-3.3-70B 为 63/81 层；Qwen2.5-72B 为 17/81 层），实现比训练集先前基线低 13.8--22.6% 的 KL 散度。轻量级激活控制调节框架集成模式（6.7--8.9% 漂移减少）并放大稳定性-准确性关系。我们进一步提出了一个道德表征一致性（MRC）指标，该指标与法学硕士连贯性评级密切相关（$r=0.715$，$p<0.0001$），其底层框架属性由人类注释者验证（平均余弦相似度$= 0.859$）。

Title: SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia

Authors: Ri Chi Ng, Aditi Kumaresan, Yujia Hu, Roy Ka-Wei Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16070
Pdf URL: https://arxiv.org/pdf/2603.16070
Copy Paste: [[2603.16070]] SEAHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Southeast Asia(https://arxiv.org/abs/2603.16070)
Keywords: language model
Abstract: Hate speech detection relies heavily on linguistic resources, which are primarily available in high-resource languages such as English and Chinese, creating barriers for researchers and platforms developing tools for low-resource languages in Southeast Asia, where diverse socio-linguistic contexts complicate online hate moderation. To address this, we introduce SEAHateCheck, a pioneering dataset tailored to Indonesia, Thailand, the Philippines, and Vietnam, covering Indonesian, Tagalog, Thai, and Vietnamese. Building on HateCheck's functional testing framework and refining SGHateCheck's methods, SEAHateCheck provides culturally relevant test cases, augmented by large language models and validated by local experts for accuracy. Experiments with state-of-the-art and multilingual models revealed limitations in detecting hate speech in specific low-resource languages. In particular, Tagalog test cases showed the lowest model accuracy, likely due to linguistic complexity and limited training data. In contrast, slang-based functional tests proved the hardest, as models struggled with culturally nuanced expressions. The diagnostic insights of SEAHateCheck further exposed model weaknesses in implicit hate detection and models' struggles with counter-speech expression. As the first functional test suite for these Southeast Asian languages, this work equips researchers with a robust benchmark, advancing the development of practical, culturally attuned hate speech detection tools for inclusive online content moderation.
摘要：仇恨言论检测严重依赖于语言资源，这些资源主要以英语和中文等高资源语言提供，这为东南亚低资源语言开发工具的研究人员和平台造成了障碍，因为东南亚多样化的社会语言环境使在线仇恨调节变得复杂。为了解决这个问题，我们推出了 SEAHateCheck，这是一个专为印度尼西亚、泰国、菲律宾和越南量身定制的开创性数据集，涵盖印度尼西亚语、他加禄语、泰语和越南语。 SEAHateCheck 基于 HateCheck 的功能测试框架并改进了 SGHateCheck 的方法，提供了文化相关的测试用例，并通过大型语言模型进行了增强，并由当地专家验证了准确性。使用最先进的多语言模型进行的实验揭示了检测特定低资源语言中的仇恨言论的局限性。特别是，他加禄语测试用例显示的模型准确性最低，这可能是由于语言复杂性和有限的训练数据。相比之下，基于俚语的功能测试被证明是最困难的，因为模型难以应对文化上微妙的表达方式。 SEAHateCheck 的诊断见解进一步暴露了模型在隐式仇恨检测方面的弱点以及模型在反言论表达方面的挣扎。作为这些东南亚语言的第一个功能测试套件，这项工作为研究人员提供了强大的基准，推动了实用、文化协调的仇恨言论检测工具的开发，以实现包容性在线内容审核。

Title: ClaimFlow: Tracing the Evolution of Scientific Claims in NLP

Authors: Aniket Pramanick, Yufang Hou, Saif M. Mohammad, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16073
Pdf URL: https://arxiv.org/pdf/2603.16073
Copy Paste: [[2603.16073]] ClaimFlow: Tracing the Evolution of Scientific Claims in NLP(https://arxiv.org/abs/2603.16073)
Keywords: language model
Abstract: Scientific papers do more than report results $-$ they advance $\textit{claims}$ that later work supports, extends, or sometimes refutes. Yet existing methods for citation and claim analysis capture only fragments of this dialogue. In this work, we make these interactions explicit at the level of individual scientific claims. We introduce $\texttt{ClaimFlow}$, a claim-centric view of the NLP literature, built from $304$ ACL Anthology papers (1979$-$2025) that are manually annotated with $1{,}084$ claims and $832$ cross-paper claim relations, indicating whether a citing paper $\textit{supports}$, $\textit{extends}$, $\textit{qualifies}$, $\textit{refutes}$, or references a claim as $\textit{background}$. Using $\texttt{ClaimFlow}$, we define a new task $-$ $\textit{Claim Relation Classification}$ $-$ which requires models to infer the scientific stance toward a cited claim from the text and citation context. Evaluating strong neural models and large language models on this task, we report baseline performance of $0.78$ macro-F1, highlighting that claim-relation classification is feasible but challenging. We further apply our model to $\sim$$13k$ NLP papers to analyze how claims evolve across decades of NLP research. Our analysis reveals that $63.5$% claims are never reused; only $11.1$% are ever challenged; meanwhile, widely propagated claims are more often $\textit{reshaped}$ through qualification and extension than directly confirmed or refuted. Overall, $\texttt{ClaimFlow}$ offers a lens for examining how ideas shift and mature within NLP, and a foundation for assessing whether models can interpret scientific argumentation.
摘要：科学论文的作用不仅仅是报告结果$-$，它们还提出了$\textit{claims}$，而后来的工作支持、扩展或有时反驳。然而，现有的引文和权利要求分析方法仅捕获了这种对话的片段。在这项工作中，我们在个人科学主张的层面上明确了这些相互作用。我们引入 $\texttt{ClaimFlow}$，一种以声明为中心的 NLP 文献视图，由 $304$ ACL 选集论文 (1979$-$2025) 构建，这些论文手动注释了 $1{,}084$ 声明和 $832$ 跨论文声明关系，指示施引论文是否 $\textit{supports}$、$\textit{extends}$、 $\textit{限定}$、$\textit{反驳}$，或将声明引用为$\textit{背景}$。使用 $\texttt{ClaimFlow}$，我们定义了一个新任务 $-$ $\textit{Claim Relation Classification}$ $-$，它需要模型从文本和引文上下文中推断出对所引用的声明的科学立场。评估此任务中的强神经模型和大型语言模型，我们报告了 $0.78$ 宏 F1 的基线性能，强调声明关系分类是可行的，但具有挑战性。我们进一步将我们的模型应用于 $\sim$$13k$ NLP 论文，以分析数十年 NLP 研究中的主张如何演变。我们的分析表明，$63.5$% 的索赔从未被重复使用；只有 $11.1$% 曾受到挑战；与此同时，广泛传播的主张更多地是通过限定和扩展而不是直接证实或反驳来$\textit{reshape}$。总体而言，$\texttt{ClaimFlow}$ 提供了一个检验 NLP 中思想如何转变和成熟的镜头，并为评估模型是否可以解释科学论证奠定了基础。

Title: CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

Authors: Tianyi Huang, Ying Kai Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16091
Pdf URL: https://arxiv.org/pdf/2603.16091
Copy Paste: [[2603.16091]] CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering(https://arxiv.org/abs/2603.16091)
Keywords: gpt
Abstract: In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight inference-time repair layer for retrieval-grounded question answering. CounterRefine first produces a short answer from retrieved evidence, then gathers additional support and conflicting evidence with follow-up queries conditioned on that draft answer, and finally applies a restricted refinement step that outputs either KEEP or REVISE, with proposed revisions accepted only if they pass deterministic validation. In effect, CounterRefine turns retrieval into a mechanism for testing a provisional answer rather than merely collecting more context. On the full SimpleQA benchmark, CounterRefine improves a matched GPT-5 Baseline-RAG by 5.8 points and reaches a 73.1 percent correct rate, while exceeding the reported one-shot GPT-5.4 score by roughly 40 points. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.
摘要：在事实问答中，很多错误不是访问失败，而是承诺失败：系统检索了相关证据，但仍然给出了错误的答案。我们推出了 CounterRefine，这是一个轻量级的推理时间修复层，用于基于检索的问答。 CounterRefine 首先从检索到的证据中生成一个简短的答案，然后通过以该草稿答案为条件的后续查询收集额外的支持和相互冲突的证据，最后应用限制性细化步骤，输出 KEEP 或 REVISE，仅当提议的修订通过确定性验证时才接受。实际上，CounterRefine 将检索转变为一种测试临时答案的机制，而不仅仅是收集更多上下文。在完整的 SimpleQA 基准测试中，CounterRefine 将匹配的 GPT-5 Baseline-RAG 提高了 5.8 分，正确率达到 73.1%，同时比报告的一次性 GPT-5.4 分数高出大约 40 分。这些发现为知识渊博的基础模型提出了一个简单但重要的方向：除了获取证据之外，他们还应该能够使用这些证据来重新考虑，并在必要时修复自己的答案。

Title: Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization

Authors: Francesco Pio Monaco, Elia Cunegatti, Flavio Vella, Giovanni Iacca
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16105
Pdf URL: https://arxiv.org/pdf/2603.16105
Copy Paste: [[2603.16105]] Frequency Matters: Fast Model-Agnostic Data Curation for Pruning and Quantization(https://arxiv.org/abs/2603.16105)
Keywords: language model, llm
Abstract: Post-training model compression is essential for enhancing the portability of Large Language Models (LLMs) while preserving their performance. While several compression approaches have been proposed, less emphasis has been placed on selecting the most suitable set of data (the so-called \emph{calibration data}) for finding the compressed model configuration. The choice of calibration data is a critical step in preserving model capabilities both intra- and inter-tasks. In this work, we address the challenge of identifying high-performance calibration sets for both pruning and quantization by analyzing intrinsic data properties rather than model-specific signals. We introduce \texttt{\textbf{ZipCal}}, a model-agnostic data curation strategy that maximizes lexical diversity based on Zipfian power laws. Experiments demonstrate that our method consistently outperforms standard uniform random sampling across various pruning benchmarks. Notably, it also performs on par, in terms of downstream performance, with a state-of-the-art method that relies on model perplexity. The latter becomes prohibitively expensive at large-scale models and datasets, while \texttt{\textbf{ZipCal}} is on average $\sim$240$\times$ faster due to its tractable linear complexity\footnote{We make the code and the experiments available at this https URL.}.
摘要：训练后模型压缩对于增强大型语言模型 (LLM) 的可移植性同时保持其性能至关重要。虽然已经提出了几种压缩方法，但较少强调选择最合适的数据集（所谓的\emph{校准数据}）来查找压缩模型配置。校准数据的选择是保留任务内和任务间模型功能的关键步骤。在这项工作中，我们通过分析内在数据属性而不是特定于模型的信号来解决识别用于修剪和量化的高性能校准集的挑战。我们引入 \texttt{\textbf{ZipCal}}，这是一种与模型无关的数据管理策略，可基于 Zipfian 幂律最大化词汇多样性。实验表明，我们的方法在各种修剪基准上始终优于标准均匀随机采样。值得注意的是，它在下游性能方面也与依赖模型复杂度的最先进方法相媲美。后者在大规模模型和数据集上变得极其昂贵，而 \texttt{\textbf{ZipCal}} 由于其易于处理的线性复杂性，平均速度快 $\sim$240$\times$\footnote{我们在此 https URL 上提供代码和实验。}。

Title: ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning

Authors: Tik Yu Yim, Wenting Tan, Sum Yee Chan, Tak-Wah Lam, Siu Ming Yiu
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2603.16112
Pdf URL: https://arxiv.org/pdf/2603.16112
Copy Paste: [[2603.16112]] ASDA: Automated Skill Distillation and Adaptation for Financial Reasoning(https://arxiv.org/abs/2603.16112)
Keywords: language model, llm, agent
Abstract: Adapting large language models (LLMs) to specialized financial reasoning typically requires expensive fine-tuning that produces model-locked expertise. Training-free alternatives have emerged, yet our experiments show that leading methods (GEPA and ACE) achieve only marginal gains on the FAMMA financial reasoning benchmark, exposing the limits of unstructured text optimization for complex, multi-step domain reasoning. We introduce Automated Skill Distillation and Adaptation (ASDA), a framework that automatically generates structured skill artifacts through iterative error-corrective learning without modifying model weights. A teacher model analyzes a student model's failures on financial reasoning tasks, clusters errors by subfield and error type, and synthesizes skill files containing reasoning procedures, code templates, and worked examples, which are dynamically injected during inference. Evaluated on FAMMA, ASDA achieves up to +17.33% improvement on arithmetic reasoning and +5.95% on non-arithmetic reasoning, substantially outperforming all training-free baselines. The resulting skill artifacts are human-readable, version-controlled, and compatible with the Agent Skills open standard, offering any organization with a labeled domain dataset a practical and auditable path to domain adaptation without weight access or retraining.
摘要：使大型语言模型 (LLM) 适应专门的财务推理通常需要昂贵的微调来产生模型锁定的专业知识。免训练的替代方案已经出现，但我们的实验表明，领先的方法（GEPA 和 ACE）仅在 FAMMA 财务推理基准上取得边际收益，暴露了非结构化文本优化对于复杂、多步骤领域推理的局限性。我们引入了自动技能蒸馏和适应（ASDA），这是一个框架，可以通过迭代纠错学习自动生成结构化技能工件，而无需修改模型权重。教师模型分析学生模型在金融推理任务上的失败，按子领域和错误类型对错误进行聚类，并综合包含推理过程、代码模板和工作示例的技能文件，这些文件在推理过程中动态注入。在 FAMMA 上进行评估，ASDA 在算术推理方面取得了高达 +17.33% 的提升，在非算术推理方面取得了 +5.95% 的提升，大大优于所有免训练基线。由此产生的技能工件是人类可读的、版本控制的，并且与代理技能开放标准兼容，为任何具有标记域数据集的组织提供了实用且可审核的域适应路径，而无需权重访问或重新培训。

Title: Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

Authors: Nishant Balepur, Malachi Hamada, Varsha Kishore, Sergey Feldman, Amanpreet Singh, Pao Siangliulue, Joseph Chee Chang, Eunsol Choi, Jordan Lee Boyd-Graber, Aakanksha Naik
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16120
Pdf URL: https://arxiv.org/pdf/2603.16120
Copy Paste: [[2603.16120]] Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users(https://arxiv.org/abs/2603.16120)
Keywords: language model, llm
Abstract: Deep Research (DR) tools (e.g. OpenAI DR) help researchers cope with ballooning publishing counts. Such tools can synthesize scientific papers to answer researchers' queries, but lack understanding of their users. We change that in MyScholarQA (MySQA), a personalized DR tool that: 1) infers a profile of a user's research interests; 2) proposes personalized actions for a user's input query; and 3) writes a multi-section report for the query that follows user-approved actions. We first test MySQA with NLP's standard protocol: we design a benchmark of synthetic users and LLM judges, where MySQA beats baselines in citation metrics and personalized action-following. However, we suspect this process does not cover all aspects of personalized DR users value, so we interview users in an online version of MySQA to unmask them. We reveal nine nuanced errors of personalized DR undetectable by our LLM judges, and we study qualitative feedback to form lessons for future DR design. In all, we argue for a pillar of personalization that easy-to-use LLM judges can lead NLP to overlook: real progress in personalization is only possible with real users.
摘要：深度研究 (DR) 工具（例如 OpenAI DR）可帮助研究人员应对不断膨胀的发表数量。此类工具可以综合科学论文来回答研究人员的疑问，但缺乏对用户的了解。我们在个性化 DR 工具 MyScholarQA (MySQA) 中对此进行了更改：1) 推断用户研究兴趣的概况； 2）针对用户输入的查询提出个性化操作； 3) 为遵循用户批准的操作的查询编写多部分报告。我们首先使用 NLP 的标准协议测试 MySQA：我们设计了综合用户和 LLM 评委的基准，其中 MySQA 在引用指标和个性化操作跟踪方面击败了基准。然而，我们怀疑这个过程并没有涵盖个性化灾难恢复用户价值的所有方面，因此我们在 MySQA 的在线版本中采访用户以揭开他们的面纱。我们揭示了法学硕士评委无法检测到的九个个性化 DR 的细微错误，并且我们研究定性反馈，为未来的 DR 设计提供经验教训。总而言之，我们主张个性化的一个支柱，易于使用的 LLM 法官可能会导致 NLP 忽视：个性化的真正进步只有通过真正的用户才有可能实现。

Title: Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning

Authors: Kazuki Yano, Shun Kiyono, Sosuke Kobayashi, Sho Takase, Jun Suzuki
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16127
Pdf URL: https://arxiv.org/pdf/2603.16127
Copy Paste: [[2603.16127]] Pre-training LLM without Learning Rate Decay Enhances Supervised Fine-Tuning(https://arxiv.org/abs/2603.16127)
Keywords: language model, llm
Abstract: We investigate the role of learning rate scheduling in the large-scale pre-training of large language models, focusing on its influence on downstream performance after supervised fine-tuning (SFT). Decay-based learning rate schedulers are widely used to minimize pre-training loss. However, despite their widespread use, how these schedulers affect performance after SFT remains underexplored. In this paper, we examine Warmup-Stable-Only (WSO), which maintains a constant learning rate after warmup without any decay. Through experiments with 1B and 8B parameter models, we show that WSO consistently outperforms decay-based schedulers in terms of performance after SFT, even though decay-based schedulers may exhibit better performance after pre-training. The result also holds across different regimes with mid-training and over-training. Loss landscape analysis further reveals that decay-based schedulers lead models into sharper minima, whereas WSO preserves flatter minima that support adaptability. These findings indicate that applying LR decay to improve pre-training metrics may compromise downstream adaptability. Our work also provides practical guidance for training and model release strategies, highlighting that pre-training models with WSO enhances their adaptability for downstream tasks.
摘要：我们研究了学习率调度在大型语言模型的大规模预训练中的作用，重点关注其对监督微调（SFT）后下游性能的影响。基于衰减的学习率调度器被广泛用于最小化预训练损失。然而，尽管它们被广泛使用，但这些调度程序如何影响 SFT 后的性能仍有待探索。在本文中，我们研究了 Warmup-Stable-Only (WSO)，它在预热后保持恒定的学习率而没有任何衰减。通过 1B 和 8B 参数模型的实验，我们表明，在 SFT 后的性能方面，WSO 始终优于基于衰减的调度器，尽管基于衰减的调度器在预训练后可能表现出更好的性能。该结果也适用于不同的训练中期和过度训练的情况。损失景观分析进一步揭示，基于衰减的调度程序导致模型进入更尖锐的最小值，而 WSO 保留了支持适应性的更平坦的最小值。这些发现表明，应用 LR 衰减来改进预训练指标可能会损害下游适应性。我们的工作还为训练和模型发布策略提供了实用指导，强调使用 WSO 进行预训练模型增强了它们对下游任务的适应性。

Title: Social Simulacra in the Wild: AI Agent Communities on Moltbook

Authors: Agam Goyal, Olivia Pal, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16128
Pdf URL: https://arxiv.org/pdf/2603.16128
Copy Paste: [[2603.16128]] Social Simulacra in the Wild: AI Agent Communities on Moltbook(https://arxiv.org/abs/2603.16128)
Keywords: llm, agent
Abstract: As autonomous LLM-based agents increasingly populate social platforms, understanding the dynamics of AI-agent communities becomes essential for both communication research and platform governance. We present the first large-scale empirical comparison of AI-agent and human online communities, analyzing 73,899 Moltbook and 189,838 Reddit posts across five matched communities. Structurally, we find that Moltbook exhibits extreme participation inequality (Gini = 0.84 vs. 0.47) and high cross-community author overlap (33.8\% vs. 0.5\%). In terms of linguistic attributes, content generated by AI-agents is emotionally flattened, cognitively shifted toward assertion over exploration, and socially detached. These differences give rise to apparent community-level homogenization, but we show this is primarily a structural artifact of shared authorship. At the author level, individual agents are more identifiable than human users, driven by outlier stylistic profiles amplified by their extreme posting volume. As AI-mediated communication reshapes online discourse, our work offers an empirical foundation for understanding how multi-agent interaction gives rise to collective communication dynamics distinct from those of human communities.
摘要：随着基于法学硕士的自主代理越来越多地出现在社交平台上，了解人工智能代理社区的动态对于通信研究和平台治理变得至关重要。我们首次对人工智能代理和人类在线社区进行了大规模实证比较，分析了五个匹配社区的 73,899 个 Moltbook 和 189,838 个 Reddit 帖子。从结构上看，我们发现 Moltbook 表现出极端的参与不平等（基尼系数 = 0.84 vs. 0.47）和较高的跨社区作者重叠度（33.8\% vs. 0.5\%）。就语言属性而言，人工智能代理生成的内容在情感上变得扁平化，认知上转向断言而非探索，并且在社交上脱离了。这些差异导致了明显的社区层面的同质化，但我们表明这主要是共享作者身份的结构性产物。在作者层面，个体代理比人类用户更容易识别，这是由于其极端的发帖量放大了异常的风格特征。随着人工智能介导的沟通重塑在线话语，我们的工作为理解多主体交互如何产生不同于人类社区的集体沟通动态提供了实证基础。

Title: SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era

Authors: Han Jang, Junhyeok Lee, Kyu Sung Choi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16131
Pdf URL: https://arxiv.org/pdf/2603.16131
Copy Paste: [[2603.16131]] SciZoom: A Large-scale Benchmark for Hierarchical Scientific Summarization across the LLM Era(https://arxiv.org/abs/2603.16131)
Keywords: gpt, llm, chat
Abstract: The explosive growth of AI research has created unprecedented information overload, increasing the demand for scientific summarization at multiple levels of granularity beyond traditional abstracts. While LLMs are increasingly adopted for summarization, existing benchmarks remain limited in scale, target only a single granularity, and predate the LLM era. Moreover, since the release of ChatGPT in November 2022, researchers have rapidly adopted LLMs for drafting manuscripts themselves, fundamentally transforming scientific writing, yet no resource exists to analyze how this writing has evolved. To bridge these gaps, we introduce SciZoom, a benchmark comprising 44,946 papers from four top-tier ML venues (NeurIPS, ICLR, ICML, EMNLP) spanning 2020 to 2025, explicitly stratified into Pre-LLM and Post-LLM eras. SciZoom provides three hierarchical summarization targets (Abstract, Contributions, and TL;DR) achieving compression ratios up to 600:1, enabling both multi-granularity summarization research and temporal mining of scientific writing patterns. Our linguistic analysis reveals striking shifts in phrase patterns (up to 10x for formulaic expressions) and rhetorical style (23% decline in hedging), suggesting that LLM-assisted writing produces more confident yet homogenized prose. SciZoom serves as both a challenging benchmark and a unique resource for mining the evolution of scientific discourse in the generative AI era. Our code and dataset are publicly available on GitHub (this https URL) and Hugging Face (this https URL), respectively.
摘要：人工智能研究的爆炸性增长造成了前所未有的信息过载，增加了对传统摘要之外的多个粒度级别的科学总结的需求。虽然法学硕士越来越多地用于总结，但现有基准的规模仍然有限，仅针对单一粒度，并且早于法学硕士时代。此外，自 2022 年 11 月发布 ChatGPT 以来，研究人员迅速采用法学硕士来自己起草稿件，从根本上改变了科学写作，但没有资源来分析这种写作是如何演变的。为了弥补这些差距，我们引入了 SciZoom，这是一个基准，包含 2020 年至 2025 年来自四个顶级 ML 场所（NeurIPS、ICLR、ICML、EMNLP）的 44,946 篇论文，明确分为 Pre-LLM 和 Post-LLM 时代。 SciZoom 提供三个层次摘要目标（Abstract、Contributions 和 TL;DR），压缩比高达 600:1，支持多粒度摘要研究和科学写作模式的时间挖掘。我们的语言分析揭示了短语模式（公式化表达高达 10 倍）和修辞风格（对冲下降 23%）的显着变化，这表明法学硕士辅助写作可以产生更加自信但同质化的散文。 SciZoom 既是一个具有挑战性的基准，也是挖掘生成人工智能时代科学话语演变的独特资源。我们的代码和数据集分别在 GitHub（此 https URL）和 Hugging Face（此 https URL）上公开可用。

Title: SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment

Authors: Zhouwei Zhai, Mengxiang Chen, Anmeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16137
Pdf URL: https://arxiv.org/pdf/2603.16137
Copy Paste: [[2603.16137]] SIA: A Synthesize-Inject-Align Framework for Knowledge-Grounded and Secure E-commerce Search LLMs with Industrial Deployment(https://arxiv.org/abs/2603.16137)
Keywords: language model, llm, hallucination
Abstract: Large language models offer transformative potential for e-commerce search by enabling intent-aware recommendations. However, their industrial deployment is hindered by two critical challenges: (1) knowledge hallucination due to insufficient encoding of dynamic, fine-grained product knowledge, and (2) security vulnerabilities under jailbreak attacks that threaten compliance. To address these issues, we propose SI--a Synthesize-Inject-Align framework for building knowledgeable and secure e-commerce search LLMs. Our approach first synthesizes high-quality natural language corpus by combining structured knowledge graphs with unstructured behavioral logs, augmented with reasoning chains and safety-aware this http URL then introduce a parameter-efficient pre-training strategy based on Depth Up-Scaling to inject domain knowledge while preserving general capabilities. Finally, a dual-path alignment method via multi-task instruction tuning and adversarial training strengthens both task performance and safety robustness. The framework has been deployed at this http URL, China's largest self-operated e-commerce platform, where A/B tests across five core search scenarios demonstrate significant improvements in key business metrics, validating its industrial effectiveness and scalability.
摘要：大型语言模型通过实现意图感知推荐，为电子商务搜索提供变革潜力。然而，它们的工业部署受到两个关键挑战的阻碍：（1）由于动态、细粒度的产品知识编码不足而导致的知识幻觉，以及（2）越狱攻击下的安全漏洞威胁合规性。为了解决这些问题，我们提出了 SI——一种 Synthesize-Inject-Align 框架，用于构建知识丰富且安全的电子商务搜索法学硕士。我们的方法首先通过将结构化知识图与非结构化行为日志相结合来合成高质量的自然语言语料库，并通过推理链和安全感知的 http URL 进行增强，然后引入基于深度向上扩展的参数高效预训练策略，以注入领域知识，同时保留通用功能。最后，通过多任务指令调整和对抗训练的双路径对齐方法增强了任务性能和安全鲁棒性。该框架已部署在中国最大的自营电子商务平台http URL上，五个核心搜索场景的A/B测试显示了关键业务指标的显着改进，验证了其行业有效性和可扩展性。

Title: Parametric Social Identity Injection and Diversification in Public Opinion Simulation

Authors: Hexi Wang, Yujia Zhou, Bangde Du, Qingyao Ai, Yiqun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16142
Pdf URL: https://arxiv.org/pdf/2603.16142
Copy Paste: [[2603.16142]] Parametric Social Identity Injection and Diversification in Public Opinion Simulation(https://arxiv.org/abs/2603.16142)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) have recently been adopted as synthetic agents for public opinion simulation, offering a promising alternative to costly and slow human surveys. Despite their scalability, current LLM-based simulation methods fail to capture social diversity, producing flattened inter-group differences and overly homogeneous responses within demographic groups. We identify this limitation as a Diversity Collapse phenomenon in LLM hidden representations, where distinct social identities become increasingly indistinguishable across layers. Motivated by this observation, we propose Parametric Social Identity Injection (PSII), a general framework that injects explicit, parametric representations of demographic attributes and value orientations directly into intermediate hidden states of LLMs. Unlike prompt-based persona conditioning, PSII enables fine-grained and controllable identity modulation at the representation level. Extensive experiments on the World Values Survey using multiple open-source LLMs show that PSII significantly improves distributional fidelity and diversity, reducing KL divergence to real-world survey data while enhancing overall diversity. This work provides new insights into representation-level control of LLM agents and advances scalable, diversity-aware public opinion simulation. Code and data are available at this https URL.
摘要：大型语言模型（LLM）最近被用作舆论模拟的合成代理，为昂贵且缓慢的人类调查提供了一种有希望的替代方案。尽管具有可扩展性，但当前基于法学硕士的模拟方法无法捕捉社会多样性，导致组间差异扁平化，人口群体内的反应过于同质。我们将这种限制视为法学硕士隐藏表示中的多样性崩溃现象，其中不同的社会身份在各层之间变得越来越难以区分。受这一观察的启发，我们提出了参数社会身份注入（PSII），这是一个通用框架，它将人口统计属性和价值取向的显式参数表示直接注入法学硕士的中间隐藏状态。与基于提示的角色调节不同，PSII 在表示级别实现细粒度且可控的身份调制。使用多个开源法学硕士对世界价值观调查进行的广泛实验表明，PSII 显着提高了分布保真度和多样性，减少了 KL 与现实世界调查数据的差异，同时增强了整体多样性。这项工作为 LLM 代理的表示级控制提供了新的见解，并推进了可扩展、具有多样性意识的舆论模拟。代码和数据可从此 https URL 获取。

Title: Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models

Authors: Xiaobing Sun, Perry Lam, Shaohua Li, Zizhou Wang, Rick Siow Mong Goh, Yong Liu, Liangli Zhen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16192
Pdf URL: https://arxiv.org/pdf/2603.16192
Copy Paste: [[2603.16192]] Structured Semantic Cloaking for Jailbreak Attacks on Large Language Models(https://arxiv.org/abs/2603.16192)
Keywords: language model, gpt, llm, prompt
Abstract: Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.
摘要：现代法学硕士采用的安全机制超越了表面级输入过滤，扩展到潜在语义表示和生成时推理，使它们能够在推理过程中恢复混淆的恶意意图并相应地拒绝，并使许多表面级混淆越狱攻击无效。我们提出了结构化语义伪装（S2C），这是一种新颖的多维越狱攻击框架，可以操纵在模型推理过程中如何重建恶意语义意图。 S2C 战略性地分配和重塑语义线索，以便完整的意图整合需要在更深层次的潜在表示中进行多步骤推理和远程共指解析。该框架包含三个互补机制：(1) 上下文重构，将请求嵌入到合理的高风险场景中，以使模型偏向于合规性； (2) 内容碎片，将请求的语义签名分散在不相交的提示段中； (3)线索引导伪装，它隐藏残留的语义线索，同时嵌入指导输出生成的可恢复标记。通过延迟和重构语义整合，S2C 降低了依赖于解码时一致或明确重构的恶意意图的安全触发器，同时为功能输出生成保留足够的指令可恢复性。我们使用 HarmBench 和 JBB-Behaviors 跨多个开源和专有 LLM 评估 S2C，与当前的 SOTA 相比，攻击成功率 (ASR) 分别提高了 12.4% 和 9.7%。值得注意的是，S2C 在 GPT-5-mini 上取得了巨大的进步，在 JBB-Behaviors 上比最强基线高出 26%。我们还分析了哪些组合在广泛的模型家族中表现最好，并描述了混淆程度与越狱成功的输入可恢复性之间的权衡。

Title: SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation

Authors: Hang Lv, Sheng Liang, Hao Wang, Yongyue Zhang, Hongchao Gu, Wei Guo, Defu Lian, Yong Liu, Enhong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16219
Pdf URL: https://arxiv.org/pdf/2603.16219
Copy Paste: [[2603.16219]] SpecSteer: Synergizing Local Context and Global Reasoning for Efficient Personalized Generation(https://arxiv.org/abs/2603.16219)
Keywords: language model
Abstract: Realizing personalized intelligence faces a core dilemma: sending user history to centralized large language models raises privacy concerns, while on-device small language models lack the reasoning capacity required for high-quality generation. Our pilot study shows that purely local enhancements remain insufficient to reliably bridge this gap. We therefore propose SpecSteer, an asymmetric collaborative inference framework that synergizes private on-device context with cloud-scale reasoning. SpecSteer casts collaboration as Bayesian knowledge fusion and repurposes speculative decoding as a distributed alignment protocol, yielding a Draft--Verify--Recover pipeline: the on-device model drafts personalized sequences; the cloud validates via a ratio-based mechanism that decouples reasoning verification from private context, filtering logical flaws without accessing raw user context; upon rejection, a steering recovery injects local intent during correction. Experiments demonstrate that SpecSteer successfully closes the reasoning gap and achieves superior personalized generation performance, while delivering a 2.36x speedup over standard baselines.
摘要：实现个性化智能面临一个核心困境：将用户历史发送到中心化的大语言模型会引发隐私问题，而设备上的小语言模型缺乏高质量生成所需的推理能力。我们的试点研究表明，纯粹的局部增强仍然不足以可靠地弥合这一差距。因此，我们提出了 SpecSteer，这是一种非对称协作推理框架，可将私有设备上上下文与云规模推理相结合。 SpecSteer 将协作视为贝叶斯知识融合，并将推测解码重新定位为分布式对齐协议，从而产生草稿-验证-恢复管道：设备上模型起草个性化序列；云通过基于比率的机制进行验证，该机制将推理验证与私有上下文分离，过滤逻辑缺陷而无需访问原始用户上下文；一旦拒绝，转向恢复会在纠正过程中注入局部意图。实验表明，SpecSteer 成功缩小了推理差距，实现了卓越的个性化生成性能，同时比标准基线提高了 2.36 倍的速度。

Title: More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification

Authors: Song Tae-Eun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16244
Pdf URL: https://arxiv.org/pdf/2603.16244
Copy Paste: [[2603.16244]] More Rounds, More Noise: Why Multi-Turn Review Fails to Improve Cross-Context Verification(https://arxiv.org/abs/2603.16244)
Keywords: llm
Abstract: Cross-Context Review (CCR) improves LLM verification by separating production and review into independent sessions. A natural extension is multi-turn review: letting the reviewer ask follow-up questions, receive author responses, and review again. We call this Dynamic Cross-Context Review (D-CCR). In a controlled experiment with 30 artifacts and 150 injected errors, we tested four D-CCR variants against the single-pass CCR baseline. Single-pass CCR (F1 = 0.376) significantly outperformed all multi-turn variants, including D-CCR-2b with question-and-answer exchange (F1 = 0.303, $p < 0.001$, $d = -0.59$). Multi-turn review increased recall (+0.08) but generated 62% more false positives (8.5 vs. 5.2), collapsing precision from 0.30 to 0.20. Two mechanisms drive this degradation: (1) false positive pressure -- reviewers in later rounds fabricate findings when the artifact's real errors have been exhausted, and (2) Review Target Drift -- reviewers provided with prior Q&A exchanges shift from reviewing the artifact to critiquing the conversation itself. Independent re-review without prior context (D-CCR-2c) performed worst (F1 = 0.263), confirming that mere repetition degrades rather than helps. The degradation stems from false positive pressure in additional rounds, not from information amount -- within multi-turn conditions, more information actually helps (D-CCR-2b > D-CCR-2a). The problem is not what the reviewer sees, but that reviewing again invites noise.
摘要：跨背景审查 (CCR) 通过将生成和审查分为独立的会话来改进 LLM 验证。一个自然的延伸是多轮评审：让审稿人提出后续问题，接收作者回复，然后再次评审。我们称之为动态跨情境审查 (D-CCR)。在一项包含 30 个伪影和 150 个注入错误的对照实验中，我们针对单通道 CCR 基线测试了四种 D-CCR 变体。单遍 CCR (F1 = 0.376) 显着优于所有多轮变体，包括带有问答交换的 D-CCR-2b (F1 = 0.303，$p < 0.001$，$d = -0.59$)。多轮审查提高了召回率 (+0.08)，但误报率增加了 62%（8.5 比 5.2），精度从 0.30 下降到 0.20。有两种机制导致了这种退化：(1) 假正压力——当工件的真正错误已经耗尽时，后面几轮的审阅者捏造了结果；(2) 审阅目标漂移——之前接受过问答交流的审阅者从审阅工件转向批评对话本身。没有事先背景的独立重新审查 (D-CCR-2c) 表现最差 (F1 = 0.263)，证实仅仅重复会降低而不是有所帮助。退化源于额外轮次中的假正压力，而不是信息量——在多轮条件下，更多信息实际上有帮助（D-CCR-2b > D-CCR-2a）。问题不在于审稿人看到了什么，而在于再次审稿会带来噪音。

Title: Attention-guided Evidence Grounding for Spoken Question Answering

Authors: Ke Yang, Bolin Chen, Yuejie Li, Yueying Hua, Jianhao Nie, Yueping He, Bowen Li, Chengjun Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16292
Pdf URL: https://arxiv.org/pdf/2603.16292
Copy Paste: [[2603.16292]] Attention-guided Evidence Grounding for Spoken Question Answering(https://arxiv.org/abs/2603.16292)
Keywords: language model, llm, hallucination
Abstract: Spoken Question Answering (Spoken QA) presents a challenging cross-modal problem: effectively aligning acoustic queries with textual knowledge while avoiding the latency and error propagation inherent in cascaded ASR-based systems. In this paper, we introduce Attention-guided Evidence Grounding (AEG), a novel end-to-end framework that leverages the internal cross-modal attention of Speech Large Language Models (SpeechLLMs) to explicitly locate and ground key evidence in the model's latent space. To address the diffuse attention distribution in pre-trained models, we propose Learning to Focus on Evidence (LFE), a supervised fine-tuning paradigm that calibrates the model's attention mechanism to distinguish query-relevant segments from irrelevant context. Experiments on SQuAD, HotpotQA, and MuSiQue demonstrate that AEG reduces hallucinations and achieves strong efficiency gains, outperforming large-scale cascaded baselines (Whisper-Large-v3 + Reranker) while reducing inference latency by approximately 62%.
摘要：口语问答 (Spoken QA) 提出了一个具有挑战性的跨模式问题：有效地将声学查询与文本知识对齐，同时避免基于级联 ASR 的系统中固有的延迟和错误传播。在本文中，我们介绍了注意力引导证据基础（AEG），这是一种新颖的端到端框架，它利用语音大语言模型（SpeechLLM）的内部跨模态注意力来显式定位和基础模型潜在空间中的关键证据。为了解决预训练模型中的分散注意力分布问题，我们提出了学习聚焦证据（LFE），这是一种有监督的微调范例，可以校准模型的注意力机制，以区分与查询相关的片段和不相关的上下文。 SQuAD、HotpotQA 和 MuSiQue 上的实验表明，AEG 减少了幻觉并实现了强大的效率提升，优于大规模级联基线（Whisper-Large-v3 + Reranker），同时将推理延迟减少了约 62%。

Title: Omnilingual MT: Machine Translation for 1,600 Languages

Authors: Omnilingual MT Team: Belen Alastruey, Niyati Bafna, Andrea Caciolai, Kevin Heffernan, Artyom Kozhevnikov, Christophe Ropers, Eduardo Sánchez, Charles-Eric Saint-James, Ioannis Tsiamas, Chierh Cheng, Joe Chuang, Paul-Ambroise Duquenne, Mark Duppenthaler, Nate Ekberg, Cynthia Gao, Pere Lluís Huguet Cabot, João Maria Janeiro, Jean Maillard, Gabriel Mejia Gonzalez, Holger Schwenk, Edan Toledo, Arina Turkatenko, Albert Ventayol-Boada, Rashel Moritz, Alexandre Mourachko, Surya Parimi, Mary Williamson, Shireen Yates, David Dale, Marta R. Costa-jussà
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16309
Pdf URL: https://arxiv.org/pdf/2603.16309
Copy Paste: [[2603.16309]] Omnilingual MT: Machine Translation for 1,600 Languages(https://arxiv.org/abs/2603.16309)
Keywords: language model, llm
Abstract: High-quality machine translation (MT) can scale to hundreds of languages, setting a high bar for multilingual systems. However, compared to the world's 7,000 languages, current systems still offer only limited coverage: about 200 languages on the target side, and maybe a few hundreds more on the source side, supported due to cross-lingual transfer. And even these numbers have been hard to evaluate due to the lack of reliable benchmarks and metrics. We present Omnilingual Machine Translation (OMT), the first MT system supporting more than 1,600 languages. This scale is enabled by a comprehensive data strategy that integrates large public multilingual corpora with newly created datasets, including manually curated MeDLEY bitext. We explore two ways of specializing a Large Language model (LLM) for machine translation: as a decoder-only model (OMT-LLaMA) or as a module in an encoder-decoder architecture (OMT-NLLB). Notably, all our 1B to 8B parameter models match or exceed the MT performance of a 70B LLM baseline, revealing a clear specialization advantage and enabling strong translation quality in low-compute settings. Moreover, our evaluation of English-to-1,600 translations further shows that while baseline models can interpret undersupported languages, they frequently fail to generate them with meaningful fidelity; OMT-LLaMA models substantially expand the set of languages for which coherent generation is feasible. Additionally, OMT models improve in cross-lingual transfer, being close to solving the "understanding" part of the puzzle in MT for the 1,600 evaluated. Our leaderboard and main human-created evaluation datasets (BOUQuET and Met-BOUQuET) are dynamically evolving towards Omnilinguality and freely available.
摘要：高质量的机器翻译 (MT) 可以扩展到数百种语言，为多语言系统设定了很高的标准。然而，与世界上 7000 种语言相比，当前系统仅提供有限的覆盖范围：由于跨语言传输，目标端支持大约 200 种语言，源端可能还有数百种语言。由于缺乏可靠的基准和指标，即使这些数字也很难评估。我们推出全语言机器翻译 (OMT)，这是第一个支持 1,600 多种语言的 MT 系统。这一规模是通过全面的数据策略实现的，该策略将大型公共多语言语料库与新创建的数据集（包括手动策划的 MeDLEY 双文本）集成在一起。我们探索了两种专门用于机器翻译的大型语言模型 (LLM) 的方法：作为纯解码器模型 (OMT-LLaMA) 或作为编码器-解码器架构中的模块 (OMT-NLLB)。值得注意的是，我们所有的 1B 到 8B 参数模型都匹配或超过了 70B LLM 基线的 MT 性能，揭示了明显的专业化优势，并在低计算设置下实现了强大的翻译质量。此外，我们对英语到 1,600 种翻译的评估进一步表明，虽然基线模型可以解释支持不足的语言，但它们经常无法以有意义的保真度生成它们； OMT-LLaMA 模型极大地扩展了可进行连贯生成的语言集。此外，OMT 模型在跨语言迁移方面有所改进，接近解决 1,600 名评估者中 MT 难题的“理解”部分。我们的排行榜和主要人工创建的评估数据集（BOUQuET 和 Met-BOUQuET）正在朝着全语言和免费可用的方向动态发展。

Title: PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development

Authors: Hanif Rahman
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16354
Pdf URL: https://arxiv.org/pdf/2603.16354
Copy Paste: [[2603.16354]] PashtoCorp: A 1.25-Billion-Word Corpus, Evaluation Suite, and Reproducible Pipeline for Low-Resource Language Development(https://arxiv.org/abs/2603.16354)
Keywords: llm
Abstract: We present PashtoCorp, a 1.25-billion-word corpus for Pashto, a language spoken by 60 million people that remains severely underrepresented in NLP. The corpus is assembled from 39 sources spanning seven HuggingFace datasets and 32 purpose-built web scrapers, processed through a reproducible pipeline with Arabic-script tokenization, SHA-256 deduplication, and quality filtering. At 1.25B words across 2.81 million documents, PashtoCorp is 40x larger than the OSCAR Pashto subset and 83x larger than the previously largest dedicated Pashto corpus. Continued MLM pretraining of XLM-R-base on PashtoCorp reduces held-out perplexity by 25.1% (8.08->6.06). On WikiANN Pashto NER, the pretrained model improves entity F1 by 10% relative (19.0%->21.0%) and reduces training variance nearly 7x; the largest gain appears at 50 training sentences (+27%), with PashtoCorp covering 97.9% of WikiANN entity vocabulary. On Belebele Pashto reading comprehension, Gemma-3n achieves 64.6% accuracy, the first published LLM baseline for Pashto on this benchmark. A leave-one-out source ablation shows that Wikipedia (0.7% of documents) is the most critical source for NER: removing it alone reduces entity F1 by 47%. Corpus data, trained model, and code are available at this https URL, this https URL, and this https URL.
摘要：我们推出了 PashtoCorp，这是一个包含 12.5 亿单词的普什图语语料库，普什图语是一种有 6000 万人使用的语言，在 NLP 中的代表性仍然严重不足。该语料库由涵盖 7 个 HuggingFace 数据集的 39 个来源和 32 个专门构建的网络抓取工具组成，通过具有阿拉伯语脚本标记化、SHA-256 重复数据删除和质量过滤的可重复管道进行处理。 PashtoCorp 包含 281 万份文档，共 1.25B 个单词，比 OSCAR 普什图语子集大 40 倍，比之前最大的专用普什图语语料库大 83 倍。在 PashtoCorp 上继续对 XLM-R-base 进行 MLM 预训练可将保留的困惑度降低 25.1% (8.08->6.06)。在 WikiANN Pashto NER 上，预训练模型将实体 F1 相对提高了 10%（19.0%->21.0%），并将训练方差降低了近 7 倍；最大的增益出现在 50 个训练句子（+27%），其中 PashtoCorp 覆盖了 WikiANN 实体词汇的 97.9%。在 Belebele 普什图语阅读理解中，Gemma-3n 达到了 64.6% 的准确率，这是该基准上第一个发布的普什图语 LLM 基线。留一法源消融表明维基百科（0.7% 的文档）是 NER 最关键的源：单独删除它会使实体 F1 减少 47%。语料库数据、训练模型和代码可从此 https URL、此 https URL 和此 https URL 获取。

Title: Fanar 2.0: Arabic Generative AI Stack

Authors: FANAR TEAM, Ummar Abbas, Mohammad Shahmeer Ahmad, Minhaj Ahmad, Abdulaziz Al-Homaid, Anas Al-Nuaimi, Enes Altinisik, Ehsaneddin Asgari, Sanjay Chawla, Shammur Chowdhury, Fahim Dalvi, Kareem Darwish, Nadir Durrani, Mohamed Elfeky, Ahmed Elmagarmid, Mohamed Eltabakh, Asim Ersoy, Masoomali Fatehkia, Mohammed Qusay Hashim, Majd Hawasly, Mohamed Hefeeda, Mus'ab Husaini, Keivin Isufaj, Soon-Gyo Jung, Houssam Lachemat, Ji Kim Lucas, Abubakr Mohamed, Tasnim Mohiuddin, Basel Mousi, Hamdy Mubarak, Ahmad Musleh, Mourad Ouzzani, Amin Sadeghi, Husrev Taha Sencar, Mohammed Shinoy, Omar Sinan, Yifan Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16397
Pdf URL: https://arxiv.org/pdf/2603.16397
Copy Paste: [[2603.16397]] Fanar 2.0: Arabic Generative AI Stack(https://arxiv.org/abs/2603.16397)
Keywords: llm, agent
Abstract: We present Fanar 2.0, the second generation of Qatar's Arabic-centric Generative AI platform. Sovereignty is a first-class design principle: every component, from data pipelines to deployment infrastructure, was designed and operated entirely at QCRI, Hamad Bin Khalifa University. Fanar 2.0 is a story of resource-constrained excellence: the effort ran on 256 NVIDIA H100 GPUs, with Arabic having only ~0.5% of web data despite 400 million native speakers. Fanar 2.0 adopts a disciplined strategy of data quality over quantity, targeted continual pre-training, and model merging to achieve substantial gains within these constraints. At the core is Fanar-27B, continually pre-trained from a Gemma-3-27B backbone on a curated corpus of 120 billion high-quality tokens across three data recipes. Despite using 8x fewer pre-training tokens than Fanar 1.0, it delivers substantial benchmark improvements: Arabic knowledge (+9.1 pts), language (+7.3 pts), dialects (+3.5 pts), and English capability (+7.6 pts). Beyond the core LLM, Fanar 2.0 introduces a rich stack of new capabilities. FanarGuard is a state-of-the-art 4B bilingual moderation filter for Arabic safety and cultural alignment. The speech family Aura gains a long-form ASR model for hours-long audio. Oryx vision family adds Arabic-aware image and video understanding alongside culturally grounded image generation. An agentic tool-calling framework enables multi-step workflows. Fanar-Sadiq utilizes a multi-agent architecture for Islamic content. Fanar-Diwan provides classical Arabic poetry generation. FanarShaheen delivers LLM-powered bilingual translation. A redesigned multi-layer orchestrator coordinates all components through intent-aware routing and defense-in-depth safety validation. Taken together, Fanar 2.0 demonstrates that sovereign, resource-constrained AI development can produce systems competitive with those built at far greater scale.
摘要：我们推出 Fanar 2.0，这是卡塔尔第二代以阿拉伯语为中心的生成人工智能平台。主权是一流的设计原则：从数据管道到部署基础设施的每个组件都完全在哈马德·本·哈利法大学 QCRI 设计和运营。 Fanar 2.0 是一个资源有限的卓越故事：该项目在 256 个 NVIDIA H100 GPU 上运行，尽管阿拉伯语有 4 亿母语人士，但其网络数据仅占约 0.5%。 Fanar 2.0 采用数据质量重于数量的严格策略、有针对性的持续预训练和模型合并，以在这些限制内实现实质性收益。其核心是 Fanar-27B，它通过 Gemma-3-27B 主干网络在三个数据配方中的 1200 亿个高质量代币的精选语料库上持续进行预训练。尽管使用的预训练标记比 Fanar 1.0 少 8 倍，但它提供了显着的基准改进：阿拉伯语知识（+9.1 分）、语言（+7.3 分）、方言（+3.5 分）和英语能力（+7.6 分）。除了核心法学硕士之外，Fanar 2.0 还引入了丰富的新功能。 FanarGuard 是最先进的 4B 双语审核过滤器，用于阿拉伯语安全和文化一致性。语音系列 Aura 获得了适用于长达数小时音频的长格式 ASR 模型。 Oryx 视觉系列添加了阿拉伯语图像和视频理解以及基于文化的图像生成。代理工具调用框架支持多步骤工作流程。 Fanar-Sadiq 利用多代理架构来处理伊斯兰内容。 Fanar-Diwan 提供古典阿拉伯诗歌生成。 FanarShaheen 提供法学硕士支持的双语翻译。重新设计的多层协调器通过意图感知路由和深度防御安全验证来协调所有组件。总而言之，Fanar 2.0 表明，主权、资源有限的人工智能开发可以产生与更大规模的系统竞争的系统。

Title: Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic

Authors: Finnur Ágúst Ingimundarson, Steinunn Rut Friðriksdóttir, Bjarki Ármannsson, Iris Edda Nowenstein, Steinþór Steingrímsson
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16406
Pdf URL: https://arxiv.org/pdf/2603.16406
Copy Paste: [[2603.16406]] Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic(https://arxiv.org/abs/2603.16406)
Keywords: language model, llm
Abstract: This paper evaluates current Large Language Model (LLM) benchmarking for Icelandic, identifies problems, and calls for improved evaluation methods in low/medium-resource languages in particular. We show that benchmarks that include synthetic or machine-translated data that have not been verified in any way, commonly contain severely flawed test examples that are likely to skew the results and undermine the tests' validity. We warn against the use of such methods without verification in low/medium-resource settings as the translation quality can, at best, only be as good as MT quality for a given language at any given time. Indeed, the results of our quantitative error analysis on existing benchmarks for Icelandic show clear differences between human-authored/-translated benchmarks vs. synthetic or machine-translated benchmarks.
摘要：本文评估了当前冰岛语的大语言模型 (LLM) 基准，找出问题，并呼吁改进低/中资源语言的评估方法。我们表明，包含未经以任何方式验证的合成或机器翻译数据的基准通常包含严重缺陷的测试示例，这些示例可能会扭曲结果并破坏测试的有效性。我们警告不要在低/中资源环境中使用未经验证的此类方法，因为在任何给定时间，翻译质量最多只能达到给定语言的 MT 质量。事实上，我们对冰岛语现有基准的定量误差分析结果显示，人工编写/翻译的基准与合成或机器翻译的基准之间存在明显差异。

Title: PlotTwist: A Creative Plot Generation Framework with Small Language Models

Authors: Abhinav Thorat, Ravi Kolla, Jyotin Goel, Niranjan Pedanekar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16410
Pdf URL: https://arxiv.org/pdf/2603.16410
Copy Paste: [[2603.16410]] PlotTwist: A Creative Plot Generation Framework with Small Language Models(https://arxiv.org/abs/2603.16410)
Keywords: language model, llm, prompt, agent
Abstract: Creative plot generation presents a fundamental challenge for language models: transforming a concise premise into a coherent narrative that sustains global structure, character development, and emotional resonance. Although recent Large Language Models (LLMs) demonstrate strong fluency across general-purpose tasks, they typically require preference alignment to perform well on specialized domains such as creative plot generation. However, conducting such alignment at the scale of frontier LLMs is computationally prohibitive, significantly limiting accessibility and practical deployment. To address this, we present PlotTwist, a structured framework that enables Small Language Models (SLMs) with $\leq$ 5B active parameters to generate high-quality, premise-conditioned plots competitive with frontier systems up to $200\times$ larger. Our approach decomposes generation into three specialized components: (1) an Aspect Rating Reward Model trained via a novel Positive-Negative prompting strategy to deliver structured narratives across five Narrative Quality Dimensions (NQDs); (2) a Mixture-of-Experts (MoE) plot generator aligned via Direct Preference Optimization on high-confidence preference pairs; and (3) an Agentic Evaluation module that emulates human critical judgment for unbiased post-hoc assessment. Extensive experiments demonstrate that PlotTwist consistently outperforms frontier models across multiple NQDs despite substantially tighter capacity constraints. Further validation confirms strong sensitivity to narrative quality, as the framework reliably distinguishes plots derived from critically acclaimed versus widely panned screenplays. Together, these results establish structured, preference-based alignment as a resource-efficient approach to high-quality creative plot generation.
摘要：创造性情节的生成对语言模型提出了根本性的挑战：将简洁的前提转化为连贯的叙事，以维持整体结构、人物发展和情感共鸣。尽管最近的大型语言模型（LLM）在通用任务中表现出很强的流畅性，但它们通常需要偏好对齐才能在创意情节生成等专业领域表现良好。然而，在前沿法学硕士的规模上进行这种调整在计算上是令人望而却步的，极大地限制了可访问性和实际部署。为了解决这个问题，我们提出了 PlotTwist，这是一个结构化框架，它使具有 $\leq$ 5B 活动参数的小语言模型 (SLM) 能够生成高质量的前提条件图，可与高达 $\times$ 的前沿系统竞争。我们的方法将生成分解为三个专门的组成部分：（1）通过新颖的正负提示策略训练的方面评级奖励模型，以跨五个叙事质量维度（NQD）提供结构化叙事； (2) 通过直接偏好优化对高置信度偏好对进行对齐的专家混合 (MoE) 绘图生成器； (3) 代理评估模块，模拟人类批判性判断以进行公正的事后评估。大量实验表明，尽管容量限制显着严格，但 PlotTwist 在多个 NQD 上始终优于前沿模型。进一步的验证证实了对叙事质量的强烈敏感性，因为该框架可靠地区分了广受好评的剧本和广受批评的剧本的情节。总之，这些结果建立了结构化的、基于偏好的对齐方式，作为生成高质量创意情节的资源高效方法。

Title: RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery

Authors: Abhishek Kumar, Aashraya Sachdeva
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2603.16411
Pdf URL: https://arxiv.org/pdf/2603.16411
Copy Paste: [[2603.16411]] RECOVER: Robust Entity Correction via agentic Orchestration of hypothesis Variants for Evidence-based Recovery(https://arxiv.org/abs/2603.16411)
Keywords: language model, llm, agent
Abstract: Entity recognition in Automatic Speech Recognition (ASR) is challenging for rare and domain-specific terms. In domains such as finance, medicine, and air traffic control, these errors are costly. If the entities are entirely absent from the ASR output, post-ASR correction becomes difficult. To address this, we introduce RECOVER, an agentic correction framework that serves as a tool-using agent. It leverages multiple hypotheses as evidence from ASR, retrieves relevant entities, and applies Large Language Model (LLM) correction under constraints. The hypotheses are used using different strategies, namely, 1-Best, Entity-Aware Select, Recognizer Output Voting Error Reduction (ROVER) Ensemble, and LLM-Select. Evaluated across five diverse datasets, it achieves 8-46% relative reductions in entity-phrase word error rate (E-WER) and increases recall by up to 22 percentage points. The LLM-Select achieves the best overall performance in entity correction while maintaining overall WER.
摘要：自动语音识别 (ASR) 中的实体识别对于稀有术语和特定领域术语具有挑战性。在金融、医学和空中交通管制等领域，这些错误的代价高昂。如果实体完全不存在于 ASR 输出中，则 ASR 后校正将变得困难。为了解决这个问题，我们引入了 RECOVER，一个充当工具使用代理的代理校正框架。它利用多个假设作为 ASR 的证据，检索相关实体，并在约束下应用大型语言模型 (LLM) 校正。这些假设使用不同的策略，即 1-Best、实体感知选择、识别器输出投票误差减少 (ROVER) 集成和 LLM-选择。通过五个不同的数据集进行评估，它使实体短语错误率 (E-WER) 相对降低 8-46%，并将召回率提高高达 22 个百分点。 LLM-Select 在实体校正方面实现了最佳整体性能，同时保持了整体 WER。

Title: IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time

Authors: Zhenghua Bao, Yi Shi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.16415
Pdf URL: https://arxiv.org/pdf/2603.16415
Copy Paste: [[2603.16415]] IndexRAG: Bridging Facts for Cross-Document Reasoning at Index Time(https://arxiv.org/abs/2603.16415)
Keywords: llm, retrieval-augmented generation
Abstract: Multi-hop question answering (QA) requires reasoning across multiple documents, yet existing retrieval-augmented generation (RAG) approaches address this either through graph-based methods requiring additional online processing or iterative multi-step reasoning. We present IndexRAG, a novel approach that shifts cross-document reasoning from online inference to offline indexing. IndexRAG identifies bridge entities shared across documents and generates bridging facts as independently retrievable units, requiring no additional training or fine-tuning. Experiments on three widely-used multi-hop QA benchmarks (HotpotQA, 2WikiMultiHopQA, MuSiQue) show that IndexRAG improves F1 over Naive RAG by 4.6 points on average, while requiring only single-pass retrieval and a single LLM call at inference time. When combined with IRCoT, IndexRAG outperforms all graph-based baselines on average, including HippoRAG and FastGraphRAG, while relying solely on flat retrieval. Our code will be released upon acceptance.
摘要：多跳问答 (QA) 需要跨多个文档进行推理，但现有的检索增强生成 (RAG) 方法通过需要额外在线处理的基于图的方法或迭代多步骤推理来解决此问题。我们提出了 IndexRAG，这是一种将跨文档推理从在线推理转变为离线索引的新颖方法。 IndexRAG 识别跨文档共享的桥接实体，并将桥接事实生成为可独立检索的单元，无需额外的培训或微调。对三个广泛使用的多跳 QA 基准（HotpotQA、2WikiMultiHopQA、MuSiQue）的实验表明，IndexRAG 相对于 Naive RAG 平均将 F1 提高了 4.6 个点，同时在推理时只需要单遍检索和单个 LLM 调用。与 IRCoT 结合使用时，IndexRAG 的平均性能优于所有基于图形的基线，包括 HippoRAG 和 FastGraphRAG，同时仅依赖于平面检索。我们的代码将在接受后发布。

Title: EngGPT2: Sovereign, Efficient and Open Intelligence

Authors: G. Ciarfaglia, A. Rosanova, S. Cipolla, J. Bartoli, A. Di Domenico, C. Fioroni, A. Fontana, M. R. Scoleri, M. I. Mone, D. Franchi, M. C. Del Gaudio, F. Picariello, M. Gabusi, S. Bonura, V. Morreale, I. Bailo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16430
Pdf URL: https://arxiv.org/pdf/2603.16430
Copy Paste: [[2603.16430]] EngGPT2: Sovereign, Efficient and Open Intelligence(https://arxiv.org/abs/2603.16430)
Keywords: gpt, llm
Abstract: EngGPT2-16B-A3B is the latest iteration of Engineering Group's Italian LLM and it's built to be a Sovereign, Efficient and Open model. EngGPT2 is trained on 2.5 trillion tokens - less than Qwen3's 36T or Llama3's 15T - and delivers performance on key benchmarks, including MMLU-Pro, GSM8K, IFEval and HumanEval, comparable to dense models in the 8B-16B range, while requiring one-fifth to half of the inference power, and between one-tenth to one-sixth of the training data and consequent needed training power. Designed as a trained-from-scratch Mixture-of-Experts (MoE) architecture, EngGPT2 features 16 billion parameters with 3 billion active per inference, with expert sizes positioned between those used in GPT-OSS and Qwen3. Approximately 25% of its training corpus consists of Italian-language data, to deliver strong capabilities for European and Italian NLP tasks among models of similar scale. This efficiency aims to position EngGPT2 as a key contributor to the growing portfolio of open-weight European models, combining performance and efficiency with full alignment to the EU AI Act. EngGPT2 is also a single model capable of multiple reasoning modes: non-reasoning, reasoning in Italian or English, and turbo-reasoning (a concise, bullet-point style reasoning available in both languages designed for real-time reasoning use cases). EngGPT2 aims to set a new standard for resource-conscious, high-performance LLMs tailored to European and Italian contexts.
摘要：EngGPT2-16B-A3B 是 Engineering Group 意大利 LLM 的最新版本，它是一个主权、高效和开放的模型。 EngGPT2 在 2.5 万亿个令牌上进行训练（少于 Qwen3 的 36T 或 Llama3 的 15T），并在包括 MMLU-Pro、GSM8K、IFEval 和 HumanEval 在内的关键基准上提供性能，可与 8B-16B 范围内的密集模型相媲美，同时需要五分之一到一半的推理能力，以及十分之一到六分之一的训练数据和随之而来的所需的训练力量。 EngGPT2 被设计为一种从头开始训练的专家混合 (MoE) 架构，具有 160 亿个参数，每个推理有 30 亿个活跃参数，专家大小介于 GPT-OSS 和 Qwen3 中使用的专家大小之间。其训练语料库中约 25% 由意大利语数据组成，为类似规模的模型中的欧洲和意大利 NLP 任务提供强大的能力。这种效率旨在将 EngGPT2 定位为不断增长的开放权重欧洲模型组合的关键贡献者，将性能和效率与欧盟人工智能法案完全一致。 EngGPT2 也是一个能够支持多种推理模式的单一模型：非推理、意大利语或英语推理以及涡轮推理（一种简洁的、要点式推理，可在两种语言中使用，专为实时推理用例而设计）。 EngGPT2 旨在为适合欧洲和意大利背景的资源意识型、高性能法学硕士制定新标准。

Title: VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization

Authors: Yixuan Wang, Qingyu Shi, Jiayu Zhou, Dianbo Liu, Ziwei He, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16435
Pdf URL: https://arxiv.org/pdf/2603.16435
Copy Paste: [[2603.16435]] VQKV: High-Fidelity and High-Ratio Cache Compression via Vector-Quantization(https://arxiv.org/abs/2603.16435)
Keywords: language model, llm
Abstract: The growing context length of Large Language Models (LLMs) enlarges the Key-Value (KV) cache, limiting deployment in resource-limited environments. Prior training-free approaches for KV cache compression typically rely on low-rank approximation or scalar quantization, which fail to simultaneously achieve high compression ratios and high reconstruction fidelity. We propose VQKV, a novel, training-free method introducing vector quantization (VQ) to obtain highly compressed KV representations while preserving high model fidelity, allowing for the representation of thousands of floating-point values with just a few integer indices. As a result, VQKV achieves an 82.8\% compression ratio on LLaMA3.1-8B while retaining 98.6\% of the baseline performance on LongBench and enabling 4.3x longer generation length on the same memory footprint.
摘要：大型语言模型 (LLM) 不断增长的上下文长度扩大了键值 (KV) 缓存，限制了资源有限环境中的部署。先前用于 KV 缓存压缩的免训练方法通常依赖于低秩近似或标量量化，这无法同时实现高压缩比和高重建保真度。我们提出了 VQKV，一种新颖的、免训练的方法，引入矢量量化（VQ）来获得高度压缩的 KV 表示，同时保持高模型保真度，允许仅用几个整数索引表示数千个浮点值。因此，VQKV 在 LLaMA3.1-8B 上实现了 82.8% 的压缩比，同时在 LongBench 上保留了 98.6% 的基准性能，并在相同的内存占用量上实现了 4.3 倍的生成长度延长。

Title: DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning

Authors: Yanyu Qian, Yue Tan, Yixin Liu, Wang Yu, Shirui Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16459
Pdf URL: https://arxiv.org/pdf/2603.16459
Copy Paste: [[2603.16459]] DynHD: Hallucination Detection for Diffusion Large Language Models via Denoising Dynamics Deviation Learning(https://arxiv.org/abs/2603.16459)
Keywords: language model, llm, hallucination
Abstract: Diffusion large language models (D-LLMs) have emerged as a promising alternative to auto-regressive models due to their iterative refinement capabilities. However, hallucinations remain a critical issue that hinders their reliability. To detect hallucination responses from model outputs, token-level uncertainty (e.g., entropy) has been widely used as an effective signal to indicate potential factual errors. Nevertheless, the fixed-length generation paradigm of D-LLMs implies that tokens contribute unevenly to hallucination detection, with only a small subset providing meaningful signals. Moreover, the evolution trend of uncertainty throughout the diffusion process can also provide important signals, highlighting the necessity of modeling its denoising dynamics for hallucination detection. In this paper, we propose DynHD that bridge these gaps from both spatial (token sequence) and temporal (denoising dynamics) perspectives. To address the information density imbalance across tokens, we propose a semantic-aware evidence construction module that extracts hallucination-indicative signals by filtering out non-informative tokens and emphasizing semantically meaningful ones. To model denoising dynamics for hallucination detection, we introduce a reference evidence generator that learns the expected evolution trajectory of uncertainty evidence, along with a deviation-based hallucination detector that makes predictions by measuring the discrepancy between the observed and reference trajectories. Extensive experiments demonstrate that DynHD consistently outperforms state-of-the-art baselines while achieving higher efficiency across multiple benchmarks and backbone models.
摘要：扩散大语言模型 (D-LLM) 因其迭代细化能力而成为自回归模型的有前景的替代方案。然而，幻觉仍然是阻碍其可靠性的一个关键问题。为了检测模型输出的幻觉反应，令牌级不确定性（例如熵）已被广泛用作指示潜在事实错误的有效信号。然而，D-LLM 的固定长度生成范例意味着令牌对幻觉检测的贡献不均匀，只有一小部分提供有意义的信号。此外，整个扩散过程中不确定性的演变趋势也可以提供重要信号，凸显了对其去噪动力学进行建模以进行幻觉检测的必要性。在本文中，我们提出了 DynHD，它从空间（令牌序列）和时间（去噪动态）角度弥补了这些差距。为了解决令牌之间的信息密度不平衡问题，我们提出了一种语义感知证据构建模块，该模块通过过滤掉非信息性令牌并强调语义上有意义的令牌来提取幻觉指示信号。为了对幻觉检测的去噪动力学进行建模，我们引入了一个参考证据生成器，它可以学习不确定性证据的预期演化轨迹，以及一个基于偏差的幻觉检测器，可以通过测量观察到的轨迹和参考轨迹之间的差异来进行预测。大量实验表明，DynHD 始终优于最先进的基准，同时在多个基准和骨干模型上实现更高的效率。

Title: On the Emotion Understanding of Synthesized Speech

Authors: Yuan Ge, Haishu Zhao, Aokai Hao, Junxiang Zhang, Bei Li, Xiaoqian Liu, Chenglong Wang, Jianjin Wang, Bingsen Zhou, Bingyu Liu, Jingbo Zhu, Zhengtao Yu, Tong Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16483
Pdf URL: https://arxiv.org/pdf/2603.16483
Copy Paste: [[2603.16483]] On the Emotion Understanding of Synthesized Speech(https://arxiv.org/abs/2603.16483)
Keywords: language model
Abstract: Emotion is a core paralinguistic feature in voice interaction. It is widely believed that emotion understanding models learn fundamental representations that transfer to synthesized speech, making emotion understanding results a plausible reward or evaluation metric for assessing emotional expressiveness in speech synthesis. In this work, we critically examine this assumption by systematically evaluating Speech Emotion Recognition (SER) on synthesized speech across datasets, discriminative and generative SER models, and diverse synthesis models. We find that current SER models can not generalize to synthesized speech, largely because speech token prediction during synthesis induces a representation mismatch between synthesized and human speech. Moreover, generative Speech Language Models (SLMs) tend to infer emotion from textual semantics while ignoring paralinguistic cues. Overall, our findings suggest that existing SER models often exploit non-robust shortcuts rather than capturing fundamental features, and paralinguistic understanding in SLMs remains challenging.
摘要：情感是语音交互中的核心副语言特征。人们普遍认为，情感理解模型学习转移到合成语音的基本表示，使情感理解结果成为评估语音合成中情感表达力的合理奖励或评估指标。在这项工作中，我们通过系统地评估跨数据集、判别式和生成式 SER 模型以及各种合成模型的合成语音的语音情感识别 (SER)，批判性地检验了这一假设。我们发现当前的 SER 模型无法推广到合成语音，这主要是因为合成过程中的语音标记预测会导致合成语音和人类语音之间的表示不匹配。此外，生成语音语言模型（SLM）倾向于从文本语义中推断情感，而忽略副语言线索。总的来说，我们的研究结果表明，现有的 SER 模型经常利用非鲁棒的捷径，而不是捕获基本特征，并且 SLM 中的副语言理解仍然具有挑战性。

Title: AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents

Authors: Shannan Yan, Jingchen Ni, Leqi Zheng, Jiajun Zhang, Peixi Wu, Dacheng Yin, Jing Lyu, Chun Yuan, Fengyun Rao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16496
Pdf URL: https://arxiv.org/pdf/2603.16496
Copy Paste: [[2603.16496]] AdaMem: Adaptive User-Centric Memory for Long-Horizon Dialogue Agents(https://arxiv.org/abs/2603.16496)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents increasingly rely on external memory to support long-horizon interaction, personalized assistance, and multi-step reasoning. However, existing memory systems still face three core challenges: they often rely too heavily on semantic similarity, which can miss evidence crucial for user-centric understanding; they frequently store related experiences as isolated fragments, weakening temporal and causal coherence; and they typically use static memory granularities that do not adapt well to the requirements of different questions. We propose AdaMem, an adaptive user-centric memory framework for long-horizon dialogue agents. AdaMem organizes dialogue history into working, episodic, persona, and graph memories, enabling the system to preserve recent context, structured long-term experiences, stable user traits, and relation-aware connections within a unified framework. At inference time, AdaMem first resolves the target participant, then builds a question-conditioned retrieval route that combines semantic retrieval with relation-aware graph expansion only when needed, and finally produces the answer through a role-specialized pipeline for evidence synthesis and response generation. We evaluate AdaMem on the LoCoMo and PERSONAMEM benchmarks for long-horizon reasoning and user modeling. Experimental results show that AdaMem achieves state-of-the-art performance on both benchmarks. The code will be released upon acceptance.
摘要：大型语言模型（LLM）代理越来越依赖外部记忆来支持长视野交互、个性化帮助和多步骤推理。然而，现有的记忆系统仍然面临三个核心挑战：它们往往过于依赖语义相似性，这可能会错过对以用户为中心的理解至关重要的证据；他们经常将相关的经历存储为孤立的片段，从而削弱了时间和因果的连贯性；而且它们通常使用静态内存粒度，不能很好地适应不同问题的要求。我们提出了 AdaMem，一种用于长视野对话代理的以用户为中心的自适应记忆框架。 AdaMem 将对话历史组织为工作、情景、角色和图形记忆，使系统能够在统一框架内保留最近的上下文、结构化的长期体验、稳定的用户特征和关系感知连接。在推理时，AdaMem 首先解析目标参与者，然后构建一个问题条件检索路径，仅在需要时将语义检索与关系感知图扩展相结合，最后通过角色专用管道生成证据合成和响应生成。我们在 LoCoMo 和 PERSONAMEM 基准上评估 AdaMem，以进行长期推理和用户建模。实验结果表明，AdaMem 在两个基准测试中均实现了最先进的性能。该代码将在接受后发布。

Title: How often do Answers Change? Estimating Recency Requirements in Question Answering

Authors: Bhawna Piryani, Zehra Mert, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16544
Pdf URL: https://arxiv.org/pdf/2603.16544
Copy Paste: [[2603.16544]] How often do Answers Change? Estimating Recency Requirements in Question Answering(https://arxiv.org/abs/2603.16544)
Keywords: language model, llm
Abstract: Large language models (LLMs) often rely on outdated knowledge when answering time-sensitive questions, leading to confident yet incorrect responses. Without explicit signals indicating whether up-to-date information is required, models struggle to decide when to retrieve external evidence, how to reason about stale facts, and how to rank answers by their validity. Existing benchmarks either periodically refresh answers or rely on fixed templates, but they do not reflect on how frequently answers change or whether a question inherently requires up-to-date information. To address this gap, we introduce a recency-stationarity taxonomy that categorizes questions by how often their answers change and whether this change frequency is time-invariant or context-dependent. Building on this taxonomy, we present RecencyQA, a dataset of 4,031 open-domain questions annotated with recency and stationarity labels. Through human evaluation and empirical analysis, we show that non-stationary questions, i.e., those where context changes the recency requirement, are significantly more challenging for LLMs, with difficulty increasing as update frequency rises. By explicitly modeling recency and context dependence, RecencyQA enables fine-grained benchmarking and analysis of temporal reasoning beyond binary notions of freshness, and provides a foundation for developing recency-aware and context-sensitive question answering systems.
摘要：大型语言模型（LLM）在回答时间敏感的问题时通常依赖于过时的知识，导致自信但不正确的回答。如果没有明确的信号表明是否需要最新信息，模型就很难决定何时检索外部证据、如何推理过时的事实以及如何根据答案的有效性对答案进行排名。现有的基准要么定期刷新答案，要么依赖固定模板，但它们没有反映答案更改的频率或问题是否本质上需要最新信息。为了解决这一差距，我们引入了一种新近平稳性分类法，该分类法根据答案变化的频率以及这种变化频率是否随时间变化或与上下文相关来对问题进行分类。在此分类法的基础上，我们提出了 RecencyQA，这是一个包含 4,031 个开放域问题的数据集，并标注有新近度和平稳性标签。通过人工评估和实证分析，我们表明非平稳问题，即那些上下文改变新近度要求的问题，对于法学硕士来说更具挑战性，随着更新频率的增加，难度也随之增加。通过显式建模新近度和上下文相关性，RecencyQA 可以对新鲜度二元概念之外的时间推理进行细粒度基准测试和分析，并为开发新近度感知和上下文敏感问答系统提供基础。

Title: DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis

Authors: Lei Wang, Min Huang, Eduard Dragut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16546
Pdf URL: https://arxiv.org/pdf/2603.16546
Copy Paste: [[2603.16546]] DanceHA: A Multi-Agent Framework for Document-Level Aspect-Based Sentiment Analysis(https://arxiv.org/abs/2603.16546)
Keywords: agent
Abstract: Aspect-Based Sentiment Intensity Analysis (ABSIA) has garnered increasing attention, though research largely focuses on domain-specific, sentence-level settings. In contrast, document-level ABSIA--particularly in addressing complex tasks like extracting Aspect-Category-Opinion-Sentiment-Intensity (ACOSI) tuples--remains underexplored. In this work, we introduce DanceHA, a multi-agent framework designed for open-ended, document-level ABSIA with informal writing styles. DanceHA has two main components: Dance, which employs a divide-and-conquer strategy to decompose the long-context ABSIA task into smaller, manageable sub-tasks for collaboration among specialized agents; and HA, Human-AI collaboration for annotation. We release Inf-ABSIA, a multi-domain document-level ABSIA dataset featuring fine-grained and high-accuracy labels from DanceHA. Extensive experiments demonstrate the effectiveness of our agentic framework and show that the multi-agent knowledge in DanceHA can be effectively transferred into student models. Our results highlight the importance of the overlooked informal styles in ABSIA, as they often intensify opinions tied to specific aspects.
摘要：基于方面的情感强度分析（ABSIA）已经引起了越来越多的关注，尽管研究主要集中在特定领域的句子级设置。相比之下，文档级 ABSIA（尤其是在处理复杂任务（例如提取方面-类别-意见-情感-强度 (ACOSI) 元组）方面）仍未得到充分探索。在这项工作中，我们介绍了 DanceHA，这是一个多智能体框架，专为具有非正式写作风格的开放式、文档级 ABSIA 设计。 DanceHA 有两个主要组成部分： Dance，它采用分而治之的策略将长上下文 ABSIA 任务分解为更小的、可管理的子任务，以便专业代理之间进行协作； HA，人类与人工智能协作注释。我们发布了 Inf-ABSIA，这是一个多域文档级 ABSIA 数据集，具有来自 DanceHA 的细粒度和高精度标签。大量的实验证明了我们的代理框架的有效性，并表明 DanceHA 中的多代理知识可以有效地转移到学生模型中。我们的结果强调了 ABSIA 中被忽视的非正式风格的重要性，因为它们常常强化与特定方面相关的观点。

Title: EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models

Authors: Yifei Zhang, Mingyang Li, Henry Gao, Liang Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16553
Pdf URL: https://arxiv.org/pdf/2603.16553
Copy Paste: [[2603.16553]] EmoLLM: Appraisal-Grounded Cognitive-Emotional Co-Reasoning in Large Language Models(https://arxiv.org/abs/2603.16553)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong cognitive intelligence (IQ), yet many real-world interactions also require emotional intelligence (EQ) to produce responses that are both factually reliable and emotionally appropriate. In settings such as emotional support, technical assistance, and consultation, effective dialogue depends on how situations are appraised with respect to the user's needs, goals, and coping capacity. Inspired by appraisal theory, we propose EmoLLM, an appraisal-grounded framework for IQ/EQ co-reasoning in dialogue. EmoLLM uses an explicit Appraisal Reasoning Graph (ARG) to structure intermediate reasoning over contextual facts, inferred user needs, appraisal dimensions, emotional states, and response strategies before generating a reply. We train EmoLLM in a multi-turn role-play environment with reinforcement learning, where reverse-perspective reasoning provides reward signals based on predicted user-side consequences of responses. Across diverse dialogue settings, EmoLLM improves emotional state outcomes and response quality over strong baselines while preserving strong factual reliability.
摘要：大型语言模型 (LLM) 表现出强大的认知智能 (IQ)，但许多现实世界的交互也需要情商 (EQ) 才能产生既可靠又在情感上适当的响应。在情感支持、技术援助和咨询等环境中，有效的对话取决于如何根据用户的需求、目标和应对能力来评估情况。受评估理论的启发，我们提出了 EmoLLM，一种基于评估的对话中 IQ/EQ 共同推理的框架。 EmoLLM 使用显式评估推理图 (ARG) 在生成回复之前根据上下文事实、推断的用户需求、评估维度、情绪状态和响应策略构建中间推理。我们在具有强化学习的多回合角色扮演环境中训练 EmoLLM，其中逆向推理根据预测的用户端响应后果提供奖励信号。在不同的对话环境中，EmoLLM 可以在强大的基线上改善情绪状态结果和响应质量，同时保持强大的事实可靠性。

Title: Characterizing Delusional Spirals through Human-LLM Chat Logs

Authors: Jared Moore, Ashish Mehta, William Agnew, Jacy Reese Anthis, Ryan Louie, Yifan Mai, Peggy Yin, Myra Cheng, Samuel J Paech, Kevin Klyman, Stevie Chancellor, Eric Lin, Nick Haber, Desmond C. Ong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16567
Pdf URL: https://arxiv.org/pdf/2603.16567
Copy Paste: [[2603.16567]] Characterizing Delusional Spirals through Human-LLM Chat Logs(https://arxiv.org/abs/2603.16567)
Keywords: language model, llm, chat
Abstract: As large language models (LLMs) have proliferated, disturbing anecdotal reports of negative psychological effects, such as delusions, self-harm, and ``AI psychosis,'' have emerged in global media and legal discourse. However, it remains unclear how users and chatbots interact over the course of lengthy delusional ``spirals,'' limiting our ability to understand and mitigate the harm. In our work, we analyze logs of conversations with LLM chatbots from 19 users who report having experienced psychological harms from chatbot use. Many of our participants come from a support group for such chatbot users. We also include chat logs from participants covered by media outlets in widely-distributed stories about chatbot-reinforced delusions. In contrast to prior work that speculates on potential AI harms to mental health, to our knowledge we present the first in-depth study of such high-profile and veridically harmful cases. We develop an inventory of 28 codes and apply it to the $391,562$ messages in the logs. Codes include whether a user demonstrates delusional thinking (15.5% of user messages), a user expresses suicidal thoughts (69 validated user messages), or a chatbot misrepresents itself as sentient (21.2% of chatbot messages). We analyze the co-occurrence of message codes. We find, for example, that messages that declare romantic interest and messages where the chatbot describes itself as sentient occur much more often in longer conversations, suggesting that these topics could promote or result from user over-engagement and that safeguards in these areas may degrade in multi-turn settings. We conclude with concrete recommendations for how policymakers, LLM chatbot developers, and users can use our inventory and conversation analysis tool to understand and mitigate harm from LLM chatbots. Warning: This paper discusses self-harm, trauma, and violence.
摘要：随着大型语言模型（LLM）的激增，全球媒体和法律话语中出现了关于妄想、自残和“人工智能精神病”等负面心理影响的令人不安的轶事报道。然而，目前尚不清楚用户和聊天机器人在漫长的妄想“螺旋”过程中如何互动，限制了我们理解和减轻伤害的能力。在我们的工作中，我们分析了 19 位用户与 LLM 聊天机器人的对话日志，这些用户报告称因使用聊天机器人而遭受了心理伤害。我们的许多参与者都来自此类聊天机器人用户的支持小组。我们还将媒体报道的参与者的聊天日志纳入广泛传播的有关聊天机器人强化妄想的故事中。与之前推测人工智能对心理健康潜在危害的研究相比，据我们所知，我们首次对此类备受瞩目且确实有害的案例进行深入研究。我们开发了 28 个代码的清单，并将其应用于日志中的 391,562 美元消息。代码包括用户是否表现出妄想思维（占用户消息的 15.5%）、用户是否表达了自杀想法（69 条经过验证的用户消息）或聊天机器人是否将自己误认为是有感知的（占聊天机器人消息的 21.2%）。我们分析消息代码的共现。例如，我们发现，在较长的对话中，宣告浪漫兴趣的消息和聊天机器人将自己描述为有感知力的消息更频繁地出现，这表明这些主题可能会促进用户的过度参与或导致用户的过度参与，并且这些领域的安全措施可能会在多回合设置中降低。最后，我们针对政策制定者、LLM 聊天机器人开发人员和用户如何使用我们的清单和对话分析工具来了解和减轻 LLM 聊天机器人的危害提出了具体建议。警告：本文讨论自残、创伤和暴力。

Title: Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects

Authors: Titus von der Malsburg, Sebastian Padó
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16574
Pdf URL: https://arxiv.org/pdf/2603.16574
Copy Paste: [[2603.16574]] Diverging Transformer Predictions for Human Sentence Processing: A Comprehensive Analysis of Agreement Attraction Effects(https://arxiv.org/abs/2603.16574)
Keywords: language model
Abstract: Transformers underlie almost all state-of-the-art language models in computational linguistics, yet their cognitive adequacy as models of human sentence processing remains disputed. In this work, we use a surprisal-based linking mechanism to systematically evaluate eleven autoregressive transformers of varying sizes and architectures on a more comprehensive set of English agreement attraction configurations than prior work. Our experiments yield mixed results: While transformer predictions generally align with human reading time data for prepositional phrase configurations, performance degrades significantly on object-extracted relative clause configurations. In the latter case, predictions also diverge markedly across models, and no model successfully replicates the asymmetric interference patterns observed in humans. We conclude that current transformer models do not explain human morphosyntactic processing, and that evaluations of transformers as cognitive models must adopt rigorous, comprehensive experimental designs to avoid spurious generalizations from isolated syntactic configurations or individual models.
摘要：Transformer 是计算语言学中几乎所有最先进的语言模型的基础，但它们作为人类句子处理模型的认知充分性仍然存在争议。在这项工作中，我们使用基于惊喜的链接机制，在比之前的工作更全面的英语协议吸引力配置集上系统地评估十一个不同大小和架构的自回归变压器。我们的实验产生了好坏参半的结果：虽然 Transformer 预测通常与介词短语配置的人类阅读时间数据一致，但对象提取的关系从句配置的性能显着下降。在后一种情况下，不同模型的预测也存在显着差异，并且没有模型成功复制在人类中观察到的不对称干扰模式。我们的结论是，当前的 Transformer 模型不能解释人类形态句法处理，并且对 Transformer 作为认知模型的评估必须采用严格、全面的实验设计，以避免从孤立的句法配置或单个模型中进行虚假概括。

Title: BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization

Authors: Ji-Fu Li, Manyi Zhang, Xiaobo Xia, Han Bao, Haoli Bai, Zhenhua Dong, Xianzhi Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16590
Pdf URL: https://arxiv.org/pdf/2603.16590
Copy Paste: [[2603.16590]] BATQuant: Outlier-resilient MXFP4 Quantization via Learnable Block-wise Optimization(https://arxiv.org/abs/2603.16590)
Keywords: language model, llm
Abstract: Microscaling floating-point (MXFP) formats have emerged as a promising standard for deploying Multi-modal Large Language Models (MLLMs) and Large Language Models (LLMs) on modern accelerator architectures. However, existing Post-Training Quantization (PTQ) methods, particularly rotation-based techniques designed for integer formats, suffer from severe performance collapse when applied to MXFP4. Recent studies attribute this failure to a fundamental format mismatch: global orthogonal rotations inadvertently transfer outlier energy across quantization blocks, inducing new outliers that disrupt local block-wise scaling, while often creating bimodal activation distributions that underutilize the limited quantization range. To address these issues, we propose BATQuant (Block-wise Affine Transformation), which restricts transformations to align with MXFP granularity to prevent cross-block outlier propagation, while relaxing orthogonality constraints to optimize distribution shaping. To ensure parameter efficiency, we introduce Global and Private Kronecker (GPK) decomposition to effectively reduces storage and runtime overhead and incorporate Block-wise Learnable Clipping to suppress residual outliers. Extensive experiments on both MLLMs and LLMs demonstrate that BATQuant establishes new state-of-the-art results under aggressive W4A4KV16 configurations, recovering up to 96.43% of full-precision performance on multimodal benchmarks and clearly outperforming existing methods across diverse tasks.
摘要：微尺度浮点 (MXFP) 格式已成为在现代加速器架构上部署多模式大型语言模型 (MLLM) 和大型语言模型 (LLM) 的有前途的标准。然而，现有的训练后量化 (PTQ) 方法，特别是为整数格式设计的基于旋转的技术，在应用于 MXFP4 时会遭受严重的性能崩溃。最近的研究将这种失败归因于基本格式不匹配：全局正交旋转无意中在量化块之间转移离群值能量，引发新的离群值，破坏局部块级缩放，同时经常创建未充分利用有限量化范围的双峰激活分布。为了解决这些问题，我们提出了 BATQuant（分块仿射变换），它限制变换与 MXFP 粒度保持一致，以防止跨块异常值传播，同时放宽正交性约束以优化分布整形。为了确保参数效率，我们引入全局和私有克罗内克（GPK）分解来有效减少存储和运行时开销，并结合分块可学习裁剪来抑制残留异常值。对 MLLM 和 LLM 的大量实验表明，BATQuant 在激进的 W4A4KV16 配置下建立了新的最先进的结果，在多模态基准上恢复了高达 96.43% 的全精度性能，并且在不同任务中明显优于现有方法。

Title: Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech

Authors: Omnilingual SONAR Team: João Maria Janeiro, Pere-Lluís Huguet Cabot, Ioannis Tsiamas, Yen Meng, Vivek Iyer, Guillem Ramírez, Loic Barrault, Belen Alastruey, Yu-An Chung, Marta R. Costa-Jussa, David Dale, Kevin Heffernan, Jaehyeong Jo, Artyom Kozhevnikov, Alexandre Mourachko, Christophe Ropers, Holger Schwenk, Paul-Ambroise Duquenne
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16606
Pdf URL: https://arxiv.org/pdf/2603.16606
Copy Paste: [[2603.16606]] Omnilingual SONAR: Cross-Lingual and Cross-Modal Sentence Embeddings Bridging Massively Multilingual Text and Speech(https://arxiv.org/abs/2603.16606)
Keywords: llm
Abstract: Cross-lingual sentence encoders typically cover only a few hundred languages and often trade downstream quality for stronger alignment, limiting their adoption. We introduce OmniSONAR, a new family of omnilingual, cross-lingual and cross-modal sentence embedding models that natively embed text, speech, code, and mathematical expressions in a single semantic space, while delivering state-of-the-art downstream performance at the scale of thousands of languages, from high-resource to extremely low-resource varieties. To reach this scale without representation collapse, we use progressive training. We first learn a strong foundational space for 200 languages with an LLM-initialized encoder-decoder, combining token-level decoding with a novel split-softmax contrastive loss and synthetic hard negatives. Building on this foundation, we expand to several thousands language varieties via a two-stage teacher-student encoder distillation framework. Finally, we demonstrate the cross-modal extensibility of this space by seamlessly mapping 177 spoken languages into it. OmniSONAR halves cross-lingual similarity search error on the 200-language FLORES dataset and reduces error by a factor of 15 on the 1,560-language BIBLE benchmark. It also enables strong translation, outperforming NLLB-3B on multilingual benchmarks and exceeding prior models (including much larger LLMs) by 15 chrF++ points on 1,560 languages into English BIBLE translation. OmniSONAR also performs strongly on MTEB and XLCoST. For speech, OmniSONAR achieves a 43% lower similarity-search error and reaches 97% of SeamlessM4T speech-to-text quality, despite being zero-shot for translation (trained only on ASR data). Finally, by training an encoder-decoder LM, Spectrum, exclusively on English text processing OmniSONAR embedding sequences, we unlock high-performance transfer to thousands of languages and speech for complex downstream tasks.
摘要：跨语言句子编码器通常只覆盖几百种语言，并且经常以下游质量为代价换取更强的对齐，从而限制了它们的采用。我们推出了 OmniSONAR，这是一个新的全语言、跨语言和跨模式句子嵌入模型系列，它可以将文本、语音、代码和数学表达式原生嵌入到单个语义空间中，同时在数千种语言（从高资源到极低资源品种）的规模上提供最先进的下游性能。为了达到这个规模而不导致表征崩溃，我们使用渐进式训练。我们首先使用 LLM 初始化的编码器-解码器学习 200 种语言的强大基础空间，将令牌级解码与新颖的 split-softmax 对比损失和合成硬底片相结合。在此基础上，我们通过两阶段师生编码器蒸馏框架扩展到数千种语言品种。最后，我们通过将 177 种口语无缝映射到其中来展示该空间的跨模式可扩展性。 OmniSONAR 将 200 种语言的 FLORES 数据集上的跨语言相似性搜索错误减少了一半，并将 1,560 种语言的 BIBLE 基准上的错误减少了 15 倍。它还能够实现强大的翻译，在多语言基准测试中优于 NLLB-3B，并在 1,560 种语言的英语 BIBLE 翻译中超过先前模型（包括更大的 LLM）15 个 chrF++ 点。 OmniSONAR 在 MTEB 和 XLCoST 上也表现强劲。对于语音，OmniSONAR 的相似性搜索误差降低了 43%，并且达到了 SeamlessM4T 语音转文本质量的 97%，尽管它是零样本翻译（仅在 ASR 数据上进行训练）。最后，通过专门针对英语文本处理 OmniSONAR 嵌入序列训练编码器-解码器 LM Spectrum，我们为复杂的下游任务解锁了向数千种语言和语音的高性能传输。

Title: Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model

Authors: Ryo Kishino, Riku Shiomi, Hiroaki Yamagiwa, Momose Oyama, Hidetoshi Shimodaira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16622
Pdf URL: https://arxiv.org/pdf/2603.16622
Copy Paste: [[2603.16622]] Domain Mixture Design via Log-Likelihood Differences for Aligning Language Models with a Target Model(https://arxiv.org/abs/2603.16622)
Keywords: language model, gpt
Abstract: Instead of directly distilling a language model, this study addresses the problem of aligning a base model with a target model in distribution by designing the domain mixture of training data for pretraining or continued pretraining as a fixed training recipe. We propose a method for determining domain weights by viewing models as points in log-likelihood space and aligning the training update direction with the direction toward the target model. Experiments with NanoGPT show that the proposed method consistently reduces the KL divergence to the target model compared with uniform weighting over the Pile. Although knowledge distillation remains more effective when available, the proposed method still achieves meaningful alignment, and downstream task performance also tends to become closer to that of the target model.
摘要：本研究不是直接提取语言模型，而是通过设计用于预训练或持续预训练的训练数据的域混合作为固定的训练配方，解决了将基础模型与分布中的目标模型对齐的问题。我们提出了一种通过将模型视为对数似然空间中的点并将训练更新方向与目标模型的方向对齐来确定域权重的方法。 NanoGPT 的实验表明，与 Pile 上的均匀加权相比，所提出的方法一致地减少了目标模型的 KL 散度。尽管知识蒸馏在可用时仍然更有效，但所提出的方法仍然实现了有意义的对齐，并且下游任务性能也往往变得更接近目标模型。

Title: Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy

Authors: Zhaoxin Feng, Zheng Chen, Jianfei Ma, Yip Tin Po, Emmanuele Chersoni, Bo Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16643
Pdf URL: https://arxiv.org/pdf/2603.16643
Copy Paste: [[2603.16643]] Good Arguments Against the People Pleasers: How Reasoning Mitigates (Yet Masks) LLM Sycophancy(https://arxiv.org/abs/2603.16643)
Keywords: llm, chain-of-thought
Abstract: Alignment techniques often inadvertently induce sycophancy in LLMs. While prior studies studied this behaviour in direct-answer settings, the role of Chain-of-Thought (CoT) reasoning remains under-explored: does it serve as a logical constraint that mitigates sycophancy, or a tool for post-hoc rationalization that masks it? We evaluate a range of models across objective and subjective tasks to investigate the issue. Results show that reasoning generally reduces sycophancy in final decisions but also masks sycophancy in some samples, where models construct deceptive justifications through logical inconsistencies, calculation errors, and one-sided arguments etc. Furthermore, LLMs are more prone to sycophancy in subjective tasks and under authority-bias. Our mechanistic analysis on three open-source models reveals that the tendency of sycophancy is dynamic during the reasoning process rather than being pre-determined at the input stage.
摘要：结盟技术常常会无意中引起法学硕士的阿谀奉承。虽然之前的研究在直接答案环境中研究了这种行为，但思想链（CoT）推理的作用仍未得到充分探索：它是作为减轻阿谀奉承的逻辑约束，还是掩盖阿谀奉承的事后合理化工具？我们评估了一系列跨客观和主观任务的模型来调查这个问题。结果表明，推理通常会减少最终决策中的阿谀奉承，但也会掩盖某些样本中的阿谀奉承，其中模型通过逻辑不一致、计算错误和片面论证等构建欺骗性的理由。此外，法学硕士在主观任务和权威偏见下更容易阿谀奉承。我们对三个开源模型的机制分析表明，阿谀奉承的倾向在推理过程中是动态的，而不是在输入阶段预先确定的。

Title: Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Authors: Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16654
Pdf URL: https://arxiv.org/pdf/2603.16654
Copy Paste: [[2603.16654]] Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models(https://arxiv.org/abs/2603.16654)
Keywords: language model, llm
Abstract: Reasoning-focused large language models (LLMs) have advanced in many NLP tasks, yet their evaluation remains challenging: final answers alone do not expose the intermediate reasoning steps, making it difficult to determine whether a model truly reasons correctly and where failures occur, while existing multi-hop QA benchmarks lack step-level annotations for diagnosing reasoning failures. To address this gap, we propose Omanic, an open-domain multi-hop QA resource that provides decomposed sub-questions and intermediate answers as structural annotations for analyzing reasoning processes. It contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench). Systematic evaluations show that state-of-the-art LLMs achieve only 73.11% multiple-choice accuracy on OmanicBench, confirming its high difficulty. Stepwise analysis reveals that CoT's performance hinges on factual completeness, with its gains diminishing under knowledge gaps and errors amplifying in later hops. Additionally, supervised fine-tuning on OmanicSynth brings substantial transfer gains (7.41 average points) across six reasoning and math benchmarks, validating the dataset's quality and further supporting the effectiveness of OmanicSynth as supervision for reasoning-capability transfer. We release the data at this https URL and the code at this https URL.
摘要：以推理为中心的大型语言模型（LLM）在许多 NLP 任务中取得了进展，但它们的评估仍然具有挑战性：仅最终答案并不能暴露中间推理步骤，因此很难确定模型是否真正正确推理以及故障发生的位置，而现有的多跳 QA 基准缺乏用于诊断推理失败的步骤级注释。为了解决这一差距，我们提出了 Omanic，一种开放域多跳 QA 资源，它提供分解的子问题和中间答案作为分析推理过程的结构注释。它包含 10,296 个机器生成的训练示例 (OmanicSynth) 和 967 个专家评审的人工注释评估示例 (OmanicBench)。系统评估表明，最先进的法学硕士在 OmanicBench 上的多项选择准确率仅为 73.11%，证实了其高难度。逐步分析表明，CoT 的性能取决于事实的完整性，其收益会因知识差距而减少，而错误会在后面的跳跃中放大。此外，OmanicSynth 上的监督微调在六个推理和数学基准上带来了可观的转移收益（平均 7.41 分），验证了数据集的质量，并进一步支持 OmanicSynth 作为推理能力转移监督的有效性。我们在此 https URL 发布数据，并在此 https URL 发布代码。

Title: Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?

Authors: Aishwarya Ramasethu, Niyathi Allu, Rohin Garg, Harshwardhan Fartale, Dun Li Chan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16660
Pdf URL: https://arxiv.org/pdf/2603.16660
Copy Paste: [[2603.16660]] Can Linguistically Related Languages Guide LLM Translation in Low-Resource Settings?(https://arxiv.org/abs/2603.16660)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have achieved strong performance across many downstream tasks, yet their effectiveness in extremely low-resource machine translation remains limited. Standard adaptation techniques typically rely on large-scale parallel data or extensive fine-tuning, which are infeasible for the long tail of underrepresented languages. In this work, we investigate a more constrained question: in data-scarce settings, to what extent can linguistically similar pivot languages and few-shot demonstrations provide useful guidance for on-the-fly adaptation in LLMs? We study a data-efficient experimental setup that combines linguistically related pivot languages with few-shot in-context examples, without any parameter updates, and evaluate translation behavior under controlled conditions. Our analysis shows that while pivot-based prompting can yield improvements in certain configurations, particularly in settings where the target language is less well represented in the model's vocabulary, the gains are often modest and sensitive to few shot example construction. For closely related or better represented varieties, we observe diminishing or inconsistent gains. Our findings provide empirical guidance on how and when inference-time prompting and pivot-based examples can be used as a lightweight alternative to fine-tuning in low-resource translation settings.
摘要：大型语言模型 (LLM) 在许多下游任务中取得了强大的性能，但它们在资源极少的机器翻译中的有效性仍然有限。标准适应技术通常依赖于大规模并行数据或广泛的微调，这对于代表性不足的语言的长尾来说是不可行的。在这项工作中，我们研究了一个更受限制的问题：在数据稀缺的环境中，语言上相似的枢轴语言和少量演示可以在多大程度上为法学硕士的即时适应提供有用的指导？我们研究了一种数据高效的实验设置，该设置将语言相关的枢轴语言与少量上下文示例相结合，无需任何参数更新，并在受控条件下评估翻译行为。我们的分析表明，虽然基于枢轴的提示可以在某些配置中产生改进，特别是在目标语言在模型词汇中表示得不太好的情况下，但效果通常是适度的，并且对少数镜头示例构建很敏感。对于密切相关或代表性更好的品种，我们观察到收益递减或不一致。我们的研究结果为如何以及何时将推理时间提示和基于枢轴的示例用作低资源翻译设置中微调的轻量级替代方案提供了经验指导。

Title: Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models

Authors: Mohamed Adel, Bashar Alhafni, Nizar Habash
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16718
Pdf URL: https://arxiv.org/pdf/2603.16718
Copy Paste: [[2603.16718]] Arabic Morphosyntactic Tagging and Dependency Parsing with Large Language Models(https://arxiv.org/abs/2603.16718)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) perform strongly on many NLP tasks, but their ability to produce explicit linguistic structure remains unclear. We evaluate instruction-tuned LLMs on two structured prediction tasks for Standard Arabic: morphosyntactic tagging and labeled dependency parsing. Arabic provides a challenging testbed due to its rich morphology and orthographic ambiguity, which create strong morphology-syntax interactions. We compare zero-shot prompting with retrieval-based in-context learning (ICL) using examples from Arabic treebanks. Results show that prompt design and demonstration selection strongly affect performance: proprietary models approach supervised baselines for feature-level tagging and become competitive with specialized dependency parsers. In raw-text settings, tokenization remains challenging, though retrieval-based ICL improves both parsing and tokenization. Our analysis highlights which aspects of Arabic morphosyntax and syntax LLMs capture reliably and which remain difficult.
摘要：大型语言模型 (LLM) 在许多 NLP 任务上表现强劲，但它们产生显式语言结构的能力仍不清楚。我们在标准阿拉伯语的两个结构化预测任务上评估指令调整的法学硕士：形态句法标记和标记依存解析。阿拉伯语由于其丰富的形态和拼写歧义而提供了一个具有挑战性的测试平台，这创造了强大的形态-语法相互作用。我们使用阿拉伯语树库中的示例将零样本提示与基于检索的上下文学习（ICL）进行比较。结果表明，及时的设计和演示选择会强烈影响性能：专有模型接近特征级标记的监督基线，并与专门的依赖项解析器竞争。在原始文本设置中，尽管基于检索的 ICL 改进了解析和标记化，但标记化仍然具有挑战性。我们的分析强调了法学硕士能够可靠地捕捉阿拉伯语形态句法和句法的哪些方面以及哪些方面仍然困难。

Title: Probing Cultural Signals in Large Language Models through Author Profiling

Authors: Valentin Lafargue, Ariel Guerra-Adames, Emmanuelle Claeys, Elouan Vuichard, Jean-Michel Loubes
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.16749
Pdf URL: https://arxiv.org/pdf/2603.16749
Copy Paste: [[2603.16749]] Probing Cultural Signals in Large Language Models through Author Profiling(https://arxiv.org/abs/2603.16749)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in applications with societal impact, raising concerns about the cultural biases they encode. We probe these representations by evaluating whether LLMs can perform author profiling from song lyrics in a zero-shot setting, inferring singers' gender and ethnicity without task-specific fine-tuning. Across several open-source models evaluated on more than 10,000 lyrics, we find that LLMs achieve non-trivial profiling performance but demonstrate systematic cultural alignment: most models default toward North American ethnicity, while DeepSeek-1.5B aligns more strongly with Asian ethnicity. This finding emerges from both the models' prediction distributions and an analysis of their generated rationales. To quantify these disparities, we introduce two fairness metrics, Modality Accuracy Divergence (MAD) and Recall Divergence (RD), and show that Ministral-8B displays the strongest ethnicity bias among the evaluated models, whereas Gemma-12B shows the most balanced behavior. Our code is available on GitHub (this https URL).
摘要：大型语言模型 (LLM) 越来越多地部署在具有社会影响力的应用程序中，引发了人们对其编码的文化偏见的担忧。我们通过评估法学硕士是否可以在零样本设置中根据歌词进行作者分析，从而推断歌手的性别和种族，而无需针对特定任务进行微调，从而探讨这些表示。在对超过 10,000 首歌词进行评估的多个开源模型中，我们发现 LLM 实现了不平凡的分析性能，但表现出系统的文化一致性：大多数模型默认面向北美种族，而 DeepSeek-1.5B 与亚洲种族更加一致。这一发现来自模型的预测分布及其生成原理的分析。为了量化这些差异，我们引入了两个公平性指标：模态准确率散度（MAD）和召回率散度（RD），并表明 Ministral-8B 在评估模型中表现出最强的种族偏见，而 Gemma-12B 显示出最平衡的行为。我们的代码可在 GitHub 上获取（此 https URL）。

Title: TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities

Authors: Victoria Graf, Valentina Pyatkin, Nouha Dziri, Nathan Lambert, Hannaneh Hajishirzi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.16759
Pdf URL: https://arxiv.org/pdf/2603.16759
Copy Paste: [[2603.16759]] TurnWise: The Gap between Single- and Multi-turn Language Model Capabilities(https://arxiv.org/abs/2603.16759)
Keywords: language model, chat
Abstract: Multi-turn conversations are a common and critical mode of language model interaction. However, current open training and evaluation data focus on single-turn settings, failing to capture the additional dimension of these longer interactions. To understand this multi-/single-turn gap, we first introduce a new benchmark, TurnWiseEval, for multi-turn capabilities that is directly comparable to single-turn chat evaluation. Our evaluation isolates multi-turn specific conversational ability through pairwise comparison to equivalent single-turn settings. We additionally introduce our synthetic multi-turn data pipeline TurnWiseData which allows the scalable generation of multi-turn training data. Our experiments with Olmo 3 show that training with multi-turn data is vital to achieving strong multi-turn chat performance, and that including as little as 10k multi-turn conversations during post-training can lead to a 12% improvement on TurnWiseEval.
摘要：多轮对话是语言模型交互的一种常见且关键的模式。然而，当前的开放训练和评估数据集中于单轮设置，未能捕获这些较长交互的附加维度。为了了解这种多轮/单轮差距，我们首先引入一个新的基准测试 TurnWiseEval，用于与单轮聊天评估直接比较的多轮功能。我们的评估通过与等效单轮设置的成对比较来隔离多轮特定对话能力。我们还引入了合成多轮数据管道 TurnWiseData，它允许可扩展地生成多轮训练数据。我们使用 Olmo 3 进行的实验表明，使用多轮数据进行训练对于实现强大的多轮聊天性能至关重要，并且在训练后包含少至 10k 多轮对话即可使 TurnWiseEval 提高 12%。

Title: SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue

Authors: Jonggeun Lee, Junseong Pyo, Jeongmin Park, Yohan Jo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16783
Pdf URL: https://arxiv.org/pdf/2603.16783
Copy Paste: [[2603.16783]] SpokenUS: A Spoken User Simulator for Task-Oriented Dialogue(https://arxiv.org/abs/2603.16783)
Keywords: agent
Abstract: Robust task-oriented spoken dialogue agents require exposure to the full diversity of how people interact through speech. Building spoken user simulators that address this requires large-scale spoken task-oriented dialogue (TOD) data encompassing spoken user behaviors, yet existing datasets are limited in scale and domain coverage, with no systematic pipeline for augmenting them. To address this, we introduce \textbf{SpokenTOD}, a spoken TOD dataset of 52,390 dialogues and 1,034 hours of speech augmented with four spoken user behaviors -- cross-turn slots, barge-in, disfluency, and emotional prosody -- across diverse speakers and domains. Building on SpokenTOD, we present \textbf{SpokenUS}, a spoken user simulator grounded in TOD with a dedicated architecture for barge-in. SpokenUS achieves comparable goal coverage to significantly larger models while substantially outperforming all baselines in Human MOS, disclosing slot values gradually across the dialogue as humans do rather than front-loading them. Further analysis confirms that SpokenUS's spoken behaviors pose meaningful challenges to downstream agents, making it a practical tool for training and evaluating more robust spoken dialogue systems.
摘要：强大的面向任务的口语对话代理需要了解人们通过语音互动方式的全部多样性。构建解决此问题的口语用户模拟器需要包含口语用户行为的大规模口语任务导向对话 (TOD) 数据，但现有数据集的规模和领域覆盖范围有限，没有系统的管道来增强它们。为了解决这个问题，我们引入了 \textbf{SpokenTOD}，这是一个包含 52,390 个对话和 1,034 小时语音的口语 TOD 数据集，并增强了四种口语用户行为——交叉槽、插话、不流利和情感韵律——跨越不同的说话者和领域。在 SpokenTOD 的基础上，我们推出了 \textbf{SpokenUS}，这是一个基于 TOD 的语音用户模拟器，具有专用的打断架构。 SpokenUS 实现了与更大的模型相当的目标覆盖率，同时大大优于 Human MOS 中的所有基线，像人类一样在对话中逐渐披露槽位值，而不是预先加载它们。进一步的分析证实，SpokenUS 的口语行为对下游智能体提出了有意义的挑战，使其成为训练和评估更强大的口语对话系统的实用工具。

Title: Mediocrity is the key for LLM as a Judge Anchor Selection

Authors: Shachar Don-Yehiya, Asaf Yehudai, Leshem Choshen, Omri Abend
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16848
Pdf URL: https://arxiv.org/pdf/2603.16848
Copy Paste: [[2603.16848]] Mediocrity is the key for LLM as a Judge Anchor Selection(https://arxiv.org/abs/2603.16848)
Keywords: llm
Abstract: The ``LLM-as-a-judge'' paradigm has become a standard method for evaluating open-ended generation. To address the quadratic scalability costs of pairwise comparisons, popular benchmarks like Arena-Hard and AlpacaEval compare all models against a single anchor. However, despite its widespread use, the impact of anchor selection on the reliability of the results remains largely unexplored. In this work, we systematically investigate the effect of anchor selection by evaluating 22 different anchors on the Arena-Hard-v2.0 dataset. We find that the choice of anchor is critical: a poor anchor can dramatically reduce correlation with human rankings. We identify that common anchor choices (best-performing and worst-performing models) make poor anchors. Because these extreme anchors are consistently better or worse than all other models, they are seldom indicative of the relative ranking of the models. We further quantify the effect size of anchor selection, showing it is comparable to the selection of a judge model. We conclude with actionable recommendations. First, we conduct a power analysis, and compute sufficient benchmark sizes for anchor-based evaluation, finding that standard benchmark sizes are insufficient for pairwise evaluation and fail to distinguish between competitive models reliably. Second, we provide guidelines for selecting informative anchors to ensure reliable and efficient evaluation practices.
摘要：“法学硕士作为法官”范式已成为评估开放式一代的标准方法。为了解决成对比较的二次可扩展性成本，流行的基准测试（例如 Arena-Hard 和 AlpacaEval）将所有模型与单个锚点进行比较。然而，尽管其广泛使用，锚选择对结果可靠性的影响在很大程度上仍未被探索。在这项工作中，我们通过评估 Arena-Hard-v2.0 数据集上的 22 个不同的锚点来系统地研究锚点选择的效果。我们发现锚的选择至关重要：一个糟糕的锚会大大降低与人类排名的相关性。我们发现常见的锚点选择（表现最好和表现最差的模型）会导致较差的锚点。由于这些极端锚点始终比所有其他模型更好或更差，因此它们很少表明模型的相对排名。我们进一步量化了锚选择的效果大小，表明它与判断模型的选择具有可比性。最后我们提出了可行的建议。首先，我们进行功效分析，并为基于锚的评估计算足够的基准尺寸，发现标准基准尺寸不足以进行成对评估，并且无法可靠地区分竞争模型。其次，我们提供了选择信息锚的指南，以确保可靠和高效的评估实践。

Title: Online Experiential Learning for Language Models

Authors: Tianzhu Ye, Li Dong, Qingxiu Dong, Xun Wu, Shaohan Huang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16856
Pdf URL: https://arxiv.org/pdf/2603.16856
Copy Paste: [[2603.16856]] Online Experiential Learning for Language Models(https://arxiv.org/abs/2603.16856)
Keywords: language model
Abstract: The prevailing paradigm for improving large language models relies on offline training with human annotations or simulated environments, leaving the rich experience accumulated during real-world deployment entirely unexploited. We propose Online Experiential Learning (OEL), a framework that enables language models to continuously improve from their own deployment experience. OEL operates in two stages: first, transferable experiential knowledge is extracted and accumulated from interaction trajectories collected on the user side; second, this knowledge is consolidated into model parameters via on-policy context distillation, requiring no access to the user-side environment. The two stages are iterated to form an online learning loop, where the improved model collects higher-quality trajectories that yield richer experiential knowledge for subsequent rounds. We evaluate OEL on text-based game environments across multiple model scales and both thinking and non-thinking variants. OEL achieves consistent improvements over successive iterations, enhancing both task accuracy and token efficiency while preserving out-of-distribution performance. Our analysis further shows that extracted experiential knowledge is significantly more effective than raw trajectories, and that on-policy consistency between the knowledge source and the policy model is critical for effective learning.
摘要：改进大型语言模型的主流范例依赖于人工注释或模拟环境的离线训练，而在实际部署过程中积累的丰富经验完全没有得到利用。我们提出了在线体验式学习（OEL），这是一个框架，使语言模型能够根据自己的部署经验不断改进。 OEL分为两个阶段：首先，从用户端收集的交互轨迹中提取和积累可迁移的经验知识；其次，这些知识通过策略上下文蒸馏整合到模型参数中，无需访问用户端环境。这两个阶段迭代形成在线学习循环，改进后的模型收集更高质量的轨迹，为后续回合产生更丰富的经验知识。我们在多个模型尺度以及思维和非思维变体的基于文本的游戏环境中评估 OEL。 OEL 在连续迭代中实现了持续改进，提高了任务准确性和代币效率，同时保持了分布外性能。我们的分析进一步表明，提取的经验知识比原始轨迹更有效，并且知识源和政策模型之间的政策一致性对于有效学习至关重要。

Title: Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory

Authors: Sahil Sen, Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.16862
Pdf URL: https://arxiv.org/pdf/2603.16862
Copy Paste: [[2603.16862]] Chronos: Temporal-Aware Conversational Agents with Structured Event Retrieval for Long-Term Memory(https://arxiv.org/abs/2603.16862)
Keywords: language model, llm, prompt, agent
Abstract: Recent advances in Large Language Models (LLMs) have enabled conversational AI agents to engage in extended multi-turn interactions spanning weeks or months. However, existing memory systems struggle to reason over temporally grounded facts and preferences that evolve across months of interaction and lack effective retrieval strategies for multi-hop, time-sensitive queries over long dialogue histories. We introduce Chronos, a novel temporal-aware memory framework that decomposes raw dialogue into subject-verb-object event tuples with resolved datetime ranges and entity aliases, indexing them in a structured event calendar alongside a turn calendar that preserves full conversational context. At query time, Chronos applies dynamic prompting to generate tailored retrieval guidance for each question, directing the agent on what to retrieve, how to filter across time ranges, and how to approach multi-hop reasoning through an iterative tool-calling loop over both calendars. We evaluate Chronos with 8 LLMs, both open-source and closed-source, on the LongMemEvalS benchmark comprising 500 questions spanning six categories of dialogue history tasks. Chronos Low achieves 92.60% and Chronos High scores 95.60% accuracy, setting a new state of the art with an improvement of 7.67% over the best prior system. Ablation results reveal the events calendar accounts for a 58.9% gain on the baseline while all other components yield improvements between 15.5% and 22.3%. Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.
摘要：大型语言模型 (LLM) 的最新进展使对话式 AI 代理能够进行数周或数月的扩展多轮交互。然而，现有的记忆系统很难对经过数月交互而演变的临时事实和偏好进行推理，并且缺乏针对长对话历史的多跳、时间敏感查询的有效检索策略。我们引入了 Chronos，一种新颖的时间感知记忆框架，它将原始对话分解为具有解析的日期时间范围和实体别名的主谓宾事件元组，将它们索引到结构化事件日历中，并在保留完整对话上下文的轮流日历中进行索引。在查询时，Chronos 应用动态提示为每个问题生成量身定制的检索指导，指导代理检索什么、如何跨时间范围进行过滤，以及如何通过两个日历上的迭代工具调用循环进行多跳推理。我们使用 8 个法学硕士（开源和闭源）在 LongMemEvalS 基准上评估 Chronos，该基准包含涵盖六类对话历史任务的 500 个问题。 Chronos Low 的准确率达到 92.60%，Chronos High 的准确率达到 95.60%，比之前最好的系统提高了 7.67%，创下了新的技术水平。消融结果显示，活动日历在基线上的提升达到了 58.9%，而所有其他组件的提升幅度在 15.5% 到 22.3% 之间。 Notably, Chronos Low alone surpasses prior approaches evaluated under their strongest model configurations.