2026-03-13

Title: Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple

Authors: Amirhossein Bozorgkhoo, Igor Molybog
Subjects: cs.CL, cs.IT, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11053
Pdf URL: https://arxiv.org/pdf/2603.11053
Copy Paste: [[2603.11053]] Speculative Decoding Scaling Laws (SDSL): Throughput Optimization Made Simple(https://arxiv.org/abs/2603.11053)
Keywords: language model, llm
Abstract: Speculative decoding is a technique that uses multiple language models to accelerate infer- ence. Previous works have used an experi- mental approach to optimize the throughput of the inference pipeline, which involves LLM training and can be costly. This study of spec- ulative decoding proposes a theory that ana- lytically connects the key hyperparameters of pre-trained LLMs to the throughput efficiency of a downstream SD-based inference system. The theory allows the prediction of throughput- optimal hyperparameters for the components of an inference system before their pre-training.
摘要：推测性解码是一种使用多种语言模型来加速推理的技术。以前的工作使用实验方法来优化推理管道的吞吐量，这涉及 LLM 培训并且成本高昂。这项推测解码的研究提出了一种理论，该理论通过分析将预训练的 LLM 的关键超参数与下游基于 SD 的推理系统的吞吐量效率联系起来。该理论允许在预训练之前预测推理系统组件的吞吐量最佳超参数。

Title: Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation

Authors: Jingtao Wang, Yucong Wang, Jun Ding, Rui Cai, Xun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11067
Pdf URL: https://arxiv.org/pdf/2603.11067
Copy Paste: [[2603.11067]] Summarize Before You Speak with ARACH: A Training-Free Inference-Time Plug-In for Enhancing LLMs via Global Attention Reallocation(https://arxiv.org/abs/2603.11067)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) achieve remarkable performance, yet further gains often require costly training. This has motivated growing interest in post-training techniques-especially training-free approaches that improve models at inference time without updating weights. Most training-free methods treat the model as a black box and improve outputs via input/output-level interventions, such as prompt design and test-time scaling through repeated sampling, reranking/verification, or search. In contrast, they rarely offer a plug-and-play mechanism to intervene in a model's internal computation. We propose ARACH(Attention Reallocation via an Adaptive Context Hub), a training-free inference-time plug-in that augments LLMs with an adaptive context hub to aggregate context and reallocate attention. Extensive experiments across multiple language modeling tasks show consistent improvements with modest inference overhead and no parameter updates. Attention analyses further suggest that ARACH mitigates the attention sink phenomenon. These results indicate that engineering a model's internal computation offers a distinct inference-time strategy, fundamentally different from both prompt-based test-time methods and training-based post-training approaches.
摘要：大型语言模型 (LLM) 取得了显着的性能，但进一步的提升通常需要昂贵的培训。这激发了人们对训练后技术越来越感兴趣，尤其是无需更新权重即可在推理时改进模型的免训练方法。大多数免训练方法将模型视为黑匣子，并通过输入/输出级干预来提高输出，例如通过重复采样、重新排序/验证或搜索进行提示设计和测试时间缩放。相反，它们很少提供即插即用机制来干预模型的内部计算。我们提出了 ARACH（通过自适应上下文中心进行注意力重新分配），这是一个免训练的推理时间插件，它通过自适应上下文中心来增强法学硕士，以聚合上下文并重新分配注意力。跨多种语言建模任务的广泛实验表明，在适度的推理开销和无参数更新的情况下取得了一致的改进。注意力分析进一步表明 ARACH 减轻了注意力下沉现象。这些结果表明，设计模型的内部计算提供了一种独特的推理时间策略，与基于提示的测试时间方法和基于训练的后训练方法有着根本的不同。

Title: DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning

Authors: Hanxu Hu, Yuxuan Wang, Maggie Huan, Jannis Vamvas, Yinya Huang, Zhijiang Guo, Rico Sennrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11193
Pdf URL: https://arxiv.org/pdf/2603.11193
Copy Paste: [[2603.11193]] DeReason: A Difficulty-Aware Curriculum Improves Decoupled SFT-then-RL Training for General Reasoning(https://arxiv.org/abs/2603.11193)
Keywords: language model, llm
Abstract: Reinforcement learning with Verifiable Rewards (RLVR) has emerged as a powerful paradigm for eliciting reasoning capabilities in large language models, particularly in mathematics and coding. While recent efforts have extended this paradigm to broader general scientific (STEM) domains, the complex interplay between supervised fine-tuning (SFT) and RL in these contexts remains underexplored. In this paper, we conduct controlled experiments revealing a critical challenge: for general STEM domains, RL applied directly to base models is highly sample-inefficient and is consistently surpassed by supervised fine-tuning (SFT) on moderate-quality responses. Yet sequential SFT followed by RL can further improve performance, suggesting that the two stages play complementary roles, and that how training data is allocated between them matters. Therefore, we propose DeReason, a difficulty-based data decoupling strategy for general reasoning. DeReason partitions training data by reasoning intensity estimated via LLM-based scoring into reasoning-intensive and non-reasoning-intensive subsets. It allocates broad-coverage, non-reasoning-intensive problems to SFT to establish foundational domain knowledge, and reserves a focused subset of difficult problems for RL to cultivate complex reasoning. We demonstrate that this principled decoupling yields better performance than randomly splitting the data for sequential SFT and RL. Extensive experiments on general STEM and mathematical benchmarks demonstrate that our decoupled curriculum training significantly outperforms SFT-only, RL-only, and random-split baselines. Our work provides a systematic study of the interplay between SFT and RL for general reasoning, offering a highly effective and generalized post-training recipe.
摘要：带可验证奖励的强化学习（RLVR）已成为在大型语言模型（尤其是数学和编码领域）中激发推理能力的强大范例。虽然最近的努力已将这种范式扩展到更广泛的普通科学 (STEM) 领域，但在这些背景下监督微调 (SFT) 和强化学习之间的复杂相互作用仍未得到充分探索。在本文中，我们进行的对照实验揭示了一个关键挑战：对于一般的 STEM 领域，直接应用于基础模型的 RL 样本效率非常低，并且始终被中等质量响应的监督微调 (SFT) 所超越。然而，顺序 SFT 和 RL 可以进一步提高性能，这表明这两个阶段发挥着互补作用，并且训练数据在它们之间的分配方式很重要。因此，我们提出了 DeReason，一种用于一般推理的基于难度的数据解耦策略。 DeReason 通过基于 LLM 的评分估计的推理强度将训练数据划分为推理密集型和非推理密集型子集。它将覆盖面广、非推理密集的问题分配给 SFT 来建立基础领域知识，并为 RL 保留困难问题的重点子集来培养复杂推理。我们证明，这种原则上的解耦比随机分割数据以进行顺序 SFT 和 RL 能产生更好的性能。对一般 STEM 和数学基准的大量实验表明，我们的解耦课程训练明显优于仅 SFT、仅 RL 和随机分割基线。我们的工作对 SFT 和 RL 之间的相互作用进行了系统研究，以进行一般推理，并提供了一种高效且通用的训练后方法。

Title: MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries

Authors: Riccardo Campi, Nicolò Oreste Pinciroli Vago, Mathyas Giudici, Marco Brambilla, Piero Fraternali
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.11223
Pdf URL: https://arxiv.org/pdf/2603.11223
Copy Paste: [[2603.11223]] MDER-DR: Multi-Hop Question Answering with Entity-Centric Summaries(https://arxiv.org/abs/2603.11223)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) over Knowledge Graphs (KGs) suffers from the fact that indexing approaches may lose important contextual nuance when text is reduced to triples, thereby degrading performance in downstream Question-Answering (QA) tasks, particularly for multi-hop QA, which requires composing answers from multiple entities, facts, or relations. We propose a domain-agnostic, KG-based QA framework that covers both the indexing and retrieval/inference phases. A new indexing approach called Map-Disambiguate-Enrich-Reduce (MDER) generates context-derived triple descriptions and subsequently integrates them with entity-level summaries, thus avoiding the need for explicit traversal of edges in the graph during the QA retrieval phase. Complementing this, we introduce Decompose-Resolve (DR), a retrieval mechanism that decomposes user queries into resolvable triples and grounds them in the KG via iterative reasoning. Together, MDER and DR form an LLM-driven QA pipeline that is robust to sparse, incomplete, and complex relational data. Experiments show that on standard and domain specific benchmarks, MDER-DR achieves substantial improvements over standard RAG baselines (up to 66%), while maintaining cross-lingual robustness. Our code is available at this https URL.
摘要：知识图谱 (KG) 上的检索增强生成 (RAG) 面临这样的问题：当文本缩减为三元组时，索引方法可能会丢失重要的上下文细微差别，从而降低下游问答 (QA) 任务的性能，特别是对于多跳 QA，这需要从多个实体、事实或关系组成答案。我们提出了一个与领域无关、基于知识图谱的 QA 框架，涵盖索引和检索/推理阶段。一种名为 Map-Disambiguate-Enrich-Reduce (MDER) 的新索引方法生成上下文派生的三元组描述，然后将它们与实体级摘要集成，从而避免在 QA 检索阶段显式遍历图中的边。作为补充，我们引入了分解解析（DR），这是一种检索机制，可将用户查询分解为可解析的三元组，并通过迭代推理将它们置于知识图谱中。 MDER 和 DR 共同构成了 LLM 驱动的 QA 管道，该管道对于稀疏、不完整和复杂的关系数据具有鲁棒性。实验表明，在标准和特定领域的基准测试上，MDER-DR 比标准 RAG 基准实现了实质性改进（高达 66%），同时保持了跨语言稳健性。我们的代码可以在这个 https URL 上找到。

Title: Markovian Generation Chains in Large Language Models

Authors: Mingmeng Geng, Amr Mohamed, Guokan Shang, Michalis Vazirgiannis, Thierry Poibeau
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11228
Pdf URL: https://arxiv.org/pdf/2603.11228
Copy Paste: [[2603.11228]] Markovian Generation Chains in Large Language Models(https://arxiv.org/abs/2603.11228)
Keywords: language model, llm, prompt, agent
Abstract: The widespread use of large language models (LLMs) raises an important question: how do texts evolve when they are repeatedly processed by LLMs? In this paper, we define this iterative inference process as Markovian generation chains, where each step takes a specific prompt template and the previous output as input, without including any prior memory. In iterative rephrasing and round-trip translation experiments, the output either converges to a small recurrent set or continues to produce novel sentences over a finite horizon. Through sentence-level Markov chain modeling and analysis of simulated data, we show that iterative process can either increase or reduce sentence diversity depending on factors such as the temperature parameter and the initial input sentence. These results offer valuable insights into the dynamics of iterative LLM inference and their implications for multi-agent LLM systems.
摘要：大型语言模型 (LLM) 的广泛使用提出了一个重要问题：当文本被 LLM 重复处理时，文本会如何演变？在本文中，我们将这种迭代推理过程定义为马尔可夫生成链，其中每个步骤都采用特定的提示模板和先前的输出作为输入，而不包含任何先前的记忆。在迭代改写和往返翻译实验中，输出要么收敛到一个小的循环集，要么在有限的范围内继续产生新的句子。通过句子级马尔可夫链建模和模拟数据分析，我们表明迭代过程可以根据温度参数和初始输入句子等因素增加或减少句子多样性。这些结果为迭代 LLM 推理的动态及其对多代理 LLM 系统的影响提供了有价值的见解。

Title: Artificial Intelligence for Sentiment Analysis of Persian Poetry

Authors: Arash Zargar, Abolfazl Moshiri, Mitra Shafaei, Shabnam Rahimi-Golkhandan, Mohamad Tavakoli-Targhi, Farzad Khalvati
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11254
Pdf URL: https://arxiv.org/pdf/2603.11254
Copy Paste: [[2603.11254]] Artificial Intelligence for Sentiment Analysis of Persian Poetry(https://arxiv.org/abs/2603.11254)
Keywords: language model, gpt, llm
Abstract: Recent advancements of the Artificial Intelligence (AI) have led to the development of large language models (LLMs) that are capable of understanding, analysing, and creating textual data. These language models open a significant opportunity in analyzing the literature and more specifically poetry. In the present work, we employ multiple Bidirectional encoder representations from transformers (BERT) and Generative Pre-trained Transformer (GPT) based language models to analyze the works of two prominent Persian poets: Jalal al-Din Muhammad Rumi (Rumi) and Parvin E'tesami. The main objective of this research is to investigate the capability of the modern language models in grasping complexities of the Persian poetry and explore potential correlations between the poems' sentiment and their meters. Our findings in this study indicates that GPT4o language model can reliably be used in analysis of Persian poetry. Furthermore, the results of our sentiment analysis revealed that in general, Rumi's poems express happier sentiments compared to Parvin E'tesami's poems. Furthermore, comparing the utilization of poetic meters highlighted Rumi's poems superiority in using meters to express a wider variety of sentiments. These findings are significant as they confirm that LLMs can be effectively applied in conducting computer-based semantic studies, where human interpretations are not required, and thereby significantly reducing potential biases in the analysis.
摘要：人工智能 (AI) 的最新进展促进了大型语言模型 (LLM) 的发展，这些模型能够理解、分析和创建文本数据。这些语言模型为分析文学，尤其是诗歌提供了重要的机会。在目前的工作中，我们采用来自 Transformer (BERT) 和基于生成式预训练 Transformer (GPT) 的语言模型的多个双向编码器表示来分析两位著名波斯诗人的作品：Jalal al-Din Muhammad Rumi (Rumi) 和 Parvin E'tesami。本研究的主要目的是调查现代语言模型掌握波斯诗歌复杂性的能力，并探索诗歌情感与其韵律之间的潜在关联。我们在这项研究中的发现表明 GPT4o 语言模型可以可靠地用于波斯诗歌的分析。此外，我们的情感分析结果表明，总的来说，鲁米的诗歌比帕尔文·埃泰萨米的诗歌表达了更快乐的情感。此外，比较诗歌韵律的运用，凸显了鲁米诗歌在使用韵律表达更广泛的情感方面的优越性。这些发现意义重大，因为它们证实法学硕士可以有效地应用于进行基于计算机的语义研究，不需要人工解释，从而显着减少分析中的潜在偏差。

Title: ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions

Authors: Monica Munnangi, Saiph Savage
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11281
Pdf URL: https://arxiv.org/pdf/2603.11281
Copy Paste: [[2603.11281]] ThReadMed-QA: A Multi-Turn Medical Dialogue Benchmark from Real Patient Questions(https://arxiv.org/abs/2603.11281)
Keywords: gpt, llm, prompt
Abstract: Medical question-answering benchmarks predominantly evaluate single-turn exchanges, failing to capture the iterative, clarification-seeking nature of real patient consultations. We introduce ThReadMed-QA, a benchmark of 2,437 fully-answered patient-physician conversation threads extracted from r/AskDocs, comprising 8,204 question-answer pairs across up to 9 turns. Unlike prior work relying on simulated dialogues, adversarial prompts, or exam-style questions, ThReadMed-QA captures authentic patient follow-up questions and verified physician responses, reflecting how patients naturally seek medical information online. We evaluate five state-of-the-art LLMs -- GPT-5, GPT-4o, Claude Haiku, Gemini 2.5 Flash, and Llama 3.3 70B -- on a stratified test split of 238 conversations (948 QA pairs) using a calibrated LLM-as-a-judge rubric grounded in physician ground truth. Even the strongest model, GPT-5, achieves only 41.2% fully-correct responses. All five models degrade significantly from turn 0 to turn 2 (p < 0.001), with wrong-answer rates roughly tripling by the third turn. We identify a fundamental tension between single-turn capability and multi-turn reliability: models with the strongest initial performance (GPT-5: 75.2; Claude Haiku: 72.3 out of 100) exhibit the steepest declines by turn 2 (dropping 16.2 and 25.0 points respectively), while weaker models plateau or marginally improve. We introduce two metrics to quantify multi-turn failure modes: Conversational Consistency Score (CCS) and Error Propagation Rate (EPR). CCS reveals that nearly one in three Claude Haiku conversations swings between a fully correct and a completely wrong response within the same thread. EPR shows that a single wrong turn raises the probability of a subsequent wrong turn by 1.9-6.1x across all models.
摘要：医疗问答基准主要评估单轮交流，未能捕捉真实患者咨询的迭代、寻求澄清的本质。我们引入了 ThReadMed-QA，这是从 r/AskDocs 中提取的 2,437 个完全回答的患者与医生对话线程的基准，其中包含最多 9 个轮次的 8,204 个问答对。与之前依赖模拟对话、对抗性提示或考试式问题的工作不同，ThReadMed-QA 捕获真实的患者后续问题和经过验证的医生反应，反映患者如何自然地在线寻求医疗信息。我们使用基于医生事实真相的校准LLM作为法官评分标准，在238个对话（948个QA对）的分层测试中评估了五种最先进的LLM——GPT-5、GPT-4o、Claude Haiku、Gemini 2.5 Flash和Llama 3.3 70B。即使是最强的模型 GPT-5，也只能达到 41.2% 的完全正确响应。从第 0 回合到第 2 回合，所有五个模型的性能均显着下降 (p < 0.001)，到第三回合时，错误答案率大约增加了两倍。我们发现单圈能力和多圈可靠性之间存在根本性的矛盾：初始性能最强的模型（GPT-5：75.2；Claude Haiku：72.3（满分 100））在第 2 圈表现出最大幅度的下降（分别下降 16.2 和 25.0 点），而较弱的模型则趋于稳定或略有改善。我们引入了两个指标来量化多轮故障模式：对话一致性得分（CCS）和错误传播率（EPR）。 CCS 表明，在同一条线索中，近三分之一的克劳德俳句对话在完全正确和完全错误的反应之间摇摆不定。 EPR 表明，在所有模型中，一次错误转弯会使后续错误转弯的概率提高 1.9-6.1 倍。

Title: Temporal Text Classification with Large Language Models

Authors: Nishat Raihan, Marcos Zampieri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11295
Pdf URL: https://arxiv.org/pdf/2603.11295
Copy Paste: [[2603.11295]] Temporal Text Classification with Large Language Models(https://arxiv.org/abs/2603.11295)
Keywords: language model, gpt, llm, prompt
Abstract: Languages change over time. Computational models can be trained to recognize such changes enabling them to estimate the publication date of texts. Despite recent advancements in Large Language Models (LLMs), their performance on automatic dating of texts, also known as Temporal Text Classification (TTC), has not been explored. This study provides the first systematic evaluation of leading proprietary (Claude 3.5, GPT-4o, Gemini 1.5) and open-source (LLaMA 3.2, Gemma 2, Mistral, Nemotron 4) LLMs on TTC using three historical corpora, two in English and one in Portuguese. We test zero-shot and few-shot prompting, and fine-tuning settings. Our results indicate that proprietary models perform well, especially with few-shot prompting. They also indicate that fine-tuning substantially improves open-source models but that they still fail to match the performance delivered by proprietary LLMs.
摘要：语言随着时间的推移而变化。可以训练计算模型来识别此类变化，从而使它们能够估计文本的出版日期。尽管大型语言模型（LLM）最近取得了进展，但它们在文本自动约会（也称为时间文本分类（TTC））方面的性能尚未得到探索。本研究使用三个历史语料库（两个英语和一个葡萄牙语）对 TTC 上领先的专有（Claude 3.5、GPT-4o、Gemini 1.5）和开源（LLaMA 3.2、Gemma 2、Mistral、Nemotron 4）法学硕士进行首次系统评估。我们测试零样本和少样本提示以及微调设置。我们的结果表明，专有模型表现良好，尤其是在少量提示的情况下。他们还指出，微调极大地改进了开源模型，但它们仍然无法与专有法学硕士提供的性能相匹配。

Title: Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Authors: Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11394
Pdf URL: https://arxiv.org/pdf/2603.11394
Copy Paste: [[2603.11394]] Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning(https://arxiv.org/abs/2603.11394)
Keywords: language model, llm, chat
Abstract: Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a "stick-or-switch" evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.
摘要：患者和临床医生越来越多地使用由大型语言模型 (LLM) 支持的聊天机器人进行医疗保健查询。虽然最先进的法学硕士在静态诊断推理基准上表现出高性能，但它们在多轮对话中的功效（更好地反映了现实世界的使用情况）尚未得到充分研究。在本文中，我们评估了三个临床数据集中的 17 名法学硕士，以研究将决策空间划分为多个更简单的对话轮次如何影响他们的诊断推理。具体来说，我们开发了一个“坚持或转换”评估框架来衡量对话中的模型信念（即捍卫正确的诊断或安全放弃不正确的建议）和灵活性（即在引入正确的建议时识别它）。我们的实验揭示了对话税，与单次基线相比，多轮交互会持续降低性能。值得注意的是，模型经常放弃最初的正确诊断和安全弃权，以与错误的用户建议保持一致。此外，一些模型表现出盲目切换，无法区分信号和错误建议。

Title: Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects

Authors: Amani Maina-Kilaas, Roger Levy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11412
Pdf URL: https://arxiv.org/pdf/2603.11412
Copy Paste: [[2603.11412]] Algorithmic Consequences of Particle Filters for Sentence Processing: Amplified Garden-Paths and Digging-In Effects(https://arxiv.org/abs/2603.11412)
Keywords: language model, llm
Abstract: Under surprisal theory, linguistic representations affect processing difficulty only through the bottleneck of surprisal. Our best estimates of surprisal come from large language models, which have no explicit representation of structural ambiguity. While LLM surprisal robustly predicts reading times across languages, it systematically underpredicts difficulty when structural expectations are violated -- suggesting that representations of ambiguity are causally implicated in sentence processing. Particle filter models offer an alternative where structural hypotheses are explicitly represented as a finite set of particles. We prove several algorithmic consequences of particle filter models, including the amplification of garden-path effects. Most critically, we demonstrate that resampling, a common practice with these models, inherently produces real-time digging-in effects -- where disambiguation difficulty increases with ambiguous region length. Digging-in magnitude scales inversely with particle count: fully parallel models predict no such effect.
摘要：在惊喜理论下，语言表征仅通过惊喜瓶颈影响处理难度。我们对惊讶的最佳估计来自大型语言模型，这些模型没有结构歧义的明确表示。虽然法学硕士意外地稳健地预测了跨语言的阅读时间，但当结构期望被违反时，它系统地低估了难度——这表明歧义性的表示与句子处理有因果关系。粒子滤波器模型提供了一种替代方案，其中结构假设被明确表示为一组有限的粒子。我们证明了粒子滤波器模型的几种算法后果，包括花园路径效应的放大。最关键的是，我们证明了重采样（这些模型的常见做法）本质上会产生实时挖掘效应——消歧难度随着模糊区域长度的增加而增加。挖掘的幅度与粒子数成反比：完全并行的模型预测不会有这样的影响。

Title: MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models

Authors: Michiko Yoshitake, Yuta Suzuki, Ryo Igarashi, Yoshitaka Ushiku, Keisuke Nagato
Subjects: cs.CL, cond-mat.mtrl-sci
Abstract URL: https://arxiv.org/abs/2603.11414
Pdf URL: https://arxiv.org/pdf/2603.11414
Copy Paste: [[2603.11414]] MaterialFigBENCH: benchmark dataset with figures for evaluating college-level materials science problem-solving abilities of multimodal large language models(https://arxiv.org/abs/2603.11414)
Keywords: language model, gpt, llm, chat
Abstract: We present MaterialFigBench, a benchmark dataset designed to evaluate the ability of multimodal large language models (LLMs) to solve university-level materials science problems that require accurate interpretation of figures. Unlike existing benchmarks that primarily rely on textual representations, MaterialFigBench focuses on problems in which figures such as phase diagrams, stress-strain curves, Arrhenius plots, diffraction patterns, and microstructural schematics are indispensable for deriving correct answers. The dataset consists of 137 free-response problems adapted from standard materials science textbooks, covering a broad range of topics including crystal structures, mechanical properties, diffusion, phase diagrams, phase transformations, and electronic properties of materials. To address unavoidable ambiguity in reading numerical values from images, expert-defined answer ranges are provided where appropriate. We evaluate several state-of-the-art multimodal LLMs, including ChatGPT and GPT models accessed via OpenAI APIs, and analyze their performance across problem categories and model versions. The results reveal that, although overall accuracy improves with model updates, current LLMs still struggle with genuine visual understanding and quantitative interpretation of materials science figures. In many cases, correct answers are obtained by relying on memorized domain knowledge rather than by reading the provided images. MaterialFigBench highlights persistent weaknesses in visual reasoning, numerical precision, and significant-digit handling, while also identifying problem types where performance has improved. This benchmark provides a systematic and domain-specific foundation for advancing multimodal reasoning capabilities in materials science and for guiding the development of future LLMs with stronger figure-based understanding.
摘要：我们推出了 MaterialFigBench，这是一个基准数据集，旨在评估多模态大语言模型 (LLM) 解决需要准确解释图形的大学水平材料科学问题的能力。与主要依赖于文本表示的现有基准不同，MaterialFigBench 重点关注相图、应力应变曲线、阿累尼乌斯图、衍射图案和微观结构示意图等图形对于得出正确答案必不可少的问题。该数据集包含 137 个改编自标准材料科学教科书的自由回答问题，涵盖广泛的主题，包括材料的晶体结构、机械性能、扩散、相图、相变和电子性能。为了解决从图像读取数值时不可避免的歧义，在适当的情况下提供了专家定义的答案范围。我们评估了几个最先进的多模式法学硕士，包括 ChatGPT 和通过 OpenAI API 访问的 GPT 模型，并分析它们在问题类别和模型版本上的表现。结果表明，尽管整体准确性随着模型更新而提高，但目前的法学硕士在真正的视觉理解和材料科学图形的定量解释方面仍然存在困难。在许多情况下，正确答案是通过依赖记忆的领域知识而不是通过阅读提供的图像来获得的。 MaterialFigBench 强调了视觉推理、数值精度和有效数字处理方面持续存在的弱点，同时还识别了性能有所改善的问题类型。该基准为推进材料科学中的多模态推理能力提供了系统性和特定领域的基础，并通过更强的基于图形的理解来指导未来法学硕士的发展。

Title: BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion

Authors: Varun Iyer, Cornelia Caragea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11415
Pdf URL: https://arxiv.org/pdf/2603.11415
Copy Paste: [[2603.11415]] BLooP: Zero-Shot Abstractive Summarization using Large Language Models with Bigram Lookahead Promotion(https://arxiv.org/abs/2603.11415)
Keywords: language model, llm
Abstract: Abstractive summarization requires models to generate summaries that convey information in the source document. While large language models can generate summaries without fine-tuning, they often miss key details and include extraneous information. We propose BLooP (Bigram Lookahead Promotion), a simple training-free decoding intervention that encourages large language models (LLMs) to generate tokens that form bigrams from the source document. BLooP operates through a hash table lookup at each decoding step, requiring no training, fine-tuning, or model modification. We demonstrate improvements in ROUGE and BARTScore for Llama-3.1-8B-Instruct, Mistral-Nemo-Instruct-2407, and Gemma-2-9b-it on CNN/DM, CCSum, Multi-News, and SciTLDR. Human evaluation shows that BLooP significantly improves faithfulness without reducing readability. We make the code available at this https URL
摘要：抽象摘要需要模型生成在源文档中传达信息的摘要。虽然大型语言模型无需微调即可生成摘要，但它们经常会错过关键细节并包含无关信息。我们提出了 BLoP（Bigram Lookahead Promotion），这是一种简单的免训练解码干预，鼓励大型语言模型（LLM）生成从源文档形成双字母组的标记。 BLoOP 在每个解码步骤中通过哈希表查找进行操作，无需训练、微调或模型修改。我们在 CNN/DM、CCSum、Multi-News 和 SciTLDR 上展示了 Llama-3.1-8B-Instruct、Mistral-Nemo-Instruct-2407 和 Gemma-2-9b-it 的 ROUGE 和 BARTScore 的改进。人类评估表明，BLooP 在不降低可读性的情况下显着提高了忠实度。我们在此 https URL 提供代码

Title: LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction

Authors: Yuzhi Liang, Lixiang Ma, Xinrong Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11446
Pdf URL: https://arxiv.org/pdf/2603.11446
Copy Paste: [[2603.11446]] LLM-Assisted Causal Structure Disambiguation and Factor Extraction for Legal Judgment Prediction(https://arxiv.org/abs/2603.11446)
Keywords: language model, llm
Abstract: Mainstream methods for Legal Judgment Prediction (LJP) based on Pre-trained Language Models (PLMs) heavily rely on the statistical correlation between case facts and judgment results. This paradigm lacks explicit modeling of legal constituent elements and underlying causal logic, making models prone to learning spurious correlations and suffering from poor robustness. While introducing causal inference can mitigate this issue, existing causal LJP methods face two critical bottlenecks in real-world legal texts: inaccurate legal factor extraction with severe noise, and significant uncertainty in causal structure discovery due to Markov equivalence under sparse features. To address these challenges, we propose an enhanced causal inference framework that integrates Large Language Model (LLM) priors with statistical causal discovery. First, we design a coarse-to-fine hybrid extraction mechanism combining statistical sampling and LLM semantic reasoning to accurately identify and purify standard legal constituent elements. Second, to resolve structural uncertainty, we introduce an LLM-assisted causal structure disambiguation mechanism. By utilizing the LLM as a constrained prior knowledge base, we conduct probabilistic evaluation and pruning on ambiguous causal directions to generate legally compliant candidate causal graphs. Finally, a causal-aware judgment prediction model is constructed by explicitly constraining text attention intensity via the generated causal graphs. Extensive experiments on multiple benchmark datasets, including LEVEN , QA, and CAIL, demonstrate that our proposed method significantly outperforms state-of-the-art baselines in both predictive accuracy and robustness, particularly in distinguishing confusing charges.
摘要：基于预训练语言模型（PLM）的主流法律判决预测（LJP）方法严重依赖案件事实与判决结果之间的统计相关性。这种范式缺乏对法律构成要素和底层因果逻辑的明确建模，使得模型容易学习虚假相关性并且鲁棒性较差。虽然引入因果推理可以缓解这个问题，但现有的因果 LJP 方法在现实世界的法律文本中面临着两个关键瓶颈：法律因素提取不准确，噪声严重，以及稀疏特征下马尔可夫等价导致因果结构发现的显着不确定性。为了应对这些挑战，我们提出了一个增强的因果推理框架，它将大型语言模型（LLM）先验与统计因果发现相结合。首先，我们设计了一种结合统计采样和LLM语义推理的从粗到细的混合提取机制，以准确识别和纯化标准法律构成要素。其次，为了解决结构不确定性，我们引入了法学硕士辅助的因果结构消歧机制。通过利用法学硕士作为受约束的先验知识库，我们对不明确的因果方向进行概率评估和修剪，以生成合法合规的候选因果图。最后，通过生成的因果图显式约束文本注意力强度，构建因果感知判断预测模型。对多个基准数据集（包括 LEVEN、QA 和 CAIL）的广泛实验表明，我们提出的方法在预测准确性和鲁棒性方面显着优于最先进的基线，特别是在区分令人困惑的费用方面。

Title: Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs

Authors: Kunfeng Chen, Qihuang Zhong, Juhua Liu, Bo Du, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11495
Pdf URL: https://arxiv.org/pdf/2603.11495
Copy Paste: [[2603.11495]] Try, Check and Retry: A Divide-and-Conquer Framework for Boosting Long-context Tool-Calling Performance of LLMs(https://arxiv.org/abs/2603.11495)
Keywords: language model, llm
Abstract: Tool-calling empowers Large Language Models (LLMs) to interact with external environments. However, current methods often struggle to handle massive and noisy candidate tools in long-context tool-calling tasks, limiting their real-world application. To this end, we propose Tool-DC, a Divide-and-Conquer framework for boosting tool-calling performance of LLMs. The core of Tool-DC is to reduce the reasoning difficulty and make full use of self-reflection ability of LLMs via a "Try-Check-Retry" paradigm. Specifically, Tool-DC involves two variants: 1) the training-free Tool-DC (TF), which is plug-and-play and flexible; 2) the training-based Tool-DC (TB), which is more inference-efficient. Extensive experiments show that both Tool-DC methods outperform their counterparts by a clear margin. Tool-DC (TF) brings up to +25.10% average gains against the baseline on BFCL and ACEBench benchmarks, while Tool-DC (TB) enables Qwen2.5-7B to achieve comparable or even better performance than proprietary LLMs, e.g., OpenAI o3 and Claude-Haiku-4.5.
摘要：工具调用使大型语言模型 (LLM) 能够与外部环境交互。然而，当前的方法通常难以在长上下文工具调用任务中处理大量且嘈杂的候选工具，从而限制了它们的实际应用。为此，我们提出了 Tool-DC，一种分而治之的框架，用于提高法学硕士的工具调用性能。 Tool-DC的核心是通过“Try-Check-Retry”范式降低推理难度，充分利用法学硕士的自我反思能力。具体来说，Tool-DC涉及两个变体：1）免训练的Tool-DC（TF），即插即用，灵活； 2）基于训练的Tool-DC（TB），推理效率更高。大量实验表明，两种 Tool-DC 方法的性能均明显优于同类方法。 Tool-DC (TF) 与 BFCL 和 ACEBench 基准测试相比平均增益高达 +25.10%，而 Tool-DC (TB) 使 Qwen2.5-7B 能够实现与专有 LLM（例如 OpenAI o3 和 Claude-Haiku-4.5）相当甚至更好的性能。

Title: Tiny Aya: Bridging Scale and Multilingual Depth

Authors: Alejandro R. Salamanca, Diana Abagyan, Daniel D'souza, Ammar Khairi, David Mora, Saurabh Dash, Viraat Aryabumi, Sara Rajaee, Mehrnaz Mofakhami, Ananya Sahu, Thomas Euyang, Brittawnya Prince, Madeline Smith, Hangyu Lin, Acyr Locatelli, Sara Hooker, Tom Kocmi, Aidan Gomez, Ivan Zhang, Phil Blunsom, Nick Frosst, Joelle Pineau, Beyza Ermis, Ahmet Üstün, Julia Kreutzer, Marzieh Fadaee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11510
Pdf URL: https://arxiv.org/pdf/2603.11510
Copy Paste: [[2603.11510]] Tiny Aya: Bridging Scale and Multilingual Depth(https://arxiv.org/abs/2603.11510)
Keywords: language model
Abstract: Tiny Aya redefines what a small multilingual language model can achieve. Trained on 70 languages and refined through region-aware posttraining, it delivers state-of-the-art in translation quality, strong multilingual understanding, and high-quality target-language generation, all with just 3.35B parameters. The release includes a pretrained foundation model, a globally balanced instruction-tuned variant, and three region-specialized models targeting languages from Africa, South Asia, Europe, Asia-Pacific, and West Asia. This report details the training strategy, data composition, and comprehensive evaluation framework behind Tiny Aya, and presents an alternative scaling path for multilingual AI: one centered on efficiency, balanced performance across languages, and practical deployment.
摘要：Tiny Aya 重新定义了小型多语言语言模型可以实现的目标。它经过 70 种语言的训练，并通过区域感知的后训练进行改进，提供最先进的翻译质量、强大的多语言理解和高质量的目标语言生成，所有这些都只需 3.35B 参数。该版本包括一个预训练的基础模型、一个全球平衡的指令调整变体以及三个针对非洲、南亚、欧洲、亚太和西亚语言的区域专用模型。该报告详细介绍了 Tiny Aya 背后的训练策略、数据构成和综合评估框架，并提出了多语言 AI 的另一种扩展路径：以效率、跨语言平衡性能和实际部署为中心。

Title: Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale

Authors: Sanchit Pandey (BITS Pilani, Hyderabad, India)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11513
Pdf URL: https://arxiv.org/pdf/2603.11513
Copy Paste: [[2603.11513]] Can Small Language Models Use What They Retrieve? An Empirical Study of Retrieval Utilization Across Model Scale(https://arxiv.org/abs/2603.11513)
Keywords: language model, llm, prompt, retrieval augmented generation
Abstract: Retrieval augmented generation RAG is widely deployed to improve factual accuracy in language models yet it remains unclear whether smaller models of size 7B parameters or less can effectively utilize retrieved information. To investigate this question we evaluate five model sizes from 360M to 8B across three architecture families SmolLM2 Qwen2.5 and Llama 3.1 under four retrieval conditions including no retrieval BM25 dense retrieval using E5 large v2 and oracle retrieval where the retrieved passage is guaranteed to contain the answer. We introduce a parametric knowledge split that separates questions a model can already answer from those that require external knowledge which allows us to isolate utilization failure from retrieval quality failure. We find three main results. First even with oracle retrieval models of size 7B or smaller fail to extract the correct answer 85 to 100 percent of the time on questions they cannot answer alone which indicates a fundamental utilization bottleneck. Second adding retrieval context destroys 42 to 100 percent of answers the model previously knew suggesting a distraction effect driven by the presence of context rather than its quality. Third an error analysis of 2588 oracle failures shows that the dominant failure mode is irrelevant generation where the model ignores the provided context entirely. These patterns hold across multiple prompt templates and retrieval methods. The results indicate that for models below 7B parameters the main limitation of RAG is context utilization rather than retrieval quality and that deploying RAG at this scale can lead to a net negative trade off under standard evaluation conditions.
摘要：检索增强生成 RAG 被广泛部署以提高语言模型的事实准确性，但尚不清楚参数大小为 7B 或更少的较小模型是否可以有效利用检索到的信息。为了研究这个问题，我们在四种检索条件下评估了三个架构系列 SmolLM2 Qwen2.5 和 Llama 3.1 中从 360M 到 8B 的五种模型大小，包括无检索、使用 E5 Large v2 的 BM25 密集检索和 Oracle 检索，其中保证检索到的段落包含答案。我们引入了参数化知识分割，将模型已经可以回答的问题与需要外部知识的问题分开，这使我们能够将利用失败与检索质量失败隔离开来。我们发现三个主要结果。首先，即使使用大小为 7B 或更小的预言机检索模型，也无法在 85% 到 100% 的时间内提取出它们无法单独回答的问题的正确答案，这表明存在根本的利用率瓶颈。其次，添加检索上下文会破坏模型先前知道的 42％至 100％的答案，这表明分散注意力的效果是由上下文的存在而不是其质量驱动的。第三，对 2588 个预言机故障的错误分析表明，主要的故障模式是不相关的生成，其中模型完全忽略所提供的上下文。这些模式适用于多种提示模板和检索方法。结果表明，对于低于 7B 参数的模型，RAG 的主要限制是上下文利用率而不是检索质量，并且在此规模上部署 RAG 可能会导致标准评估条件下的净负权衡。

Title: One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries

Authors: Mayank Saini Arit Kumar Bishwas
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.11545
Pdf URL: https://arxiv.org/pdf/2603.11545
Copy Paste: [[2603.11545]] One Supervisor, Many Modalities: Adaptive Tool Orchestration for Autonomous Queries(https://arxiv.org/abs/2603.11545)
Keywords: llm, agent
Abstract: We present an agentic AI framework for autonomous multimodal query processing that coordinates specialized tools across text, image, audio, video, and document modalities. A central Supervisor dynamically decomposes user queries, delegates subtasks to modality-appropriate tools (e.g., object detection, OCR, speech transcription), and synthesizes results through adaptive routing strategies rather than predetermined decision trees. For text-only queries, the framework uses learned routing via RouteLLM, while non-text paths use SLM-assisted modality decomposition. Evaluated on 2,847 queries across 15 task categories, our framework achieves 72% reduction in time-to-accurate-answer, 85% reduction in conversational rework, and 67% cost reduction compared to the matched hierarchical baseline while maintaining accuracy parity. These results demonstrate that intelligent centralized orchestration fundamentally improves multimodal AI deployment economics.
摘要：我们提出了一个用于自主多模式查询处理的代理人工智能框架，该框架可协调跨文本、图像、音频、视频和文档模式的专用工具。中央主管动态分解用户查询，将子任务委托给适合模态的工具（例如对象检测、OCR、语音转录），并通过自适应路由策略而不是预先确定的决策树来合成结果。对于纯文本查询，该框架使用通过 RouteLLM 学习的路由，而非文本路径则使用 SLM 辅助模态分解。通过对 15 个任务类别中的 2,847 个查询进行评估，我们的框架与匹配的分层基线相比，在保持准确性对等的同时，准确回答时间缩短了 72%，对话返工减少了 85%，成本降低了 67%。这些结果表明，智能集中编排从根本上提高了多模式人工智能部署的经济性。

Title: Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries

Authors: Zhenxu Tian, Yi Su, Juntao Li, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11564
Pdf URL: https://arxiv.org/pdf/2603.11564
Copy Paste: [[2603.11564]] Where Matters More Than What: Decoding-aligned KV Cache Compression via Position-aware Pseudo Queries(https://arxiv.org/abs/2603.11564)
Keywords: language model, llm, long context, prompt
Abstract: The Key-Value (KV) cache is crucial for efficient Large Language Models (LLMs) inference, but excessively long contexts drastically increase KV cache memory footprint. Existing KV cache compression methods typically rely on input-side attention patterns within a prompt observation window to estimate token importance during the prefill stage. They fail to preserve critical tokens for future generation since these assessments are not derived from the decoding process. Intuitively, an effective observation window should mirror the decoding-stage queries to accurately reflect which tokens the generation process will attend to. However, ground-truth decoding queries are inherently unavailable during inference. For constructing pseudo queries to approximate them, we find that positional information plays a more critical role than semantic content. Motivated by this insight, we propose decoding-aligned KV cache compression via position-aware pseudo queries (DapQ), a novel and lightweight eviction framework that leverages position-aware pseudo queries to simulate the output tokens, thereby establishing an effective observation window for importance assessment. It aligns closely with the actual generation context and enables precise token eviction. Extensive evaluations across multiple benchmarks and LLMs demonstrate that DapQ achieves superior performance, particularly under strict memory constraints (e.g., up to nearly lossless performance 99.5% on NIAH with 3% KV cache budgets).
摘要：键值 (KV) 缓存对于高效的大型语言模型 (LLM) 推理至关重要，但过长的上下文会大大增加 KV 缓存的内存占用量。现有的 KV 缓存压缩方法通常依赖于提示观察窗口内的输入侧注意模式来估计预填充阶段的令牌重要性。他们无法为下一代保留关键令牌，因为这些评估不是从解码过程中得出的。直观上，有效的观察窗口应该反映解码阶段的查询，以准确反映生成过程将关注哪些令牌。然而，在推理过程中，真实解码查询本质上是不可用的。为了构造伪查询来近似它们，我们发现位置信息比语义内容起着更关键的作用。受此启发，我们提出通过位置感知伪查询（DapQ）进行解码对齐的 KV 缓存压缩，这是一种新颖的轻量级逐出框架，利用位置感知伪查询来模拟输出标记，从而为重要性评估建立有效的观察窗口。它与实际的生成环境紧密结合，并实现精确的代币驱逐。跨多个基准测试和 LLM 的广泛评估表明，DapQ 实现了卓越的性能，特别是在严格的内存限制下（例如，在具有 3% KV 缓存预算的 NIAH 上几乎无损性能高达 99.5%）。

Title: UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization

Authors: Ofir Marom
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11583
Pdf URL: https://arxiv.org/pdf/2603.11583
Copy Paste: [[2603.11583]] UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Optimization(https://arxiv.org/abs/2603.11583)
Keywords: language model, gpt, llm, prompt
Abstract: The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.
摘要：大型语言模型 (LLM) 任务的成功在很大程度上取决于其提示。大多数用例使用自然语言指定提示，当必须同时满足多个目标时，这本质上是不明确的。在本文中，我们介绍了 UtilityMax Prompting，这是一个使用形式数学语言指定任务的框架。我们将任务重建为影响图，其中法学硕士的答案是唯一的决策变量。效用函数是在图中的条件概率分布上定义的，并且法学硕士被指示找到最大化预期效用的答案。这限制了法学硕士对目标的每个组成部分进行明确的推理，将其输出导向精确的优化目标，而不是主观的自然语言解释。我们在三个前沿模型（Claude Sonnet 4.6、GPT-5.4 和 Gemini 2.5 Pro）上的 MovieLens 1M 数据集上验证了我们的方法，证明在多目标电影推荐任务中，精度和归一化贴现累积增益（NDCG）相对于自然语言基线的持续改进。

Title: Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese

Authors: Masataka Kawai, Singo Sakashita, Shumpei Ishikawa, Shogo Watanabe, Anna Matsuoka, Mikio Sakurai, Yasuto Fujimoto, Yoshiyuki Takahara, Atsushi Ohara, Hirohiko Miyake, Genichiro Ishii
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11597
Pdf URL: https://arxiv.org/pdf/2603.11597
Copy Paste: [[2603.11597]] Performance Evaluation of Open-Source Large Language Models for Assisting Pathology Report Writing in Japanese(https://arxiv.org/abs/2603.11597)
Keywords: language model, llm
Abstract: The performance of large language models (LLMs) for supporting pathology report writing in Japanese remains unexplored. We evaluated seven open-source LLMs from three perspectives: (A) generation and information extraction of pathology diagnosis text following predefined formats, (B) correction of typographical errors in Japanese pathology reports, and (C) subjective evaluation of model-generated explanatory text by pathologists and clinicians. Thinking models and medical-specialized models showed advantages in structured reporting tasks that required reasoning and in typo correction. In contrast, preferences for explanatory outputs varied substantially across raters. Although the utility of LLMs differed by task, our findings suggest that open-source LLMs can be useful for assisting Japanese pathology report writing in limited but clinically relevant scenarios.
摘要：支持日语病理报告撰写的大型语言模型 (LLM) 的性能仍有待探索。我们从三个角度评估了七个开源法学硕士：（A）按照预定义格式生成和信息提取病理诊断文本，（B）纠正日本病理报告中的印刷错误，以及（C）病理学家和临床医生对模型生成的解释文本的主观评估。思维模型和医学专业模型在需要推理的结构化报告任务和拼写纠正方面显示出优势。相比之下，不同评估者对解释性输出的偏好差异很大。尽管法学硕士的效用因任务而异，但我们的研究结果表明，开源法学硕士可用于在有限但临床相关的情况下协助日本病理报告的撰写。

Title: QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate

Authors: Jihao Zhao, Daixuan Li, Pengfei Li, Shuaishuai Zu, Biao Qin, Hongyan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11650
Pdf URL: https://arxiv.org/pdf/2603.11650
Copy Paste: [[2603.11650]] QChunker: Learning Question-Aware Text Chunking for Domain RAG via Multi-Agent Debate(https://arxiv.org/abs/2603.11650)
Keywords: language model, retrieval-augmented generation, agent
Abstract: The effectiveness upper bound of retrieval-augmented generation (RAG) is fundamentally constrained by the semantic integrity and information granularity of text chunks in its knowledge base. To address these challenges, this paper proposes QChunker, which restructures the RAG paradigm from retrieval-augmentation to understanding-retrieval-augmentation. Firstly, QChunker models the text chunking as a composite task of text segmentation and knowledge completion to ensure the logical coherence and integrity of text chunks. Drawing inspiration from Hal Gregersen's "Questions Are the Answer" theory, we design a multi-agent debate framework comprising four specialized components: a question outline generator, text segmenter, integrity reviewer, and knowledge completer. This framework operates on the principle that questions serve as catalysts for profound insights. Through this pipeline, we successfully construct a high-quality dataset of 45K entries and transfer this capability to small language models. Additionally, to handle long evaluation chains and low efficiency in existing chunking evaluation methods, which overly rely on downstream QA tasks, we introduce a novel direct evaluation metric, ChunkScore. Both theoretical and experimental validations demonstrate that ChunkScore can directly and efficiently discriminate the quality of text chunks. Furthermore, during the text segmentation phase, we utilize document outlines for multi-path sampling to generate multiple candidate chunks and select the optimal solution employing ChunkScore. Extensive experimental results across four heterogeneous domains exhibit that QChunker effectively resolves aforementioned issues by providing RAG with more logically coherent and information-rich text chunks.
摘要：检索增强生成（RAG）的有效性上限从根本上受到其知识库中文本块的语义完整性和信息粒度的约束。为了应对这些挑战，本文提出了 QChunker，它将 RAG 范式从检索-增强重构为理解-检索-增强。首先，QChunker 将文本分块建模为文本分割和知识补全的复合任务，以确保文本块的逻辑连贯性和完整性。受到 Hal Gregersen 的“问题就是答案”理论的启发，我们设计了一个多智能体辩论框架，包含四个专门的组件：问题大纲生成器、文本分段器、完整性审核器和知识完成器。该框架的运作原则是：问题是深刻见解的催化剂。通过这个管道，我们成功构建了包含 45K 条目的高质量数据集，并将这种能力转移到小型语言模型中。此外，为了处理现有分块评估方法中的长评估链和低效率（过度依赖下游 QA 任务），我们引入了一种新颖的直接评估指标 ChunkScore。理论和实验验证都表明ChunkScore可以直接有效地判别文本块的质量。此外，在文本分割阶段，我们利用文档轮廓进行多路径采样来生成多个候选块，并使用 ChunkScore 选择最佳解决方案。跨四个异构领域的广泛实验结果表明，QChunker 通过为 RAG 提供逻辑上更加连贯且信息丰富的文本块，有效解决了上述问题。

Title: Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge

Authors: Junjie Wu, Xuan Kan, Zihao He, Shunwen Tan, Bo Pan, Kaitai Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11665
Pdf URL: https://arxiv.org/pdf/2603.11665
Copy Paste: [[2603.11665]] Multi-Task Reinforcement Learning for Enhanced Multimodal LLM-as-a-Judge(https://arxiv.org/abs/2603.11665)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have been widely adopted as MLLM-as-a-Judges due to their strong alignment with human judgment across various visual tasks. However, most existing judge models are optimized for single-task scenarios and struggle to generalize to diverse contexts, which is a critical requirement for reliable evaluation. To address this limitation, we propose Multi-Task Reinforcement Learning for MLLM-as-a-Judge (MT-RL-Judge), a framework that jointly optimizes the judge model across multiple tasks, leveraging the generalization capabilities of RL. Experimental results against several strong baselines demonstrate that MT-RL-Judge outperforms strong baselines in both judgment consistency and correlation with human preferences. Furthermore, our approach exhibits robust generalization on out-of-distribution tasks, further validating its effectiveness.
摘要：多模态大语言模型 (MLLM) 已被广泛采用作为 MLLM 作为法官，因为它们与人类在各种视觉任务中的判断具有很强的一致性。然而，大多数现有的判断模型都是针对单一任务场景进行优化的，并且很难推广到不同的环境，而这是可靠评估的关键要求。为了解决这个限制，我们提出了MLLM作为法官的多任务强化学习（MT-RL-Judge），这是一个利用强化学习的泛化能力跨多个任务联合优化法官模型的框架。针对多个强基线的实验结果表明，MT-RL-Judge 在判断一致性和与人类偏好的相关性方面均优于强基线。此外，我们的方法对分布外任务表现出强大的泛化能力，进一步验证了其有效性。

Title: In the LLM era, Word Sense Induction remains unsolved

Authors: Anna Mosolova, Marie Candito, Carlos Ramisch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11686
Pdf URL: https://arxiv.org/pdf/2603.11686
Copy Paste: [[2603.11686]] In the LLM era, Word Sense Induction remains unsolved(https://arxiv.org/abs/2603.11686)
Keywords: llm
Abstract: In the absence of sense-annotated data, word sense induction (WSI) is a compelling alternative to word sense disambiguation, particularly in low-resource or domain-specific settings. In this paper, we emphasize methodological problems in current WSI evaluation. We propose an evaluation on a SemCor-derived dataset, respecting the original corpus polysemy and frequency distributions. We assess pre-trained embeddings and clustering algorithms across parts of speech, and propose and evaluate an LLM-based WSI method for English. We evaluate data augmentation sources (LLM-generated, corpus and lexicon), and semi-supervised scenarios using Wiktionary for data augmentation, must-link constraints, number of clusters per lemma. We find that no unsupervised method (whether ours or previous) surpasses the strong "one cluster per lemma" heuristic (1cpl). We also show that (i) results and best systems may vary across POS, (ii) LLMs have troubles performing this task, (iii) data augmentation is beneficial and (iv) capitalizing on Wiktionary does help. It surpasses previous SOTA system on our test set by 3.3\%. WSI is not solved, and calls for a better articulation of lexicons and LLMs' lexical semantics capabilities.
摘要：在缺乏词义注释数据的情况下，词义归纳 (WSI) 是词义消歧的一个令人信服的替代方案，特别是在资源匮乏或特定领域的环境中。在本文中，我们强调当前 WSI 评估中的方法论问题。我们提出对 SemCor 派生的数据集进行评估，尊重原始语料库多义性和频率分布。我们评估跨词性的预训练嵌入和聚类算法，并提出和评估基于 LLM 的英语 WSI 方法。我们评估数据增强源（法学硕士生成的、语料库和词典），以及使用维基词典进行数据增强的半监督场景、必须链接约束、每个引理的集群数量。我们发现没有无监督方法（无论是我们的还是以前的）超越强大的“每个引理一个簇”启发式（1cpl）。我们还表明，(i) 不同 POS 的结果和最佳系统可能有所不同，(ii) 法学硕士在执行这项任务时遇到困难，(iii) 数据增强是有益的，(iv) 利用维基词典确实有帮助。它在我们的测试集上超过了之前的 SOTA 系统 3.3%。 WSI 尚未解决，需要更好地阐明词典和法学硕士的词汇语义能力。

Title: SemBench: A Universal Semantic Framework for LLM Evaluation

Authors: Mikel Zubillaga, Naiara Perez, Oscar Sainz, German Rigau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11687
Pdf URL: https://arxiv.org/pdf/2603.11687
Copy Paste: [[2603.11687]] SemBench: A Universal Semantic Framework for LLM Evaluation(https://arxiv.org/abs/2603.11687)
Keywords: language model, llm
Abstract: Recent progress in Natural Language Processing (NLP) has been driven by the emergence of Large Language Models (LLMs), which exhibit remarkable generative and reasoning capabilities. However, despite their success, evaluating the true semantic understanding of these models remains a persistent challenge. Traditional benchmarks such as Word-in-Context (WiC) effectively probe this capability, but their creation is resource-intensive and often limited to high-resource languages. In this paper, we introduce SemBench, a framework for automatically generating synthetic benchmarks that assess the semantic competence of LLMs using only dictionary sense definitions and a sentence encoder. This approach eliminates the need for curated example sentences, making it both scalable and language-independent. We evaluate SemBench in three languages (English, Spanish, and Basque) spanning different levels of linguistic resources, and across a wide range of LLMs. Our results show that rankings derived from SemBench strongly correlate with those obtained from standard WiC datasets. Furthermore, our analysis demonstrates that only a small number of examples is required to achieve stable and meaningful rankings. Overall, SemBench provides a lightweight, adaptable, and data-efficient framework for cross-lingual evaluation of semantic understanding in LLMs.
摘要：自然语言处理 (NLP) 的最新进展是由大型语言模型 (LLM) 的出现推动的，该模型表现出卓越的生成和推理能力。然而，尽管取得了成功，评估这些模型的真实语义理解仍然是一个持续的挑战。 Word-in-Context (WiC) 等传统基准测试有效地探索了这种能力，但它们的创建是资源密集型的，并且通常仅限于高资源语言。在本文中，我们介绍了 SemBench，这是一个自动生成综合基准的框架，该基准仅使用字典意义定义和句子编码器来评估法学硕士的语义能力。这种方法消除了对精选例句的需要，使其既可扩展又独立于语言。我们使用三种语言（英语、西班牙语和巴斯克语）评估 SemBench，涵盖不同级别的语言资源和广泛的法学硕士。我们的结果表明，从 SemBench 得出的排名与从标准 WiC 数据集获得的排名密切相关。此外，我们的分析表明，只需要少量的例子就可以实现稳定且有意义的排名。总体而言，SemBench 为法学硕士语义理解的跨语言评估提供了一个轻量级、适应性强且数据高效的框架。

Title: Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information

Authors: Konstantin Krestnikov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11749
Pdf URL: https://arxiv.org/pdf/2603.11749
Copy Paste: [[2603.11749]] Compression Favors Consistency, Not Truth: When and Why Language Models Prefer Correct Information(https://arxiv.org/abs/2603.11749)
Keywords: language model, gpt
Abstract: Why do language models sometimes prefer correct statements even when trained on mixed-quality data? We introduce the Compression--Consistency Principle: next-token prediction favors hypotheses that allow shorter and more internally consistent descriptions of the training data. Truth bias emerges only when false alternatives are structurally harder to compress. We test this using small GPT-2-style character-level transformers (3.5M--86M parameters) on synthetic math corpora with controlled mixtures of correct and incorrect rules. In the random-error setting, models strongly prefer correct completions in paired evaluation: 83.1% accuracy at balanced data and 67.0% even when correct rules appear in only 10% of the corpus. Replacing random errors with a coherent but mathematically incorrect rule system largely eliminates the preference (near-chance accuracy). In a more natural-language-like synthetic world, the effect is weaker but still present (57.7%). Additional experiments show that embedding verification steps can restore preference for correctness even at small scale, while increasing the number of consistent rules produces a graded improvement in accuracy. Our results suggest that what appears as a "truth bias" is largely a side effect of compression pressure and preference for internal consistency, rather than an intrinsic drive toward truth. Full code and data are available at this https URL.
摘要：为什么语言模型有时更喜欢正确的陈述，即使是在混合质量数据上进行训练时也是如此？我们引入压缩一致性原则：下一个标记预测有利于允许对训练数据进行更短且更内部一致的描述的假设。只有当错误的替代方案在结构上更难以压缩时，真相偏见才会出现。我们使用小型 GPT-2 风格的字符级转换器（3.5M--86M 参数）在合成数学语料库上进行测试，并使用正确和错误规则的受控混合。在随机错误设置中，模型强烈倾向于在配对评估中正确完成：平衡数据的准确率为 83.1%，即使正确的规则仅出现在语料库的 10% 中，准确率仍为 67.0%。用连贯但数学上不正确的规则系统代替随机错误很大程度上消除了偏好（接近机会的准确性）。在更接近自然语言的合成世界中，效果较弱，但仍然存在（57.7%）。其他实验表明，即使在小规模下，嵌入验证步骤也可以恢复对正确性的偏好，同时增加一致规则的数量会产生准确性的分级提高。我们的研究结果表明，看似“真相偏见”的现象很大程度上是压缩压力和对内部一致性偏好的副作用，而不是追求真相的内在动力。完整的代码和数据可在此 https URL 获取。

Title: Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents

Authors: Yaocong Li, Qiang Lan, Leihan Zhang, Le Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11772
Pdf URL: https://arxiv.org/pdf/2603.11772
Copy Paste: [[2603.11772]] Legal-DC: Benchmarking Retrieval-Augmented Generation for Legal Documents(https://arxiv.org/abs/2603.11772)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a promising technology for legal document consultation, yet its application in Chinese legal scenarios faces two key limitations: existing benchmarks lack specialized support for joint retriever-generator evaluation, and mainstream RAG systems often fail to accommodate the structured nature of legal provisions. To address these gaps, this study advances two core contributions: First, we constructed the Legal-DC benchmark dataset, comprising 480 legal documents (covering areas such as market regulation and contract management) and 2,475 refined question-answer pairs, each annotated with clause-level references, filling the gap for specialized evaluation resources in Chinese legal RAG. Second, we propose the LegRAG framework, which integrates legal adaptive indexing (clause-boundary segmentation) with a dual-path self-reflection mechanism to ensure clause integrity while enhancing answer accuracy. Third, we introduce automated evaluation methods for large language models to meet the high-reliability demands of legal retrieval scenarios. LegRAG outperforms existing state-of-the-art methods by 1.3% to 5.6% across key evaluation metrics. This research provides a specialized benchmark, practical framework, and empirical insights to advance the development of Chinese legal RAG systems. Our code and data are available at this https URL.
摘要：检索增强生成（RAG）已成为一种很有前途的法律文件咨询技术，但其在中国法律场景中的应用面临两个关键限制：现有基准缺乏对检索器-生成器联合评估的专门支持，主流RAG系统往往无法适应法律条款的结构化性质。为了弥补这些差距，本研究提出了两个核心贡献：首先，我们构建了Legal-DC基准数据集，包括480个法律文件（涵盖市场监管和合同管理等领域）和2,475个精细问答对，每个问答对都带有条款级参考文献注释，填补了中国法律RAG专业评估资源的空白。其次，我们提出了LegRAG框架，它将法律自适应索引（子句边界分割）与双路径自我反思机制相结合，以确保子句完整性，同时提高答案准确性。第三，我们引入了大型语言模型的自动化评估方法，以满足法律检索场景的高可靠性需求。 LegRAG 在关键评估指标上的表现比现有最先进的方法高出 1.3% 至 5.6%。本研究为推动中国法律 RAG 制度的发展提供了专门的基准、实践框架和实证见解。我们的代码和数据可在此 https URL 中获取。

Title: Large Language Models for Biomedical Article Classification

Authors: Jakub Proboszcz, Paweł Cichosz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11780
Pdf URL: https://arxiv.org/pdf/2603.11780
Copy Paste: [[2603.11780]] Large Language Models for Biomedical Article Classification(https://arxiv.org/abs/2603.11780)
Keywords: language model, prompt
Abstract: This work presents a systematic and in-depth investigation of the utility of large language models as text classifiers for biomedical article classification. The study uses several small and mid-size open source models, as well as selected closed source ones, and is more comprehensive than most prior work with respect to the scope of evaluated configurations: different types of prompts, output processing methods for generating both class and class probability predictions, as well as few-shot example counts and selection methods. The performance of the most successful configurations is compared to that of conventional classification algorithms. The obtained average PR AUC over 15 challenging datasets above 0.4 for zero-shot prompting and nearly 0.5 for few-shot prompting comes close to that of the naïve Bayes classifier (0.5), the random forest algorithm (0.5 with default settings or 0.55 with hyperparameter tuning) and fine-tuned transformer models (0.5). These results confirm the utility of large language models as text classifiers for non-trivial domains and provide practical recommendations of the most promising setups, including in particular using output token probabilities for class probability prediction.
摘要：这项工作对大型语言模型作为生物医学文章分类的文本分类器的实用性进行了系统而深入的研究。该研究使用了几个中小型开源模型，以及选定的闭源模型，并且在评估配置的范围方面比大多数先前的工作更全面：不同类型的提示、用于生成类和类概率预测的输出处理方法，以及少样本示例计数和选择方法。将最成功配置的性能与传统分类算法的性能进行比较。在 15 个具有挑战性的数据集上获得的平均 PR AUC 对于零样本提示高于 0.4，对于少样本提示接近 0.5，接近朴素贝叶斯分类器 (0.5)、随机森林算法（默认设置为 0.5 或超参数调整为 0.55）和微调 Transformer 模型 (0.5)。这些结果证实了大型语言模型作为非平凡领域的文本分类器的实用性，并提供了最有前途的设置的实用建议，特别包括使用输出标记概率进行类概率预测。

Title: DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining

Authors: Yutong Yan, Raphael Tang, Zhenyu Gao, Wenxi Jiang, Yao Lu
Subjects: cs.CL, q-fin.GN
Abstract URL: https://arxiv.org/abs/2603.11838
Pdf URL: https://arxiv.org/pdf/2603.11838
Copy Paste: [[2603.11838]] DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining(https://arxiv.org/abs/2603.11838)
Keywords: language model, gpt
Abstract: In financial backtesting, large language models pretrained on internet-scale data risk introducing lookahead bias that undermines their forecasting validity, as they may have already seen the true outcome during training. To address this, we present DatedGPT, a family of twelve 1.3B-parameter language models, each trained from scratch on approximately 100 billion tokens of temporally partitioned data with strict annual cutoffs spanning 2013 to 2024. We further enhance each model with instruction fine-tuning on both general-domain and finance-specific datasets curated to respect the same temporal boundaries. Perplexity-based probing confirms that each model's knowledge is effectively bounded by its data cutoff year, while evaluation on standard benchmarks shows competitive performance with existing models of similar scale. We provide an interactive web demo that allows users to query and compare responses from models across different cutoff years.
摘要：在金融回测中，基于互联网规模数据进行预训练的大型语言模型可能会引入前瞻偏差，从而破坏其预测的有效性，因为它们可能已经在训练期间看到了真实的结果。为了解决这个问题，我们提出了 DatedGPT，这是一个由 12 个 1.3B 参数语言模型组成的家族，每个模型都从头开始训练了约 1000 亿个时间分区数据，并在 2013 年至 2024 年之间进行了严格的年度截止。我们通过对通用领域和金融特定数据集进行指令微调来进一步增强每个模型，以尊重相同的时间边界。基于困惑度的探测证实，每个模型的知识都受到其数据截止年份的有效限制，而标准基准的评估则显示出与类似规模的现有模型的竞争性能。我们提供了一个交互式网络演示，允许用户查询和比较不同截止年份的模型的响应。

Title: Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language

Authors: Remigiusz Kinas, Paweł Kiszczak, Sergio P. Perez, Krzysztof Ociepa, Łukasz Flis, Krzysztof Wróbel, Adrian Gwoździej
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.11881
Pdf URL: https://arxiv.org/pdf/2603.11881
Copy Paste: [[2603.11881]] Bielik-Minitron-7B: Compressing Large Language Models via Structured Pruning and Knowledge Distillation for the Polish Language(https://arxiv.org/abs/2603.11881)
Keywords: language model
Abstract: This report details the creation of Bielik-Minitron-7B, a compressed 7.35B parameter version of the Bielik-11B-v3.0 model, specifically optimized for European languages. By leveraging a two-stage compression methodology inspired by the NVIDIA Minitron approach, we combined structured hybrid pruning and knowledge distillation to reduce the model's parameter count by 33.4%, from 11.04B to 7.35B. We utilized the NVIDIA Model Optimizer for structural pruning and the NVIDIA NeMo Framework for logit-based distillation for quality recovery. Following distillation, the model underwent a rigorous alignment pipeline consisting of Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO-P), and Reinforcement Learning (GRPO). Our final model successfully recovered approximately 90% of the baseline model's performance while providing up to 50% inference speedup. This approach demonstrates an efficient pathway to create language models for less-represented languages, preserving the original model quality while reducing inference deployment costs.
摘要：本报告详细介绍了 Bielik-Minitron-7B 的创建过程，这是 Bielik-11B-v3.0 模型的压缩 7.35B 参数版本，专门针对欧洲语言进行了优化。通过利用受 NVIDIA Minitron 方法启发的两级压缩方法，我们将结构化混合修剪和知识蒸馏相结合，将模型的参数数量减少了 33.4%，从 11.04B 减少到 7.35B。我们利用 NVIDIA 模型优化器进行结构修剪，并利用 NVIDIA NeMo 框架进行基于 logit 的蒸馏以实现质量恢复。蒸馏后，模型经历了严格的对齐流程，包括监督微调 (SFT)、直接偏好优化 (DPO-P) 和强化学习 (GRPO)。我们的最终模型成功恢复了基准模型约 90% 的性能，同时提供高达 50% 的推理加速。这种方法展示了一种为代表性较少的语言创建语言模型的有效途径，在保持原始模型质量的同时降低推理部署成本。

Title: CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?

Authors: Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11915
Pdf URL: https://arxiv.org/pdf/2603.11915
Copy Paste: [[2603.11915]] CoMMET: To What Extent Can LLMs Perform Theory of Mind Tasks?(https://arxiv.org/abs/2603.11915)
Keywords: language model, llm
Abstract: Theory of Mind (ToM)-the ability to reason about the mental states of oneself and others-is a cornerstone of human social intelligence. As Large Language Models (LLMs) become ubiquitous in real-world applications, validating their capacity for this level of social reasoning is essential for effective and natural interactions. However, existing benchmarks for assessing ToM in LLMs are limited; most rely solely on text inputs and focus narrowly on belief-related tasks. In this paper, we propose a new multimodal benchmark dataset, CoMMET, a Comprehensive Mental states and Moral Evaluation Task inspired by the Theory of Mind Booklet Task. CoMMET expands the scope of evaluation by covering a broader range of mental states and introducing multi-turn testing. To the best of our knowledge, this is the first multimodal dataset to evaluate ToM in a multi-turn conversational setting. Through a comprehensive assessment of LLMs across different families and sizes, we analyze the strengths and limitations of current models and identify directions for future improvement. Our work offers a deeper understanding of the social cognitive capabilities of modern LLMs.
摘要：心智理论（ToM）——推理自己和他人心理状态的能力——是人类社会智能的基石。随着大型语言模型 (LLM) 在现实世界的应用中变得无处不在，验证其这种级别的社会推理能力对于有效和自然的交互至关重要。然而，现有的评估法学硕士 ToM 的基准是有限的；大多数仅依赖文本输入并狭隘地关注与信仰相关的任务。在本文中，我们提出了一个新的多模态基准数据集 CoMMET，这是一个受心理理论小册子任务启发的综合心理状态和道德评估任务。 CoMMET 通过覆盖更广泛的心理状态并引入多轮测试来扩大评估范围。据我们所知，这是第一个在多轮对话环境中评估 ToM 的多模式数据集。通过对不同家庭和规模的法学硕士进行全面评估，我们分析了当前模式的优势和局限性，并确定了未来改进的方向。我们的工作让人们更深入地了解现代法学硕士的社交认知能力。

Title: PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents

Authors: Minjia Wang, Yunfeng Wang, Xiao Ma, Dexin Lv, Qifan Guo, Lynn Zheng, Benliang Wang, Lei Wang, Jiannan Li, Yongwei Xing, David Xu, Zheng Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11955
Pdf URL: https://arxiv.org/pdf/2603.11955
Copy Paste: [[2603.11955]] PersonaTrace: Synthesizing Realistic Digital Footprints with LLM Agents(https://arxiv.org/abs/2603.11955)
Keywords: language model, llm, agent
Abstract: Digital footprints (records of individuals' interactions with digital systems) are essential for studying behavior, developing personalized applications, and training machine learning models. However, research in this area is often hindered by the scarcity of diverse and accessible data. To address this limitation, we propose a novel method for synthesizing realistic digital footprints using large language model (LLM) agents. Starting from a structured user profile, our approach generates diverse and plausible sequences of user events, ultimately producing corresponding digital artifacts such as emails, messages, calendar entries, reminders, etc. Intrinsic evaluation results demonstrate that the generated dataset is more diverse and realistic than existing baselines. Moreover, models fine-tuned on our synthetic data outperform those trained on other synthetic datasets when evaluated on real-world out-of-distribution tasks.
摘要：数字足迹（个人与数字系统交互的记录）对于研究行为、开发个性化应用程序和训练机器学习模型至关重要。然而，该领域的研究常常因缺乏多样化和可访问的数据而受到阻碍。为了解决这个限制，我们提出了一种使用大语言模型（LLM）代理合成真实数字足迹的新方法。从结构化的用户配置文件开始，我们的方法生成多样化且合理的用户事件序列，最终生成相应的数字工件，例如电子邮件、消息、日历条目、提醒等。内在评估结果表明，生成的数据集比现有基线更加多样化和真实。此外，在现实世界的分布外任务上进行评估时，根据我们的合成数据进行微调的模型优于在其他合成数据集上训练的模型。

Title: CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading

Authors: Pranav Raikote, Korbinian Randl, Ioanna Miliou, Athanasios Lakes, Panagiotis Papapetrou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.11957
Pdf URL: https://arxiv.org/pdf/2603.11957
Copy Paste: [[2603.11957]] CHiL(L)Grader: Calibrated Human-in-the-Loop Short-Answer Grading(https://arxiv.org/abs/2603.11957)
Keywords: language model
Abstract: Scaling educational assessment with large language models requires not just accuracy, but the ability to recognize when predictions are trustworthy. Instruction-tuned models tend to be overconfident, and their reliability deteriorates as curricula evolve, making fully autonomous deployment unsafe in high-stakes settings. We introduce CHiL(L)Grader, the first automated grading framework that incorporates calibrated confidence estimation into a human-in-the-loop workflow. Using post-hoc temperature scaling, confidence-based selective prediction, and continual learning, CHiL(L)Grader automates only high-confidence predictions while routing uncertain cases to human graders, and adapts to evolving rubrics and unseen questions. Across three short-answer grading datasets, CHiL(L)Grader automatically scores 35-65% of responses at expert-level quality (QWK >= 0.80). A QWK gap of 0.347 between accepted and rejected predictions confirms the effectiveness of the confidence-based routing. Each correction cycle strengthens the model's grading capability as it learns from teacher feedback. These results show that uncertainty quantification is key for reliable AI-assisted grading.
摘要：使用大型语言模型扩展教育评估不仅需要准确性，还需要识别预测何时值得信赖的能力。经过指令调整的模型往往过于自信，并且其可靠性随着课程的发展而恶化，使得完全自主的部署在高风险环境中变得不安全。我们推出 CHiL(L)Grader，这是第一个自动评分框架，它将校准的置信度估计纳入人机交互工作流程中。使用事后温度缩放、基于置信度的选择性预测和持续学习，CHiL(L)Grader 仅自动执行高置信度预测，同时将不确定的案例发送给人工评分者，并适应不断变化的规则和看不见的问题。在三个简答评分数据集中，CHiL(L)Grader 自动对 35-65% 的回答进行专家级质量评分 (QWK >= 0.80)。接受和拒绝的预测之间的 QWK 差距为 0.347，证实了基于置信度的路由的有效性。每个校正周期都会增强模型的评分能力，因为它会从教师的反馈中学习。这些结果表明，不确定性量化是可靠的人工智能辅助分级的关键。

Title: BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs

Authors: Ilias Aarab
Subjects: cs.CL, cs.AI, cs.LG, stat.ML
Abstract URL: https://arxiv.org/abs/2603.11991
Pdf URL: https://arxiv.org/pdf/2603.11991
Copy Paste: [[2603.11991]] BTZSC: A Benchmark for Zero-Shot Text Classification Across Cross-Encoders, Embedding Models, Rerankers and LLMs(https://arxiv.org/abs/2603.11991)
Keywords: language model, llm
Abstract: Zero-shot text classification (ZSC) offers the promise of eliminating costly task-specific annotation by matching texts directly to human-readable label descriptions. While early approaches have predominantly relied on cross-encoder models fine-tuned for natural language inference (NLI), recent advances in text-embedding models, rerankers, and instruction-tuned large language models (LLMs) have challenged the dominance of NLI-based architectures. Yet, systematically comparing these diverse approaches remains difficult. Existing evaluations, such as MTEB, often incorporate labeled examples through supervised probes or fine-tuning, leaving genuine zero-shot capabilities underexplored. To address this, we introduce BTZSC, a comprehensive benchmark of 22 public datasets spanning sentiment, topic, intent, and emotion classification, capturing diverse domains, class cardinalities, and document lengths. Leveraging BTZSC, we conduct a systematic comparison across four major model families, NLI cross-encoders, embedding models, rerankers and instruction-tuned LLMs, encompassing 38 public and custom checkpoints. Our results show that: (i) modern rerankers, exemplified by Qwen3-Reranker-8B, set a new state-of-the-art with macro F1 = 0.72; (ii) strong embedding models such as GTE-large-en-v1.5 substantially close the accuracy gap while offering the best trade-off between accuracy and latency; (iii) instruction-tuned LLMs at 4--12B parameters achieve competitive performance (macro F1 up to 0.67), excelling particularly on topic classification but trailing specialized rerankers; (iv) NLI cross-encoders plateau even as backbone size increases; and (v) scaling primarily benefits rerankers and LLMs over embedding models. BTZSC and accompanying evaluation code are publicly released to support fair and reproducible progress in zero-shot text understanding.
摘要：零样本文本分类 (ZSC) 通过将文本直接与人类可读的标签描述进行匹配，有望消除成本高昂的特定任务注释。虽然早期的方法主要依赖于针对自然语言推理 (NLI) 进行微调的跨编码器模型，但文本嵌入模型、重新排序器和指令调整大语言模型 (LLM) 的最新进展挑战了基于 NLI 的架构的主导地位。然而，系统地比较这些不同的方法仍然很困难。现有的评估（例如 MTEB）通常通过监督探测或微调来纳入标记示例，从而导致真正的零样本能力尚未得到充分开发。为了解决这个问题，我们引入了 BTZSC，这是一个包含 22 个公共数据集的综合基准，涵盖情感、主题、意图和情感分类，捕获不同的领域、类基数和文档长度。利用 BTZSC，我们对四个主要模型系列、NLI 交叉编码器、嵌入模型、重新排序器和指令调整的 LLM 进行了系统比较，涵盖 38 个公共和自定义检查点。我们的结果表明：（i）现代重新排序器，以 Qwen3-Reranker-8B 为例，设置了宏 F1 = 0.72 的新的最先进技术； (ii) 强大的嵌入模型（例如 GTE-large-en-v1.5）大大缩小了精度差距，同时提供了精度和延迟之间的最佳权衡； (iii) 指令调整的法学硕士在 4--12B 参数上实现了有竞争力的表现（宏观 F1 高达 0.67），尤其在主题分类方面表现出色，但落后于专门的重新排序者； (iv) 即使主干尺寸增加，NLI 交叉编码器也处于稳定状态； (v) 与嵌入模型相比，扩展主要有利于重新排名者和法学硕士。公开发布 BTZSC 和随附的评估代码，以支持零样本文本理解的公平且可重复的进展。

Title: Translationese as a Rational Response to Translation Task Difficulty

Authors: Maria Kunilovskaya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12050
Pdf URL: https://arxiv.org/pdf/2603.12050
Copy Paste: [[2603.12050]] Translationese as a Rational Response to Translation Task Difficulty(https://arxiv.org/abs/2603.12050)
Keywords: llm
Abstract: Translations systematically diverge from texts originally produced in the target language, a phenomenon widely referred to as translationese. Translationese has been attributed to production tendencies (e.g. interference, simplification), socio-cultural variables, and language-pair effects, yet a unified explanatory account is still lacking. We propose that translationese reflects cognitive load inherent in the translation task itself. We test whether observable translationese can be predicted from quantifiable measures of translation task difficulty. Translationese is operationalised as a segment-level translatedness score produced by an automatic classifier. Translation task difficulty is conceptualised as comprising source-text and cross-lingual transfer components, operationalised mainly through information-theoretic metrics based on LLM surprisal, complemented by established syntactic and semantic alternatives. We use a bidirectional English-German corpus comprising written and spoken subcorpora. Results indicate that translationese can be partly explained by translation task difficulty, especially in English-to-German. For most experiments, cross-lingual transfer difficulty contributes more than source-text complexity. Information-theoretic indicators match or outperform traditional features in written mode, but offer no advantage in spoken mode. Source-text syntactic complexity and translation-solution entropy emerged as the strongest predictors of translationese across language pairs and modes.
摘要：翻译系统地偏离最初用目标语言生成的文本，这种现象被广泛称为翻译语。翻译语被归因于生产倾向（例如干扰、简化）、社会文化变量和语言对效应，但仍然缺乏统一的解释。我们认为翻译语反映了翻译任务本身固有的认知负荷。我们测试是否可以通过可量化的翻译任务难度测量来预测可观察的翻译语。 Translationese 被操作为由自动分类器生成的片段级翻译分数。翻译任务难度被概念化为包括源文本和跨语言转换组件，主要通过基于 LLM 惊喜的信息论度量来操作，并辅以已建立的句法和语义替代方案。我们使用包含书面和口语子语料库的双向英语-德语语料库。结果表明，翻译语的部分原因在于翻译任务的难度，尤其是英语到德语的翻译任务难度。对于大多数实验来说，跨语言迁移难度比源文本复杂性更重要。信息论指标在书面模式下与传统特征相匹配或优于传统特征，但在口语模式下没有优势。源文本句法复杂性和翻译解熵成为跨语言对和模式的翻译语的最强预测因子。

Title: To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times

Authors: Thomas Hikaru Clark, Carlos Arriaga, Javier Conde, Gonzalo Martínez, Pedro Reviriego
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12105
Pdf URL: https://arxiv.org/pdf/2603.12105
Copy Paste: [[2603.12105]] To Words and Beyond: Probing Large Language Models for Sentence-Level Psycholinguistic Norms of Memorability and Reading Times(https://arxiv.org/abs/2603.12105)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have recently been shown to produce estimates of psycholinguistic norms, such as valence, arousal, or concreteness, for words and multiword expressions, that correlate with human judgments. These estimates are obtained by prompting an LLM, in zero-shot fashion, with a question similar to those used in human studies. Meanwhile, for other norms such as lexical decision time or age of acquisition, LLMs require supervised fine-tuning to obtain results that align with ground-truth values. In this paper, we extend this approach to the previously unstudied features of sentence memorability and reading times, which involve the relationship between multiple words in a sentence-level context. Our results show that via fine-tuning, models can provide estimates that correlate with human-derived norms and exceed the predictive power of interpretable baseline predictors, demonstrating that LLMs contain useful information about sentence-level features. At the same time, our results show very mixed zero-shot and few-shot performance, providing further evidence that care is needed when using LLM-prompting as a proxy for human cognitive measures.
摘要：最近，大型语言模型（LLM）已被证明可以对与人类判断相关的单词和多词表达的心理语言学规范进行估计，例如效价、唤醒度或具体性。这些估计是通过以零样本的方式向法学硕士提出类似于人类研究中使用的问题来获得的。同时，对于其他规范，例如词汇决策时间或习得年龄，法学硕士需要监督微调以获得与真实值一致的结果。在本文中，我们将这种方法扩展到以前未研究的句子记忆性和阅读时间的特征，其中涉及句子级上下文中多个单词之间的关系。我们的结果表明，通过微调，模型可以提供与人类衍生规范相关的估计值，并超过可解释基线预测器的预测能力，证明 LLM 包含有关句子级特征的有用信息。与此同时，我们的结果显示了非常混合的零样本和少样本性能，进一步证明在使用 LLM 提示作为人类认知测量的代理时需要小心。

Title: SommBench: Assessing Sommelier Expertise of Language Models

Authors: William Brach, Tomas Bedej, Jacob Nielsen, Jacob Pichna, Juraj Bedej, Eemeli Saarensilta, Julie Dupouy, Gianluca Barmina, Andrea Blasi Núñez, Peter Schneider-Kamp, Kristian Košťál, Michal Ries, Lukas Galke Poech
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12117
Pdf URL: https://arxiv.org/pdf/2603.12117
Copy Paste: [[2603.12117]] SommBench: Assessing Sommelier Expertise of Language Models(https://arxiv.org/abs/2603.12117)
Keywords: language model, gpt
Abstract: With the rapid advances of large language models, it becomes increasingly important to systematically evaluate their multilingual and multicultural capabilities. Previous cultural evaluation benchmarks focus mainly on basic cultural knowledge that can be encoded in linguistic form. Here, we propose SommBench, a multilingual benchmark to assess sommelier expertise, a domain deeply grounded in the senses of smell and taste. While language models learn about sensory properties exclusively through textual descriptions, SommBench tests whether this textual grounding is sufficient to emulate expert-level sensory judgment. SommBench comprises three main tasks: Wine Theory Question Answering (WTQA), Wine Feature Completion (WFC), and Food-Wine Pairing (FWP). SommBench is available in multiple languages: English, Slovak, Swedish, Finnish, German, Danish, Italian, and Spanish. This helps separate a language model's wine expertise from its language skills. The benchmark datasets were developed in close collaboration with a professional sommelier and native speakers of the respective languages, resulting in 1,024 wine theory question-answering questions, 1,000 wine feature-completion examples, and 1,000 food-wine pairing examples. We provide results for the most popular language models, including closed-weights models such as Gemini 2.5, and open-weights models, such as GPT-OSS and Qwen 3. Our results show that the most capable models perform well on wine theory question answering (up to 97% correct with a closed-weights model), yet feature completion (peaking at 65%) and food-wine pairing show (MCC ranging between 0 and 0.39) turn out to be more challenging. These results position SommBench as an interesting and challenging benchmark for evaluating the sommelier expertise of language models. The benchmark is publicly available at this https URL.
摘要：随着大型语言模型的快速发展，系统地评估其多语言和多文化能力变得越来越重要。以往的文化评估基准主要关注能够以语言形式编码的基础文化知识。在这里，我们提出了 SommBench，这是一个评估侍酒师专业知识的多语言基准，这是一个深深植根于嗅觉和味觉的领域。虽然语言模型仅通过文本描述来了解感官属性，但 SommBench 测试这种文本基础是否足以模拟专家级的感官判断。 SommBench 包括三个主要任务：葡萄酒理论问答 (WTQA)、葡萄酒特征完成 (WFC) 和食物与葡萄酒搭配 (FWP)。 SommBench 有多种语言版本：英语、斯洛伐克语、瑞典语、芬兰语、德语、丹麦语、意大利语和西班牙语。这有助于将语言模型的葡萄酒专业知识与其语言技能区分开来。基准数据集是与专业侍酒师和各自语言的母语人士密切合作开发的，产生了 1,024 个葡萄酒理论问答题、1,000 个葡萄酒功能完成示例和 1,000 个食物与葡萄酒搭配示例。我们提供了最流行的语言模型的结果，包括 Gemini 2.5 等封闭权重模型和 GPT-OSS 和 Qwen 3 等开放权重模型。我们的结果表明，最有能力的模型在葡萄酒理论问答方面表现良好（封闭权重模型的正确率高达 97%），但功能完成（峰值为 65%）和食物与葡萄酒配对展示（MCC 范围在 0 到 0.39 之间）结果更差。具有挑战性。这些结果使 SommBench 成为评估侍酒师语言模型专业知识的有趣且具有挑战性的基准。该基准可通过此 https URL 公开获取。

Title: Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions

Authors: Tae-Eun Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12123
Pdf URL: https://arxiv.org/pdf/2603.12123
Copy Paste: [[2603.12123]] Cross-Context Review: Improving LLM Output Quality by Separating Production and Review Sessions(https://arxiv.org/abs/2603.12123)
Keywords: language model, llm, agent
Abstract: Large language models struggle to catch errors in their own outputs when the review happens in the same session that produced them. This paper introduces Cross-Context Review (CCR), a straightforward method where the review is conducted in a fresh session with no access to the production conversation history. We ran a controlled experiment: 30 artifacts (code, technical documents, presentation scripts) with 150 injected errors, tested under four review conditions -- same-session Self-Review (SR), repeated Self-Review (SR2), context-aware Subagent Review (SA), and Cross-Context Review (CCR). Over 360 reviews, CCR reached an F1 of 28.6%, outperforming SR (24.6%, p=0.008, d=0.52), SR2 (21.7%, p<0.001, d=0.72), and SA (23.8%, p=0.004, d=0.57). The SR2 result matters most for interpretation: reviewing twice in the same session did not beat reviewing once (p=0.11), which rules out repetition as an explanation for CCR's advantage. The benefit comes from context separation itself. CCR works with any model, needs no infrastructure, and costs only one extra session.
摘要：当审查发生在生成它们的同一会话中时，大型语言模型很难捕获自己输出中的错误。本文介绍了跨上下文审查 (CCR)，这是一种直接的方法，在新会话中进行审查，无法访问生产对话历史记录。我们进行了一项对照实验：30 个工件（代码、技术文档、演示脚本），其中包含 150 个注入错误，在四种审查条件下进行测试：同一会话自我审查 (SR)、重复自我审查 (SR2)、上下文感知子代理审查 (SA) 和跨上下文审查 (CCR)。经过 360 条评论，CCR 的 F1 达到 28.6%，优于 SR（24.6%，p=0.008，d=0.52）、SR2（21.7%，p<0.001，d=0.72）和 SA（23.8%，p=0.004，d=0.57）。 SR2 结果对于解释来说最重要：在同一会话中复习两次并不比复习一次更好 (p=0.11)，这排除了重复作为 CCR 优势的解释。好处来自上下文分离本身。 CCR 适用于任何模型，不需要基础设施，并且只需要一次额外的会话。

Title: LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation

Authors: Feiyu Duan, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12152
Pdf URL: https://arxiv.org/pdf/2603.12152
Copy Paste: [[2603.12152]] LifeSim: Long-Horizon User Life Simulator for Personalized Assistant Evaluation(https://arxiv.org/abs/2603.12152)
Keywords: language model, llm
Abstract: The rapid advancement of large language models (LLMs) has accelerated progress toward universal AI assistants. However, existing benchmarks for personalized assistants remain misaligned with real-world user-assistant interactions, failing to capture the complexity of external contexts and users' cognitive states. To bridge this gap, we propose LifeSim, a user simulator that models user cognition through the Belief-Desire-Intention (BDI) model within physical environments for coherent life trajectories generation, and simulates intention-driven user interactive behaviors. Based on LifeSim, we introduce LifeSim-Eval, a comprehensive benchmark for multi-scenario, long-horizon personalized assistance. LifeSim-Eval covers 8 life domains and 1,200 diverse scenarios, and adopts a multi-turn interactive method to assess models' abilities to complete explicit and implicit intentions, recover user profiles, and produce high-quality responses. Under both single-scenario and long-horizon settings, our experiments reveal that current LLMs face significant limitations in handling implicit intention and long-term user preference modeling.
摘要：大语言模型（LLM）的快速发展加速了通用人工智能助手的发展。然而，个性化助理的现有基准仍然与现实世界的用户助理交互不一致，无法捕捉外部环境和用户认知状态的复杂性。为了弥补这一差距，我们提出了 LifeSim，这是一种用户模拟器，它通过物理环境中的信念-愿望-意图 (BDI) 模型对用户认知进行建模，以生成连贯的生活轨迹，并模拟意图驱动的用户交互行为。在LifeSim的基础上，我们推出了多场景、长视野个性化援助的综合基准LifeSim-Eval。 LifeSim-Eval覆盖8个生活领域、1200个不同场景，采用多轮交互方法评估模型完成显性和隐性意图、恢复用户画像和产生高质量响应的能力。在单一场景和长期设置下，我们的实验表明，当前的法学硕士在处理隐含意图和长期用户偏好建模方面面临着重大限制。

Title: QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions

Authors: Jiayin Lei, Ming Ma, Yunxi Duan, Chenxi Li, Tianming Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12165
Pdf URL: https://arxiv.org/pdf/2603.12165
Copy Paste: [[2603.12165]] QAQ: Bidirectional Semantic Coherence for Selecting High-Quality Synthetic Code Instructions(https://arxiv.org/abs/2603.12165)
Keywords: llm, hallucination
Abstract: Synthetic data has become essential for training code generation models, yet it introduces significant noise and hallucinations that are difficult to detect with current metrics. Existing data selection methods like Instruction-Following Difficulty (IFD) typically assess how hard a model generates an answer given a query ($A|Q$). However, this metric is ambiguous on noisy synthetic data, where low probability can distinguish between intrinsic task complexity and model-generated hallucinations. Here, we propose QAQ, a novel data selection framework that evaluates data quality from the reverse direction: how well can the answer predict the query ($Q|A$)? We define Reverse Mutual Information (RMI) to quantify the information gain about the query conditioned on the answer. Our analyses reveal that both extremes of RMI signal quality issues: low RMI indicates semantic misalignment, while excessively high RMI may contain defect patterns that LLMs easily recognize. Furthermore, we introduce a selection strategy based on the disagreement between strong and weak models to identify samples that are valid yet challenging. Experiments on the WarriorCoder dataset demonstrate that selecting just 25% of data using stratified RMI achieves comparable performance to full-data training, significantly outperforming existing data selection methods. Our approach highlights the importance of bidirectional semantic coherence in synthetic data curation, offering a scalable pathway to reduce computational costs without sacrificing model capability.
摘要：合成数据对于训练代码生成模型至关重要，但它引入了当前指标难以检测到的显着噪声和幻觉。指令遵循难度 (IFD) 等现有数据选择方法通常评估模型在给定查询 ($A|Q$) 的情况下生成答案的难度。然而，这个指标在嘈杂的合成数据上是不明确的，其中低概率可以区分内在的任务复杂性和模型生成的幻觉。在这里，我们提出了 QAQ，一种新颖的数据选择框架，它从相反的方向评估数据质量：答案能够如何很好地预测查询（$Q|A$）？我们定义反向互信息（RMI）来量化以答案为条件的查询的信息增益。我们的分析揭示了 RMI 信号质量问题的两个极端：低 RMI 表示语义错位，而过高的 RMI 可能包含法学硕士很容易识别的缺陷模式。此外，我们引入了一种基于强模型和弱模型之间差异的选择策略，以识别有效但具有挑战性的样本。 WarriorCoder 数据集上的实验表明，使用分层 RMI 仅选择 25% 的数据即可实现与全数据训练相当的性能，显着优于现有的数据选择方法。我们的方法强调了合成数据管理中双向语义一致性的重要性，提供了一种可扩展的途径，可以在不牺牲模型能力的情况下降低计算成本。

Title: Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections

Authors: Łukasz Borchmann, Jordy Van Landeghem, Michał Turski, Shreyansh Padarha, Ryan Othniel Kearns, Adam Mahdi, Niels Rogge, Clémentine Fourrier, Siwei Han, Huaxiu Yao, Artemis Llabrés, Yiming Xu, Dimosthenis Karatzas, Hao Zhang, Anupam Datta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12180
Pdf URL: https://arxiv.org/pdf/2603.12180
Copy Paste: [[2603.12180]] Strategic Navigation or Stochastic Search? How Agents and Humans Reason Over Document Collections(https://arxiv.org/abs/2603.12180)
Keywords: agent
Abstract: Multimodal agents offer a promising path to automating complex document-intensive workflows. Yet, a critical question remains: do these agents demonstrate genuine strategic reasoning, or merely stochastic trial-and-error search? To address this, we introduce MADQA, a benchmark of 2,250 human-authored questions grounded in 800 heterogeneous PDF documents. Guided by Classical Test Theory, we design it to maximize discriminative power across varying levels of agentic abilities. To evaluate agentic behaviour, we introduce a novel evaluation protocol measuring the accuracy-effort trade-off. Using this framework, we show that while the best agents can match human searchers in raw accuracy, they succeed on largely different questions and rely on brute-force search to compensate for weak strategic planning. They fail to close the nearly 20% gap to oracle performance, persisting in unproductive loops. We release the dataset and evaluation harness to help facilitate the transition from brute-force retrieval to calibrated, efficient reasoning.
摘要：多模式代理为自动化复杂的文档密集型工作流程提供了一条有前途的途径。然而，一个关键问题仍然存在：这些智能体是否表现出真正的战略推理，或者仅仅是随机的试错搜索？为了解决这个问题，我们引入了 MADQA，这是一个基于 800 个异构 PDF 文档的 2,250 个人工编写问题的基准。在经典测试理论的指导下，我们设计它是为了最大限度地提高不同水平的代理能力的辨别能力。为了评估代理行为，我们引入了一种新颖的评估协议来衡量准确性与努力的权衡。使用这个框架，我们表明，虽然最好的智能体可以在原始准确度上与人类搜索者相匹配，但他们在很大程度上不同的问题上取得了成功，并依靠强力搜索来弥补薄弱的战略规划。他们未能缩小与预言机性能近 20% 的差距，持续处于非生产性循环中。我们发布数据集和评估工具，以帮助促进从强力检索到校准、高效推理的转变。

Title: Long-Context Encoder Models for Polish Language Understanding

Authors: Sławomir Dadas, Rafał Poświata, Marek Kozłowski, Małgorzata Grębowiec, Michał Perełkiewicz, Paweł Klimiuk, Przemysław Boruta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12191
Pdf URL: https://arxiv.org/pdf/2603.12191
Copy Paste: [[2603.12191]] Long-Context Encoder Models for Polish Language Understanding(https://arxiv.org/abs/2603.12191)
Keywords: language model, llm
Abstract: While decoder-only Large Language Models (LLMs) have recently dominated the NLP landscape, encoder-only architectures remain a cost-effective and parameter-efficient standard for discriminative tasks. However, classic encoders like BERT are limited by a short context window, which is insufficient for processing long documents. In this paper, we address this limitation for the Polish by introducing a high-quality Polish model capable of processing sequences of up to 8192 tokens. The model was developed by employing a two-stage training procedure that involves positional embedding adaptation and full parameter continuous pre-training. Furthermore, we propose compressed model variants trained via knowledge distillation. The models were evaluated on 25 tasks, including the KLEJ benchmark, a newly introduced financial task suite (FinBench), and other classification and regression tasks, specifically those requiring long-document understanding. The results demonstrate that our model achieves the best average performance among Polish and multilingual models, significantly outperforming competitive solutions in long-context tasks while maintaining comparable quality on short texts.
摘要：虽然仅解码器的大型语言模型 (LLM) 最近在 NLP 领域占据主导地位，但仅编码器的架构仍然是判别任务的经济高效且参数高效的标准。然而，像 BERT 这样的经典编码器受到短上下文窗口的限制，不足以处理长文档。在本文中，我们通过引入能够处理最多 8192 个标记的序列的高质量 Polish 模型来解决 Polish 的这一限制。该模型是通过采用两阶段训练程序开发的，其中包括位置嵌入自适应和全参数连续预训练。此外，我们提出通过知识蒸馏训练的压缩模型变体。这些模型在 25 项任务上进行了评估，包括 KLEJ 基准、新推出的金融任务套件 (FinBench) 以及其他分类和回归任务，特别是那些需要理解长文档的任务。结果表明，我们的模型在波兰语和多语言模型中实现了最佳平均性能，在长上下文任务中显着优于竞争解决方案，同时在短文本上保持了可比的质量。

Title: IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Authors: Yushi Bai, Qian Dong, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.12201
Pdf URL: https://arxiv.org/pdf/2603.12201
Copy Paste: [[2603.12201]] IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse(https://arxiv.org/abs/2603.12201)
Keywords: language model, agent
Abstract: Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).
摘要：长上下文代理工作流程已成为大型语言模型的定义用例，使得注意力效率对于推理速度和服务成本都至关重要。稀疏注意力有效地解决了这一挑战，而 DeepSeek 稀疏注意力 (DSA) 是一种具有代表性的生产级解决方案：轻量级闪电索引器为每个查询选择前 k 个最相关的标记，将核心注意力从 $O(L^2)$ 减少到 $O(Lk)$。然而，索引器本身保留了 $O(L^2)$ 复杂性，并且必须在每一层独立运行，尽管事实上生成的 top-k 选择在连续层之间高度相似。我们提出了 IndexCache，它通过将层划分为一小组运行自己的索引器的完整层和仅重用最近的完整层的 top-k 索引的大多数共享层来利用这种跨层冗余。我们提出了两种互补的方法来确定和优化此配置。免训练 IndexCache 应用贪婪搜索算法，通过直接最小化校准集上的语言建模损失来选择保留索引器的层，无需权重更新。训练感知 IndexCache 引入了多层蒸馏损失，可以根据其所服务的所有层的平均注意力分布来训练每个保留的索引器，甚至使简单的交错模式也能匹配全索引器的准确性。 30B DSA 模型上的实验结果表明，IndexCache 可以消除 75% 的索引器计算，而质量下降可以忽略不计，与标准 DSA 相比，实现高达 1.82$\times$ 预填充加速和 1.48$\times$ 解码加速。我们对生产规模 GLM-5 模型的初步实验进一步证实了这些积极结果（图 1）。

Title: CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks

Authors: Alexandre Le Mercier, Thomas Demeester, Chris Develder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.12206
Pdf URL: https://arxiv.org/pdf/2603.12206
Copy Paste: [[2603.12206]] CLASP: Defending Hybrid Large Language Models Against Hidden State Poisoning Attacks(https://arxiv.org/abs/2603.12206)
Keywords: language model, llm
Abstract: State space models (SSMs) like Mamba have gained significant traction as efficient alternatives to Transformers, achieving linear complexity while maintaining competitive performance. However, Hidden State Poisoning Attacks (HiSPAs), a recently discovered vulnerability that corrupts SSM memory through adversarial strings, pose a critical threat to these architectures and their hybrid variants. Framing the HiSPA mitigation task as a binary classification problem at the token level, we introduce the CLASP model to defend against this threat. CLASP exploits distinct patterns in Mamba's block output embeddings (BOEs) and uses an XGBoost classifier to identify malicious tokens with minimal computational overhead. We consider a realistic scenario in which both SSMs and HiSPAs are likely to be used: an LLM screening résumés to identify the best candidates for a role. Evaluated on a corpus of 2,483 résumés totaling 9.5M tokens with controlled injections, CLASP achieves 95.9% token-level F1 score and 99.3% document-level F1 score on malicious tokens detection. Crucially, the model generalizes to unseen attack patterns: under leave-one-out cross-validation, performance remains high (96.9% document-level F1), while under clustered cross-validation with structurally novel triggers, it maintains useful detection capability (91.6% average document-level F1). Operating independently of any downstream model, CLASP processes 1,032 tokens per second with under 4GB VRAM consumption, potentially making it suitable for real-world deployment as a lightweight front-line defense for SSM-based and hybrid architectures. All code and detailed results are available at this https URL.
摘要：像 Mamba 这样的状态空间模型 (SSM) 作为 Transformer 的有效替代品已经获得了巨大的关注，在保持竞争性能的同时实现了线性复杂性。然而，隐藏状态中毒攻击 (HiSPA) 是最近发现的一个漏洞，它通过对抗性字符串破坏 SSM 内存，对这些架构及其混合变体构成了严重威胁。将 HiSPA 缓解任务构建为令牌级别的二元分类问题，我们引入 CLASP 模型来防御这种威胁。 CLASP 利用 Mamba 块输出嵌入 (BOE) 中的不同模式，并使用 XGBoost 分类器以最小的计算开销识别恶意令牌。我们考虑一个可能同时使用 SSM 和 HiSPA 的现实场景：法学硕士筛选简历，以确定职位的最佳候选人。通过对 2,483 份简历（总计 950 万个令牌）的语料库进行评估，CLASP 在恶意令牌检测方面实现了 95.9% 的令牌级 F1 得分和 99.3% 的文档级 F1 得分。至关重要的是，该模型可推广到看不见的攻击模式：在留一法交叉验证下，性能仍然很高（96.9% 文档级 F1），而在具有结构新颖的触发器的集群交叉验证下，它保持有用的检测能力（平均文档级 F1 为 91.6%）。 CLASP 独立于任何下游模型运行，每秒处理 1,032 个令牌，消耗量低于 4GB VRAM，这可能使其适合实际部署，作为基于 SSM 和混合架构的轻量级前线防御。所有代码和详细结果都可以在此 https URL 中找到。

Title: Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration

Authors: Priyanka Kargupta, Shuhaib Mehri, Dilek Hakkani-Tur, Jiawei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.12226
Pdf URL: https://arxiv.org/pdf/2603.12226
Copy Paste: [[2603.12226]] Sparking Scientific Creativity via LLM-Driven Interdisciplinary Inspiration(https://arxiv.org/abs/2603.12226)
Keywords: language model, llm
Abstract: Despite interdisciplinary research leading to larger and longer-term impact, most work remains confined to single-domain academic silos. Recent AI-based approaches to scientific discovery show promise for interdisciplinary research, but many prioritize rapidly designing experiments and solutions, bypassing the exploratory, collaborative reasoning processes that drive creative interdisciplinary breakthroughs. As a result, prior efforts largely prioritize automating scientific discovery rather than augmenting the reasoning processes that underlie scientific disruption. We present Idea-Catalyst, a novel framework that systematically identifies interdisciplinary insights to support creative reasoning in both humans and large language models. Starting from an abstract research goal, Idea-Catalyst is designed to assist the brainstorming stage, explicitly avoiding premature anchoring on specific solutions. The framework embodies key metacognitive features of interdisciplinary reasoning: (a) defining and assessing research goals, (b) awareness of a domain's opportunities and unresolved challenges, and (c) strategic exploration of interdisciplinary ideas based on impact potential. Concretely, Idea-Catalyst decomposes an abstract goal (e.g., improving human-AI collaboration) into core target-domain research questions that guide the analysis of progress and open challenges within that domain. These challenges are reformulated as domain-agnostic conceptual problems, enabling retrieval from external disciplines (e.g., Psychology, Sociology) that address analogous issues. By synthesizing and recontextualizing insights from these domains back into the target domain, Idea-Catalyst ranks source domains by their interdisciplinary potential. Empirically, this targeted integration improves average novelty by 21% and insightfulness by 16%, while remaining grounded in the original research problem.
摘要：尽管跨学科研究产生了更大、更长期的影响，但大多数工作仍然局限于单一领域的学术孤岛。最近基于人工智能的科学发现方法显示出跨学科研究的前景，但许多人优先考虑快速设计实验和解决方案，绕过了推动创造性跨学科突破的探索性协作推理过程。因此，之前的努力主要优先考虑自动化科学发现，而不是增强科学颠覆背后的推理过程。我们提出了 Idea-Catalyst，这是一个新颖的框架，它系统地识别跨学科见解，以支持人类和大型语言模型的创造性推理。从抽象的研究目标出发，Idea-Catalyst 旨在协助头脑风暴阶段，明确避免过早锚定具体解决方案。该框架体现了跨学科推理的关键元认知特征：（a）定义和评估研究目标，（b）对领域机遇和未解决挑战的认识，以及（c）基于影响潜力对跨学科思想的战略探索。具体来说，Idea-Catalyst 将抽象目标（例如，改善人类与人工智能的协作）分解为核心目标领域研究问题，指导对该领域内的进展和开放挑战的分析。这些挑战被重新表述为与领域无关的概念问题，从而能够从解决类似问题的外部学科（例如心理学、社会学）中检索。通过将这些领域的见解综合并重新关联到目标领域，Idea-Catalyst 根据源领域的跨学科潜力对源领域进行排名。根据经验，这种有针对性的整合将平均新颖性提高了 21%，洞察力提高了 16%，同时仍然立足于原始研究问题。