2026-03-17

Title: Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting

Authors: Jinghan Cao, Qingyang Ren, Xiangyun Chen, Xinjin Li, Haoxiang Gao, Yu Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13230
Pdf URL: https://arxiv.org/pdf/2603.13230
Copy Paste: [[2603.13230]] Slang Context-based Inference Enhancement via Greedy Search-Guided Chain-of-Thought Prompting(https://arxiv.org/abs/2603.13230)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Slang interpretation has been a challenging downstream task for Large Language Models (LLMs) as the expressions are inherently embedded in contextual, cultural, and linguistic frameworks. In the absence of domain-specific training data, it is difficult for LLMs to accurately interpret slang meaning based on lexical information. This paper attempts to investigate the challenges of slang inference using large LLMs and presents a greedy search-guided chain-of-thought framework for slang interpretation. Through our experiments, we conclude that the model size and temperature settings have limited impact on inference accuracy. Transformer-based models with larger active parameters do not generate higher accuracy than smaller models. Based on the results of the above empirical study, we integrate greedy search algorithms with chain-of-thought prompting for small language models to build a framework that improves the accuracy of slang interpretation. The experimental results indicate that our proposed framework demonstrates improved accuracy in slang meaning interpretation. These findings contribute to the understanding of context dependency in language models and provide a practical solution for enhancing slang comprehension through a structured reasoning prompting framework.
摘要：对于大型语言模型 (LLM) 来说，俚语解释一直是一项具有挑战性的下游任务，因为这些表达本质上嵌入在上下文、文化和语言框架中。在缺乏特定领域训练数据的情况下，法学硕士很难根据词汇信息准确解释俚语含义。本文试图研究使用大型法学硕士进行俚语推理的挑战，并提出了一种用于俚语解释的贪婪搜索引导的思想链框架。通过我们的实验，我们得出结论，模型大小和温度设置对推理精度的影响有限。具有较大活动参数的基于 Transformer 的模型不会比较小的模型产生更高的精度。基于上述实证研究的结果，我们将贪婪搜索算法与小语言模型的思想链提示相结合，构建了一个提高俚语解释准确性的框架。实验结果表明，我们提出的框架提高了俚语含义解释的准确性。这些发现有助于理解语言模型中的上下文依赖，并为通过结构化推理提示框架增强俚语理解提供了实用的解决方案。

Title: Steering at the Source: Style Modulation Heads for Robust Persona Control

Authors: Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.13249
Pdf URL: https://arxiv.org/pdf/2603.13249
Copy Paste: [[2603.13249]] Steering at the Source: Style Modulation Heads for Robust Persona Control(https://arxiv.org/abs/2603.13249)
Keywords: language model, llm
Abstract: Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.
摘要：激活控制提供了一种计算高效的机制，无需微调即可控制大型语言模型 (LLM)。在有效控制目标特征（例如角色）的同时，一致性退化仍然是安全和实际部署的主要障碍。我们假设这种退化源于对残余流的干预，这会不加区别地影响聚合特征并无意中放大偏离目标的噪声。在这项工作中，我们确定了一个稀疏的注意力头子集（只有三个头），它们独立地控制角色和风格形成，我们将其称为风格调制头。具体来说，这些头部可以通过内部表示的几何分析、结合分层余弦相似度和头部贡献分数来定位。我们证明，仅针对这些特定头部的干预可以实现稳健的行为控制，同时显着减轻残余流引导中观察到的一致性退化。更广泛地说，我们的研究结果表明，精确的组件级本地化可以实现更安全、更精确的模型控制。

Title: Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems

Authors: Mohammad Parsa Hosseini, Ankit Shah, Saiyra Qureshi, Alex Huang, Connie Miao, Wei Wei
Subjects: cs.CL, cs.AI, cs.ET, cs.MA
Abstract URL: https://arxiv.org/abs/2603.13256
Pdf URL: https://arxiv.org/pdf/2603.13256
Copy Paste: [[2603.13256]] Training-Free Agentic AI: Probabilistic Control and Coordination in Multi-Agent LLM Systems(https://arxiv.org/abs/2603.13256)
Keywords: language model, llm, agent
Abstract: Multi-agent large language model (LLM) systems enable complex, long-horizon reasoning by composing specialized agents, but practical deployment remains hindered by inefficient routing, noisy feedback, and high interaction cost. We introduce REDEREF, a lightweight and training-free controller for multi-agent LLM collaboration that improves routing efficiency during recursive delegation. REDEREF integrates (i) belief-guided delegation via Thompson sampling to prioritize agents with historically positive marginal contributions, (ii) reflection-driven re-routing using a calibrated LLM or programmatic judge, (iii) evidence-based selection rather than output averaging, and (iv) memory-aware priors to reduce cold-start inefficiency. Across multi-agent split-knowledge tasks, we show that while recursive retry alone saturates task success, belief-guided routing reduces token usage by 28%, agent calls by 17%, and time-to-success by 19% compared to random recursive delegation, and adapts gracefully under agent or judge degradation. These results demonstrate that simple, interpretable probabilistic control can meaningfully improve the efficiency and robustness of multi-agent LLM systems without training or fine-tuning.
摘要：多智能体大语言模型（LLM）系统通过组成专门的智能体来实现复杂的长视野推理，但实际部署仍然受到低效路由、嘈杂反馈和高交互成本的阻碍。我们引入了 REDEREF，这是一种轻量级且无需培训的控制器，用于多代理 LLM 协作，可提高递归委派期间的路由效率。 REDEREF 集成了（i）通过汤普森抽样的信念引导委托，以优先考虑具有历史正边际贡献的代理，（ii）使用经过校准的法学硕士或程序化法官进行反射驱动的重新路由，（iii）基于证据的选择而不是输出平均，以及（iv）记忆感知先验以减少冷启动效率低下。在多代理分裂知识任务中，我们表明，虽然单独的递归重试使任务成功率达到饱和，但与随机递归委托相比，信念引导路由将令牌使用量减少了 28%，代理调用减少了 17%，成功时间减少了 19%，并且在代理或判断退化的情况下能够优雅地适应。这些结果表明，简单、可解释的概率控制可以有意义地提高多智能体 LLM 系统的效率和鲁棒性，而无需训练或微调。

Title: How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Authors: Javier Marín
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13259
Pdf URL: https://arxiv.org/pdf/2603.13259
Copy Paste: [[2603.13259]] How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing(https://arxiv.org/abs/2603.13259)
Keywords: language model
Abstract: When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.
摘要：当语言模型输入错误答案时，网络内部会发生什么？目前的理解将真实性视为单层表示的静态属性——一个要探测的方向，一个要提取的特征。人们对动态知之甚少：当模型处理正确与不正确的延续时，内部表示如何在整个网络深度上发散。我们引入了强制完成探测，这种方法使用已知的正确和不正确的单令牌延续来呈现相同的查询，并跟踪四个仅解码器模型（1.5B-13B 参数）的每一层的五个几何测量。我们报告了三项发现。首先，正确和错误的路径通过旋转而不是重新缩放而发散：位移向量保持几乎相同的大小，而它们的角度间隔增加，这意味着事实选择是在近似超球面上的方向上编码的。其次，模型不会被动地因不正确的输入而失败——它会主动抑制正确的答案，从而使内部概率远离正确的标记。第三，这两种现象在低于参数阈值时完全不存在，并在 1.6B 时出现，表明实际处理能力发生了相变。这些结果表明，事实约束处理具有特定的几何特征——旋转，而不是标量；主动的，而不是被动的——这对于基于单层探针或幅度比较的方法来说是不可见的。

Title: Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation

Authors: Minsang Kim, Seung Jun Baek
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13260
Pdf URL: https://arxiv.org/pdf/2603.13260
Copy Paste: [[2603.13260]] Explain in Your Own Words: Improving Reasoning via Token-Selective Dual Knowledge Distillation(https://arxiv.org/abs/2603.13260)
Keywords: chain-of-thought
Abstract: Knowledge Distillation (KD) can transfer the reasoning abilities of large models to smaller ones, which can reduce the costs to generate Chain-of-Thoughts for reasoning tasks. KD methods typically ask the student to mimic the teacher's distribution over the entire output. However, a student with limited capacity can be overwhelmed by such extensive supervision causing a distribution mismatch, especially in complex reasoning tasks. We propose Token-Selective Dual Knowledge Distillation (TSD-KD), a framework for student-centric distillation. TSD-KD focuses on distilling important tokens for reasoning and encourages the student to explain reasoning in its own words. TSD-KD combines indirect and direct distillation. Indirect distillation uses a weak form of feedback based on preference ranking. The student proposes candidate responses generated on its own; the teacher re-ranks those candidates as indirect feedback without enforcing its entire distribution. Direct distillation uses distribution matching; however, it selectively distills tokens based on the relative confidence between teacher and student. Finally, we add entropy regularization to maintain the student's confidence during distillation. Overall, our method provides the student with targeted and indirect feedback to support its own reasoning process and to facilitate self-improvement. The experiments show the state-of-the-art performance of TSD-KD on 10 challenging reasoning benchmarks, outperforming the baseline and runner-up in accuracy by up to 54.4\% and 40.3\%, respectively. Notably, a student trained by TSD-KD even outperformed its own teacher model in four cases by up to 20.3\%. The source code is available at this https URL.
摘要：知识蒸馏（KD）可以将大型模型的推理能力转移到较小的模型，从而可以降低为推理任务生成思想链的成本。 KD 方法通常要求学生模仿教师对整个输出的分布。然而，能力有限的学生可能会因如此广泛的监督而不知所措，导致分布不匹配，尤其是在复杂的推理任务中。我们提出了令牌选择性双知识蒸馏（TSD-KD），这是一个以学生为中心的蒸馏框架。 TSD-KD 专注于提炼重要的推理标记，并鼓励学生用自己的语言解释推理。 TSD-KD 结合了间接蒸馏和直接蒸馏。间接蒸馏使用基于偏好排名的弱反馈形式。学生提出自己生成的候选答案；教师将这些候选者重新排序为间接反馈，而不强制执行整个分配。直接蒸馏采用分布匹配；然而，它根据教师和学生之间的相对信任度选择性地提取令牌。最后，我们添加熵正则化以保持学生在蒸馏过程中的信心。总体而言，我们的方法为学生提供了有针对性的间接反馈，以支持其自己的推理过程并促进自我完善。实验表明，TSD-KD 在 10 个具有挑战性的推理基准测试中表现出了最先进的性能，准确率分别比基线和亚军高出 54.4% 和 40.3%。值得注意的是，接受 TSD-KD 培训的学生甚至在 4 个案例中比自己的教师模型表现好达 20.3%。源代码可从此 https URL 获取。

Title: Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets

Authors: Roben Delos Reyes, Timothy Douglas, Asanobu Kitamoto
Subjects: cs.CL, cs.LG, cs.MA, cs.SI
Abstract URL: https://arxiv.org/abs/2603.13625
Pdf URL: https://arxiv.org/pdf/2603.13625
Copy Paste: [[2603.13625]] Design and evaluation of an agentic workflow for crisis-related synthetic tweet datasets(https://arxiv.org/abs/2603.13625)
Keywords: agent
Abstract: Twitter (now X) has become an important source of social media data for situational awareness during crises. Crisis informatics research has widely used tweets from Twitter to develop and evaluate artificial intelligence (AI) systems for various crisis-relevant tasks, such as extracting locations and estimating damage levels from tweets to support damage assessment. However, recent changes in Twitter's data access policies have made it increasingly difficult to curate real-world tweet datasets related to crises. Moreover, existing curated tweet datasets are limited to past crisis events in specific contexts and are costly to annotate at scale. These limitations constrain the development and evaluation of AI systems used in crisis informatics. To address these limitations, we introduce an agentic workflow for generating crisis-related synthetic tweet datasets. The workflow iteratively generates synthetic tweets conditioned on prespecified target characteristics, evaluates them using predefined compliance checks, and incorporates structured feedback to refine them in subsequent iterations. As a case study, we apply the workflow to generate synthetic tweet datasets relevant to post-earthquake damage assessment. We show that the workflow can generate synthetic tweets that capture their target labels for location and damage level. We further demonstrate that the resulting synthetic tweet datasets can be used to evaluate AI systems on damage assessment tasks like geolocalization and damage level prediction. Our results indicate that the workflow offers a flexible and scalable alternative to real-world tweet data curation, enabling the systematic generation of synthetic social media data across diverse crisis events, societal contexts, and crisis informatics applications.
摘要：Twitter（现在的 X）已成为危机期间态势感知的社交媒体数据的重要来源。危机信息学研究广泛使用 Twitter 的推文来开发和评估用于各种危机相关任务的人工智能 (AI) 系统，例如从推文中提取位置和估计损害程度以支持损害评估。然而，最近 Twitter 数据访问政策的变化使得管理与危机相关的现实世界推文数据集变得越来越困难。此外，现有的精选推文数据集仅限于特定背景下的过去危机事件，并且大规模注释成本高昂。这些限制限制了危机信息学中使用的人工智能系统的开发和评估。为了解决这些限制，我们引入了一种代理工作流程来生成与危机相关的合成推文数据集。该工作流程迭代地生成以预先指定的目标特征为条件的合成推文，使用预定义的合规性检查对其进行评估，并结合结构化反馈以在后续迭代中对其进行完善。作为案例研究，我们应用该工作流程来生成与震后损失评估相关的合成推文数据集。我们表明，该工作流程可以生成合成推文，捕获其位置和损坏级别的目标标签。我们进一步证明，生成的合成推文数据集可用于评估人工智能系统的损害评估任务，例如地理定位和损害程度预测。我们的结果表明，该工作流程为现实世界的推文数据管理提供了灵活且可扩展的替代方案，从而能够跨不同的危机事件、社会背景和危机信息学应用系统地生成合成社交媒体数据。

Title: Widespread Gender and Pronoun Bias in Moral Judgments Across LLMs

Authors: Gustavo Lúcius Fernandes, Jeiverson C. V. M. Santos, Pedro O. S. Vaz-de-Melo
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.13636
Pdf URL: https://arxiv.org/pdf/2603.13636
Copy Paste: [[2603.13636]] Widespread Gender and Pronoun Bias in Moral Judgments Across LLMs(https://arxiv.org/abs/2603.13636)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly used to assess moral or ethical statements, yet their judgments may reflect social and linguistic biases. This work presents a controlled, sentence-level study of how grammatical person, number, and gender markers influence LLM moral classifications of fairness. Starting from 550 balanced base sentences from the ETHICS dataset, we generated 26 counterfactual variants per item, systematically varying pronouns and demographic markers to yield 14,850 semantically equivalent sentences. We evaluated six model families (Grok, GPT, LLaMA, Gemma, DeepSeek, and Mistral), and measured fairness judgments and inter-group disparities using Statistical Parity Difference (SPD). Results show statistically significant biases: sentences written in the singular form and third person are more often judged as "fair'', while those in the second person are penalized. Gender markers produce the strongest effects, with non-binary subjects consistently favored and male subjects disfavored. We conjecture that these patterns reflect distributional and alignment biases learned during training, emphasizing the need for targeted fairness interventions in moral LLM applications.
摘要：大语言模型（LLM）越来越多地用于评估道德或伦理陈述，但它们的判断可能反映社会和语言偏见。这项工作提出了一项受控的句子级研究，研究语法人称、数字和性别标记如何影响法学硕士公平的道德分类。从 ETHICS 数据集中的 550 个平衡基本句子开始，我们为每个项目生成了 26 个反事实变体，系统地改变代词和人口统计标记，以产生 14,850 个语义等效的句子。我们评估了六个模型系列（Grok、GPT、LLaMA、Gemma、DeepSeek 和 Mistral），并使用统计奇偶差异 (SPD) 测量公平性判断和组间差异。结果显示出统计上显着的偏差：以单数形式和第三人称写的句子更常被判断为“公平”，而第二人称的句子则受到惩罚。性别标记产生最强的影响，非二元受试者始终受到青睐，而男性受试者不受青睐。我们推测这些模式反映了在培训期间学到的分布和对齐偏差，强调在道德法学硕士申请中进行有针对性的公平干预的必要性。

Title: Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities

Authors: Yurui Zhu, Giovanni Colavizza, Matteo Romanello
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.13651
Pdf URL: https://arxiv.org/pdf/2603.13651
Copy Paste: [[2603.13651]] Benchmarking Large Language Models on Reference Extraction and Parsing in the Social Sciences and Humanities(https://arxiv.org/abs/2603.13651)
Keywords: language model, llm
Abstract: Bibliographic reference extraction and parsing are foundational for citation indexing, linking, and downstream scholarly knowledge-graph construction. However, most established evaluations focus on clean, English, end-of-document bibliographies, and therefore underrepresent the Social Sciences and Humanities (SSH), where citations are frequently multilingual, embedded in footnotes, abbreviated, and shaped by heterogeneous historical conventions. We present a unified benchmark that targets these SSH-realistic conditions across three complementary datasets: CEX (English journal articles spanning multiple disciplines), EXCITE (German/English documents with end-section, footnote-only, and mixed regimes), and LinkedBooks (humanities references with strong stylistic variation and multilinguality). We evaluate three tasks of increasing difficulty -- reference extraction, reference parsing, and end-to-end document parsing -- under a schema-constrained setup that enables direct comparison between a strong supervised pipeline baseline (GROBID) and contemporary LLMs (DeepSeek-V3.1, Mistral-Small-3.2-24B, Gemma-3-27B-it, and Qwen3-VL (4B-32B variants)). Across datasets, extraction largely saturates beyond a moderate capability threshold, while parsing and end-to-end parsing remain the primary bottlenecks due to structured-output brittleness under noisy layouts. We further show that lightweight LoRA adaptation yields consistent gains -- especially on SSH-heavy benchmarks -- and that segmentation/pipelining can substantially improve robustness. Finally, we argue for hybrid deployment via routing: leveraging GROBID for well-structured, in-distribution PDFs while escalating multilingual and footnote-heavy documents to task-adapted LLMs.
摘要：书目参考文献提取和解析是引文索引、链接和下游学术知识图构建的基础。然而，大多数既定的评估都侧重于干净的、英文的、文档结尾的参考书目，因此对社会科学和人文学科 (SSH) 的代表性不足，因为这些领域的引文通常是多语言的、嵌入脚注、缩写的，并受到不同历史惯例的影响。我们提出了一个统一的基准，针对三个互补数据集的 SSH 现实条件：CEX（跨多个学科的英语期刊文章）、EXCITE（具有结尾部分、仅脚注和混合机制的德语/英语文档）和 LinkedBooks（具有强烈风格变化和多语言性的人文参考文献）。我们在模式约束的设置下评估了三个难度不断增加的任务——引用提取、引用解析和端到端文档解析，该设置可以在强监督管道基线（GROBID）和当代LLM（DeepSeek-V3.1、Mistral-Small-3.2-24B、Gemma-3-27B-it和Qwen3-VL（4B-32B变体））之间进行直接比较。在整个数据集中，提取在很大程度上饱和超过了中等能力阈值，而由于噪声布局下结构化输出的脆弱性，解析和端到端解析仍然是主要瓶颈。我们进一步表明，轻量级 LoRA 适应可以带来持续的收益——尤其是在 SSH 密集的基准测试上——并且分段/流水线可以显着提高鲁棒性。最后，我们主张通过路由进行混合部署：利用 GROBID 构建结构良好的分布式 PDF，同时将多语言和脚注较多的文档升级为适应任务的 LLM。

Title: Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation

Authors: Hanwen Shen, Ting Ying, Jiajie Lu, Shanshan Wang
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.13683
Pdf URL: https://arxiv.org/pdf/2603.13683
Copy Paste: [[2603.13683]] Preconditioned Test-Time Adaptation for Out-of-Distribution Debiasing in Narrative Generation(https://arxiv.org/abs/2603.13683)
Keywords: llm, prompt
Abstract: Although debiased LLMs perform well on known bias patterns, they often fail to generalize to unfamiliar bias prompts, producing toxic outputs. We first validate that such high-bias prompts constitute a \emph{distribution shift} via OOD detection, and show static models degrade under this shift. To adapt on-the-fly, we propose \textbf{CAP-TTA}, a test-time adaptation framework that performs context-aware LoRA updates only when the bias-risk \emph{trigger} exceeds a threshold, using a precomputed diagonal \emph{preconditioner} for fast and stable updates. Across toxic-prompt settings and benchmarks, CAP-TTA reduces bias (confirmed by human evaluation) while achieving much lower update latency than AdamW/SGD; it also mitigates catastrophic forgetting by significantly improving narrative fluency over SOTA debiasing baseline while maintaining comparable debiasing effectiveness.
摘要：尽管去偏见的法学硕士在已知的偏见模式上表现良好，但他们常常无法泛化到不熟悉的偏见提示，从而产生有毒的输出。我们首先通过 OOD 检测验证这种高偏差提示是否构成 \emph{分布偏移}，并显示静态模型在这种偏移下退化。为了动态适应，我们提出了 \textbf{CAP-TTA}，一个测试时适应框架，仅当偏差风险 \emph{trigger} 超过阈值时才执行上下文感知的 LoRA 更新，使用预先计算的对角线 \emph{preconditioner} 进行快速稳定的更新。在有毒提示设置和基准中，CAP-TTA 减少了偏差（经人工评估证实），同时实现了比 AdamW/SGD 低得多的更新延迟；它还通过显着提高 SOTA 去偏基线的叙述流畅性，同时保持可比较的去偏有效性，来减轻灾难性遗忘。

Title: QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models

Authors: Yao Wu, Kangping Yin, Liang Dong, Zhenxin Ma, Shuting Xu, Xuehai Wang, Yuxuan Jiang, Tingting Yu, Yunqing Hong, Jiayi Liu, Rianzhe Huang, Shuxin Zhao, Haiping Hu, Wen Shang, Jian Xu, Guanjun Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13691
Pdf URL: https://arxiv.org/pdf/2603.13691
Copy Paste: [[2603.13691]] QuarkMedBench: A Real-World Scenario Driven Benchmark for Evaluating Large Language Models(https://arxiv.org/abs/2603.13691)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) excel on standardized medical exams, high scores often fail to translate to high-quality responses for real-world medical queries. Current evaluations rely heavily on multiple-choice questions, failing to capture the unstructured, ambiguous, and long-tail complexities inherent in genuine user inquiries. To bridge this gap, we introduce QuarkMedBench, an ecologically valid benchmark tailored for real-world medical LLM assessment. We compiled a massive dataset spanning Clinical Care, Wellness Health, and Professional Inquiry, comprising 20,821 single-turn queries and 3,853 multi-turn sessions. To objectively evaluate open-ended answers, we propose an automated scoring framework that integrates multi-model consensus with evidence-based retrieval to dynamically generate 220,617 fine-grained scoring rubrics (~9.8 per query). During evaluation, hierarchical weighting and safety constraints structurally quantify medical accuracy, key-point coverage, and risk interception, effectively mitigating the high costs and subjectivity of human grading. Experimental results demonstrate that the generated rubrics achieve a 91.8% concordance rate with clinical expert blind audits, establishing highly dependable medical reliability. Crucially, baseline evaluations on this benchmark reveal significant performance disparities among state-of-the-art models when navigating real-world clinical nuances, highlighting the limitations of conventional exam-based metrics. Ultimately, QuarkMedBench establishes a rigorous, reproducible yardstick for measuring LLM performance on complex health issues, while its framework inherently supports dynamic knowledge updates to prevent benchmark obsolescence.
摘要：虽然大型语言模型 (LLM) 在标准化医学考试中表现出色，但高分往往无法转化为现实世界医学查询的高质量响应。目前的评估严重依赖多项选择题，未能捕捉到真实用户查询中固有的非结构化、模糊性和长尾复杂性。为了弥补这一差距，我们引入了 QuarkMedBench，这是一个为现实世界的医学 LLM 评估量身定制的生态有效的基准。我们编制了涵盖临床护理、健康保健和专业查询的海量数据集，其中包括 20,821 个单轮查询和 3,853 个多轮会话。为了客观地评估开放式答案，我们提出了一个自动评分框架，它将多模型共识与基于证据的检索相结合，动态生成 220,617 个细粒度评分标准（每个查询约 9.8 个）。评估过程中，通过分级赋权和安全约束，从结构上量化医疗准确性、关键点覆盖和风险拦截，有效缓解人工分级的高成本和主观性。实验结果表明，生成的量规与临床专家盲审的一致性达到91.8%，建立了高度可靠的医疗可靠性。至关重要的是，对该基准的基线评估揭示了在处理现实世界临床细微差别时最先进模型之间的显着性能差异，突显了传统的基于考试的指标的局限性。最终，QuarkMedBench 建立了一个严格的、可重复的标准来衡量法学硕士在复杂健康问题上的表现，而其框架本身支持动态知识更新，以防止基准过时。

Title: Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models

Authors: Jon-Paul Cacioli
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13696
Pdf URL: https://arxiv.org/pdf/2603.13696
Copy Paste: [[2603.13696]] Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models(https://arxiv.org/abs/2603.13696)
Keywords: language model, gpt
Abstract: We present the first systematic evaluation of mutual exclusivity (ME) -- the bias to map novel words to novel referents -- in text-only language models trained on child-directed speech. We operationalise ME as referential suppression: when a familiar object is relabelled in a two-referent discourse context, ME predicts decreased probability of the labelled noun at a subsequent completion position. Three pilot findings motivate a pre-registered scale-sensitivity experiment: (1) a masked language model (BabyBERTa) is entirely insensitive to multi-sentence referential context; (2) autoregressive models show robust repetition priming -- the opposite of ME -- when familiar nouns are re-labelled; and (3) a novel context-dependence diagnostic reveals that apparent ME-like patterns with nonce tokens are fully explained by embedding similarity, not referential disambiguation. In the confirmatory experiment, we train 45 GPT-2-architecture models (2.9M, 8.9M, and 33.5M parameters; 5, 10, and 20 epochs on AO-CHILDES; 5 seeds each) and evaluate on a pre-registered ME battery. Anti-ME repetition priming is significant in all 9 cells (85-100% of items; all p < 2.4 x 10^-13). Priming attenuates with improved language modelling (Spearman rho = -0.533, p = 0.0002) but never crosses zero across a 3.8x perplexity range. The context-dependence diagnostic replicates in all 9 cells, and dose-response priming increases with repetitions in 8/9 cells (all trend p < 0.002). These findings indicate that distributional learning on child-directed speech produces repetition-based reference tracking rather than lexical exclusivity. We connect this to the grounded cognition literature and argue that referential grounding may be a necessary ingredient for ME -- an empirical claim about required input structure, not a nativist one.
摘要：我们在针对儿童的语音训练的纯文本语言模型中首次系统地评估了相互排他性（ME）——将新词映射到新指称的偏差。我们将 ME 操作为指称抑制：当一个熟悉的对象在两个指称语篇上下文中被重新标记时，ME 预测在后续完成位置标记名词的概率会降低。三个试点发现激发了预先注册的尺度敏感性实验：（1）掩码语言模型（BabyBERTa）对多句子引用上下文完全不敏感； (2) 当熟悉的名词被重新标记时，自回归模型显示出强大的重复启动——与 ME 相反； (3) 一种新颖的上下文相关诊断表明，具有随机数标记的明显的类似 ME 的模式可以通过嵌入相似性而不是指代消歧来充分解释。在验证性实验中，我们训练了 45 个 GPT-2 架构模型（2.9M、8.9M 和 33.5M 参数；AO-CHILDES 上的 5、10 和 20 个 epoch；每个 5 个种子）并在预先注册的 ME 电池上进行评估。抗 ME 重复启动在所有 9 个细胞中均显着（85-100% 的项目；所有 p < 2.4 x 10^-13）。启动随着语言模型的改进而减弱（Spearman rho = -0.533，p = 0.0002），但在 3.8 倍的困惑度范围内永远不会越过零。上下文依赖性诊断在所有 9 个细胞中重复，并且剂量反应启动随着 8/9 个细胞中的重复而增加（所有趋势 p < 0.002）。这些发现表明，针对儿童的语音的分布式学习产生基于重复的参考跟踪，而不是词汇排他性。我们将其与扎根认知文献联系起来，并认为参照扎根可能是 ME 的必要成分——一种关于所需输入结构的经验主张，而不是本土主义主张。

Title: Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality

Authors: Taiqiang Wu, Yuxin Cheng, Chenchen Ding, Runming Yang, Xincheng Feng, Wenyong Zhou, Zhengwu Liu, Ngai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13725
Pdf URL: https://arxiv.org/pdf/2603.13725
Copy Paste: [[2603.13725]] Can We Trust LLMs on Memristors? Diving into Reasoning Ability under Non-Ideality(https://arxiv.org/abs/2603.13725)
Keywords: language model, llm
Abstract: Memristor-based analog compute-in-memory (CIM) architectures provide a promising substrate for the efficient deployment of Large Language Models (LLMs), owing to superior energy efficiency and computational density. However, these architectures suffer from precision issues caused by intrinsic non-idealities of memristors. In this paper, we first conduct a comprehensive investigation into the impact of such typical non-idealities on LLM reasoning. Empirical results indicate that reasoning capability decreases significantly but varies for distinct benchmarks. Subsequently, we systematically appraise three training-free strategies, including thinking mode, in-context learning, and module redundancy. We thus summarize valuable guidelines, i.e., shallow layer redundancy is particularly effective for improving robustness, thinking mode performs better under low noise levels but degrades at higher noise, and in-context learning reduces output length with a slight performance trade-off. Our findings offer new insights into LLM reasoning under non-ideality and practical strategies to improve robustness.
摘要：基于忆阻器的模拟内存计算 (CIM) 架构凭借卓越的能源效率和计算密度，为大型语言模型 (LLM) 的高效部署提供了有前途的基础。然而，这些架构存在由忆阻器固有的非理想性引起的精度问题。在本文中，我们首先对这种典型的非理想性对LLM推理的影响进行了全面的调查。实证结果表明，推理能力显着下降，但因不同的基准而异。随后，我们系统地评估了三种免训练策略，包括思维模式、情境学习和模块冗余。因此，我们总结了有价值的指导原则，即浅层冗余对于提高鲁棒性特别有效，思维模式在低噪声水平下表现更好，但在较高噪声水平下性能下降，上下文学习在稍微牺牲性能的情况下减少了输出长度。我们的研究结果为非理想情况下的法学硕士推理和提高鲁棒性的实用策略提供了新的见解。

Title: Knowledge Distillation for Large Language Models

Authors: Alejandro Paredes La Torre, Barbara Flores, Diego Rodriguez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13765
Pdf URL: https://arxiv.org/pdf/2603.13765
Copy Paste: [[2603.13765]] Knowledge Distillation for Large Language Models(https://arxiv.org/abs/2603.13765)
Keywords: language model, prompt, chain-of-thought
Abstract: We propose a resource-efficient framework for compressing large language models through knowledge distillation, combined with guided chain-of-thought reinforcement learning. Using Qwen 3B as the teacher and Qwen 0.5B as the student, we apply knowledge distillation across English Dolly-15k, Spanish Dolly-15k, and code BugNet and PyTorrent datasets, with hyperparameters tuned in the English setting to optimize student performance. Across tasks, the distilled student retains a substantial portion of the teacher's capability while remaining significantly smaller: 70% to 91% in English, up to 95% in Spanish, and up to 93.5% Rouge-L in code. For coding tasks, integrating chain-of-thought prompting with Group Relative Policy Optimization using CoT-annotated Codeforces data improves reasoning coherence and solution correctness compared to knowledge distillation alone. Post-training 4-bit weight quantization further reduces memory footprint and inference latency. These results show that knowledge distillation combined with chain-of-thought guided reinforcement learning can produce compact, efficient models suitable for deployment in resource-constrained settings.
摘要：我们提出了一个资源高效的框架，通过知识蒸馏并结合引导的思想链强化学习来压缩大型语言模型。我们使用 Qwen 3B 作为教师，Qwen 0.5B 作为学生，在英语 Dolly-15k、西班牙语 Dolly-15k 以及代码 BugNet 和 PyTorrent 数据集上应用知识蒸馏，并在英语设置中调整超参数以优化学生的表现。在各个任务中，经过提炼的学生保留了老师的大部分能力，同时仍然显着较小：英语中的 70% 到 91%，西班牙语中的高达 95%，代码中的 Rouge-L 高达 93.5%。对于编码任务，与单独的知识蒸馏相比，使用 CoT 注释的 Codeforces 数据将思想链提示与组相对策略优化相集成可以提高推理连贯性和解决方案的正确性。训练后 4 位权重量化进一步减少内存占用和推理延迟。这些结果表明，知识蒸馏与思想链引导的强化学习相结合可以产生适合在资源有限的环境中部署的紧凑、高效的模型。

Title: LiveWeb-IE: A Benchmark For Online Web Information Extraction

Authors: Seungbin Yang, Jihwan Kim, Jaemin Choi, Dongjin Kim, Soyoung Yang, ChaeHun Park, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13773
Pdf URL: https://arxiv.org/pdf/2603.13773
Copy Paste: [[2603.13773]] LiveWeb-IE: A Benchmark For Online Web Information Extraction(https://arxiv.org/abs/2603.13773)
Keywords: agent
Abstract: Web information extraction (WIE) is the task of automatically extracting data from web pages, offering high utility for various applications. The evaluation of WIE systems has traditionally relied on benchmarks built from HTML snapshots captured at a single point in time. However, this offline evaluation paradigm fails to account for the temporally evolving nature of the web; consequently, performance on these static benchmarks often fails to generalize to dynamic real-world scenarios. To bridge this gap, we introduce \dataset, a new benchmark designed for evaluating WIE systems directly against live websites. Based on trusted and permission-granted websites, we curate natural language queries that require information extraction of various data categories, such as text, images, and hyperlinks. We further design these queries to represent four levels of complexity, based on the number and cardinality of attributes to be extracted, enabling a granular assessment of WIE systems. In addition, we propose Visual Grounding Scraper (VGS), a novel multi-stage agentic framework that mimics human cognitive processes by visually narrowing down web page content to extract desired information. Extensive experiments across diverse backbone models demonstrate the effectiveness and robustness of VGS. We believe that this study lays the foundation for developing practical and robust WIE systems.
摘要：网页信息提取（WIE）是从网页中自动提取数据的任务，为各种应用程序提供高实用性。 WIE 系统的评估传统上依赖于从单个时间点捕获的 HTML 快照构建的基准。然而，这种离线评估范式无法解释网络随时间变化的性质；因此，这些静态基准测试的性能通常无法推广到动态的现实场景。为了弥补这一差距，我们引入了 \dataset，这是一个新的基准，旨在直接针对实时网站评估 WIE 系统。基于受信任且获得许可的网站，我们策划需要提取各种数据类别（例如文本、图像和超链接）信息的自然语言查询。我们根据要提取的属性的数量和基数进一步设计这些查询来表示四个复杂程度，从而能够对 WIE 系统进行精细评估。此外，我们提出了 Visual Grounding Scraper (VGS)，这是一种新颖的多阶段代理框架，通过视觉缩小网页内容以提取所需信息来模仿人类认知过程。跨不同骨干模型的大量实验证明了 VGS 的有效性和鲁棒性。我们相信这项研究为开发实用且强大的 WIE 系统奠定了基础。

Title: Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction

Authors: Shidong He, Haoyu Wang, Wenjie Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13777
Pdf URL: https://arxiv.org/pdf/2603.13777
Copy Paste: [[2603.13777]] Generate Then Correct: Single Shot Global Correction for Aspect Sentiment Quad Prediction(https://arxiv.org/abs/2603.13777)
Keywords: llm
Abstract: Aspect-based sentiment analysis (ABSA) extracts aspect-level sentiment signals from user-generated text, supports product analytics, experience monitoring, and public-opinion tracking, and is central to fine-grained opinion mining. A key challenge in ABSA is aspect sentiment quad prediction (ASQP), which requires identifying four elements: the aspect term, the aspect category, the opinion term, and the sentiment polarity. However, existing studies usually linearize the unordered quad set into a fixed-order template and decode it left-to-right. With teacher forcing training, the resulting training-inference mismatch (exposure bias) lets early prefix errors propagate to later elements. The linearization order determines which elements appear earlier in the prefix, so this propagation becomes order-sensitive and is hard to repair in a single pass. To address this, we propose a method, Generate-then-Correct (G2C): a generator drafts quads and a corrector performs a single-shot, sequence-level global correction trained on LLM-synthesized drafts with common error patterns. On the Rest15 and Rest16 datasets, G2C outperforms strong baseline models.
摘要：基于方面的情感分析 (ABSA) 从用户生成的文本中提取方面级别的情感信号，支持产品分析、体验监控和舆情跟踪，是细粒度意见挖掘的核心。 ABSA 的一个关键挑战是方面情感四边形预测 (ASQP)，它需要识别四个元素：方面项、方面类别、观点项和情感极性。然而，现有的研究通常将无序四边形集线性化为固定顺序模板，并从左到右对其进行解码。通过教师强制训练，产生的训练推理不匹配（暴露偏差）会让早期的前缀错误传播到后面的元素。线性化顺序决定了哪些元素出现在前缀中较早的位置，因此这种传播变得顺序敏感，并且很难在单次传递中修复。为了解决这个问题，我们提出了一种方法，生成然后校正（G2C）：生成器起草四边形，校正器执行单次序列级全局校正，在具有常见错误模式的 LLM 合成草稿上进行训练。在 Rest15 和 Rest16 数据集上，G2C 优于强大的基线模型。

Title: Projection-Free Evolution Strategies for Continuous Prompt Search

Authors: Yu Cai, Canxi Huang, Xiaoyu He
Subjects: cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2603.13786
Pdf URL: https://arxiv.org/pdf/2603.13786
Copy Paste: [[2603.13786]] Projection-Free Evolution Strategies for Continuous Prompt Search(https://arxiv.org/abs/2603.13786)
Keywords: prompt
Abstract: Continuous prompt search offers a computationally efficient alternative to conventional parameter tuning in natural language processing tasks. Nevertheless, its practical effectiveness can be significantly hindered by the black-box nature and the inherent high-dimensionality of the objective landscapes. Existing methods typically mitigate these challenges by restricting the search to a randomly projected low-dimensional subspace. However, the effectiveness and underlying motivation of the projection mechanism remain ambiguous. In this paper, we first empirically demonstrate that despite the prompt space possessing a low-dimensional structure, random projections fail to adequately capture this essential structure. Motivated by this finding, we propose a projection-free prompt search method based on evolutionary strategies. By directly optimizing in the full prompt space with an adaptation mechanism calibrated to the intrinsic dimension, our method achieves competitive search capabilities without additional computational overhead. Furthermore, to bridge the generalization gap in few-shot scenarios, we introduce a confidence-based regularization mechanism that systematically enhances the model's confidence in the target verbalizers. Experimental results on seven natural language understanding tasks from the GLUE benchmark demonstrate that our proposed approach significantly outperforms existing baselines.
摘要：连续提示搜索为自然语言处理任务中的传统参数调整提供了一种计算高效的替代方案。然而，其实际有效性可能会受到黑盒性质和客观景观固有的高维性的严重阻碍。现有方法通常通过将搜索限制在随机投影的低维子空间来缓解这些挑战。然而，预测机制的有效性和潜在动机仍然不明确。在本文中，我们首先凭经验证明，尽管提示空间具有低维结构，但随机投影无法充分捕捉这种基本结构。受这一发现的启发，我们提出了一种基于进化策略的无投影提示搜索方法。通过使用校准到内在维度的适应机制直接在完整的提示空间中进行优化，我们的方法实现了有竞争力的搜索能力，而无需额外的计算开销。此外，为了弥补少数场景中的泛化差距，我们引入了一种基于置信度的正则化机制，可以系统地增强模型对目标语言表达者的置信度。 GLUE 基准测试的七个自然语言理解任务的实验结果表明，我们提出的方法显着优于现有基线。

Title: DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents

Authors: Snehasis Mukhopadhyay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13791
Pdf URL: https://arxiv.org/pdf/2603.13791
Copy Paste: [[2603.13791]] DeceptGuard :A Constitutional Oversight Framework For Detecting Deception in LLM Agents(https://arxiv.org/abs/2603.13791)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Reliable detection of deceptive behavior in Large Language Model (LLM) agents is an essential prerequisite for safe deployment in high-stakes agentic contexts. Prior work on scheming detection has focused exclusively on black-box monitors that observe only externally visible tool calls and outputs, discarding potentially rich internal reasoning signals. We introduce DECEPTGUARD, a unified framework that systematically compares three monitoring regimes: black-box monitors (actions and outputs only), CoT-aware monitors (additionally observing the agent's chain-of-thought reasoning trace), and activation-probe monitors (additionally reading hidden-state representations from a frozen open-weights encoder). We introduce DECEPTSYNTH, a scalable synthetic pipeline for generating deception-positive and deception-negative agent trajectories across a novel 12-category taxonomy spanning verbal, behavioral, and structural deception. Our monitors are optimized on 4,800 synthetic trajectories and evaluated on 9,200 held-out samples from DeceptArena, a benchmark of realistic sandboxed agent environments with execution-verified labels. Across all evaluation settings, CoT-aware and activation-probe monitors substantially outperform their black-box counterparts (mean pAUROC improvement of +0.097), with the largest gains on subtle, long-horizon deception that leaves minimal behavioral footprints. We empirically characterize a transparency-detectability trade-off: as agents learn to suppress overt behavioral signals, chain-of-thought becomes the primary detection surface but is itself increasingly unreliable due to post-training faithfulness degradation. We propose HYBRID-CONSTITUTIONAL ensembles as a robust defense-in-depth approach, achieving a pAUROC of 0.934 on the held-out test set, representing a substantial advance over the prior state of the art.
摘要：可靠地检测大型语言模型 (LLM) 代理中的欺骗行为是在高风险代理环境中安全部署的重要先决条件。先前有关阴谋检测的工作主要集中在黑盒监视器上，这些监视器仅观察外部可见的工具调用和输出，丢弃潜在丰富的内部推理信号。我们引入了 DECEPTGUARD，一个统一的框架，它系统地比较了三种监控机制：黑盒监控器（仅操作和输出）、CoT 感知监控器（另外观察代理的思维链推理轨迹）和激活探针监控器（另外从冻结的开放权重编码器读取隐藏状态表示）。我们引入了 DECEPTSYNTH，这是一个可扩展的合成管道，用于跨涵盖言语、行为和结构欺骗的新型 12 类分类法生成欺骗阳性和欺骗阴性代理轨迹。我们的监视器针对 4,800 个合成轨迹进行了优化，并根据 DeceptArena 的 9,200 个保留样本进行了评估，DeceptArena 是具有执行验证标签的真实沙盒代理环境的基准。在所有评估设置中，CoT 感知和激活探针监测器的性能显着优于黑盒监测器（平均 pAUROC 改进为+0.097），在微妙的长视野欺骗上获得最大收益，从而留下最小的行为足迹。我们根据经验描述了透明度与可检测性的权衡：当智能体学会抑制明显的行为信号时，思想链成为主要的检测表面，但由于训练后的忠实度下降，其本身越来越不可靠。我们提出混合宪法集成作为一种稳健的纵深防御方法，在保留的测试集上实现了 0.934 的 pAUROC，代表着相对于现有技术的重大进步。

Title: PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement

Authors: Yongkang Guo, Zhihuan Huang, Yuqing Kong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13796
Pdf URL: https://arxiv.org/pdf/2603.13796
Copy Paste: [[2603.13796]] PMIScore: An Unsupervised Approach to Quantify Dialogue Engagement(https://arxiv.org/abs/2603.13796)
Keywords: language model, llm
Abstract: High dialogue engagement is a crucial indicator of an effective conversation. A reliable measure of engagement could help benchmark large language models, enhance the effectiveness of human-computer interactions, or improve personal communication skills. However, quantifying engagement is challenging, since it is subjective and lacks a "gold standard". This paper proposes PMIScore, an efficient unsupervised approach to quantify dialogue engagement. It uses pointwise mutual information (PMI), which is the probability of generating a response conditioning on the conversation history. Thus, PMIScore offers a clear interpretation of engagement. As directly computing PMI is intractable due to the complexity of dialogues, PMIScore learned it through a dual form of divergence. The algorithm includes generating positive and negative dialogue pairs, extracting embeddings by large language models (LLMs), and training a small neural network using a mutual information loss function. We validated PMIScore on both synthetic and real-world datasets. Our results demonstrate the effectiveness of PMIScore in PMI estimation and the reasonableness of the PMI metric itself.
摘要：高度的对话参与度是有效对话的关键指标。可靠的参与度测量可以帮助对大型语言模型进行基准测试，提高人机交互的有效性，或提高个人沟通技巧。然而，量化参与度具有挑战性，因为它是主观的并且缺乏“黄金标准”。本文提出了 PMIScore，一种有效的无监督量化对话参与度的方法。它使用逐点互信息 (PMI)，这是根据对话历史生成响应条件的概率。因此，PMIScore 对敬业度提供了清晰的解释。由于对话的复杂性，直接计算 PMI 很困难，PMIScore 通过双重形式的发散来学习它。该算法包括生成正面和负面对话对、通过大型语言模型 (LLM) 提取嵌入以及使用互信息损失函数训练小型神经网络。我们在合成数据集和真实数据集上验证了 PMIScore。我们的结果证明了PMIScore在PMI估算中的有效性以及PMI指标本身的合理性。

Title: APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution

Authors: Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13853
Pdf URL: https://arxiv.org/pdf/2603.13853
Copy Paste: [[2603.13853]] APEX-Searcher: Augmenting LLMs' Search Capabilities through Agentic Planning and Execution(https://arxiv.org/abs/2603.13853)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG), based on large language models (LLMs), serves as a vital approach to retrieving and leveraging external knowledge in various domain applications. When confronted with complex multi-hop questions, single-round retrieval is often insufficient for accurate reasoning and problem solving. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches significantly improve problem-solving performance, they are still faced with challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL) process, leading to inaccurate retrieval results and performance degradation. To address these issues, in this paper, we proposes APEX-Searcher, a novel Agentic Planning and Execution framework to augment LLM search capabilities. Specifically, we introduce a two-stage agentic framework that decouples the retrieval process into planning and execution: It first employs RL with decomposition-specific rewards to optimize strategic planning; Built on the sub-task decomposition, it then applies supervised fine-tuning on high-quality multi-hop trajectories to equip the model with robust iterative sub-task execution capabilities. Extensive experiments demonstrate that our proposed framework achieves significant improvements in both multi-hop RAG and task planning performances across multiple benchmarks.
摘要：基于大型语言模型 (LLM) 的检索增强生成 (RAG) 是在各种领域应用程序中检索和利用外部知识的重要方法。当面对复杂的多跳问题时，单轮检索往往不足以准确推理和解决问题。为了增强复杂任务的搜索能力，大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法显着提高了解决问题的性能，但它们在任务推理和模型训练方面仍然面临挑战，特别是端到端强化学习（RL）过程中模糊的检索执行路径和稀疏的奖励，导致检索结果不准确和性能下降。为了解决这些问题，在本文中，我们提出了 APEX-Searcher，这是一种新颖的代理规划和执行框架，用于增强 LLM 搜索功能。具体来说，我们引入了一个两阶段代理框架，将检索过程解耦为规划和执行：它首先采用具有特定分解奖励的强化学习来优化战略规划；它以子任务分解为基础，然后对高质量多跳轨迹进行监督微调，以使模型具有强大的迭代子任务执行能力。大量实验表明，我们提出的框架在多个基准测试中的多跳 RAG 和任务规划性能方面取得了显着改进。

Title: GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent

Authors: Yuri Kuratov, Matvey Kairov, Aydar Bulatov, Ivan Rodkin, Mikhail Burtsev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.13875
Pdf URL: https://arxiv.org/pdf/2603.13875
Copy Paste: [[2603.13875]] GradMem: Learning to Write Context into Memory with Test-Time Gradient Descent(https://arxiv.org/abs/2603.13875)
Keywords: language model, long context
Abstract: Many large language model applications require conditioning on long contexts. Transformers typically support this by storing a large per-layer KV-cache of past activations, which incurs substantial memory overhead. A desirable alternative is ompressive memory: read a context once, store it in a compact state, and answer many queries from that state. We study this in a context removal setting, where the model must generate an answer without access to the original context at inference time. We introduce GradMem, which writes context into memory via per-sample test-time optimization. Given a context, GradMem performs a few steps of gradient descent on a small set of prefix memory tokens while keeping model weights frozen. GradMem explicitly optimizes a model-level self-supervised context reconstruction loss, resulting in a loss-driven write operation with iterative error correction, unlike forward-only methods. On associative key--value retrieval, GradMem outperforms forward-only memory writers with the same memory size, and additional gradient steps scale capacity much more effectively than repeated forward writes. We further show that GradMem transfers beyond synthetic benchmarks: with pretrained language models, it attains competitive results on natural language tasks including bAbI and SQuAD variants, relying only on information encoded in memory.
摘要：许多大型语言模型应用程序需要对长上下文进行调节。 Transformer 通常通过存储过去激活的大型每层 KV 缓存来支持这一点，这会产生大量的内存开销。一个理想的替代方案是压缩内存：读取一次上下文，将其存储在紧凑状态中，并从该状态回答许多查询。我们在上下文删除设置中研究这一点，其中模型必须在推理时无法访问原始上下文的情况下生成答案。我们引入 GradMem，它通过每个样本测试时优化将上下文写入内存。给定上下文，GradMem 对一小组前缀内存标记执行梯度下降的几个步骤，同时保持模型权重冻结。 GradMem 显式优化了模型级自监督上下文重建损失，从而产生具有迭代纠错的损失驱动写入操作，与仅前向方法不同。在关联键值检索方面，GradMem 的性能优于具有相同内存大小的仅前向内存写入器，并且额外的梯度步骤比重复前向写入更有效地扩展容量。我们进一步表明，GradMem 超越了综合基准：通过预训练的语言模型，它仅依赖于内存中编码的信息，在包括 bAbI 和 SQuAD 变体在内的自然语言任务上获得了有竞争力的结果。

Title: Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation

Authors: Petter Törnberg
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13891
Pdf URL: https://arxiv.org/pdf/2603.13891
Copy Paste: [[2603.13891]] Large Language Models Reproduce Racial Stereotypes When Used for Text Annotation(https://arxiv.org/abs/2603.13891)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used for automated text annotation in tasks ranging from academic research to content moderation and hiring. Across 19 LLMs and two experiments totaling more than 4 million annotation judgments, we show that subtle identity cues embedded in text systematically bias annotation outcomes in ways that mirror racial stereotypes. In a names-based experiment spanning 39 annotation tasks, texts containing names associated with Black individuals are rated as more aggressive by 18 of 19 models and more gossipy by 18 of 19. Asian names produce a bamboo-ceiling profile: 17 of 19 models rate individuals as more intelligent, while 18 of 19 rate them as less confident and less sociable. Arab names elicit cognitive elevation alongside interpersonal devaluation, and all four minority groups are consistently rated as less self-disciplined. In a matched dialect experiment, the same sentence is judged significantly less professional (all 19 models, mean gap $-0.774$), less indicative of an educated speaker ($-0.688$), more toxic (18/19), and more angry (19/19) when written in African American Vernacular English rather than Standard American English. A notable exception occurs for name-based hireability, where fine-tuning appears to overcorrect, systematically favoring minority-named applicants. These findings suggest that using LLMs as automated annotators can embed socially patterned biases directly into the datasets and measurements that increasingly underpin research, governance, and decision-making.
摘要：大型语言模型 (LLM) 越来越多地用于从学术研究到内容审核和招聘等任务中的自动文本注释。在 19 个法学硕士和两项实验中，总计超过 400 万个注释判断，我们发现文本中嵌入的微妙身份线索会系统性地使注释结果产生偏见，从而反映种族刻板印象。在一项涵盖 39 个注释任务的基于姓名的实验中，包含与黑人个人相关的姓名的文本被 19 个模型中的 18 个评为更具攻击性，19 个模型中的 18 个被评为更具八卦性。亚洲名字产生了竹天花板轮廓：19 个模型中的 17 个将个体评为更聪明，而 19 个模型中的 18 个将其评为不那么自信和不善于交际。阿拉伯名字会导致认知提升，同时也会导致人际交往贬值，所有四个少数群体都被一致认为自律性较差。在一项匹配的方言实验中，同一个句子在用非裔美国白话英语而不是标准美式英语书写时，被认为明显不太专业（所有 19 个模型，平均差距 $-0.774$），不太能体现出受过教育的说话者（$-0.688$），更有毒（18/19），更愤怒（19/19）。基于姓名的可聘性存在一个值得注意的例外，微调似乎矫枉过正，系统性地偏向少数族裔申请人。这些发现表明，使用法学硕士作为自动注释器可以将社会模式偏见直接嵌入到数据集和测量中，从而日益支撑研究、治理和决策。

Title: OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset

Authors: Wenbin Hu, Huihao Jing, Haochen Shi, Changxuan Fan, Haoran Li, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13933
Pdf URL: https://arxiv.org/pdf/2603.13933
Copy Paste: [[2603.13933]] OmniCompliance-100K: A Multi-Domain, Rule-Grounded, Real-World Safety Compliance Dataset(https://arxiv.org/abs/2603.13933)
Keywords: language model, llm, agent
Abstract: Ensuring the safety and compliance of large language models (LLMs) is of paramount importance. However, existing LLM safety datasets often rely on ad-hoc taxonomies for data generation and suffer from a significant shortage of rule-grounded, real-world cases that are essential for robustly protecting LLMs. In this work, we address this critical gap by constructing a comprehensive safety dataset from a compliance perspective. Using a powerful web-searching agent, we collect a rule-grounded, real-world case dataset OmniCompliance-100K, sourced from multi-domain authoritative references. The dataset spans 74 regulations and policies across a wide range of domains, including security and privacy regulations, content safety and user data privacy policies from leading AI companies and social media platforms, financial security requirements, medical device risk management standards, educational integrity guidelines, and protections of fundamental human rights. In total, our dataset contains 12,985 distinct rules and 106,009 associated real-world compliance cases. Our analysis confirms a strong alignment between the rules and their corresponding cases. We further conduct extensive benchmarking experiments to evaluate the safety and compliance capabilities of advanced LLMs across different model scales. Our experiments reveal several interesting findings that have great potential to offer valuable insights for future LLM safety research.
摘要：确保大型语言模型 (LLM) 的安全性和合规性至关重要。然而，现有的法学硕士安全数据集通常依赖于数据生成的临时分类法，并且严重缺乏基于规则的现实案例，而这些案例对于强有力地保护法学硕士至关重要。在这项工作中，我们通过从合规性角度构建全面的安全数据集来解决这一关键差距。使用强大的网络搜索代理，我们收集了一个基于规则的真实案例数据集 OmniCompliance-100K，该数据集源自多领域权威参考文献。该数据集涵盖多个领域的 74 项法规和政策，包括安全和隐私法规、领先人工智能公司和社交媒体平台的内容安全和用户数据隐私政策、金融安全要求、医疗器械风险管理标准、教育诚信指南以及基本人权保护。我们的数据集总共包含 12,985 条不同的规则和 106,009 个相关的现实世界合规案例。我们的分析证实了规则与其相应案例之间的紧密一致性。我们进一步进行了广泛的基准测试实验，以评估不同模型规模的高级法学硕士的安全性和合规性能力。我们的实验揭示了一些有趣的发现，这些发现很有可能为未来的法学硕士安全研究提供有价值的见解。

Title: ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering

Authors: Hussein Jawad, Nicolas J-B Brunel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.13950
Pdf URL: https://arxiv.org/pdf/2603.13950
Copy Paste: [[2603.13950]] ToolFlood: Beyond Selection -- Hiding Valid Tools from LLM Agents via Semantic Covering(https://arxiv.org/abs/2603.13950)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM) agents increasingly use external tools for complex tasks and rely on embedding-based retrieval to select a small top-k subset for reasoning. As these systems scale, the robustness of this retrieval stage is underexplored, even though prior work has examined attacks on tool selection. This paper introduces ToolFlood, a retrieval-layer attack on tool-augmented LLM agents. Rather than altering which tool is chosen after retrieval, ToolFlood overwhelms retrieval itself by injecting a few attacker-controlled tools whose metadata is carefully placed by exploiting the geometry of embedding space. These tools semantically span many user queries, dominate the top-k results, and push all benign tools out of the agent's context. ToolFlood uses a two-phase adversarial tool generation strategy. It first samples subsets of target queries and uses an LLM to iteratively generate diverse tool names and descriptions. It then runs an iterative greedy selection that chooses tools maximizing coverage of remaining queries in embedding space under a cosine-distance threshold, stopping when all queries are covered or a budget is reached. We provide theoretical analysis of retrieval saturation and show on standard benchmarks that ToolFlood achieves up to a 95% attack success rate with a low injection rate (1% in ToolBench). The code will be made publicly available at the following link: this https URL
摘要：大型语言模型 (LLM) 代理越来越多地使用外部工具来执行复杂任务，并依靠基于嵌入的检索来选择一个小的 top-k 子集进行推理。随着这些系统的扩展，尽管先前的工作已经研究了对工具选择的攻击，但该检索阶段的稳健性尚未得到充分探索。本文介绍了 ToolFlood，这是一种针对工具增强 LLM 代理的检索层攻击。 ToolFlood 不是在检索后更改选择的工具，而是通过注入一些攻击者控制的工具来压倒检索本身，这些工具的元数据是通过利用嵌入空间的几何形状精心放置的。这些工具在语义上跨越了许多用户查询，主导了 top-k 结果，并将所有良性工具排除在代理的上下文之外。 ToolFlood 使用两阶段对抗性工具生成策略。它首先对目标查询的子集进行采样，并使用 LLM 迭代生成不同的工具名称和描述。然后，它运行迭代贪婪选择，选择工具在余弦距离阈值下最大程度地覆盖嵌入空间中的剩余查询，并在覆盖所有查询或达到预算时停止。我们提供了检索饱和度的理论分析，并在标准基准测试中表明，ToolFlood 以较低的注入率（在 ToolBench 中为 1%）实现了高达 95% 的攻击成功率。该代码将通过以下链接公开提供：此 https URL

Title: FLUX: Data Worth Training On

Authors: Gowtham, Sai Rupesh, Sanjay Kumar, Saravanan, Venkata Chaithanya
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.13972
Pdf URL: https://arxiv.org/pdf/2603.13972
Copy Paste: [[2603.13972]] FLUX: Data Worth Training On(https://arxiv.org/abs/2603.13972)
Keywords: language model
Abstract: Modern large language model training is no longer limited by data availability, but by the inability of existing preprocessing pipelines to simultaneously achieve massive scale and high data quality. Current approaches are forced to sacrifice one for the other: either aggressively filtering to improve quality at the cost of severe token loss, or retaining large volumes of data while introducing substantial noise. In this work, we introduce FLUX, a preprocessing pipeline specifically designed to break this long-standing trade-off by maximizing token retention while enforcing rigorous quality control. Models trained on FLUX-curated data consistently outperform prior methods. A 3B-parameter model trained on 60B tokens with FLUX achieves 32.14% MMLU accuracy, surpassing the previous state-of-the-art pipeline DCLM (31.98%) and significantly outperforming FineWeb (29.88%). FLUX achieves the same aggregate score as a model trained on DCLM data using only 39B tokens, resulting in a 34.4% reduction in training compute. At the data level, FLUX extracts 50B usable tokens from a single dump (CC-MAIN-2025-51), compared to 40B from DCLM (+25% retention). FLUX-Base yields 192B tokens, exceeding FineWeb's 170B while still maintaining superior quality. Overall, FLUX establishes a new state of the art in web-scale data preprocessing by demonstrating that high retention, strong quality control, and computational efficiency can be achieved simultaneously, redefining the limits of scalable dataset construction for modern language models.
摘要：现代大语言模型训练不再受到数据可用性的限制，而是受到现有预处理管道无法同时实现大规模和高数据质量的限制。当前的方法被迫牺牲一种方法：要么以严重的令牌丢失为代价积极过滤以提高质量，要么保留大量数据，同时引入大量噪音。在这项工作中，我们引入了 FLUX，这是一种预处理管道，专门设计用于通过最大限度地提高代币保留率同时执行严格的质量控制来打破这种长期存在的权衡。在 FLUX 整理的数据上训练的模型始终优于先前的方法。使用 FLUX 在 60B 代币上训练的 3B 参数模型实现了 32.14% MMLU 准确率，超越了之前最先进的管道 DCLM (31.98%)，并显着优于 FineWeb (29.88%)。 FLUX 获得了与仅使用 39B 令牌在 DCLM 数据上训练的模型相同的总分，从而使训练计算量减少了 34.4%。在数据级别，FLUX 从单个转储 (CC-MAIN-2025-51) 中提取 50B 可用令牌，而从 DCLM 中提取 40B（+25% 保留）。 FLUX-Base 产生 192B 代币，超过 FineWeb 的 170B，同时仍然保持卓越的质量。总体而言，FLUX 通过证明可以同时实现高保留率、强大的质量控制和计算效率，在网络规模数据预处理方面建立了新的技术水平，重新定义了现代语言模型可扩展数据集构建的限制。

Title: Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs

Authors: Hang Gao, Dimitris N. Metaxas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14006
Pdf URL: https://arxiv.org/pdf/2603.14006
Copy Paste: [[2603.14006]] Beyond Explicit Edges: Robust Reasoning over Noisy and Sparse Knowledge Graphs(https://arxiv.org/abs/2603.14006)
Keywords: llm
Abstract: GraphRAG is increasingly adopted for converting unstructured corpora into graph structures to enable multi-hop reasoning. However, standard graph algorithms rely heavily on static connectivity and explicit edges, often failing in real-world scenarios where knowledge graphs (KGs) are noisy, sparse, or incomplete. To address this limitation, we introduce INSES (Intelligent Navigation and Similarity Enhanced Search), a dynamic framework designed to reason beyond explicit edges. INSES couples LLM-guided navigation, which prunes noise and steers exploration, with embedding-based similarity expansion to recover hidden links and bridge semantic gaps. Recognizing the computational cost of graph reasoning, we complement INSES with a lightweight router that delegates simple queries to Naïve RAG and escalates complex cases to INSES, balancing efficiency with reasoning depth. INSES consistently outperforms SOTA RAG and GraphRAG baselines across multiple benchmarks. Notably, on the MINE benchmark, it demonstrates superior robustness across KGs constructed by varying methods (KGGEN, GraphRAG, OpenIE), improving accuracy by 5%, 10%, and 27%, respectively.
摘要：GraphRAG 越来越多地用于将非结构化语料库转换为图结构，以实现多跳推理。然而，标准图算法严重依赖静态连接和显式边缘，在知识图 (KG) 有噪声、稀疏或不完整的现实场景中通常会失败。为了解决这个限制，我们引入了 INSES（智能导航和相似性增强搜索），这是一个动态框架，旨在超越显式边缘进行推理。 INSES 将 LLM 引导的导航与基于嵌入的相似性扩展相结合，以消除噪声并引导探索，以恢复隐藏链接并弥合语义差距。认识到图推理的计算成本，我们用一个轻量级路由器来补充 INSES，该路由器将简单查询委托给 Naïve RAG，并将复杂情况升级给 INSES，从而平衡效率与推理深度。 INSES 在多个基准测试中始终优于 SOTA RAG 和 GraphRAG 基线。值得注意的是，在 MINE 基准上，它在通过不同方法（KGGEN、GraphRAG、OpenIE）构建的知识图谱中表现出卓越的鲁棒性，准确率分别提高了 5%、10% 和 27%。

Title: SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

Authors: Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14027
Pdf URL: https://arxiv.org/pdf/2603.14027
Copy Paste: [[2603.14027]] SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions(https://arxiv.org/abs/2603.14027)
Keywords: language model, prompt
Abstract: Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.
摘要：政治演讲者常常避免直接回答问题，同时保持表面上的回应。尽管它对公共话语很重要，但这种策略规避在自然语言处理中仍未得到充分探索。我们引入了 SemEval-2026 任务 6，CLARITY，这是一项关于政治问题回避的共享任务，由两个子任务组成：(i) 将清晰度级别分类为明确答复、矛盾和明确不答复，以及 (ii) 将回避级别分类为九种细粒度回避策略。该基准是根据美国总统访谈构建的，遵循基于专家的回答清晰度和回避分类法。该任务吸引了 124 个注册团队，他们提交了 946 次清晰度级别分类的有效运行和 539 次规避级别分类的有效运行。结果显示，两个子任务之间的难度差距很大：最好的系统在清晰度分类上达到了 0.89 宏观 F1，大幅超过最强基线，而最高规避级别系统达到了 0.68 宏观 F1，与最佳基线相匹配。总体而言，大型语言模型提示和分类法的分层利用成为最有效的策略，顶级系统的性能始终优于那些独立处理两个子任务的系统。 CLARITY 将政治反应规避确立为计算话语分析的挑战性基准，并强调了在政治语言中对战略模糊性进行建模的难度。

Title: CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification

Authors: Menna Elgabry, Ali Hamdi, Khaled Shaban
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14078
Pdf URL: https://arxiv.org/pdf/2603.14078
Copy Paste: [[2603.14078]] CMHL: Contrastive Multi-Head Learning for Emotionally Consistent Text Classification(https://arxiv.org/abs/2603.14078)
Keywords: language model, llm
Abstract: Textual Emotion Classification (TEC) is one of the most difficult NLP tasks. State of the art approaches rely on Large language models (LLMs) and multi-model ensembles. In this study, we challenge the assumption that larger scale or more complex models are necessary for improved performance. In order to improve logical consistency, We introduce CMHL, a novel single-model architecture that explicitly models the logical structure of emotions through three key innovations: (1) multi-task learning that jointly predicts primary emotions, valence, and intensity, (2) psychologically-grounded auxiliary supervision derived from Russell's circumplex model, and (3) a novel contrastive contradiction loss that enforces emotional consistency by penalizing mutually incompatible predictions (e.g., simultaneous high confidence in joy and anger). With just 125M parameters, our model outperforms 56x larger LLMs and sLM ensembles with a new state-of-the-art F1 score of 93.75\% compared to (86.13\%-93.2\%) on the dair-ai Emotion dataset. We further show cross domain generalization on the Reddit Suicide Watch and Mental Health Collection dataset (SWMH), outperforming domain-specific models like MentalBERT and MentalRoBERTa with an F1 score of 72.50\% compared to (68.16\%-72.16\%) + a 73.30\% recall compared to (67.05\%-70.89\%) that translates to enhanced sensitivity for detecting mental health distress. Our work establishes that architectural intelligence (not parameter count) drives progress in TEC. By embedding psychological priors and explicit consistency constraints, a well-designed single model can outperform both massive LLMs and complex ensembles, offering a efficient, interpretable, and clinically-relevant paradigm for affective computing.
摘要：文本情感分类（TEC）是最困难的 NLP 任务之一。最先进的方法依赖于大型语言模型 (LLM) 和多模型集成。在这项研究中，我们挑战了这样的假设：更大规模或更复杂的模型对于提高性能是必要的。为了提高逻辑一致性，我们引入了 CMHL，一种新颖的单模型架构，它通过三个关键创新显式地模拟情感的逻辑结构：（1）联合预测主要情感、效价和强度的多任务学习，（2）源自罗素环线模型的基于心理的辅助监督，以及（3）一种新颖的对比矛盾损失，通过惩罚相互不相容的预测（例如，对快乐和愤怒同时具有高置信度）来增强情感一致性。仅用 1.25 亿个参数，我们的模型就比 56 倍大的 LLM 和 sLM 集成表现得更好，新的最先进的 F1 分数为 93.75\%，而 dair-ai Emotion 数据集上的 (86.13\%-93.2\%) 则为 (86.13\%-93.2\%)。我们进一步在 Reddit Suicide Watch 和 Mental Health Collection 数据集 (SWMH) 上展示了跨域泛化，其性能优于 MentalBERT 和 MentalRoBERTa 等特定领域模型，与 (68.16\%-72.16\%) 相比，F1 分数为 72.50\% + 与 (67.05\%-70.89\%) 相比，召回率为 73.30\%，这意味着检测心理健康的灵敏度增强苦恼。我们的工作表明，架构智能（而非参数计数）推动了 TEC 的进步。通过嵌入心理先验和明确的一致性约束，精心设计的单一模型可以优于大规模法学硕士和复杂的集成，为情感计算提供高效、可解释且与临床相关的范式。

Title: OasisSimp: An Open-source Asian-English Sentence Simplification Dataset

Authors: Hannah Liu, Muxin Tian, Iqra Ali, Haonan Gao, Qiaoyiwen Wu, Blair Yang, Uthayasanker Thayasivam, En-Shiun Annie Lee, Pakawat Nakwijit, Surangika Ranathunga, Ravi Shekhar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14111
Pdf URL: https://arxiv.org/pdf/2603.14111
Copy Paste: [[2603.14111]] OasisSimp: An Open-source Asian-English Sentence Simplification Dataset(https://arxiv.org/abs/2603.14111)
Keywords: language model, llm
Abstract: Sentence simplification aims to make complex text more accessible by reducing linguistic complexity while preserving the original meaning. However, progress in this area remains limited for mid-resource and low-resource languages due to the scarcity of high-quality data. To address this gap, we introduce the OasisSimp dataset, a multilingual dataset for sentence-level simplification covering five languages: English, Sinhala, Tamil, Pashto, and Thai. Among these, no prior sentence simplification datasets exist for Thai, Pashto, and Tamil, while limited data is available for Sinhala. Each language simplification dataset was created by trained annotators who followed detailed guidelines to simplify sentences while maintaining meaning, fluency, and grammatical correctness. We evaluate eight open-weight multilingual Large Language Models (LLMs) on the OasisSimp dataset and observe substantial performance disparities between high-resource and low-resource languages, highlighting the simplification challenges in multilingual settings. The OasisSimp dataset thus provides both a valuable multilingual resource and a challenging benchmark, revealing the limitations of current LLM-based simplification methods and paving the way for future research in low-resource sentence simplification. The dataset is available at this https URL.
摘要：句子简化旨在通过降低语言复杂性同时保留原始含义，使复杂文本更易于理解。然而，由于缺乏高质量数据，中等资源和低资源语言在这一领域的进展仍然有限。为了解决这一差距，我们引入了 OasisSimp 数据集，这是一个用于句子级简化的多语言数据集，涵盖五种语言：英语、僧伽罗语、泰米尔语、普什图语和泰语。其中，泰语、普什图语和泰米尔语不存在先前的句子简化数据集，而僧伽罗语的可用数据有限。每个语言简化数据集都是由训练有素的注释者创建的，他们遵循详细的指南来简化句子，同时保持含义、流畅性和语法正确性。我们在 OasisSimp 数据集上评估了八个开放权重多语言大型语言模型 (LLM)，并观察到高资源语言和低资源语言之间存在巨大的性能差异，突出了多语言环境中的简化挑战。因此，OasisSimp 数据集提供了宝贵的多语言资源和具有挑战性的基准，揭示了当前基于 LLM 的简化方法的局限性，并为未来低资源句子简化的研究铺平了道路。该数据集可从此 https URL 获取。

Title: The GELATO Dataset for Legislative NER

Authors: Matthew Flynn, Timothy Obiso, Sam Newman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14130
Pdf URL: https://arxiv.org/pdf/2603.14130
Copy Paste: [[2603.14130]] The GELATO Dataset for Legislative NER(https://arxiv.org/abs/2603.14130)
Keywords: llm, prompt
Abstract: This paper introduces GELATO (Government, Executive, Legislative, and Treaty Ontology), a dataset of U.S. House and Senate bills from the 118th Congress annotated using a novel two-level named entity recognition ontology designed for U.S. legislative texts. We fine-tune transformer-based models (BERT, RoBERTa) of different architectures and sizes on this dataset for first-level prediction. We then use LLMs with optimized prompts to complete the second level prediction. The strong performance of RoBERTa and relatively weak performance of BERT models, as well as the application of LLMs as second-level predictors, support future research in legislative NER or downstream tasks using these model combinations as extraction tools.
摘要：本文介绍了 GELATO（政府、行政、立法和条约本体），这是第 118 届国会美国众议院和参议院法案的数据集，使用专为美国立法文本设计的新型两级命名实体识别本体进行注释。我们在此数据集上微调不同架构和大小的基于变压器的模型（BERT、RoBERTa）以进行第一级预测。然后，我们使用具有优化提示的法学硕士来完成第二级预测。 RoBERTa 的强大性能和 BERT 模型相对较弱的性能，以及 LLM 作为二级预测器的应用，支持使用这些模型组合作为提取工具的立法 NER 或下游任务的未来研究。

Title: MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos

Authors: Arushi Goel, Sreyan Ghosh, Vatsal Agarwal, Nishit Anand, Kaousheik Jayakumar, Lasha Koroshinadze, Yao Xu, Katie Lyons, James Case, Karan Sapra, Kevin J. Shih, Siddharth Gururani, Abhinav Shrivastava, Ramani Duraiswami, Dinesh Manocha, Andrew Tao, Bryan Catanzaro, Mohammad Shoeybi, Wei Ping
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.14145
Pdf URL: https://arxiv.org/pdf/2603.14145
Copy Paste: [[2603.14145]] MMOU: A Massive Multi-Task Omni Understanding and Reasoning Benchmark for Long and Complex Real-World Videos(https://arxiv.org/abs/2603.14145)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have shown strong performance in visual and audio understanding when evaluated in isolation. However, their ability to jointly reason over omni-modal (visual, audio, and textual) signals in long and complex videos remains largely unexplored. We introduce MMOU, a new benchmark designed to systematically evaluate multimodal understanding and reasoning under these challenging, real-world conditions. MMOU consists of 15,000 carefully curated questions paired with 9038 web-collected videos of varying length, spanning diverse domains and exhibiting rich, tightly coupled audio-visual content. The benchmark covers 13 fundamental skill categories, all of which require integrating evidence across modalities and time. All questions are manually annotated across multiple turns by professional annotators, ensuring high quality and reasoning fidelity. We evaluate 20+ state-of-the-art open-source and proprietary multimodal models on MMOU. The results expose substantial performance gaps: the best closed-source model achieves only 64.2% accuracy, while the strongest open-source model reaches just 46.8%. Our results highlight the challenges of long-form omni-modal understanding, revealing that current models frequently fail to apply even fundamental skills in long videos. Through detailed analysis, we further identify systematic failure modes and provide insights into where and why current models break.
摘要：单独评估时，多模态大语言模型 (MLLM) 在视觉和音频理解方面表现出强大的性能。然而，它们在长而复杂的视频中联合推理全模态（视觉、音频和文本）信号的能力在很大程度上仍未得到探索。我们推出了 MMOU，这是一个新的基准，旨在系统地评估这些具有挑战性的现实条件下的多模态理解和推理。 MMOU 由 15,000 个精心策划的问题组成，搭配 9038 个不同长度的网络收集视频，跨越不同领域，展示丰富、紧密耦合的视听内容。该基准涵盖 13 个基本技能类别，所有这些都需要跨模式和时间整合证据。所有问题均由专业注释者多次手动注释，确保高质量和推理保真度。我们在 MMOU 上评估了 20 多个最先进的开源和专有多模式模型。结果暴露了巨大的性能差距：最好的闭源模型仅达到 64.2% 的准确率，而最强的开源模型仅达到 46.8%。我们的结果凸显了长篇全模态理解的挑战，表明当前模型经常无法在长视频中应用基本技能。通过详细分析，我们进一步识别系统性故障模式，并深入了解当前模型故障的位置和原因。

Title: Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification

Authors: Fariba Afrin Irany, Sampson Akwafuo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14183
Pdf URL: https://arxiv.org/pdf/2603.14183
Copy Paste: [[2603.14183]] Selective Fine-Tuning of GPT Architectures for Parameter-Efficient Clinical Text Classification(https://arxiv.org/abs/2603.14183)
Keywords: language model, gpt
Abstract: The rapid expansion of electronic health record (EHR) systems has generated large volumes of unstructured clinical narratives that contain valuable information for disease identification, patient cohort discovery, and clinical decision support. Extracting structured knowledge from these free-text documents remains challenging because clinical language is highly specialized, labeled datasets are limited, and full fine-tuning of large pretrained language models can require substantial computational resources. Efficient adaptation strategies are therefore essential for practical clinical natural language processing applications. This study proposes a parameter-efficient selective fine-tuning framework for adapting GPT-2 to clinical text classification tasks. Instead of updating the entire pretrained model, the majority of network parameters are frozen, and only the final Transformer block, the final layer normalization module, and a lightweight classification head are updated during training. This design substantially reduces the number of trainable parameters while preserving the contextual representation capabilities learned during pretraining. The proposed approach is evaluated using radiology reports from the MIMIC-IV-Note dataset with automatically derived CheXpert-style labels. Experiments on 50,000 radiology reports demonstrate that selective fine-tuning achieves approximately 91% classification accuracy while updating fewer than 6% of the model parameters. Comparative experiments with head-only training and full-model fine-tuning show that the proposed method provides a favorable balance between predictive performance and computational efficiency. These results indicate that selective fine-tuning offers an efficient and scalable framework for clinical text classification.
摘要：电子健康记录 (EHR) 系统的快速扩展产生了大量非结构化临床叙述，其中包含用于疾病识别、患者群体发现和临床决策支持的宝贵信息。从这些自由文本文档中提取结构化知识仍然具有挑战性，因为临床语言高度专业化，标记数据集有限，并且大型预训练语言模型的全面微调可能需要大量计算资源。因此，有效的适应策略对于实际的临床自然语言处理应用至关重要。本研究提出了一种参数高效的选择性微调框架，用于使 GPT-2 适应临床文本分类任务。不是更新整个预训练模型，而是冻结大部分网络参数，并且在训练期间仅更新最终的 Transformer 块、最后一层归一化模块和轻量级分类头。这种设计大大减少了可训练参数的数量，同时保留了预训练期间学到的上下文表示能力。使用来自 MIMIC-IV-Note 数据集的放射学报告以及自动派生的 CheXpert 风格标签来评估所提出的方法。对 50,000 份放射学报告的实验表明，选择性微调可实现约 91% 的分类准确率，同时更新不到 6% 的模型参数。仅头部训练和全模型微调的对比实验表明，所提出的方法在预测性能和计算效率之间提供了良好的平衡。这些结果表明选择性微调为临床文本分类提供了一个高效且可扩展的框架。

Title: Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

Authors: Tianyi Zhang, David Traum
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14217
Pdf URL: https://arxiv.org/pdf/2603.14217
Copy Paste: [[2603.14217]] Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective(https://arxiv.org/abs/2603.14217)
Keywords: llm
Abstract: In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.
摘要：在认知科学和语言理论中，对话并不被视为一系列独立的话语，而是被视为由连贯性、一致性和共同理解维持的联合活动。然而，许多用于开放域和个性化对话的系统使用表面级别的相似性度量（例如 BLEU、ROUGE、F1）作为其主要报告措施之一，但无法捕获对话质量的这些更深层次的方面。我们重新审视了一个著名的个性化对话检索增强框架 LAPDOG，作为评估方法的案例研究。使用人类和法学硕士法官，我们发现了当前评估实践的局限性，包括损坏的对话历史、检索到的故事和角色之间的矛盾以及不连贯的响应生成。我们的结果表明，人类和法学硕士的判断密切相关，但与词汇相似性指标有所不同，这强调了基于认知的评估方法的必要性。从广义上讲，这项工作为检索增强对话系统开辟了一条更可靠的评估框架的道路，更好地反映了自然人类交流的原则。

Title: QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis

Authors: Yutong Wu, Chenrui Cao, Pengwei Jin, Di Huang, Rui Zhang, Xishan Zhang, Zidong Du, Qi Guo, Xing Hu
Subjects: cs.CL, cs.AI, cs.AR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14239
Pdf URL: https://arxiv.org/pdf/2603.14239
Copy Paste: [[2603.14239]] QiMeng-CodeV-SVA: Training Specialized LLMs for Hardware Assertion Generation via RTL-Grounded Bidirectional Data Synthesis(https://arxiv.org/abs/2603.14239)
Keywords: gpt, llm
Abstract: SystemVerilog Assertions (SVAs) are crucial for hardware verification. Recent studies leverage general-purpose LLMs to translate natural language properties to SVAs (NL2SVA), but they perform poorly due to limited data. We propose a data synthesis framework to tackle two challenges: the scarcity of high-quality real-world SVA corpora and the lack of reliable methods to determine NL-SVA semantic equivalence. For the former, large-scale open-source RTLs are used to guide LLMs to generate real-world SVAs; for the latter, bidirectional translation serves as a data selection method. With the synthesized data, we train CodeV-SVA, a series of SVA generation models. Notably, CodeV-SVA-14B achieves 75.8% on NL2SVA-Human and 84.0% on NL2SVA-Machine in Func.@1, matching or exceeding advanced LLMs like GPT-5 and DeepSeek-R1.
摘要：SystemVerilog 断言 (SVA) 对于硬件验证至关重要。最近的研究利用通用 LLM 将自然语言属性翻译为 SVA（NL2SVA），但由于数据有限，它们的表现不佳。我们提出了一个数据合成框架来解决两个挑战：高质量的现实世界 SVA 语料库的稀缺以及缺乏可靠的方法来确定 NL-SVA 语义等价性。对于前者，使用大规模开源RTL来指导LLM生成现实世界的SVA；对于后者，双向翻译充当数据选择方法。利用合成数据，我们训练 CodeV-SVA，这是一系列 SVA 生成模型。值得注意的是，CodeV-SVA-14B 在 Func.@1 中的 NL2SVA-Human 上实现了 75.8%，在 NL2SVA-Machine 上实现了 84.0%，匹配或超过了 GPT-5 和 DeepSeek-R1 等高级 LLM。

Title: Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring

Authors: Weixin Guan, Liang Li, Jiapeng Liu, Bing Li, Peng Fu, Chengyang Fang, Xiaoshuai Hao, Can Ma, Weiping Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14251
Pdf URL: https://arxiv.org/pdf/2603.14251
Copy Paste: [[2603.14251]] Mitigating Overthinking in Large Reasoning Language Models via Reasoning Path Deviation Monitoring(https://arxiv.org/abs/2603.14251)
Keywords: language model, chain-of-thought
Abstract: Large Reasoning Language Models (LRLMs) demonstrate impressive capabilities on complex tasks by utilizing long Chain-of-Thought reasoning. However, they are prone to overthinking, which generates redundant reasoning steps that degrade both performance and efficiency. Recently, early-exit strategies are proposed to mitigate overthinking by dynamically and adaptively terminating redundant reasoning. However, current early-exit methods either introduce extra training overhead by relying on proxy models or limit inference throughput due to the frequent content switching between reasoning and generating probing answers. Moreover, most early-exit methods harm LRLMs performance due to over-truncation. Our insight stems from an observation: overthinking often causes LRLMs to deviate from the correct reasoning path, which is frequently accompanied by high-entropy transition tokens. Given this, we propose an early-exit method deeply coupled with the native reasoning process, which leverages the path deviation index as a dedicated monitoring metric for the frequent occurrence of high-entropy transition tokens to dynamically detect and terminate overthinking trajectories. We conduct experiments across multiple benchmarks using LRLMs of different types and scales, and the results indicate that our method delivers the largest performance improvement over vanilla CoT compared to existing early-exit methods.
摘要：大型推理语言模型 (LRLM) 通过利用长思维链推理，在复杂任务上展示了令人印象深刻的能力。然而，他们容易过度思考，从而产生冗余的推理步骤，从而降低性能和效率。最近，提出了提前退出策略，通过动态和自适应地终止冗余推理来减轻过度思考。然而，当前的提前退出方法要么通过依赖代理模型引入额外的训练开销，要么由于推理和生成探测答案之间的频繁内容切换而限制推理吞吐量。此外，大多数早期退出方法会因过度截断而损害 LRLM 的性能。我们的见解源于一个观察：过度思考往往会导致 LRLM 偏离正确的推理路径，而这往往伴随着高熵转移标记。鉴于此，我们提出了一种与原生推理过程深度耦合的提前退出方法，利用路径偏差指数作为高熵转移标记频繁出现的专用监控指标，动态检测和终止过度思考轨迹。我们使用不同类型和规模的 LRLM 在多个基准上进行实验，结果表明，与现有的提前退出方法相比，我们的方法比普通 CoT 提供了最大的性能改进。

Title: Automatic Inter-document Multi-hop Scientific QA Generation

Authors: Seungmin Lee, Dongha Kim, Yuni Jeon, Junyoung Koh, Min Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14257
Pdf URL: https://arxiv.org/pdf/2603.14257
Copy Paste: [[2603.14257]] Automatic Inter-document Multi-hop Scientific QA Generation(https://arxiv.org/abs/2603.14257)
Keywords: language model, llm
Abstract: Existing automatic scientific question generation studies mainly focus on single-document factoid QA, overlooking the inter-document reasoning crucial for scientific understanding. We present AIM-SciQA, an automated framework for generating multi-document, multi-hop scientific QA datasets. AIM-SciQA extracts single-hop QAs using large language models (LLMs) with machine reading comprehension and constructs cross-document relations based on embedding-based semantic alignment while selectively leveraging citation information. Applied to 8,211 PubMed Central papers, it produced 411,409 single-hop and 13,672 multi-hop QAs, forming the IM-SciQA dataset. Human and automatic validation confirmed high factual consistency, and experimental results demonstrate that IM-SciQA effectively differentiates reasoning capabilities across retrieval and QA stages, providing a realistic and interpretable benchmark for retrieval-augmented scientific reasoning. We further extend this framework to construct CIM-SciQA, a citation-guided variant achieving comparable performance to the Oracle setting, reinforcing the dataset's validity and generality.
摘要：现有的自动科学问题生成研究主要集中在单文档事实问答上，忽视了对科学理解至关重要的文档间推理。我们提出了 AIM-SciQA，这是一个用于生成多文档、多跳科学 QA 数据集的自动化框架。 AIM-SciQA 使用具有机器阅读理解功能的大型语言模型 (LLM) 提取单跳 QA，并基于基于嵌入的语义对齐构建跨文档关系，同时选择性地利用引文信息。应用于 8,211 篇 PubMed Central 论文，生成了 411,409 个单跳和 13,672 个多跳 QA，形成了 IM-SciQA 数据集。人工和自动验证证实了高度的事实一致性，实验结果表明 IM-SciQA 有效区分了检索和 QA 阶段的推理能力，为检索增强科学推理提供了现实且可解释的基准。我们进一步扩展该框架来构建 CIM-SciQA，这是一种引文引导的变体，其性能可与 Oracle 设置相当，从而增强了数据集的有效性和通用性。

Title: MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering

Authors: Shaowei Guan, Yu Zhai, Hin Chi Kwok, Jiawei Du, Xinyu Feng, Jing Li, Harry Qin, Vivian Hui
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2603.14265
Pdf URL: https://arxiv.org/pdf/2603.14265
Copy Paste: [[2603.14265]] MedPriv-Bench: Benchmarking the Privacy-Utility Trade-off of Large Language Models in Medical Open-End Question Answering(https://arxiv.org/abs/2603.14265)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Recent advances in Retrieval-Augmented Generation (RAG) have enabled large language models (LLMs) to ground outputs in clinical evidence. However, connecting LLMs with external databases introduces the risk of contextual leakage: a subtle privacy threat where unique combinations of medical details enable patient re-identification even without explicit identifiers. Current benchmarks in healthcare heavily focus on accuracy, ignoring such privacy issues, despite strict regulations like Health Insurance Portability and Accountability Act (HIPAA) and General Data Protection Regulation (GDPR). To fill this gap, we present MedPriv-Bench, the first benchmark specifically designed to jointly evaluate privacy preservation and clinical utility in medical open-ended question answering. Our framework utilizes a multi-agent, human-in-the-loop pipeline to synthesize sensitive medical contexts and clinically relevant queries that create realistic privacy pressure. We establish a standardized evaluation protocol leveraging a pre-trained RoBERTa-Natural Language Inference (NLI) model as an automated judge to quantify data leakage, achieving an average of 85.9% alignment with human experts. Through an extensive evaluation of 9 representative LLMs, we demonstrate a pervasive privacy-utility trade-off. Our findings underscore the necessity of domain-specific benchmarks to validate the safety and efficacy of medical AI systems in privacy-sensitive environments.
摘要：检索增强生成 (RAG) 的最新进展使大型语言模型 (LLM) 能够将输出结果纳入临床证据。然而，将法学硕士与外部数据库连接会带来上下文泄露的风险：这是一种微妙的隐私威胁，即使没有明确的标识符，医疗详细信息的独特组合也可以使患者重新识别。尽管存在健康保险流通与责任法案 (HIPAA) 和通用数据保护条例 (GDPR) 等严格法规，但当前的医疗保健基准严重关注准确性，忽略此类隐私问题。为了填补这一空白，我们推出了 MedPriv-Bench，这是第一个专门设计用于联合评估医疗开放式问答中的隐私保护和临床实用性的基准。我们的框架利用多代理、人机交互管道来综合敏感的医疗环境和临床相关查询，从而产生现实的隐私压力。我们建立了标准化评估协议，利用预先训练的 RoBERTa 自然语言推理 (NLI) 模型作为自动判断来量化数据泄漏，与人类专家平均达到 85.9% 的一致性。通过对 9 位具有代表性的法学硕士进行广泛评估，我们展示了普遍存在的隐私与实用性权衡。我们的研究结果强调了特定领域基准的必要性，以验证医疗人工智能系统在隐私敏感环境中的安全性和有效性。

Title: Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models

Authors: Yixuan Tang, Yi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14313
Pdf URL: https://arxiv.org/pdf/2603.14313
Copy Paste: [[2603.14313]] Mind the Shift: Decoding Monetary Policy Stance from FOMC Statements with Large Language Models(https://arxiv.org/abs/2603.14313)
Keywords: language model, llm
Abstract: Federal Open Market Committee (FOMC) statements are a major source of monetary-policy information, and even subtle changes in their wording can move global financial markets. A central task is therefore to measure the hawkish--dovish stance conveyed in these texts. Existing approaches typically treat stance detection as a standard classification problem, labeling each statement in isolation. However, the interpretation of monetary-policy communication is inherently relative: market reactions depend not only on the tone of a statement, but also on how that tone shifts across meetings. We introduce Delta-Consistent Scoring (DCS), an annotation-free framework that maps frozen large language model (LLM) representations to continuous stance scores by jointly modeling absolute stance and relative inter-meeting shifts. Rather than relying on manual hawkish--dovish labels, DCS uses consecutive meetings as a source of self-supervision. It learns an absolute stance score for each statement and a relative shift score between consecutive statements. A delta-consistency objective encourages changes in absolute scores to align with the relative shifts. This allows DCS to recover a temporally coherent stance trajectory without manual labels. Across four LLM backbones, DCS consistently outperforms supervised probes and LLM-as-judge baselines, achieving up to 71.1% accuracy on sentence-level hawkish--dovish classification. The resulting meeting-level scores are also economically meaningful: they correlate strongly with inflation indicators and are significantly associated with Treasury yield movements. Overall, the results suggest that LLM representations encode monetary-policy signals that can be recovered through relative temporal structure.
摘要：联邦公开市场委员会 (FOMC) 声明是货币政策信息的主要来源，即使其措辞的细微变化也可能影响全球金融市场。因此，核心任务是衡量这些文本中所传达的鹰派与鸽派立场。现有方法通常将姿态检测视为标准分类问题，单独标记每个语句。然而，对货币政策沟通的解释本质上是相对的：市场反应不仅取决于声明的语气，还取决于这种语气在会议中如何变化。我们引入了 Delta 一致评分（DCS），这是一种无注释框架，通过联合建模绝对立场和相对会议间变化，将冻结的大语言模型（LLM）表示映射到连续立场分数。 DCS 不依赖人工鹰派-鸽派标签，而是使用连续会议作为自我监督的来源。它学习每个陈述的绝对立场分数以及连续陈述之间的相对转变分数。增量一致性目标鼓励绝对分数的变化以与相对变化保持一致。这使得 DCS 能够恢复时间一致的姿态轨迹，而无需手动标记。在四个法学硕士主干中，DCS 始终优于监督探针和法学硕士法官基线，在句子级鹰派-鸽派分类上实现高达 71.1% 的准确率。由此产生的会议级别分数也具有经济意义：它们与通胀指标密切相关，并与国债收益率变动显着相关。总体而言，结果表明 LLM 表示编码了可以通过相对时间结构恢复的货币政策信号。

Title: Motivation in Large Language Models

Authors: Omer Nahum, Asael Sklar, Ariel Goldstein, Roi Reichart
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.14347
Pdf URL: https://arxiv.org/pdf/2603.14347
Copy Paste: [[2603.14347]] Motivation in Large Language Models(https://arxiv.org/abs/2603.14347)
Keywords: language model, llm
Abstract: Motivation is a central driver of human behavior, shaping decisions, goals, and task performance. As large language models (LLMs) become increasingly aligned with human preferences, we ask whether they exhibit something akin to motivation. We examine whether LLMs "report" varying levels of motivation, how these reports relate to their behavior, and whether external factors can influence them. Our experiments reveal consistent and structured patterns that echo human psychology: self-reported motivation aligns with different behavioral signatures, varies across task types, and can be modulated by external manipulations. These findings demonstrate that motivation is a coherent organizing construct for LLM behavior, systematically linking reports, choices, effort, and performance, and revealing motivational dynamics that resemble those documented in human psychology. This perspective deepens our understanding of model behavior and its connection to human-inspired concepts.
摘要：动机是人类行为的核心驱动力，影响决策、目标和任务绩效。随着大型语言模型（LLM）越来越符合人类的偏好，我们询问它们是否表现出类似于动机的东西。我们研究法学硕士是否“报告”不同程度的动机，这些报告与他们的行为有何关系，以及外部因素是否会影响他们。我们的实验揭示了与人类心理学相呼应的一致和结构化模式：自我报告的动机与不同的行为特征相一致，随着任务类型的不同而变化，并且可以通过外部操作进行调节。这些发现表明，动机是法学硕士行为的连贯组织结构，系统地将报告、选择、努力和表现联系起来，并揭示了类似于人类心理学中记录的动机动态。这种观点加深了我们对模型行为及其与人类启发概念的联系的理解。

Title: Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling

Authors: Suvadeep Hajra, Palash Nandi, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14355
Pdf URL: https://arxiv.org/pdf/2603.14355
Copy Paste: [[2603.14355]] Exposing Long-Tail Safety Failures in Large Language Models through Efficient Diverse Response Sampling(https://arxiv.org/abs/2603.14355)
Keywords: language model, llm, prompt
Abstract: Safety tuning through supervised fine-tuning and reinforcement learning from human feedback has substantially improved the robustness of large language models (LLMs). However, it often suppresses rather than eliminates unsafe behaviors, leaving rare but critical failures hidden in the long tail of the output distribution. While most red-teaming work emphasizes adversarial prompt search (input-space optimization), we show that safety failures can also be systematically exposed through diverse response generation (output-space exploration) for a fixed safety-critical prompt, where increasing the number and diversity of sampled responses can drive jailbreak success rates close to unity. To efficiently uncover such failures, we propose Progressive Diverse Population Sampling (PDPS), which combines stochastic token-level sampling with diversity-aware selection to explore a large candidate pool of responses and retain a compact, semantically diverse subset. Across multiple jailbreak benchmarks and open-source LLMs, PDPS achieves attack success rates comparable to large-scale IID sampling while using only 8% to 29% of the computational cost. Under limited-response settings, it improves success rates by 26% to 40% over IID sampling and Diverse Beam Search. Furthermore, responses generated by PDPS exhibit both a higher number and greater diversity of unsafe outputs, demonstrating its effectiveness in uncovering a broader range of failures.
摘要：通过监督微调和根据人类反馈进行强化学习进行的安全调整极大地提高了大型语言模型（LLM）的稳健性。然而，它通常会抑制而不是消除不安全行为，从而使罕见但严重的故障隐藏在输出分布的长尾中。虽然大多数红队工作强调对抗性提示搜索（输入空间优化），但我们表明，通过针对固定的安全关键提示的不同响应生成（输出空间探索），也可以系统地暴露安全故障，其中增加采样响应的数量和多样性可以使越狱成功率接近统一。为了有效地发现此类故障，我们提出了渐进式多样化总体抽样（PDPS），它将随机标记级抽样与多样性感知选择相结合，以探索大量候选响应池并保留紧凑的、语义多样化的子集。在多个越狱基准和开源 LLM 中，PDPS 实现了与大规模 IID 采样相当的攻击成功率，同时仅使用 8% 到 29% 的计算成本。在有限响应设置下，与 IID 采样和 Diverse Beam Search 相比，它的成功率提高了 26% 至 40%。此外，PDPS 生成的响应表现出更多数量和更多样的不安全输出，证明了其在发现更广泛的故障方面的有效性。

Title: Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains

Authors: Andrew Katz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14400
Pdf URL: https://arxiv.org/pdf/2603.14400
Copy Paste: [[2603.14400]] Extending Minimal Pairs with Ordinal Surprisal Curves and Entropy Across Applied Domains(https://arxiv.org/abs/2603.14400)
Keywords: language model, prompt
Abstract: The minimal pairs paradigm of comparing model probabilities for contrasting completions has proven useful for evaluating linguistic knowledge in language models, yet its application has largely been confined to binary grammaticality judgments over syntactic phenomena. Additionally, standard prompting-based evaluation requires expensive text generation, may elicit post-hoc rationalizations rather than model judgments, and discards information about model uncertainty. We address both limitations by extending surprisal-based evaluation from binary grammaticality contrasts to ordinal-scaled classification and scoring tasks across multiple domains. Rather than asking models to generate answers, we measure the information-theoretic "surprise" (negative log probability) they assign to each position on rating scales (e.g., 1-5 or 1-9), yielding full surprisal curves that reveal both the model's preferred response and its uncertainty via entropy. We explore this framework across four domains: social-ecological-technological systems classification, causal statement identification (binary and scaled), figurative language detection, and deductive qualitative coding. Across these domains, surprisal curves produce interpretable classification signals with clear minima near expected ordinal scale positions, and entropy over the completion tended to distinguish genuinely ambiguous items from easier items.
摘要：比较模型概率以对比补全的最小对范式已被证明对于评估语言模型中的语言知识很有用，但其应用在很大程度上仅限于对句法现象的二元语法判断。此外，基于标准提示的评估需要昂贵的文本生成，可能会引发事后合理化而不是模型判断，并且会丢弃有关模型不确定性的信息。我们通过将基于惊喜的评估从二元语法对比扩展到跨多个领域的序数尺度分类和评分任务来解决这两个局限性。我们不是要求模型生成答案，而是测量它们分配给评级量表上每个位置（例如 1-5 或 1-9）的信息论“惊喜”（负对数概率），产生完整的惊喜曲线，通过熵揭示模型的首选响应及其不确定性。我们跨四个领域探索这个框架：社会生态技术系统分类、因果语句识别（二进制和缩放）、比喻语言检测和演绎定性编码。在这些领域中，令人惊讶的曲线产生可解释的分类信号，在预期序数尺度位置附近具有清晰的最小值，并且完成时的熵倾向于区分真正模糊的项目和更容易的项目。

Title: BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation

Authors: Zhaoyi Li, Xu Zhang, Xiaojun Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14410
Pdf URL: https://arxiv.org/pdf/2603.14410
Copy Paste: [[2603.14410]] BiT-MCTS: A Theme-based Bidirectional MCTS Approach to Chinese Fiction Generation(https://arxiv.org/abs/2603.14410)
Keywords: language model, llm
Abstract: Generating long-form linear fiction from open-ended themes remains a major challenge for large language models, which frequently fail to guarantee global structure and narrative diversity when using premise-based or linear outlining approaches. We present BiT-MCTS, a theme-driven framework that operationalizes a "climax-first, bidirectional expansion" strategy motivated by Freytag's Pyramid. Given a theme, our method extracts a core dramatic conflict and generates an explicit climax, then employs a bidirectional Monte Carlo Tree Search (MCTS) to expand the plot backward (rising action, exposition) and forward (falling action, resolution) to produce a structured outline. A final generation stage realizes a complete narrative from the refined outline. We construct a Chinese theme corpus for evaluation and conduct extensive experiments across three contemporary LLM backbones. Results show that BiT-MCTS improves narrative coherence, plot structure, and thematic depth relative to strong baselines, while enabling substantially longer, more coherent stories according to automatic metrics and human judgments.
摘要：从开放式主题生成长篇线性小说仍然是大型语言模型的主要挑战，在使用基于前提或线性大纲方法时，这些模型经常无法保证全局结构和叙事多样性。我们提出了 BiT-MCTS，这是一个主题驱动的框架，可实施受 Freytag 金字塔启发的“高潮优先、双向扩展”策略。给定一个主题，我们的方法提取核心戏剧冲突并生成明确的高潮，然后采用双向蒙特卡罗树搜索（MCTS）向后扩展情节（上升动作，展开）和向前扩展情节（下降动作，解决）以产生结构化轮廓。最后的生成阶段从精炼的轮廓中实现了完整的叙述。我们构建了一个用于评估的中文主题语料库，并在三个当代法学硕士骨干中进行了广泛的实验。结果表明，BiT-MCTS 相对于强基线提高了叙事连贯性、情节结构和主题深度，同时根据自动指标和人类判断实现更长、更连贯的故事。

Title: Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature

Authors: Yuanchi Ma, Kaize Shi, Hui He, Zhihua Zhang, Zhongxiang Lei, Ziliang Qiu, Renfen Hu, Jiamou Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14430
Pdf URL: https://arxiv.org/pdf/2603.14430
Copy Paste: [[2603.14430]] Creative Convergence or Imitation? Genre-Specific Homogeneity in LLM-Generated Chinese Literature(https://arxiv.org/abs/2603.14430)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in narrative generation. However, they often produce structurally homogenized stories, frequently following repetitive arrangements and combinations of plot events along with stereotypical resolutions. In this paper, we propose a novel theoretical framework for analysis by incorporating Proppian narratology and narrative functions. This framework is used to analyze the composition of narrative texts generated by LLMs to uncover their underlying narrative logic. Taking Chinese web literature as our research focus, we extend Propp's narrative theory, defining 34 narrative functions suited to modern web narrative structures. We further construct a human-annotated corpus to support the analysis of narrative structures within LLM-generated text. Experiments reveal that the primary reasons for the singular narrative logic and severe homogenization in generated texts are that current LLMs are unable to correctly comprehend the meanings of narrative functions and instead adhere to rigid narrative generation paradigms.
摘要：大型语言模型（LLM）在叙事生成方面表现出了非凡的能力。然而，他们经常产生结构同质化的故事，经常遵循重复的安排和情节事件的组合以及刻板的解决方案。在本文中，我们通过结合普罗普叙事学和叙事功能，提出了一种新颖的分析理论框架。该框架用于分析法学硕士生成的叙事文本的构成，以揭示其潜在的叙事逻辑。我们以中国网络文学为研究重点，扩展了普罗普的叙事理论，定义了适合现代网络叙事结构的34种叙事功能。我们进一步构建了一个人工注释的语料库，以支持对法学硕士生成的文本中的叙事结构进行分析。实验表明，目前的法学硕士无法正确理解叙事功能的含义，而拘泥于僵化的叙事生成范式，是造成生成文本叙事逻辑单一、同质化严重的主要原因。

Title: PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark

Authors: Mohammad Javad Ranjbar Kalahroodi, Mohammad Amini, Parmis Bathayan, Heshaam Faili, Azadeh Shakery
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2603.14456
Pdf URL: https://arxiv.org/pdf/2603.14456
Copy Paste: [[2603.14456]] PARSA-Bench: A Comprehensive Persian Audio-Language Model Benchmark(https://arxiv.org/abs/2603.14456)
Keywords: language model
Abstract: Persian poses unique audio understanding challenges through its classical poetry, traditional music, and pervasive code-switching - none captured by existing benchmarks. We introduce PARSA-Bench (Persian Audio Reasoning and Speech Assessment Benchmark), the first benchmark for evaluating large audio-language models on Persian language and culture, comprising 16 tasks and over 8,000 samples across speech understanding, paralinguistic analysis, and cultural audio understanding. Ten tasks are newly introduced, including poetry meter and style detection, traditional Persian music understanding, and code-switching detection. Text-only baselines consistently outperform audio counterparts, suggesting models may not leverage audio-specific information beyond what transcription alone provides. Culturally-grounded tasks expose a qualitatively distinct failure mode: all models perform near random chance on vazn detection regardless of scale, suggesting prosodic perception remains beyond the reach of current models. The dataset is publicly available at this https URL
摘要：波斯语通过其古典诗歌、传统音乐和普遍的语码转换提出了独特的音频理解挑战——现有基准没有捕获这些挑战。我们推出 PARSA-Bench（波斯语音频推理和语音评估基准），这是第一个评估波斯语和文化大型音频语言模型的基准，包含 16 个任务和超过 8,000 个样本，涉及语音理解、副语言分析和文化音频理解。新引入了十个任务，包括诗歌韵律和风格检测、传统波斯音乐理解和语码转换检测。纯文本基线始终优于音频基线，这表明模型可能不会利用转录单独提供的特定音频信息。以文化为基础的任务暴露了一种本质上不同的失败模式：无论规模如何，所有模型在 vazn 检测上都执行近乎随机的机会，这表明韵律感知仍然超出了当前模型的范围。该数据集可通过此 https URL 公开获取

Title: Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs

Authors: Auksarapak Kietkajornrit, Jad Tarifi, Nima Asgharbeygi
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.14458
Pdf URL: https://arxiv.org/pdf/2603.14458
Copy Paste: [[2603.14458]] Distilling Reasoning Without Knowledge: A Framework for Reliable LLMs(https://arxiv.org/abs/2603.14458)
Keywords: language model, llm, hallucination, prompt
Abstract: Fact-seeking question answering with large language models (LLMs) remains unreliable when answers depend on up-to-date or conflicting information. Although retrieval-augmented and tool-using LLMs reduce hallucinations, they often rely on implicit planning, leading to inefficient tool usage. We propose a modular framework that explicitly separates planning from factual retrieval and answer synthesis. A lightweight student planner is trained via a teacher-student framework to generate structured decompositions consisting of abstract reasoning steps and searchable fact requests. The supervision signals contain only planning traces and fact requests, without providing factual answers or retrieved evidence. At inference, the planner produces plans, while prompt-engineered modules perform retrieval and response synthesis. We evaluate the proposed framework on SEAL-0, an extremely challenging benchmark for search-augmented LLMs. Results show that supervised planning improves both accuracy and latency compared to monolithic reasoning models and prompt-based tool-augmented frameworks, demonstrating that explicitly learned planning structures are essential for reliable fact-seeking LLMs.
摘要：当答案依赖于最新或相互冲突的信息时，使用大型语言模型 (LLM) 寻求事实的问答仍然不可靠。尽管增强检索和使用工具的法学硕士可以减少幻觉，但它们通常依赖于隐式计划，导致工具使用效率低下。我们提出了一个模块化框架，将规划与事实检索和答案合成明确分开。通过师生框架训练轻量级学生规划器，以生成由抽象推理步骤和可搜索事实请求组成的结构化分解。监督信号仅包含规划痕迹和事实请求，不提供事实答案或检索证据。在推理时，规划器生成计划，而即时设计的模块则执行检索和响应合成。我们在 SEAL-0 上评估了拟议的框架，SEAL-0 是搜索增强法学硕士的一个极具挑战性的基准。结果表明，与整体推理模型和基于提示的工具增强框架相比，监督规划提高了准确性和延迟，这表明显式学习的规划结构对于可靠的事实寻求法学硕士至关重要。

Title: An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs

Authors: Qian Zhu, Xinnan Guo, Jingjing Huo, Jun Li, Pan Liu, Wenyan Yang, Wanqing Xu, Xuan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14463
Pdf URL: https://arxiv.org/pdf/2603.14463
Copy Paste: [[2603.14463]] An Industrial-Scale Insurance LLM Achieving Verifiable Domain Mastery and Hallucination Control without Competence Trade-offs(https://arxiv.org/abs/2603.14463)
Keywords: language model, llm, hallucination
Abstract: Adapting Large Language Models (LLMs) to high-stakes vertical domains like insurance presents a significant challenge: scenarios demand strict adherence to complex regulations and business logic with zero tolerance for hallucinations. Existing approaches often suffer from a Competency Trade-off - sacrificing general intelligence for domain expertise - or rely heavily on RAG without intrinsic reasoning. To bridge this gap, we present INS-S1, an insurance-specific LLM family trained via a novel end-to-end alignment paradigm. Our approach features two methodological innovations: (1) A Verifiable Data Synthesis System that constructs hierarchical datasets for actuarial reasoning and compliance; and (2) A Progressive SFT-RL Curriculum Framework that integrates dynamic data annealing with a synergistic mix of Verified Reasoning (RLVR) and AI Feedback (RLAIF). By optimizing data ratios and reward signals, this framework enforces domain constraints while preventing catastrophic forgetting. Additionally, we release INSEva, the most comprehensive insurance benchmark to date (39k+ samples). Extensive experiments show that INS-S1 achieves SOTA performance on domain tasks, significantly outperforming DeepSeek-R1 and Gemini-2.5-Pro. Crucially, it maintains top-tier general capabilities and achieves a record-low 0.6% hallucination rate (HHEM). Our results demonstrate that rigorous domain specialization can be achieved without compromising general intelligence.
摘要：将大型语言模型 (LLM) 应用于保险等高风险垂直领域提出了重大挑战：场景要求严格遵守复杂的法规和业务逻辑，对幻觉零容忍。现有的方法经常遭受能力权衡的困扰——为了领域专业知识而牺牲通用智能——或者严重依赖 RAG 而没有内在推理。为了弥补这一差距，我们提出了 INS-S1，这是一个通过新颖的端到端对齐范例进行培训的保险特定法学硕士系列。我们的方法有两个方法创新：（1）可验证的数据合成系统，为精算推理和合规性构建分层数据集； (2) 渐进式 SFT-RL 课程框架，将动态数据退火与验证推理 (RLVR) 和人工智能反馈 (RLAIF) 的协同组合相结合。通过优化数据比率和奖励信号，该框架强制执行领域约束，同时防止灾难性遗忘。此外，我们还发布了 INSEva，迄今为止最全面的保险基准（39,000 多个样本）。大量实验表明，INS-S1 在领域任务上实现了 SOTA 性能，显着优于 DeepSeek-R1 和 Gemini-2.5-Pro。至关重要的是，它保持了顶级的通用能力，并实现了创纪录的低 0.6% 幻觉率 (HHEM)。我们的结果表明，可以在不损害一般智能的情况下实现严格的领域专业化。

Title: AI Can Learn Scientific Taste

Authors: Jingqi Tong, Mingzhe Li, Hangcheng Li, Yongzhuo Yang, Yurong Mou, Weijie Ma, Zhiheng Xi, Hongji Chen, Xiaoran Liu, Qinyuan Cheng, Ming Zhang, Qiguang Chen, Weifeng Ge, Qipeng Guo, Tianlei Ying, Tianxiang Sun, Yining Zheng, Xinchi Chen, Jun Zhao, Ning Ding, Xuanjing Huang, Yugang Jiang, Xipeng Qiu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14473
Pdf URL: https://arxiv.org/pdf/2603.14473
Copy Paste: [[2603.14473]] AI Can Learn Scientific Taste(https://arxiv.org/abs/2603.14473)
Keywords: gpt, llm
Abstract: Great scientists have strong judgement and foresight, closely tied to what we call scientific taste. Here, we use the term to refer to the capacity to judge and propose research ideas with high potential impact. However, most relative research focuses on improving an AI scientist's executive capability, while enhancing an AI's scientific taste remains underexplored. In this work, we propose Reinforcement Learning from Community Feedback (RLCF), a training paradigm that uses large-scale community signals as supervision, and formulate scientific taste learning as a preference modeling and alignment problem. For preference modeling, we train Scientific Judge on 700K field- and time-matched pairs of high- vs. low-citation papers to judge ideas. For preference alignment, using Scientific Judge as a reward model, we train a policy model, Scientific Thinker, to propose research ideas with high potential impact. Experiments show Scientific Judge outperforms SOTA LLMs (e.g., GPT-5.2, Gemini 3 Pro) and generalizes to future-year test, unseen fields, and peer-review preference. Furthermore, Scientific Thinker proposes research ideas with higher potential impact than baselines. Our findings show that AI can learn scientific taste, marking a key step toward reaching human-level AI scientists.
摘要：伟大的科学家具有很强的判断力和远见，这与我们所谓的科学品味密切相关。在这里，我们使用该术语来指代判断和提出具有高潜在影响的研究想法的能力。然而，大多数相关研究都集中在提高人工智能科学家的执行能力，而提高人工智能的科学品味仍有待探索。在这项工作中，我们提出了社区反馈强化学习（RLCF），这是一种使用大规模社区信号作为监督的训练范式，并将科学品味学习制定为偏好建模和对齐问题。对于偏好建模，我们使用 70 万个领域和时间匹配的高引用论文对和低引用论文对来训练科学法官来判断想法。为了进行偏好调整，我们使用科学法官作为奖励模型，训练一个政策模型“科学思考者”，以提出具有高潜在影响的研究想法。实验表明，Scientific Judge 的表现优于 SOTA LLM（例如 GPT-5.2、Gemini 3 Pro），并可推广到未来一年的测试、未见过的领域和同行评审偏好。此外，科学思想家提出的研究想法比基线具有更高的潜在影响。我们的研究结果表明，人工智能可以学习科学品味，这标志着人工智能科学家向达到人类水平的关键一步。

Title: Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows

Authors: Aditya Sharan, Sriram Hebbale, Dhruv Kumar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14486
Pdf URL: https://arxiv.org/pdf/2603.14486
Copy Paste: [[2603.14486]] Infinite Problem Generator: Verifiably Scaling Physics Reasoning Data with Agentic Workflows(https://arxiv.org/abs/2603.14486)
Keywords: language model, hallucination, agent
Abstract: Training large language models for complex reasoning is bottlenecked by the scarcity of verifiable, high-quality data. In domains like physics, standard text augmentation often introduces hallucinations, while static benchmarks lack the reasoning traces required for fine-tuning. We introduce the Infinite Problem Generator (IPG), an agentic framework that synthesizes physics problems with guaranteed solvability through a Formula-as-Code paradigm. Unlike probabilistic text generation, IPG constructs solutions as executable Python programs, enforcing strict mathematical consistency. As a proof-of-concept, we release ClassicalMechanicsV1, a high-fidelity corpus of 1,335 classical mechanics problems expanded from 165 expert seeds. The corpus demonstrates high structural diversity, spanning 102 unique physical formulas with an average complexity of 3.05 formulas per problem. Furthermore, we identify a Complexity Blueprint, demonstrating a strong linear correlation ($R^2 \approx 0.95$) between formula count and verification code length. This relationship establishes code complexity as a precise, proxy-free metric for problem difficulty, enabling controllable curriculum generation. We release the full IPG pipeline, the ClassicalMechanicsV1 dataset, and our evaluation report to support reproducible research in reasoning-intensive domains.
摘要：由于缺乏可验证的高质量数据，训练大型语言模型进行复杂推理面临着瓶颈。在物理学等领域，标准文本增强通常会引入幻觉，而静态基准缺乏微调所需的推理痕迹。我们引入了无限问题生成器（IPG），这是一个代理框架，它通过公式即代码范例来综合物理问题，并保证可解决性。与概率文本生成不同，IPG 将解决方案构建为可执行的 Python 程序，强制执行严格的数学一致性。作为概念验证，我们发布了 ClassicalMechanicsV1，这是一个由 165 个专家种子扩展而成的包含 1,335 个经典力学问题的高保真语料库。该语料库表现出高度的结构多样性，涵盖 102 个独特的物理公式，每个问题的平均复杂度为 3.05 个公式。此外，我们确定了一个复杂度蓝图，证明公式计数和验证码长度之间存在很强的线性相关性（$R^2 \approx 0.95$）。这种关系将代码复杂性建立为精确的、无代理的问题难度度量，从而实现可控的课程生成。我们发布了完整的 IPG 流程、ClassicalMechanicsV1 数据集和我们的评估报告，以支持推理密集型领域的可重复研究。

Title: MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection

Authors: Arkadiusz Modzelewski, Witold Sosnowski, Eleni Papadopulos, Elisa Sartori, Tiziano Labruna, Giovanni Da San Martino, Adam Wierzbicki
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14525
Pdf URL: https://arxiv.org/pdf/2603.14525
Copy Paste: [[2603.14525]] MALicious INTent Dataset and Inoculating LLMs for Enhanced Disinformation Detection(https://arxiv.org/abs/2603.14525)
Keywords: language model, llm
Abstract: The intentional creation and spread of disinformation poses a significant threat to public discourse. However, existing English datasets and research rarely address the intentionality behind the disinformation. This work presents MALINT, the first human-annotated English corpus developed in collaboration with expert fact-checkers to capture disinformation and its malicious intent. We utilize our novel corpus to benchmark 12 language models, including small language models (SLMs) such as BERT and large language models (LLMs) like Llama 3.3, on binary and multilabel intent classification tasks. Moreover, inspired by inoculation theory from psychology and communication studies, we investigate whether incorporating knowledge of malicious intent can improve disinformation detection. To this end, we propose intent-based inoculation, an intent-augmented reasoning for LLMs that integrates intent analysis to mitigate the persuasive impact of disinformation. Analysis on six disinformation datasets, five LLMs, and seven languages shows that intent-augmented reasoning improves zero-shot disinformation detection. To support research in intent-aware disinformation detection, we release the MALINT dataset with annotations from each annotation step.
摘要：故意制造和传播虚假信息对公共话语构成重大威胁。然而，现有的英语数据集和研究很少解决虚假信息背后的意图。这项工作提出了 MALINT，这是第一个与事实核查专家合作开发的人工注释英语语料库，用于捕获虚假信息及其恶意意图。我们利用我们的新颖语料库在二元和多标签意图分类任务上对 12 种语言模型进行基准测试，包括 BERT 等小型语言模型 (SLM) 和 Llama 3.3 等大型语言模型 (LLM)。此外，受心理学和传播学研究的接种理论的启发，我们研究了纳入恶意意图的知识是否可以改善虚假信息检测。为此，我们提出基于意图的接种，这是一种针对法学硕士的意图增强推理，它集成了意图分析，以减轻虚假信息的说服力影响。对六个虚假信息数据集、五个法学硕士和七种语言的分析表明，意图增强推理可以改善零样本虚假信息检测。为了支持意图感知虚假信息检测的研究，我们发布了 MALINT 数据集，其中包含每个注释步骤的注释。

Title: Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models

Authors: Deepon Halder, Angira Mukherjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14563
Pdf URL: https://arxiv.org/pdf/2603.14563
Copy Paste: [[2603.14563]] Multilingual TinyStories: A Synthetic Combinatorial Corpus of Indic Children's Stories for Training Small Language Models(https://arxiv.org/abs/2603.14563)
Keywords: language model, prompt
Abstract: The development of robust language models for low-resource languages is frequently bottlenecked by the scarcity of high-quality, coherent, and domain-appropriate training corpora. In this paper, we introduce the Multilingual TinyStories dataset, a large-scale, synthetically generated collection of children's stories encompassing 17 Indian languages. Designed specifically for the training and evaluation of Small Language Models (SLMs), the corpus provides simple, narrative-driven text strictly localized to native scripts. We detail our hybrid curation pipeline, which leverages the Sarvam-M language model and a novel combinatorial prompt engineering framework for native generation, coupled with the Google Translate API for large-scale cross-lingual expansion. Through strict programmatic filtering, we compiled 132,942 stories and over 93.9 million tokens in our release, serving as a foundational resource for multilingual language modeling and transfer learning in the Indic linguistic sphere.
摘要：低资源语言的鲁棒语言模型的开发经常因缺乏高质量、连贯且适合领域的训练语料库而受到瓶颈。在本文中，我们介绍了多语言 TinyStories 数据集，这是一个大规模、综合生成的儿童故事集合，涵盖 17 种印度语言。该语料库专为小语言模型 (SLM) 的训练和评估而设计，提供严格本地化的简单、叙述驱动的文本。我们详细介绍了我们的混合管理管道，该管道利用 Sarvam-M 语言模型和用于本地生成的新型组合提示工程框架，再加上用于大规模跨语言扩展的 Google Translate API。通过严格的程序化过滤，我们在版本中编译了 132,942 个故事和超过 9390 万个代币，作为印度语言领域多语言建模和迁移学习的基础资源。

Title: $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought

Authors: Shubhashis Roy Dipta, Daniel Bis, Kun Zhou, Lichao Wang, Benjamin Z. Yao, Chenlei Guo, Ruhi Sarikaya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14602
Pdf URL: https://arxiv.org/pdf/2603.14602
Copy Paste: [[2603.14602]] $PA^3$: $\textbf{P}$olicy-$\textbf{A}$ware $\textbf{A}$gent $\textbf{A}$lignment through Chain-of-Thought(https://arxiv.org/abs/2603.14602)
Keywords: language model, llm, long context, hallucination, prompt, chain-of-thought
Abstract: Conversational assistants powered by large language models (LLMs) excel at tool-use tasks but struggle with adhering to complex, business-specific rules. While models can reason over business rules provided in context, including all policies for every query introduces high latency and wastes compute. Furthermore, these lengthy prompts lead to long contexts, harming overall performance due to the "needle-in-the-haystack" problem. To address these challenges, we propose a multi-stage alignment method that teaches models to recall and apply relevant business policies during chain-of-thought reasoning at inference time, without including the full business policy in-context. Furthermore, we introduce a novel PolicyRecall reward based on the Jaccard score and a Hallucination Penalty for GRPO training. Altogether, our best model outperforms the baseline by 16 points and surpasses comparable in-context baselines of similar model size by 3 points, while using 40% fewer words.
摘要：由大型语言模型 (LLM) 提供支持的对话助理擅长工具使用任务，但很难遵守复杂的、特定于业务的规则。虽然模型可以根据上下文中提供的业务规则进行推理，但包含每个查询的所有策略会带来高延迟并浪费计算。此外，这些冗长的提示会导致长上下文，由于“大海捞针”问题而损害整体性能。为了应对这些挑战，我们提出了一种多阶段对齐方法，教会模型在推理时的思想链推理过程中回忆和应用相关的业务策略，而不包括上下文中的完整业务策略。此外，我们还引入了一种基于 Jaccard 分数的新颖的 PolicyRecall 奖励和 GRPO 训练的幻觉惩罚。总而言之，我们的最佳模型比基线高出 16 个点，比模型大小相似的上下文中的基线高出 3 个点，同时使用的单词数量减少了 40%。

Title: Seamless Deception: Larger Language Models Are Better Knowledge Concealers

Authors: Dhananjay Ashok, Ruth-Ann Armstrong, Jonathan May
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14672
Pdf URL: https://arxiv.org/pdf/2603.14672
Copy Paste: [[2603.14672]] Seamless Deception: Larger Language Models Are Better Knowledge Concealers(https://arxiv.org/abs/2603.14672)
Keywords: language model, prompt
Abstract: Language Models (LMs) may acquire harmful knowledge, and yet feign ignorance of these topics when under audit. Inspired by the recent discovery of deception-related behaviour patterns in LMs, we aim to train classifiers that detect when a LM is actively concealing knowledge. Initial findings on smaller models show that classifiers can detect concealment more reliably than human evaluators, with gradient-based concealment proving easier to identify than prompt-based methods. However, contrary to prior work, we find that the classifiers do not reliably generalize to unseen model architectures and topics of hidden knowledge. Most concerningly, the identifiable traces associated with concealment become fainter as the models increase in scale, with the classifiers achieving no better than random performance on any model exceeding 70 billion parameters. Our results expose a key limitation in black-box-only auditing of LMs and highlight the need to develop robust methods to detect models that are actively hiding the knowledge they contain.
摘要：语言模型 (LM) 可能会获取有害的知识，但在审核时却假装不了解这些主题。受到最近发现的语言模型中与欺骗相关的行为模式的启发，我们的目标是训练分类器来检测语言模型何时主动隐藏知识。对较小模型的初步研究结果表明，分类器可以比人类评估者更可靠地检测隐藏，并且基于梯度的隐藏比基于提示的方法更容易识别。然而，与之前的工作相反，我们发现分类器不能可靠地推广到看不见的模型架构和隐藏知识的主题。最令人担忧的是，随着模型规模的增加，与隐藏相关的可识别痕迹变得越来越模糊，分类器在任何超过 700 亿个参数的模型上所取得的性能并不比随机性能更好。我们的结果暴露了 LM 的黑盒审计的一个关键限制，并强调需要开发强大的方法来检测主动隐藏其所包含知识的模型。

Title: Towards Next-Generation LLM Training: From the Data-Centric Perspective

Authors: Hao Liang, Zhengyang Zhao, Zhaoyang Han, Meiyi Qiang, Xiaochen Ma, Bohan Zeng, Qifeng Cai, Zhiyu Li, Linpeng Tang, Weinan E, Wentao Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.14712
Pdf URL: https://arxiv.org/pdf/2603.14712
Copy Paste: [[2603.14712]] Towards Next-Generation LLM Training: From the Data-Centric Perspective(https://arxiv.org/abs/2603.14712)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks and domains, with data playing a central role in enabling these advances. Despite this success, the preparation and effective utilization of the massive datasets required for LLM training remain major bottlenecks. In current practice, LLM training data is often constructed using ad hoc scripts, and there is still a lack of mature, agent-based data preparation systems that can automatically construct robust and reusable data workflows, thereby freeing data scientists from repetitive and error-prone engineering efforts. Moreover, once collected, datasets are often consumed largely in their entirety during training, without systematic mechanisms for data selection, mixture optimization, or reweighting. To address these limitations, we advocate two complementary research directions. First, we propose building a robust, agent-based automatic data preparation system that supports automated workflow construction and scalable data management. Second, we argue for a unified data-model interaction training system in which data is dynamically selected, mixed, and reweighted throughout the training process, enabling more efficient, adaptive, and performance-aware data utilization. Finally, we discuss the remaining challenges and outline promising directions for future research and system development.
摘要：大型语言模型 (LLM) 在广泛的任务和领域中表现出了卓越的性能，而数据在实现这些进步方面发挥着核心作用。尽管取得了这一成功，法学硕士培训所需的海量数据集的准备和有效利用仍然是主要瓶颈。在目前的实践中，LLM训练数据通常是使用即席脚本构建的，仍然缺乏成熟的、基于代理的数据准备系统，可以自动构建健壮且可重用的数据工作流程，从而将数据科学家从重复且容易出错的工程工作中解放出来。此外，数据集一旦收集，通常会在训练过程中被全部消耗掉，而没有用于数据选择、混合优化或重新加权的系统机制。为了解决这些局限性，我们提倡两个互补的研究方向。首先，我们建议构建一个强大的、基于代理的自动数据准备系统，支持自动化工作流程构建和可扩展的数据管理。其次，我们主张建立一个统一的数据模型交互训练系统，在整个训练过程中动态选择、混合和重新加权数据，从而实现更高效、适应性和性能感知的数据利用。最后，我们讨论了剩余的挑战，并概述了未来研究和系统开发的有希望的方向。

Title: Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA

Authors: Renhao Pei, Siyao Peng, Verena Blaschke, Robert Litschko, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14782
Pdf URL: https://arxiv.org/pdf/2603.14782
Copy Paste: [[2603.14782]] Information Asymmetry across Language Varieties: A Case Study on Cantonese-Mandarin and Bavarian-German QA(https://arxiv.org/abs/2603.14782)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are becoming a common way for humans to seek knowledge, yet their coverage and reliability vary widely. Especially for local language varieties, there are large asymmetries, e.g., information in local Wikipedia that is absent from the standard variant. However, little is known about how well LLMs perform under such information asymmetry, especially on closely related languages. We manually construct a novel challenge question-answering (QA) dataset that captures knowledge conveyed on a local Wikipedia page, which is absent from their higher-resource counterparts-covering Mandarin Chinese vs. Cantonese and German vs. Bavarian. Our experiments show that LLMs fail to answer questions about information only in local editions of Wikipedia. Providing context from lead sections substantially improves performance, with further gains possible via translation. Our topical, geographic annotations, and stratified evaluations reveal the usefulness of local Wikipedia editions as sources of both regional and global information. These findings raise critical questions about inclusivity and cultural coverage of LLMs.
摘要：大型语言模型（LLM）正在成为人类寻求知识的常见方式，但其覆盖范围和可靠性差异很大。特别是对于本地语言变体，存在很大的不对称性，例如，标准变体中缺少本地维基百科中的信息。然而，人们对法学硕士在这种信息不对称下的表现知之甚少，尤其是在密切相关的语言上。我们手动构建了一个新颖的挑战问答（QA）数据集，该数据集捕获本地维基百科页面上传达的知识，而这些知识在其更高资源的对应页面（涵盖普通话与粤语、德语与巴伐利亚语）中是不存在的。我们的实验表明，法学硕士无法回答有关仅本地版维基百科信息的问题。提供引导部分的上下文可以显着提高性能，并可以通过翻译获得进一步的收益。我们的主题、地理注释和分层评估揭示了当地维基百科版本作为区域和全球信息来源的有用性。这些发现提出了关于法学硕士的包容性和文化覆盖率的关键问题。

Title: The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments

Authors: Elmira Salari (1), Maria Claudia Nunes Delfino (2), Hazem Amamou (3), José Victor de Souza (3), Shruti Kshirsagar (1), Alan Davoust (4), Anderson Avila (3) ((1) Wichita State University, (2) Pontifícia Universidade Católica de São Paulo, (3) Institut national de la recherche scientifique, (4) Université du Québec en Outaouais)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14838
Pdf URL: https://arxiv.org/pdf/2603.14838
Copy Paste: [[2603.14838]] The Impact of Ideological Discourses in RAG: A Case Study with COVID-19 Treatments(https://arxiv.org/abs/2603.14838)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: This paper studies the impact of retrieved ideological texts on the outputs of large language models (LLMs). While interest in understanding ideology in LLMs has recently increased, little attention has been given to this issue in the context of Retrieval-Augmented Generation (RAG). To fill this gap, we design an external knowledge source based on ideological loaded texts about COVID-19 treatments. Our corpus is based on 1,117 academic articles representing discourses about controversial and endorsed treatments for the disease. We propose a corpus linguistics framework, based on Lexical Multidimensional Analysis (LMDA), to identify the ideologies within the corpus. LLMs are tasked to answer questions derived from three identified ideological dimensions, and two types of contextual prompts are adopted: the first comprises the user question and ideological texts; and the second contains the question, ideological texts, and LMDA descriptions. Ideological alignment between reference ideological texts and LLMs' responses is assessed using cosine similarity for lexical and semantic representations. Results demonstrate that LLMs' responses based on ideological retrieved texts are more aligned with the ideology encountered in the external knowledge, with the enhanced prompt further influencing LLMs' outputs. Our findings highlight the importance of identifying ideological discourses within the RAG framework in order to mitigate not just unintended ideological bias, but also the risks of malicious manipulation of such models.
摘要：本文研究了检索到的意识形态文本对大型语言模型（LLM）输出的影响。虽然最近人们对理解法学硕士意识形态的兴趣有所增加，但在检索增强一代（RAG）的背景下却很少关注这个问题。为了填补这一空白，我们根据有关 COVID-19 治疗的意识形态文本设计了一个外部知识源。我们的语料库基于 1,117 篇学术文章，这些文章代表了有关该疾病有争议和认可的治疗方法的讨论。我们提出了一个基于词汇多维分析（LMDA）的语料库语言学框架，以识别语料库中的意识形态。法学硕士的任务是回答来自三个已确定的意识形态维度的问题，并采用两种类型的上下文提示：第一种包括用户问题和意识形态文本；第二种包括用户问题和意识形态文本。第二个包含问题、意识形态文本和 LMDA 描述。使用词汇和语义表示的余弦相似性来评估参考意识形态文本和法学硕士的回答之间的意识形态一致性。结果表明，法学硕士基于意识形态检索文本的反应与外部知识中遇到的意识形态更加一致，增强的提示进一步影响法学硕士的输出。我们的研究结果强调了在 RAG 框架内识别意识形态话语的重要性，不仅可以减少无意的意识形态偏见，还可以减少恶意操纵此类模型的风险。

Title: ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations

Authors: Hankun Kang, Xin Miao, Jianhao Chen, Jintao Wen, Mayi Xu, Weiyu Zhang, Wenpeng Lu, Tieyun Qian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14843
Pdf URL: https://arxiv.org/pdf/2603.14843
Copy Paste: [[2603.14843]] ContiGuard: A Framework for Continual Toxicity Detection Against Evolving Evasive Perturbations(https://arxiv.org/abs/2603.14843)
Keywords: llm
Abstract: Toxicity detection mitigates the dissemination of toxic content (e.g., hateful comments, posts, and messages within online social actions) to safeguard a healthy online social environment. However, malicious users persistently develop evasive perturbations to disguise toxic content and evade detectors. Traditional detectors or methods are static over time and are inadequate in addressing these evolving evasion tactics. Thus, continual learning emerges as a logical approach to dynamically update detection ability against evolving perturbations. Nevertheless, disparities across perturbations hinder the detector's continual learning on perturbed text. More importantly, perturbation-induced noises distort semantics to degrade comprehension and also impair critical feature learning to render detection sensitive to perturbations. These amplify the challenge of continual learning against evolving perturbations. In this work, we present ContiGuard, the first framework tailored for continual learning of the detector on time-evolving perturbed text (termed continual toxicity detection) to enable the detector to continually update capability and maintain sustained resilience against evolving perturbations. Specifically, to boost the comprehension, we present an LLM-powered semantic enriching strategy, where we dynamically incorporate possible meaning and toxicity-related clues excavated by LLM into the perturbed text to improve the comprehension. To mitigate non-critical features and amplify critical ones, we propose a discriminability-driven feature learning strategy, where we strengthen discriminative features while suppressing the less-discriminative ones to shape a robust classification boundary for detection...
摘要：毒性检测可减少有毒内容（例如在线社交活动中的仇恨评论、帖子和消息）的传播，以维护健康的在线社交环境。然而，恶意用户不断地开发规避扰动来伪装有毒内容并逃避检测器。传统的探测器或方法随着时间的推移是静态的，不足以应对这些不断变化的规避策略。因此，持续学习成为动态更新检测能力以应对不断变化的扰动的逻辑方法。然而，扰动之间的差异阻碍了检测器对扰动文本的持续学习。更重要的是，扰动引起的噪声会扭曲语义，从而降低理解力，还会损害关键特征学习，从而使检测对扰动敏感。这些加大了持续学习以应对不断变化的扰动的挑战。在这项工作中，我们提出了 ContiGuard，这是第一个专为检测器持续学习随时间变化的扰动文本而定制的框架（称为连续毒性检测），使检测器能够不断更新功能并保持针对不断变化的扰动的持续弹性。具体来说，为了提高理解力，我们提出了一种由法学硕士支持的语义丰富策略，其中我们动态地将法学硕士挖掘出的可能含义和毒性相关线索融入到扰动的文本中，以提高理解力。为了减轻非关键特征并放大关键特征，我们提出了一种可辨别性驱动的特征学习策略，在该策略中，我们加强辨别性特征，同时抑制辨别性较差的特征，以形成稳健的分类边界以进行检测......

Title: Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks

Authors: Zijian Yu, Kejun Xiao, Huaipeng Zhao, Tao Luo, Xiaoyi Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14864
Pdf URL: https://arxiv.org/pdf/2603.14864
Copy Paste: [[2603.14864]] Shopping Companion: A Memory-Augmented LLM Agent for Real-World E-Commerce Tasks(https://arxiv.org/abs/2603.14864)
Keywords: gpt, llm, agent
Abstract: In e-commerce, LLM agents show promise for shopping tasks such as recommendations, budgeting, and bundle deals, where accurately capturing user preferences from long-term conversations is critical. However, two challenges hinder realizing this potential: (1) the absence of benchmarks for evaluating long-term preference-aware shopping tasks, and (2) the lack of end-to-end optimization due to existing designs that treat preference identification and shopping assistance as separate components. In this paper, we introduce a novel benchmark with a long-term memory setup, spanning two shopping tasks over 1.2 million real-world products, and propose Shopping Companion, a unified framework that jointly tackles memory retrieval and shopping assistance while supporting user intervention. To train such capabilities, we develop a dual-reward reinforcement learning strategy with tool-wise rewards to handle the sparse and discontinuous rewards inherent in multi-turn interactions. Experimental results demonstrate that even state-of-the-art models (such as GPT-5) achieve success rates under 70% on our benchmark, highlighting the significant challenges in this domain. Notably, our lightweight LLM, trained with Shopping Companion, consistently outperforms strong baselines, achieving better preference capture and task performance, which validates the effectiveness of our unified design.
摘要：在电子商务中，法学硕士代理在推荐、预算和捆绑交易等购物任务中表现出了希望，在这些任务中，从长期对话中准确捕捉用户偏好至关重要。然而，有两个挑战阻碍了实现这一潜力：（1）缺乏评估长期偏好感知购物任务的基准，以及（2）由于现有设计将偏好识别和购物辅助视为单独的组件，因此缺乏端到端优化。在本文中，我们介绍了一种具有长期记忆设置的新颖基准，涵盖超过 120 万种现实世界产品的两项购物任务，并提出了 Shopping Companion，这是一个统一的框架，可以在支持用户干预的同时共同解决记忆检索和购物辅助问题。为了训练这种能力，我们开发了一种双奖励强化学习策略，具有工具奖励，以处理多轮交互中固有的稀疏和不连续的奖励。实验结果表明，即使是最先进的模型（例如 GPT-5），在我们的基准测试中也能取得低于 70% 的成功率，凸显了该领域面临的重大挑战。值得注意的是，我们的轻量级法学硕士经过 Shopping Companion 的培训，始终优于强大的基线，实现了更好的偏好捕获和任务绩效，这验证了我们统一设计的有效性。

Title: Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models

Authors: Han Zhang, Jiamin Su, Li liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14891
Pdf URL: https://arxiv.org/pdf/2603.14891
Copy Paste: [[2603.14891]] Decision-Level Ordinal Modeling for Multimodal Essay Scoring with Large Language Models(https://arxiv.org/abs/2603.14891)
Keywords: language model, llm
Abstract: Automated essay scoring (AES) predicts multiple rubric-defined trait scores for each essay, where each trait follows an ordered discrete rating scale. Most LLM-based AES methods cast scoring as autoregressive token generation and obtain the final score via decoding and parsing, making the decision implicit. This formulation is particularly sensitive in multimodal AES, where the usefulness of visual inputs varies across essays and traits. To address these limitations, we propose Decision-Level Ordinal Modeling (DLOM), which makes scoring an explicit ordinal decision by reusing the language model head to extract score-wise logits on predefined score tokens, enabling direct optimization and analysis in the score space. For multimodal AES, DLOM-GF introduces a gated fusion module that adaptively combines textual and multimodal score logits. For text-only AES, DLOM-DA adds a distance-aware regularization term to better reflect ordinal distances. Experiments on the multimodal EssayJudge dataset show that DLOM improves over a generation-based SFT baseline across scoring traits, and DLOM-GF yields further gains when modality relevance is heterogeneous. On the text-only ASAP/ASAP++ benchmarks, DLOM remains effective without visual inputs, and DLOM-DA further improves performance and outperforms strong representative baselines.
摘要：自动论文评分 (AES) 可以预测每篇论文的多个评分标准定义的特征分数，其中每个特征都遵循有序的离散评分量表。大多数基于 LLM 的 AES 方法将评分视为自回归标记生成，并通过解码和解析获得最终分数，从而使决策变得隐式。这种表述在多模态 AES 中特别敏感，其中视觉输入的有用性因论文和特征而异。为了解决这些限制，我们提出了决策级序数建模（DLOM），它通过重用语言模型头来提取预定义分数标记上的分数逻辑，从而使评分成为明确的序数决策，从而能够在分数空间中进行直接优化和分析。对于多模态 AES，DLOM-GF 引入了门控融合模块，该模块自适应地组合文本和多模态得分逻辑。对于纯文本 AES，DLOM-DA 添加了距离感知正则化项，以更好地反映序数距离。多模态 EssayJudge 数据集上的实验表明，DLOM 在评分特征方面比基于生成的 SFT 基线有所改进，并且当模态相关性异构时，DLOM-GF 会产生进一步的收益。在纯文本 ASAP/ASAP++ 基准测试中，DLOM 在没有视觉输入的情况下仍然有效，并且 DLOM-DA 进一步提高了性能并超越了强代表性基准。

Title: LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy

Authors: Jon-Paul Cacioli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.14893
Pdf URL: https://arxiv.org/pdf/2603.14893
Copy Paste: [[2603.14893]] LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy(https://arxiv.org/abs/2603.14893)
Keywords: language model, llm
Abstract: Large language models (LLMs) are evaluated for calibration using metrics such as Expected Calibration Error that conflate two distinct components: the model's ability to discriminate correct from incorrect answers (sensitivity) and its tendency toward confident or cautious responding (bias). Signal Detection Theory (SDT) decomposes these components. While SDT-derived metrics such as AUROC are increasingly used, the full parametric framework - unequal-variance model fitting, criterion estimation, z-ROC analysis - has not been applied to LLMs as signal detectors. In this pre-registered study, we treat three LLMs as observers performing factual discrimination across 168,000 trials and test whether temperature functions as a criterion shift analogous to payoff manipulations in human psychophysics. Critically, this analogy may break down because temperature changes the generated answer itself, not only the confidence assigned to it. Our results confirm the breakdown with temperature simultaneously increasing sensitivity (AUC) and shifting criterion. All models exhibited unequal-variance evidence distributions (z-ROC slopes 0.52-0.84), with instruct models showing more extreme asymmetry (0.52-0.63) than the base model (0.77-0.87) or human recognition memory (~0.80). The SDT decomposition revealed that models occupying distinct positions in sensitivity-bias space could not be distinguished by calibration metrics alone, demonstrating that the full parametric framework provides diagnostic information unavailable from existing metrics.
摘要：使用预期校准误差等指标对大型语言模型 (LLM) 进行校准评估，这些指标合并了两个不同的组成部分：模型区分正确答案和错误答案的能力（敏感性）及其自信或谨慎响应的倾向（偏差）。信号检测理论 (SDT) 分解了这些组件。虽然越来越多地使用 AUROC 等 SDT 衍生指标，但完整的参数框架 - 不等方差模型拟合、标准估计、z-ROC 分析 - 尚未应用于 LLM 作为信号检测器。在这项预先注册的研究中，我们将三名法学硕士作为观察员，在 168,000 项试验中进行事实歧视，并测试温度是否充当类似于人类心理物理学中的回报操纵的标准转变。至关重要的是，这种类比可能会失败，因为温度改变了生成的答案本身，而不仅仅是分配给它的置信度。我们的结果证实了随着温度的增加而导致的故障，同时增加了灵敏度（AUC）和移动标准。所有模型都表现出不等方差证据分布（z-ROC 斜率 0.52-0.84），指示模型显示出比基础模型（0.77-0.87）或人类识别记忆（~0.80）更极端的不对称性（0.52-0.63）。 SDT 分解表明，仅通过校准指标无法区分在灵敏度偏差空间中占据不同位置的模型，这表明完整的参数框架提供了现有指标无法提供的诊断信息。

Title: ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation

Authors: Yuzhe Shang, Pengzhi Gao, Yazheng Yang, Jiayao Ma, Wei Liu, Jian Luan, Jingsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.14903
Pdf URL: https://arxiv.org/pdf/2603.14903
Copy Paste: [[2603.14903]] ExPosST: Explicit Positioning with Adaptive Masking for LLM-Based Simultaneous Machine Translation(https://arxiv.org/abs/2603.14903)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have recently demonstrated promising performance in simultaneous machine translation (SimulMT). However, applying decoder-only LLMs to SimulMT introduces a positional mismatch, which leads to a dilemma between decoding efficiency and positional consistency. Existing approaches often rely on specific positional encodings or carefully designed prompting schemes, and thus fail to simultaneously achieve inference efficiency, positional consistency, and broad model compatibility. In this work, we propose ExPosST, a general framework that resolves this dilemma through explicit position allocation. ExPosST reserves fixed positional slots for incoming source tokens, enabling efficient decoding with KV cache across different positional encoding methods. To further bridge the gap between fine-tuning and inference, we introduce a policy-consistent fine-tuning strategy that aligns training with inference-time decoding behavior. Experiments across multiple language pairs demonstrate that ExPosST effectively supports simultaneous translation under diverse policies.
摘要：大型语言模型 (LLM) 最近在同步机器翻译 (SimulMT) 方面表现出了良好的性能。然而，将仅解码器的 LLM 应用于 SimulMT 会引入位置不匹配，从而导致解码效率和位置一致性之间的困境。现有的方法通常依赖于特定的位置编码或精心设计的提示方案，因此无法同时实现推理效率、位置一致性和广泛的模型兼容性。在这项工作中，我们提出了 ExPosST，一个通过明确的位置分配解决这一困境的通用框架。 ExPosST 为传入的源令牌保留固定的位置槽，从而能够在不同的位置编码方法中使用 KV 缓存进行高效解码。为了进一步弥合微调和推理之间的差距，我们引入了一种策略一致的微调策略，使训练与推理时间解码行为保持一致。跨多个语言对的实验表明，ExPosST 有效支持不同策略下的同声翻译。

Title: Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

Authors: Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.14987
Pdf URL: https://arxiv.org/pdf/2603.14987
Copy Paste: [[2603.14987]] Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI(https://arxiv.org/abs/2603.14987)
Keywords: hallucination, agent
Abstract: As agentic AI systems move beyond static question answering into open-ended, tool-augmented, and multi-step real-world workflows, their increased authority poses greater risks of system misuse and operational failures. However, current evaluation practices remain fragmented, measuring isolated capabilities such as coding, hallucination, jailbreak resistance, or tool use in narrowly defined settings. We argue that the central limitation is not merely insufficient coverage of evaluation dimensions, but the lack of a principled notion of representativeness: an agent's trustworthiness should be assessed over a representative socio-technical scenario distribution rather than a collection of disconnected benchmark instances. To this end, we propose the Holographic Agent Assessment Framework (HAAF), a systematic evaluation paradigm that characterizes agent trustworthiness over a scenario manifold spanning task types, tool interfaces, interaction dynamics, social contexts, and risk levels. The framework integrates four complementary components: (i) static cognitive and policy analysis, (ii) interactive sandbox simulation, (iii) social-ethical alignment assessment, and (iv) a distribution-aware representative sampling engine that jointly optimizes coverage and risk sensitivity -- particularly for rare but high-consequence tail risks that conventional benchmarks systematically overlook. These components are connected through an iterative Trustworthy Optimization Factory. Through cycles of red-team probing and blue-team hardening, this paradigm progressively narrows the vulnerabilities to meet deployment standards, shifting agent evaluation from benchmark islands toward representative, real-world trustworthiness. Code and data for the illustrative instantiation are available at this https URL.
摘要：随着代理人工智能系统超越静态问答，进入开放式、工具增强和多步骤的现实世界工作流程，其权力的增加带来了更大的系统误用和操作失败的风险。然而，当前的评估实践仍然分散，衡量的是孤立的能力，例如编码、幻觉、越狱阻力或在狭隘定义的环境中使用工具。我们认为，核心限制不仅在于评估维度的覆盖范围不足，而且还在于缺乏代表性的原则性概念：代理人的可信度应该根据代表性的社会技术场景分布来评估，而不是通过一组互不相关的基准实例来评估。为此，我们提出了全息代理评估框架（HAAF），这是一种系统的评估范式，它描述了跨任务类型、工具界面、交互动态、社会背景和风险水平的场景中代理的可信度。该框架整合了四个互补的组成部分：（i）静态认知和政策分析，（ii）交互式沙箱模拟，（iii）社会伦理一致性评估，以及（iv）一个具有分布意识的代表性抽样引擎，可共同优化覆盖范围和风险敏感性——特别是对于传统基准系统性忽视的罕见但后果严重的尾部风险。这些组件通过迭代的可信优化工厂连接。通过红队探测和蓝队强化的循环，这种范式逐步缩小漏洞以满足部署标准，将代理评估从基准岛屿转向具有代表性的、现实世界的可信度。用于说明性实例化的代码和数据可在此 https URL 处获得。

Title: OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora

Authors: Jeffrey Flynt
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.14997
Pdf URL: https://arxiv.org/pdf/2603.14997
Copy Paste: [[2603.14997]] OrgForge: A Multi-Agent Simulation Framework for Verifiable Synthetic Corporate Corpora(https://arxiv.org/abs/2603.14997)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Evaluating retrieval-augmented generation (RAG) pipelines requires corpora where ground truth is knowable, temporally structured, and cross-artifact properties that real-world datasets rarely provide cleanly. Existing resources such as the Enron corpus carry legal ambiguity, demographic skew, and no structured ground truth. Purely LLM-generated synthetic data solves the legal problem but introduces a subtler one: the generating model cannot be prevented from hallucinating facts that contradict themselves across this http URL present OrgForge, an open-source multi-agent simulation framework that enforces a strict physics-cognition boundary: a deterministic Python engine maintains a SimEvent ground truth bus; large language models generate only surface prose, constrained by validated proposals. An actor-local clock enforces causal timestamp correctness across all artifact types, eliminating the class of timeline inconsistencies that arise when timestamps are sampled independently per document. We formalize three graph-dynamic subsystems stress propagation via betweenness centrality, temporal edge-weight decay, and Dijkstra escalation routing that govern organizational behavior independently of any LLM. Running a configurable N-day simulation, OrgForge produces interleaved Slack threads, JIRA tickets, Confluence pages, Git pull requests, and emails, all traceable to a shared, immutable event log. We additionally describe a causal chain tracking subsystem that accumulates cross-artifact evidence graphs per incident, a hybrid reciprocal-rank-fusion recurrence detector for identifying repeated failure classes, and an inbound/outbound email engine that routes vendor alerts, customer complaints, and HR correspondence through gated causal chains with probabilistic drop simulation. OrgForge is available under the MIT license.
摘要：评估检索增强生成（RAG）管道需要语料库，其中地面事实是已知的、时间结构化的和跨工件的属性，而现实世界的数据集很少能干净地提供这些属性。安然语料库等现有资源存在法律模糊性、人口统计偏差，并且没有结构化的基本事实。纯粹的 LLM 生成的合成数据解决了法律问题，但引入了一个更微妙的问题：生成模型无法避免在这个 http URL 中产生与自己相矛盾的幻觉事实 OrgForge，一个开源多代理模拟框架，强制执行严格的物理认知边界：确定性 Python 引擎维护 SimEvent 地面实况总线；大型语言模型仅生成表面散文，受到经过验证的建议的限制。参与者本地时钟强制所有工件类型的因果时间戳正确性，消除了每个文档独立采样时间戳时出现的时间线不一致类别。我们通过中介中心性、时间边权重衰减和 Dijkstra 升级路由来形式化三个图动态子系统的压力传播，这些子系统独立于任何法学硕士来管理组织行为。 OrgForge 运行可配置的 N 天模拟，生成交错的 Slack 线程、JIRA 票证、Confluence 页面、Git 拉取请求和电子邮件，所有这些都可追溯到共享的、不可变的事件日志。我们还描述了一个因果链跟踪子系统，它累积每个事件的跨工件证据图，一个用于识别重复故障类的混合倒数排序融合复发检测器，以及一个入站/出站电子邮件引擎，通过具有概率下降模拟的门控因果链路由供应商警报、客户投诉和人力资源通信。 OrgForge 可在 MIT 许可下使用。

Title: Attention Residuals

Authors: Kimi Team: Guangyu Chen, Yu Zhang, Jianlin Su, Weixin Xu, Siyuan Pan, Yaoyu Wang, Yucheng Wang, Guanduo Chen, Bohong Yin, Yutian Chen, Junjie Yan, Ming Wei, Y. Zhang, Fanqing Meng, Chao Hong, Xiaotong Xie, Shaowei Liu, Enzhe Lu, Yunpeng Tai, Yanru Chen, Xin Men, Haiqing Guo, Y. Charles, Haoyu Lu, Lin Sui, Jinguo Zhu, Zaida Zhou, Weiran He, Weixiao Huang, Xinran Xu, Yuzhi Wang, Guokun Lai, Yulun Du, Yuxin Wu, Zhilin Yang, Xinyu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15031
Pdf URL: https://arxiv.org/pdf/2603.15031
Copy Paste: [[2603.15031]] Attention Residuals(https://arxiv.org/abs/2603.15031)
Keywords: llm
Abstract: Residual connections with PreNorm are standard in modern LLMs, yet they accumulate all layer outputs with fixed unit weights. This uniform aggregation causes uncontrolled hidden-state growth with depth, progressively diluting each layer's contribution. We propose Attention Residuals (AttnRes), which replaces this fixed accumulation with softmax attention over preceding layer outputs, allowing each layer to selectively aggregate earlier representations with learned, input-dependent weights. To address the memory and communication overhead of attending over all preceding layer outputs for large-scale model training, we introduce Block AttnRes, which partitions layers into blocks and attends over block-level representations, reducing the memory footprint while preserving most of the gains of full AttnRes. Combined with cache-based pipeline communication and a two-phase computation strategy, Block AttnRes becomes a practical drop-in replacement for standard residual connections with minimal overhead. Scaling law experiments confirm that the improvement is consistent across model sizes, and ablations validate the benefit of content-dependent depth-wise selection. We further integrate AttnRes into the Kimi Linear architecture (48B total / 3B activated parameters) and pre-train on 1.4T tokens, where AttnRes mitigates PreNorm dilution, yielding more uniform output magnitudes and gradient distribution across depth, and improves downstream performance across all evaluated tasks.
摘要：与 PreNorm 的残差连接是现代法学硕士的标准，但它们以固定的单位权重累积所有层的输出。这种均匀的聚合导致隐藏状态随深度不受控制地增长，逐渐稀释每一层的贡献。我们提出了注意力残差（AttnRes），它将这种固定积累替换为对前一层输出的 Softmax 注意力，允许每一层有选择地聚合早期表示与学习的、依赖于输入的权重。为了解决参与大规模模型训练的所有先前层输出的内存和通信开销，我们引入了 Block AttnRes，它将层划分为块并参与块级表示，减少内存占用，同时保留完整 AttnRes 的大部分收益。与基于缓存的管道通信和两阶段计算策略相结合，Block AttnRes 成为标准剩余连接的实用替代品，并且开销最小。缩放定律实验证实，不同模型大小的改进是一致的，并且消融验证了内容相关的深度明智选择的好处。我们进一步将 AttnRes 集成到 Kimi Linear 架构中（总计 48B / 3B 激活参数）并在 1.4T 令牌上进行预训练，其中 AttnRes 减轻了 PreNorm 稀释，产生更均匀的输出幅度和跨深度的梯度分布，并提高了所有评估任务的下游性能。

Title: Interpretable Predictability-Based AI Text Detection: A Replication Study

Authors: Adam Skurla, Dominik Macko, Jakub Simko
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15034
Pdf URL: https://arxiv.org/pdf/2603.15034
Copy Paste: [[2603.15034]] Interpretable Predictability-Based AI Text Detection: A Replication Study(https://arxiv.org/abs/2603.15034)
Keywords: language model, gpt
Abstract: This paper replicates and extends the system used in the AuTexTification 2023 shared task for authorship attribution of machine-generated texts. First, we tried to reproduce the original results. Exact replication was not possible because of differences in data splits, model availability, and implementation details. Next, we tested newer multilingual language models and added 26 document-level stylometric features. We also applied SHAP analysis to examine which features influence the model's decisions. We replaced the original GPT-2 models with newer generative models such as Qwen and mGPT for computing probabilistic features. For contextual representations, we used mDeBERTa-v3-base and applied the same configuration to both English and Spanish. This allowed us to use one shared configuration for Subtask 1 and Subtask 2. Our experiments show that the additional stylometric features improve performance in both tasks and both languages. The multilingual configuration achieves the results that are comparable to or better than language-specific models. The study also shows that clear documentation is important for reliable replication and fair comparison of systems.
摘要：本文复制并扩展了 AuTexTification 2023 共享任务中使用的系统，用于机器生成文本的作者归属。首先，我们尝试重现原始结果。由于数据分割、模型可用性和实现细节的差异，精确复制是不可能的。接下来，我们测试了较新的多语言语言模型，并添加了 26 个文档级风格特征。我们还应用 SHAP 分析来检查哪些特征影响模型的决策。我们用更新的生成模型（例如 Qwen 和 mGPT）替换了原始的 GPT-2 模型，用于计算概率特征。对于上下文表示，我们使用 mDeBERTa-v3-base 并将相同的配置应用于英语和西班牙语。这使我们能够对子任务 1 和子任务 2 使用一种共享配置。我们的实验表明，附加的风格特征可以提高这两种任务和两种语言的性能。多语言配置实现的结果与特定语言模型相当或更好。该研究还表明，清晰的文档对于系统的可靠复制和公平比较非常重要。

Title: Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs

Authors: Disha Sheshanarayana, Rajat Subhra Pal, Manjira Sinha, Tirthankar Dasgupta
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15051
Pdf URL: https://arxiv.org/pdf/2603.15051
Copy Paste: [[2603.15051]] Thinking in Latents: Adaptive Anchor Refinement for Implicit Reasoning in LLMs(https://arxiv.org/abs/2603.15051)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Token-level Chain-of-Thought (CoT) prompting has become a standard way to elicit multi-step reasoning in large language models (LLMs), especially for mathematical word problems. However, generating long intermediate traces increases output length and inference cost, and can be inefficient when the model could arrive at the correct answer without extensive verbalization. This has motivated latent-space reasoning approaches that shift computation into hidden representations and only emit a final answer. Yet, many latent reasoning methods depend on a fixed number of latent refinement steps at inference, adding another hyperparameter that must be tuned across models and datasets to balance accuracy and efficiency. We introduce AdaAnchor, a latent reasoning framework that performs silent iterative computation by refining a set of latent anchor vectors attached to the input. AdaAnchor further incorporates an adaptive halting mechanism that monitors anchor stability across iterations and terminates refinement once the anchor dynamics converge, allocating fewer steps to easier instances while reserving additional refinement steps for harder ones under a shared maximum-step budget. Our empirical evaluation across three mathematical word-problem benchmarks shows that AdaAnchor with adaptive halting yields accuracy gains of up to 5% over fixed-step latent refinement while reducing average latent refinement steps by 48-60% under the same maximum-step budget. Compared to standard reasoning baselines, AdaAnchor achieves large reductions in generated tokens (92-93%) by moving computation into silent latent refinement, offering a different accuracy-efficiency trade-off with substantially lower output-token usage.
摘要：令牌级思想链 (CoT) 提示已成为在大型语言模型 (LLM) 中引发多步推理的标准方法，尤其是对于数学应用问题。然而，生成长的中间轨迹会增加输出长度和推理成本，并且当模型无需大量语言表达即可得出正确答案时，效率可能会很低。这激发了潜在空间推理方法，将计算转移到隐藏表示中，并且只给出最终答案。然而，许多潜在推理方法依赖于推理时固定数量的潜在细化步骤，添加了另一个必须跨模型和数据集进行调整的超参数，以平衡准确性和效率。我们引入了 AdaAnchor，这是一个潜在推理框架，它通过细化附加到输入的一组潜在锚向量来执行静默迭代计算。 AdaAnchor 进一步采用了自适应停止机制，该机制可监控迭代中锚点的稳定性，并在锚点动态收敛后终止细化，为较容易的实例分配较少的步骤，同时在共享最大步数预算下为较困难的实例保留额外的细化步骤。我们对三个数学单词问题基准的实证评估表明，与固定步长潜在细化相比，具有自适应停止功能的 AdaAnchor 的准确率提升高达 5%，同时在相同的最大步数预算下将平均潜在细化步骤减少 48-60%。与标准推理基线相比，AdaAnchor 通过将计算转移到静默潜在细化中，实现了生成令牌的大幅减少 (92-93%)，从而提供了不同的准确性-效率权衡，同时大大降低了输出令牌使用率。

Title: Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization

Authors: Jihao Zhao, Shuaishuai Zu, Zhiyuan Ji, Chunlai Zhou, Biao Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15061
Pdf URL: https://arxiv.org/pdf/2603.15061
Copy Paste: [[2603.15061]] Writer-R1: Enhancing Generative Writing in LLMs via Memory-augmented Replay Policy Optimization(https://arxiv.org/abs/2603.15061)
Keywords: llm, agent
Abstract: As a typical open-ended generation task, creative writing lacks verifiable reference answers, which has long constrained reward modeling and automatic evaluation due to high human annotation costs, evaluative bias, and coarse feedback signals. To address these challenges, this paper first designs a multi-agent collaborative workflow based on Grounded Theory, performing dimensional decomposition and hierarchical induction of the problem to dynamically produce interpretable and reusable fine-grained criteria. Furthermore, we propose the Memory-augmented Replay Policy Optimization (MRPO) algorithm: on the one hand, without additional training, MRPO guides models to engage in self-reflection based on dynamic criteria, enabling controlled iterative improvement; on the other hand, we adopt the training paradigm that combines supervised fine-tuning with reinforcement learning to convert evaluation criteria into reward signals, achieving end-to-end optimization. Experimental results demonstrate that the automatically constructed criteria achieve performance gains comparable to human annotations. Writer-R1-4B models trained with this approach outperform baselines across multiple creative writing tasks and surpass some 100B+ parameter open-source models.
摘要：作为典型的开放式生成任务，创意写作缺乏可验证的参考答案，由于人工标注成本高、评价偏差和反馈信号粗糙，长期以来限制了奖励建模和自动评价。为了应对这些挑战，本文首先设计了基于扎根理论的多智能体协作工作流程，对问题进行维度分解和层次归纳，以动态生成可解释和可重用的细粒度标准。此外，我们提出了记忆增强重放策略优化（MRPO）算法：一方面，无需额外训练，MRPO就可以引导模型基于动态标准进行自我反思，从而实现受控迭代改进；另一方面，我们采用监督微调与强化学习相结合的训练范式，将评价标准转化为奖励信号，实现端到端的优化。实验结果表明，自动构建的标准实现了与人类注释相当的性能提升。使用这种方法训练的 Writer-R1-4B 模型在多个创意写作任务中的表现优于基线，并超过了一些 100B+ 参数的开源模型。

Title: Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike

Authors: Miriam Winkler, Verena Blaschke, Barbara Plank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15130
Pdf URL: https://arxiv.org/pdf/2603.15130
Copy Paste: [[2603.15130]] Indirect Question Answering in English, German and Bavarian: A Challenging Task for High- and Low-Resource Languages Alike(https://arxiv.org/abs/2603.15130)
Keywords: gpt
Abstract: Indirectness is a common feature of daily communication, yet is underexplored in NLP research for both low-resource as well as high-resource languages. Indirect Question Answering (IQA) aims at classifying the polarity of indirect answers. In this paper, we present two multilingual corpora for IQA of varying quality that both cover English, Standard German and Bavarian, a German dialect without standard orthography: InQA+, a small high-quality evaluation dataset with hand-annotated labels, and GenIQA, a larger training dataset, that contains artificial data generated by GPT-4o-mini. We find that IQA is a pragmatically hard task that comes with various challenges, based on several experiment variations with multilingual transformer models (mBERT, XLM-R and mDeBERTa). We suggest and employ recommendations to tackle these challenges. Our results reveal low performance, even for English, and severe overfitting. We analyse various factors that influence these results, including label ambiguity, label set and dataset size. We find that the IQA performance is poor in high- (English, German) and low-resource languages (Bavarian) and that it is beneficial to have a large amount of training data. Further, GPT-4o-mini does not possess enough pragmatic understanding to generate high-quality IQA data in any of our tested languages.
摘要：间接性是日常交流的一个常见特征，但在低资源和高资源语言的 NLP 研究中尚未得到充分探索。间接问答（IQA）旨在对间接答案的极性进行分类。在本文中，我们提出了两个不同质量的 IQA 多语言语料库，均涵盖英语、标准德语和巴伐利亚语（一种没有标准拼字法的德国方言）：InQA+（带有手工注释标签的小型高质量评估数据集）和 GenIQA（更大的训练数据集，包含 GPT-4o-mini 生成的人工数据）。基于多语言 Transformer 模型（mBERT、XLM-R 和 mDeBERTa）的多个实验变体，我们发现 IQA 是一项务实的艰巨任务，伴随着各种挑战。我们提出并采用建议来应对这些挑战。我们的结果显示，即使对于英语，性能也很低，并且存在严重的过度拟合。我们分析了影响这些结果的各种因素，包括标签歧义、标签集和数据集大小。我们发现，高资源语言（英语、德语）和低资源语言（巴伐利亚语）的 IQA 性能较差，拥有大量训练数据是有益的。此外，GPT-4o-mini 不具备足够的实用理解，无法以我们测试的任何语言生成高质量的 IQA 数据。

Title: HindSight: Evaluating Research Idea Generation via Future Impact

Authors: Bo Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.15164
Pdf URL: https://arxiv.org/pdf/2603.15164
Copy Paste: [[2603.15164]] HindSight: Evaluating Research Idea Generation via Future Impact(https://arxiv.org/abs/2603.15164)
Keywords: llm
Abstract: Evaluating AI-generated research ideas typically relies on LLM judges or human panels -- both subjective and disconnected from actual research impact. We introduce \hs{}, a time-split evaluation framework that measures idea quality by matching generated ideas against real future publications and scoring them by citation impact and venue acceptance. Using a temporal cutoff~$T$, we restrict an idea generation system to pre-$T$ literature, then evaluate its outputs against papers published in the subsequent 30 months. Experiments across 10 AI/ML research topics reveal a striking disconnect: LLM-as-Judge finds no significant difference between retrieval-augmented and vanilla idea generation ($p{=}0.584$), while \hs{} shows the retrieval-augmented system produces 2.5$\times$ higher-scoring ideas ($p{<}0.001$). Moreover, \hs{} scores are \emph{negatively} correlated with LLM-judged novelty ($\rho{=}{-}0.29$, $p{<}0.01$), suggesting that LLMs systematically overvalue novel-sounding ideas that never materialize in real research.
摘要：评估人工智能生成的研究想法通常依赖于法学硕士法官或人工小组——既主观又与实际研究影响脱节。我们引入了 \hs{}，一个时间分割评估框架，通过将生成的想法与未来的真实出版物进行匹配来衡量想法质量，并根据引用影响力和场地接受度对其进行评分。使用时间截止~$T$，我们将想法生成系统限制为 $T$ 之前的文献，然后根据随后 30 个月发表的论文评估其输出。跨越 10 个 AI/ML 研究主题的实验揭示了一个惊人的脱节：LLM-as-Judge 发现检索增强和普通想法生成之间没有显着差异 ($p{=}0.584$)，而 \hs{} 显示检索增强系统产生 2.5$\times$ 得分更高的想法 ($p{<}0.001$)。此外，\hs{} 分数与 LLM 判断的新颖性呈 \emph{负} 相关（$\rho{=}{-}0.29$, $p{<}0.01$），这表明 LLM 系统性地高估了听起来新颖的想法，而这些想法从未在实际研究中实现。

Title: The Hrunting of AI: Where and How to Improve English Dialectal Fairness

Authors: Wei Li, Adrian de Wynter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15187
Pdf URL: https://arxiv.org/pdf/2603.15187
Copy Paste: [[2603.15187]] The Hrunting of AI: Where and How to Improve English Dialectal Fairness(https://arxiv.org/abs/2603.15187)
Keywords: language model, llm
Abstract: It is known that large language models (LLMs) underperform in English dialects, and that improving them is difficult due to data scarcity. In this work we investigate how quality and availability impact the feasibility of improving LLMs in this context. For this, we evaluate three rarely-studied English dialects (Yorkshire, Geordie, and Cornish), plus African-American Vernacular English, and West Frisian as control. We find that human-human agreement when determining LLM generation quality directly impacts LLM-as-a-judge performance. That is, LLM-human agreement mimics the human-human agreement pattern, and so do metrics such as accuracy. It is an issue because LLM-human agreement measures an LLM's alignment with the human consensus; and hence raises questions about the feasibility of improving LLM performance in locales where low populations induce low agreement. We also note that fine-tuning does not eradicate, and might amplify, this pattern in English dialects. But also find encouraging signals, such as some LLMs' ability to generate high-quality data, thus enabling scalability. We argue that data must be carefully evaluated to ensure fair and inclusive LLM improvement; and, in the presence of scarcity, new tools are needed to handle the pattern found.
摘要：众所周知，大型语言模型（LLM）在英语方言中表现不佳，并且由于数据稀缺而很难对其进行改进。在这项工作中，我们研究质量和可用性如何影响在这种情况下改进法学硕士的可行性。为此，我们评估了三种很少研究的英语方言（约克郡、乔迪和康沃尔），加上非裔美国人白话英语和西弗里斯兰语作为对照。我们发现，在确定法学硕士生成质量时，人与人之间的一致性直接影响法学硕士作为法官的表现。也就是说，LLM-人类一致性模仿了人类-人类一致性模式，准确性等指标也是如此。这是一个问题，因为法学硕士与人类的一致性衡量了法学硕士与人类共识的一致性；因此，提出了在人口较少导致一致性较低的地区提高法学硕士绩效的可行性的问题。我们还注意到，微调并不能消除英语方言中的这种模式，反而可能会放大这种模式。但也发现了令人鼓舞的信号，例如一些法学硕士能够生成高质量数据，从而实现可扩展性。我们认为，必须仔细评估数据，以确保公平和包容性的 LLM 改进；而且，在资源匮乏的情况下，需要新的工具来处理所发现的模式。

Title: Efficient Document Parsing via Parallel Token Prediction

Authors: Lei Li, Ze Zhao, Meng Li, Zhongwang Lun, Yi Yuan, Xingjing Lu, Zheng Wei, Jiang Bian, Zang Li
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.15206
Pdf URL: https://arxiv.org/pdf/2603.15206
Copy Paste: [[2603.15206]] Efficient Document Parsing via Parallel Token Prediction(https://arxiv.org/abs/2603.15206)
Keywords: language model, hallucination
Abstract: Document parsing, as a fundamental yet crucial vision task, is being revolutionized by vision-language models (VLMs). However, the autoregressive (AR) decoding inherent to VLMs creates a significant bottleneck, severely limiting parsing speed. In this paper, we propose Parallel-Token Prediction (PTP), a plugable, model-agnostic and simple-yet-effective method that enables VLMs to generate multiple future tokens in parallel with improved sample efficiency. Specifically, we insert some learnable tokens into the input sequence and design corresponding training objectives to equip the model with parallel decoding capabilities for document parsing. Furthermore, to support effective training, we develop a comprehensive data generation pipeline that efficiently produces large-scale, high-quality document parsing training data for VLMs. Extensive experiments on OmniDocBench and olmOCR-bench demonstrate that our method not only significantly improves decoding speed (1.6x-2.2x) but also reduces model hallucinations and exhibits strong generalization abilities.
摘要：文档解析作为一项基本但关键的视觉任务，正在被视觉语言模型（VLM）彻底改变。然而，VLM 固有的自回归 (AR) 解码造成了严重的瓶颈，严重限制了解析速度。在本文中，我们提出了并行令牌预测（PTP），这是一种可插入、与模型无关且简单而有效的方法，使 VLM 能够并行生成多个未来令牌，同时提高样本效率。具体来说，我们在输入序列中插入一些可学习的标记，并设计相应的训练目标，以使模型具备用于文档解析的并行解码能力。此外，为了支持有效的训练，我们开发了一个全面的数据生成管道，可以有效地为 VLM 生成大规模、高质量的文档解析训练数据。在 OmniDocBench 和 olmOCR-bench 上进行的大量实验表明，我们的方法不仅显着提高了解码速度（1.6x-2.2x），而且减少了模型幻觉并表现出强大的泛化能力。

Title: Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation

Authors: Xinyue Ma, Pol Pastells, Mireia Farrús, Mariona Taulé
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.15227
Pdf URL: https://arxiv.org/pdf/2603.15227
Copy Paste: [[2603.15227]] Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation(https://arxiv.org/abs/2603.15227)
Keywords: llm
Abstract: Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage of the source language, and therefore tend to maintain the passive voice when translating a passive in either direction. However, models demonstrate some knowledge of the low frequency and predominantly negative context of Chinese passives, leading to higher voice consistency with human translators in English-to-Chinese translation than in Chinese-to-English translation. Commercial NMT models scored higher in metric evaluations, but LLMs showed a better ability to use diverse alternative translations. Datasets and annotation script will be shared upon request.
摘要：机器翻译（MT）评估已经超越了指标，转向了更具体的语言现象。对于英汉语言对，被动句由于语言变异而构造和分布不同，因此在机器翻译中需要特别注意。本文提出了一个双向多领域被动句数据集，从五个中英平行语料库中提取，并根据人工翻译自动使用结构标签进行注释，以及一个手动验证注释的测试集。该数据集由 73,965 个平行句子对组成（2,358,731 个英文单词，3,498,229 个汉字）。我们使用我们的数据集评估两个最先进的开源机器翻译系统，并使用测试集评估四个商业模型。结果表明，与人类不同，模型更容易受到源文本语音的影响，而不是源语言的一般语音用法的影响，因此在双向翻译被动语态时倾向于保持被动语态。然而，模型展示了对中文被动语态的低频和主要负面语境的一些了解，导致英汉翻译中与汉译英翻译中的语音一致性更高。商业 NMT 模型在指标评估中得分较高，但法学硕士表现出更好的使用多种替代翻译的能力。数据集和注释脚本将根据请求共享。

Title: Practicing with Language Models Cultivates Human Empathic Communication

Authors: Aakriti Kumar, Nalin Poungpeth, Diyi Yang, Bruce Lambert, Matthew Groh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2603.15245
Pdf URL: https://arxiv.org/pdf/2603.15245
Copy Paste: [[2603.15245]] Practicing with Language Models Cultivates Human Empathic Communication(https://arxiv.org/abs/2603.15245)
Keywords: language model, llm
Abstract: Empathy is central to human connection, yet people often struggle to express it effectively. In blinded evaluations, large language models (LLMs) generate responses that are often judged more empathic than human-written ones. Yet when a response is attributed to AI, recipients feel less heard and validated than when comparable responses are attributed to a human. To probe and address this gap in empathic communication skill, we built Lend an Ear, an experimental conversation platform in which participants are asked to offer empathic support to an LLM role-playing personal and workplace troubles. From 33,938 messages spanning 2,904 text-based conversations between 968 participants and their LLM conversational partners, we derive a data-driven taxonomy of idiomatic empathic expressions in naturalistic dialogue. Based on a pre-registered randomized experiment, we present evidence that a brief LLM coaching intervention offering personalized feedback on how to effectively communicate empathy significantly boosts alignment of participants' communication patterns with normative empathic communication patterns relative to both a control group and a group that received video-based but non-personalized feedback. Moreover, we find evidence for a silent empathy effect that people feel empathy but systematically fail to express it. Nonetheless, participants reliably identify responses aligned with normative empathic communication criteria as more expressive of empathy. Together, these results advance the scientific understanding of how empathy is expressed and valued and demonstrate a scalable, AI-based intervention for scaffolding and cultivating it.
摘要：同理心是人际关系的核心，但人们常常难以有效地表达它。在盲法评估中，大型语言模型 (LLM) 生成的响应通常被认为比人类编写的响应更具同理心。然而，当人工智能做出回应时，与人类做出类似回应相比，接收者感觉更少被倾听和验证。为了探讨和解决同理心沟通技巧方面的差距，我们建立了“倾听”，这是一个实验性对话平台，要求参与者为法学硕士角色扮演个人和工作场所的问题提供同理心支持。从涵盖 968 名参与者及其法学硕士对话伙伴之间 2,904 条基于文本的对话的 33,938 条消息中，我们得出了自然对话中惯用移情表达的数据驱动分类法。基于预先注册的随机实验，我们提供的证据表明，相对于对照组和收到基于视频但非个性化反馈的组，简短的法学硕士辅导干预提供有关如何有效沟通同理心的个性化反馈，可以显着促进参与者的沟通模式与规范的同理心沟通模式的一致性。此外，我们发现了沉默同理心效应的证据，即人们感到同理心，但系统地无法表达它。尽管如此，参与者可靠地认为符合规范同理心沟通标准的反应更能表达同理心。总之，这些结果促进了对同理心如何表达和重视的科学理解，并展示了一种可扩展的、基于人工智能的干预措施，用于支架和培养同理心。

Title: From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding

Authors: Xu Zhang, Wenxin Ma, Chenxu Wu, Rongsheng Wang, Kun Zhang, S. Kevin Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15270
Pdf URL: https://arxiv.org/pdf/2603.15270
Copy Paste: [[2603.15270]] From Documents to Spans: Code-Centric Learning for LLM-based ICD Coding(https://arxiv.org/abs/2603.15270)
Keywords: llm
Abstract: ICD coding is a critical yet challenging task in healthcare. Recently, LLM-based methods demonstrate stronger generalization than discriminative methods in ICD coding. However, fine-tuning LLMs for ICD coding faces three major challenges. First, existing public ICD coding datasets provide limited coverage of the ICD code space, restricting a model's ability to generalize to unseen codes. Second, naive fine-tuning diminishes the interpretability of LLMs, as few public datasets contain explicit supporting evidence for assigned codes. Third, ICD coding typically involves long clinical documents, making fine-tuning LLMs computationally expensive. To address these issues, we propose Code-Centric Learning, a training framework that shifts supervision from full clinical documents to scalable, short evidence spans. The key idea of this framework is that span-level learning improves LLMs' ability to perform document-level ICD coding. Our proposed framework consists of a mixed training strategy and code-centric data expansion, which substantially reduces training cost, improves accuracy on unseen ICD codes and preserves interpretability. Under the same LLM backbone, our method substantially outperforms strong baselines. Notably, our method enables small-scale LLMs to achieve performance comparable to much larger proprietary models, demonstrating its effectiveness and potential for fully automated ICD coding.
摘要：ICD 编码是医疗保健领域一项关键但具有挑战性的任务。最近，基于 LLM 的方法在 ICD 编码中表现出比判别方法更强的泛化能力。然而，针对 ICD 编码的法学硕士的微调面临着三个主要挑战。首先，现有的公共 ICD 编码数据集提供的 ICD 代码空间覆盖范围有限，限制了模型泛化到未见过的代码的能力。其次，天真的微调降低了法学硕士的可解释性，因为很少有公共数据集包含指定代码的明确支持证据。第三，ICD 编码通常涉及很长的临床文档，使得 LLM 的微调计算成本高昂。为了解决这些问题，我们提出了以代码为中心的学习，这是一种培训框架，可将监督从完整的临床文档转变为可扩展的短证据跨度。该框架的关键思想是跨级学习提高了法学硕士执行文档级 ICD 编码的能力。我们提出的框架由混合训练策略和以代码为中心的数据扩展组成，这大大降低了训练成本，提高了未见过的 ICD 代码的准确性并保留了可解释性。在相同的法学硕士骨干下，我们的方法大大优于强大的基线。值得注意的是，我们的方法使小型法学硕士能够实现与更大的专有模型相当的性能，证明了其全自动 ICD 编码的有效性和潜力。

Title: Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies

Authors: Giuseppe Samo, Paola Merlo
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.15295
Pdf URL: https://arxiv.org/pdf/2603.15295
Copy Paste: [[2603.15295]] Datasets for Verb Alternations across Languages: BLM Templates and Data Augmentation Strategies(https://arxiv.org/abs/2603.15295)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown remarkable performance across various sentence-based linguistic phenomena, yet their ability to capture cross-sentence paradigmatic patterns, such as verb alternations, remains underexplored. In this work, we present curated paradigm-based datasets for four languages, designed to probe systematic cross-sentence knowledge of verb alternations (change-of-state and object-drop constructions in English, German and Italian, and Hebrew binyanim). The datasets comprise thousands of the Blackbird Language Matrices (BLMs) problems. The BLM task -- an RPM/ARC-like task devised specifically for language -- is a controlled linguistic puzzle where models must select the sentence that completes a pattern according to syntactic and semantic rules. We introduce three types of templates varying in complexity and apply linguistically-informed data augmentation strategies across synthetic and natural data. We provide simple baseline performance results across English, Italian, German, and Hebrew, that demonstrate the diagnostic usefulness of the datasets.
摘要：大型语言模型（LLM）在各种基于句子的语言现象中表现出了卓越的性能，但它们捕获跨句子范式模式（例如动词交替）的能力仍未得到充分探索。在这项工作中，我们为四种语言提供了基于范式的数据集，旨在探索动词交替的系统跨句子知识（英语、德语、意大利语以及希伯来语 binyanim 的状态变化和对象删除结构）。该数据集包含数千个 Blackbird 语言矩阵 (BLM) 问题。 BLM 任务（一种专门为语言设计的类似于 RPM/ARC 的任务）是一个受控语言难题，其中模型必须根据句法和语义规则选择完成模式的句子。我们引入了三种复杂程度不同的模板，并在合成数据和自然数据中应用基于语言的数据增强策略。我们提供了英语、意大利语、德语和希伯来语的简单基线性能结果，证明了数据集的诊断有用性。

Title: CCTU: A Benchmark for Tool Use under Complex Constraints

Authors: Junjie Ye, Guoqiang Zhang, Wenjie Fu, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15309
Pdf URL: https://arxiv.org/pdf/2603.15309
Copy Paste: [[2603.15309]] CCTU: A Benchmark for Tool Use under Complex Constraints(https://arxiv.org/abs/2603.15309)
Keywords: language model, llm, prompt, agent
Abstract: Solving problems through tool use under explicit constraints constitutes a highly challenging yet unavoidable scenario for large language models (LLMs), requiring capabilities such as function calling, instruction following, and self-refinement. However, progress has been hindered by the absence of dedicated evaluations. To address this, we introduce CCTU, a benchmark for evaluating LLM tool use under complex constraints. CCTU is grounded in a taxonomy of 12 constraint categories spanning four dimensions (i.e., resource, behavior, toolset, and response). The benchmark comprises 200 carefully curated and challenging test cases across diverse tool-use scenarios, each involving an average of seven constraint types and an average prompt length exceeding 4,700 tokens. To enable reliable evaluation, we develop an executable constraint validation module that performs step-level validation and enforces compliance during multi-turn interactions between models and their environments. We evaluate nine state-of-the-art LLMs in both thinking and non-thinking modes. Results indicate that when strict adherence to all constraints is required, no model achieves a task completion rate above 20%. Further analysis reveals that models violate constraints in over 50% of cases, particularly in the resource and response dimensions. Moreover, LLMs demonstrate limited capacity for self-refinement even after receiving detailed feedback on constraint violations, highlighting a critical bottleneck in the development of robust tool-use agents. To facilitate future research, we release the data and code.
摘要：对于大型语言模型（LLM）来说，在显式约束下使用工具来解决问题是一个极具挑战性但又不可避免的场景，需要函数调用、指令跟踪和自我改进等能力。然而，由于缺乏专门的评价，进展受到阻碍。为了解决这个问题，我们引入了 CCTU，这是一个在复杂约束下评估 LLM 工具使用的基准。 CCTU 基于跨越四个维度（即资源、行为、工具集和响应）的 12 个约束类别的分类法。该基准测试包含 200 个精心策划的、具有挑战性的测试用例，涵盖不同的工具使用场景，每个测试用例平均涉及七种约束类型，平均提示长度超过 4,700 个标记。为了实现可靠的评估，我们开发了一个可执行的约束验证模块，该模块执行步骤级验证并在模型及其环境之间的多轮交互期间强制遵守规定。我们以思维和非思维模式评估了九个最先进的法学硕士。结果表明，当需要严格遵守所有约束时，没有模型能够实现超过 20% 的任务完成率。进一步的分析表明，模型在超过 50% 的情况下违反了约束，特别是在资源和响应维度上。此外，即使在收到有关违反约束的详细反馈后，法学硕士也表现出自我完善的能力有限，这凸显了开发强大的工具使用代理的关键瓶颈。为了方便将来的研究，我们发布了数据和代码。

Title: DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models

Authors: Xueyu Zhou, Yangrong Hu, Jian Huang
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2603.15340
Pdf URL: https://arxiv.org/pdf/2603.15340
Copy Paste: [[2603.15340]] DOS: Dependency-Oriented Sampler for Masked Diffusion Language Models(https://arxiv.org/abs/2603.15340)
Keywords: language model
Abstract: Masked diffusion language models (MDLMs) have recently emerged as a new paradigm in language modeling, offering flexible generation dynamics and enabling efficient parallel decoding. However, existing decoding strategies for pre-trained MDLMs predominantly rely on token-level uncertainty criteria, while largely overlooking sequence-level information and inter-token dependencies. To address this limitation, we propose Dependency-Oriented Sampler (DOS), a training-free decoding strategy that leverages inter-token dependencies to inform token updates during generation. Specifically, DOS exploits attention matrices from transformer blocks to approximate inter-token dependencies, emphasizing information from unmasked tokens when updating masked positions. Empirical results demonstrate that DOS consistently achieves superior performance on both code generation and mathematical reasoning tasks. Moreover, DOS can be seamlessly integrated with existing parallel sampling methods, leading to improved generation efficiency without sacrificing generation quality.
摘要：掩码扩散语言模型 (MDLM) 最近已成为语言建模的新范例，提供灵活的生成动态并实现高效的并行解码。然而，现有的预训练 MDLM 解码策略主要依赖于 token 级不确定性标准，而很大程度上忽略了序列级信息和 token 间依赖性。为了解决这个限制，我们提出了面向依赖的采样器（DOS），这是一种免训练的解码策略，它利用令牌间依赖关系来通知生成过程中的令牌更新。具体来说，DOS 利用转换器块中的注意力矩阵来近似令牌间依赖关系，在更新屏蔽位置时强调来自未屏蔽令牌的信息。实证结果表明，DOS 在代码生成和数学推理任务上始终取得优异的性能。此外，DOS可以与现有的并行采样方法无缝集成，从而在不牺牲生成质量的情况下提高生成效率。

Title: When Does Sparsity Mitigate the Curse of Depth in LLMs

Authors: Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15389
Pdf URL: https://arxiv.org/pdf/2603.15389
Copy Paste: [[2603.15389]] When Does Sparsity Mitigate the Curse of Depth in LLMs(https://arxiv.org/abs/2603.15389)
Keywords: language model, llm, long context
Abstract: Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at this https URL.
摘要：最近的工作证明了大型语言模型（LLM）中的深度诅咒，其中后面的层对学习和表示的贡献比前面的层少。这种利用率不足与预层归一化中方差的累积增长有关，这可以将深层区块推向接近同一的行为。在本文中，我们证明，稀疏性除了提高效率之外，还可以充当方差传播的调节器，从而提高深度利用率。我们的研究涵盖了稀疏性的两个来源：（i）隐式稀疏性，它是由训练和数据条件产生的，包括权重衰减引起的权重稀疏性和长上下文输入引起的注意力稀疏性； (ii) 显式稀疏性，这是通过架构设计强制执行的，包括分组查询注意力中的键/值共享稀疏性和 Mixtureof-Experts 中的专家激活稀疏性。我们的主张得到了受控深度扩展实验和有针对性的层有效性干预的充分支持。在各个设置中，我们观察到一致的关系：稀疏性通过减少输出方差和促进功能分化来提高层利用率。我们最终将我们的研究结果提炼成一个实用的经验法则，用于训练深度有效的法学硕士，使下游任务的准确性显着提高了 4.6%。我们的结果揭示了标准设计选择自然产生的稀疏性，是法学硕士有效深度缩放的关键但之前被忽视的机制。代码可从此 https URL 获取。

Title: A Closer Look into LLMs for Table Understanding

Authors: Jia Wang, Chuanyu Qin, Mingyu Zheng, Qingyi Si, Peize Li, Zheng Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15402
Pdf URL: https://arxiv.org/pdf/2603.15402
Copy Paste: [[2603.15402]] A Closer Look into LLMs for Table Understanding(https://arxiv.org/abs/2603.15402)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Despite the success of Large Language Models (LLMs) in table understanding, their internal mechanisms remain unclear. In this paper, we conduct an empirical study on 16 LLMs, covering general LLMs, specialist tabular LLMs, and Mixture-of-Experts (MoE) models, to explore how LLMs understand tabular data and perform downstream tasks. Our analysis focus on 4 dimensions including the attention dynamics, the effective layer depth, the expert activation, and the impacts of input designs. Key findings include: (1) LLMs follow a three-phase attention pattern -- early layers scan the table broadly, middle layers localize relevant cells, and late layers amplify their contributions; (2) tabular tasks require deeper layers than math reasoning to reach stable predictions; (3) MoE models activate table-specific experts in middle layers, with early and late layers sharing general-purpose experts; (4) Chain-of-Thought prompting increases table attention, further enhanced by table-tuning. We hope these findings and insights can facilitate interpretability and future research on table-related tasks.
摘要：尽管大型语言模型（LLM）在表理解方面取得了成功，但其内部机制仍不清楚。在本文中，我们对16个法学硕士进行了实证研究，涵盖普通法学硕士、专业表格法学硕士和专家混合（MoE）模型，以探讨法学硕士如何理解表格数据并执行下游任务。我们的分析集中在 4 个维度，包括注意力动态、有效层深度、专家激活和输入设计的影响。主要发现包括：（1）LLM遵循三阶段注意力模式——早期层广泛扫描表格，中间层定位相关单元格，后期层放大它们的贡献；（2）表格任务需要比数学推理更深的层次才能达到稳定的预测；（3）MoE模型在中间层激活表专用专家，早期层和后期层共享通用专家； (4) 思想链提示可增加桌面注意力，并通过桌面调整进一步增强。我们希望这些发现和见解能够促进表相关任务的可解释性和未来研究。

Title: Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models

Authors: Zehao Chen, Rong Pan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15405
Pdf URL: https://arxiv.org/pdf/2603.15405
Copy Paste: [[2603.15405]] Fusian: Multi-LoRA Fusion for Fine-Grained Continuous MBTI Personality Control in Large Language Models(https://arxiv.org/abs/2603.15405)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities in simulating diverse human behaviors and personalities. However, existing methods for personality control, which include prompt engineering and standard Supervised Fine-Tuning (SFT), typically treat personality traits as discrete categories (e.g., "Extroverted" vs. "Introverted"), lacking the ability to precisely control the intensity of a trait on a continuous spectrum. In this paper, we introduce Fusian, a novel framework for fine-grained, continuous personality control in LLMs. Fusian operates in two stages: (1) Trajectory Collection, where we capture the dynamic evolution of personality adoption during SFT by saving a sequence of LoRA adapters, effectively mapping the continuous manifold of a trait; and (2) RL-based Dynamic Fusion, where we train a policy network using Reinforcement Learning to dynamically compute mixing weights for these frozen adapters. By sampling from a Dirichlet distribution parameterized by the policy network, Fusian fuses multiple adapters to align the model's output with a specific numerical target intensity. Experiments on the Qwen3-14B model demonstrate that Fusian achieves high precision in personality control, significantly outperforming baseline methods in aligning with user-specified trait intensities.
摘要：大型语言模型 (LLM) 在模拟不同人类行为和性格方面表现出了令人印象深刻的能力。然而，现有的人格控制方法，包括即时工程和标准监督微调（SFT），通常将人格特征视为离散类别（例如“外向”与“内向”），缺乏在连续谱上精确控制特征强度的能力。在本文中，我们介绍了 Fusian，这是一种用于法学硕士中细粒度、连续个性控制的新型框架。 Fusian 分两个阶段运行：(1) 轨迹收集，我们通过保存一系列 LoRA 适配器来捕获 SFT 期间个性采用的动态演变，从而有效地映射特征的连续流形； (2) 基于强化学习的动态融合，我们使用强化学习训练策略网络来动态计算这些冻结适配器的混合权重。通过从策略网络参数化的狄利克雷分布中进行采样，Fusian 融合了多个适配器，以使模型的输出与特定的数值目标强度保持一致。 Qwen3-14B 模型上的实验表明，Fusian 在个性控制方面实现了高精度，在与用户指定的特质强度保持一致方面显着优于基线方法。

Title: SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia

Authors: Pengfei Yue, Xingran Zhao, Juntao Chen, Peng Hou, Wang Longchao, Jianghang Lin, Shengchuan Zhang, Anxiang Zeng, Liujuan Cao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15409
Pdf URL: https://arxiv.org/pdf/2603.15409
Copy Paste: [[2603.15409]] SEA-Vision: A Multilingual Benchmark for Comprehensive Document and Scene Text Understanding in Southeast Asia(https://arxiv.org/abs/2603.15409)
Keywords: llm
Abstract: Multilingual document and scene text understanding plays an important role in applications such as search, finance, and public services. However, most existing benchmarks focus on high-resource languages and fail to evaluate models in realistic multilingual environments. In Southeast Asia, the diversity of languages, complex writing systems, and highly varied document types make this challenge even greater. We introduce SEA-Vision, a benchmark that jointly evaluates Document Parsing and Text-Centric Visual Question Answering (TEC-VQA) across 11 Southeast Asian languages. SEA-Vision contains 15,234 document parsing pages from nine representative document types, annotated with hierarchical page-, block-, and line-level labels. It also provides 7,496 TEC-VQA question-answer pairs that probe text recognition, numerical calculation, comparative analysis, logical reasoning, and spatial understanding. To make such multilingual, multi-task annotation feasible, we design a hybrid pipeline for Document Parsing and TEC-VQA. It combines automated filtering and scoring with MLLM-assisted labeling and lightweight native-speaker verification, greatly reducing manual labeling while maintaining high quality. We evaluate several leading multimodal models and observe pronounced performance degradation on low-resource Southeast Asian languages, highlighting substantial remaining gaps in multilingual document and scene text understanding. We believe SEA-Vision will help drive global progress in document and scene text understanding.
摘要：多语言文档和场景文本理解在搜索、金融和公共服务等应用中发挥着重要作用。然而，大多数现有的基准测试都集中在高资源语言上，无法在现实的多语言环境中评估模型。在东南亚，语言的多样性、复杂的书写系统和高度多样化的文档类型使这一挑战变得更加严峻。我们推出 SEA-Vision，这是一个联合评估 11 种东南亚语言的文档解析和以文本为中心的视觉问答 (TEC-VQA) 的基准。 SEA-Vision 包含来自九种代表性文档类型的 15,234 个文档解析页面，并用分层页面、块和行级标签进行注释。它还提供 7,496 个 TEC-VQA 问答对，用于探索文本识别、数值计算、比较分析、逻辑推理和空间理解。为了使这种多语言、多任务注释可行，我们设计了一个用于文档解析和 TEC-VQA 的混合管道。它将自动过滤和评分与 MLLM 辅助标记和轻量级母语验证相结合，大大减少了手动标记，同时保持了高质量。我们评估了几种领先的多模态模型，并观察到资源匮乏的东南亚语言的性能明显下降，突显了多语言文档和场景文本理解方面仍存在巨大差距。我们相信 SEA-Vision 将有助于推动文档和场景文本理解的全球进步。

Title: CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents

Authors: Taeyun Roh, Wonjune Jang, Junha Jung, Jaewoo Kang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15421
Pdf URL: https://arxiv.org/pdf/2603.15421
Copy Paste: [[2603.15421]] CLAG: Adaptive Memory Organization via Agent-Driven Clustering for Small Language Model Agents(https://arxiv.org/abs/2603.15421)
Keywords: language model, agent
Abstract: Large language model agents heavily rely on external memory to support knowledge reuse and complex reasoning tasks. Yet most memory systems store experiences in a single global retrieval pool which can gradually dilute or corrupt stored knowledge. This problem is especially pronounced for small language models (SLMs), which are highly vulnerable to irrelevant context. We introduce CLAG, a CLustering-based AGentic memory framework where an SLM agent actively organizes memory by clustering. CLAG employs an SLM-driven router to assign incoming memories to semantically coherent clusters and autonomously generates cluster-specific profiles, including topic summaries and descriptive tags, to establish each cluster as a self-contained functional unit. By performing localized evolution within these structured neighborhoods, CLAG effectively reduces cross-topic interference and enhances internal memory density. During retrieval, the framework utilizes a two-stage process that first filters relevant clusters via their profiles, thereby excluding distractors and reducing the search space. Experiments on multiple QA datasets with three SLM backbones show that CLAG consistently improves answer quality and robustness over prior memory systems for agents, remaining lightweight and efficient.
摘要：大型语言模型代理严重依赖外部存储器来支持知识重用和复杂的推理任务。然而，大多数记忆系统将经验存储在单个全局检索池中，这可能会逐渐稀释或破坏存储的知识。这个问题对于小语言模型（SLM）来说尤其明显，它们很容易受到不相关上下文的影响。我们引入了 CLAG，这是一种基于集群的 AGentic 内存框架，其中 SLM 代理通过集群主动组织内存。 CLAG 采用 SLM 驱动的路由器将传入内存分配给语义一致的集群，并自动生成特定于集群的配置文件，包括主题摘要和描述性标签，以将每个集群建立为独立的功能单元。通过在这些结构化邻域内执行局部进化，CLAG 有效地减少了跨主题干扰并增强了内部存储密度。在检索过程中，该框架采用两阶段过程，首先通过其配置文件过滤相关集群，从而排除干扰因素并减少搜索空间。对具有三个 SLM 主干的多个 QA 数据集进行的实验表明，与之前的代理记忆系统相比，CLAG 持续提高了答案质量和鲁棒性，同时保持了轻量级和高效性。

Title: Invisible failures in human-AI interactions

Authors: Christopher Potts, Moritz Sudhof
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15423
Pdf URL: https://arxiv.org/pdf/2603.15423
Copy Paste: [[2603.15423]] Invisible failures in human-AI interactions(https://arxiv.org/abs/2603.15423)
Keywords: chat
Abstract: AI systems fail silently far more often than they fail visibly. In a large-scale quantitative analysis of human-AI interactions from the WildChat dataset, we find that 78% of AI failures are invisible: something went wrong but the user gave no overt indication that there was a problem. These invisible failures cluster into eight archetypes that help us characterize where and how AI systems are failing to meet users' needs. In addition, the archetypes show systematic co-occurrence patterns indicating higher-level failure types. To address the question of whether these archetypes will remain relevant as AI systems become more capable, we also assess failures for whether they are primarily interactional or capability-driven, finding that 91% involve interactional dynamics, and we estimate that 94% of such failures would persist even with a more capable model. Finally, we illustrate how the archetypes help us to identify systematic and variable AI limitations across different usage domains. Overall, we argue that our invisible failure taxonomy can be a key component in reliable failure monitoring for product developers, scientists, and policy makers. Our code and data are available at this https URL
摘要：人工智能系统默默地失败的次数远远多于明显失败的次数。在对 WildChat 数据集的人类与人工智能交互的大规模定量分析中，我们发现 78% 的人工智能故障是不可见的：出了问题，但用户没有明确表明存在问题。这些看不见的故障聚集成八种原型，帮助我们描述人工智能系统在哪里以及如何无法满足用户的需求。此外，原型显示出系统的共现模式，表明更高级别的故障类型。为了解决随着人工智能系统变得更加强大，这些原型是否仍然具有相关性的问题，我们还评估了失败主要是交互性的还是能力驱动的，发现 91% 涉及交互动态，并且我们估计，即使使用更强大的模型，94% 的此类失败也会持续存在。最后，我们说明原型如何帮助我们识别不同使用领域的系统性和可变的人工智能限制。总的来说，我们认为我们的隐形故障分类可以成为产品开发人员、科学家和政策制定者可靠故障监控的关键组成部分。我们的代码和数据可在此 https URL 获取

Title: ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models

Authors: Duy Vu Minh Nguyen, Chinh Thanh Truong, Phuc Hoang Tran, Hung Tuan Le, Nguyen Van-Thanh Dat, Trung Hieu Pham, Kiet Van Nguyen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15513
Pdf URL: https://arxiv.org/pdf/2603.15513
Copy Paste: [[2603.15513]] ViX-Ray: A Vietnamese Chest X-Ray Dataset for Vision-Language Models(https://arxiv.org/abs/2603.15513)
Keywords: language model, gpt, hallucination
Abstract: Vietnamese medical research has become an increasingly vital domain, particularly with the rise of intelligent technologies aimed at reducing time and resource burdens in clinical diagnosis. Recent advances in vision-language models (VLMs), such as Gemini and GPT-4V, have sparked a growing interest in applying AI to healthcare. However, most existing VLMs lack exposure to Vietnamese medical data, limiting their ability to generate accurate and contextually appropriate diagnostic outputs for Vietnamese patients. To address this challenge, we introduce ViX-Ray, a novel dataset comprising 5,400 Vietnamese chest X-ray images annotated with expert-written findings and impressions from physicians at a major Vietnamese hospital. We analyze linguistic patterns within the dataset, including the frequency of mentioned body parts and diagnoses, to identify domain-specific linguistic characteristics of Vietnamese radiology reports. Furthermore, we fine-tune five state-of-the-art open-source VLMs on ViX-Ray and compare their performance to leading proprietary models, GPT-4V and Gemini. Our results show that while several models generate outputs partially aligned with clinical ground truths, they often suffer from low precision and excessive hallucination, especially in impression generation. These findings not only demonstrate the complexity and challenge of our dataset but also establish ViX-Ray as a valuable benchmark for evaluating and advancing vision-language models in the Vietnamese clinical domain.
摘要：越南医学研究已成为一个日益重要的领域，特别是随着旨在减少临床诊断时间和资源负担的智能技术的兴起。 Gemini 和 GPT-4V 等视觉语言模型 (VLM) 的最新进展激发了人们对将人工智能应用于医疗保健的兴趣。然而，大多数现有的 VLM 缺乏越南医疗数据的接触，限制了它们为越南患者生成准确且适合上下文的诊断输出的能力。为了应对这一挑战，我们引入了 ViX-Ray，这是一个新颖的数据集，包含 5,400 张越南胸部 X 射线图像，并附有专家撰写的研究结果和越南一家主要医院医生的印象。我们分析数据集中的语言模式，包括提到的身体部位和诊断的频率，以识别越南放射学报告的特定领域语言特征。此外，我们在 ViX-Ray 上微调了五个最先进的开源 VLM，并将其性能与领先的专有模型 GPT-4V 和 Gemini 进行比较。我们的结果表明，虽然一些模型生成的输出部分与临床基本事实一致，但它们经常遭受精度低和过度幻觉的困扰，特别是在印象生成方面。这些发现不仅证明了我们数据集的复杂性和挑战，而且还将 ViX-Ray 确立为评估和推进越南临床领域视觉语言模型的宝贵基准。

Title: Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models

Authors: Xiyu Liu, Qingyi Si, Zhengxiao Liu, Chenxu Yang, Naibin Gu, Zheng Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15518
Pdf URL: https://arxiv.org/pdf/2603.15518
Copy Paste: [[2603.15518]] Beyond the Covariance Trap: Unlocking Generalization in Same-Subject Knowledge Editing for Large Language Models(https://arxiv.org/abs/2603.15518)
Keywords: language model, llm, prompt, agent
Abstract: While locate-then-edit knowledge editing efficiently updates knowledge encoded within Large Language Models (LLMs), a critical generalization failure mode emerges in the practical same-subject knowledge editing scenario: models fail to recall the updated knowledge when following user instructions, despite successfully recalling it in the original edited form. This paper identifies the geometric root of this generalization collapse as a fundamental conflict where the inner activation drifts induced by prompt variations exceed the model's geometric tolerance for generalization after editing. We attribute this instability to a dual pathology: (1) The joint optimization with orthogonal gradients collapses solutions into sharp minima with narrow stability, and (2) the standard covariance constraint paradoxically acts as a Covariance Trap that amplifies input perturbations. To resolve this, we introduce RoSE (Robust Same-subject Editing), which employs Isotropic Geometric Alignment to minimize representational deviation and Hierarchical Knowledge Integration to smooth the optimization landscape. Extensive experiments demonstrate that RoSE significantly improves instruction-following capabilities, laying the foundation for robust interactive parametric memory of LLM agents.
摘要：虽然“定位然后编辑”知识编辑有效地更新了大型语言模型（LLM）中编码的知识，但在实际的同一主题知识编辑场景中出现了一种关键的泛化失败模式：模型在遵循用户指令时无法回忆起更新的知识，尽管成功地以原始编辑的形式回忆了它。本文将这种泛化崩溃的几何根源确定为一种基本冲突，其中由即时变化引起的内部激活漂移超出了编辑后模型泛化的几何容差。我们将这种不稳定性归因于双重病理学：（1）正交梯度的联合优化将解折叠成稳定性较窄的尖锐最小值，（2）标准协方差约束矛盾地充当了放大输入扰动的协方差陷阱。为了解决这个问题，我们引入了 RoSE（鲁棒同主题编辑），它采用各向同性几何对齐来最小化表征偏差，并采用分层知识集成来平滑优化景观。大量实验表明，RoSE 显着提高了指令跟踪能力，为 LLM 代理的稳健交互式参数记忆奠定了基础。

Title: SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction

Authors: David Števaňák, Marek Šuppa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15523
Pdf URL: https://arxiv.org/pdf/2603.15523
Copy Paste: [[2603.15523]] SlovKE: A Large-Scale Dataset and LLM Evaluation for Slovak Keyphrase Extraction(https://arxiv.org/abs/2603.15523)
Keywords: gpt, llm
Abstract: Keyphrase extraction for morphologically rich, low-resource languages remains understudied, largely due to the scarcity of suitable evaluation datasets. We address this gap for Slovak by constructing a dataset of 227,432 scientific abstracts with author-assigned keyphrases -- scraped and systematically cleaned from the Slovak Central Register of Theses -- representing a 25-fold increase over the largest prior Slovak resource and approaching the scale of established English benchmarks such as KP20K. Using this dataset, we benchmark three unsupervised baselines (YAKE, TextRank, KeyBERT with SlovakBERT embeddings) and evaluate KeyLLM, an LLM-based extraction method using GPT-3.5-turbo. Unsupervised baselines achieve at most 11.6\% exact-match $F1@6$, with a large gap to partial matching (up to 51.5\%), reflecting the difficulty of matching inflected surface forms to author-assigned keyphrases. KeyLLM narrows this exact--partial gap, producing keyphrases closer to the canonical forms assigned by authors, while manual evaluation on 100 documents ($\kappa = 0.61$) confirms that KeyLLM captures relevant concepts that automated exact matching underestimates. Our analysis identifies morphological mismatch as the dominant failure mode for statistical methods -- a finding relevant to other inflected languages. The dataset (this https URL) and evaluation code (this https URL) are publicly available.
摘要：形态丰富、资源匮乏的语言的关键短语提取仍然没有得到充分研究，很大程度上是由于缺乏合适的评估数据集。我们通过构建一个由 227,432 条科学摘要组成的数据集来解决斯洛伐克语的这一差距，该数据集包含作者指定的关键词（从斯洛伐克中央论文登记簿中刮取并系统地清理），比斯洛伐克语之前最大的资源增加了 25 倍，并接近 KP20K 等既定英语基准的规模。使用此数据集，我们对三个无监督基线（YAKE、TextRank、带有 SlovakBERT 嵌入的 KeyBERT）进行基准测试，并评估 KeyLLM，这是一种使用 GPT-3.5-turbo 的基于 LLM 的提取方法。无监督基线最多实现 11.6% 的精确匹配 $F1@6$，与部分匹配有很大差距（高达 51.5%），反映了将变形表面形式与作者指定的关键词匹配的难度。 KeyLLM 缩小了这种精确-部分差距，生成的关键短语更接近作者指定的规范形式，而对 100 个文档 ($\kappa = 0.61$) 的手动评估证实 KeyLLM 捕获了自动精确匹配低估的相关概念。我们的分析将形态不匹配确定为统计方法的主要失败模式——这一发现与其他变形语言相关。数据集（此 https URL）和评估代码（此 https URL）是公开可用的。

Title: Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation

Authors: Yanick Zengaffinen, Andreas Opedal, Donya Rooein, Kv Aditya Srivatsa, Shashank Sonkar, Mrinmaya Sachan
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.15547
Pdf URL: https://arxiv.org/pdf/2603.15547
Copy Paste: [[2603.15547]] Can LLMs Model Incorrect Student Reasoning? A Case Study on Distractor Generation(https://arxiv.org/abs/2603.15547)
Keywords: language model, llm, prompt
Abstract: Modeling plausible student misconceptions is critical for AI in education. In this work, we examine how large language models (LLMs) reason about misconceptions when generating multiple-choice distractors, a task that requires modeling incorrect yet plausible answers by coordinating solution knowledge, simulating student misconceptions, and evaluating plausibility. We introduce a taxonomy for analyzing the strategies used by state-of-the-art LLMs, examining their reasoning procedures and comparing them to established best practices in the learning sciences. Our structured analysis reveals a surprising alignment between their processes and best practices: the models typically solve the problem correctly first, then articulate and simulate multiple potential misconceptions, and finally select a set of distractors. An analysis of failure modes reveals that errors arise primarily from failures in recovering the correct solution and selecting among response candidates, rather than simulating errors or structuring the process. Consistent with these results, we find that providing the correct solution in the prompt improves alignment with human-authored distractors by 8%, highlighting the critical role of anchoring to the correct solution when generating plausible incorrect student reasoning. Overall, our analysis offers a structured and interpretable lens into LLMs' ability to model incorrect student reasoning and produce high-quality distractors.
摘要：对学生合理的误解进行建模对于人工智能在教育领域的应用至关重要。在这项工作中，我们研究了大型语言模型（LLM）在生成多项选择干扰项时如何推理误解，这项任务需要通过协调解决方案知识、模拟学生的误解和评估合理性来建模不正确但合理的答案。我们引入了一种分类法，用于分析最先进的法学硕士所使用的策略，检查他们的推理程序并将其与学习科学中既定的最佳实践进行比较。我们的结构化分析揭示了他们的流程和最佳实践之间惊人的一致性：模型通常首先正确地解决问题，然后阐明和模拟多个潜在的误解，最后选择一组干扰因素。对故障模式的分析表明，错误主要是由于未能恢复正确的解决方案并在候选响应中进行选择而产生的，而不是模拟错误或构建过程。与这些结果一致，我们发现在提示中提供正确的解决方案可以将与人类编写的干扰因素的一致性提高 8%，这凸显了在生成看似合理的错误学生推理时锚定到正确解决方案的关键作用。总的来说，我们的分析为法学硕士建模错误学生推理和产生高质量干扰因素的能力提供了一个结构化和可解释的视角。

Title: Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning

Authors: Aozhe Wang, Yuchen Yan, Nan Zhou, Zhengxi Lu, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.15611
Pdf URL: https://arxiv.org/pdf/2603.15611
Copy Paste: [[2603.15611]] Code-A1: Adversarial Evolving of Code LLM and Test LLM via Reinforcement Learning(https://arxiv.org/abs/2603.15611)
Keywords: llm
Abstract: Reinforcement learning for code generation relies on verifiable rewards from unit test pass rates. Yet high-quality test suites are scarce, existing datasets offer limited coverage, and static rewards fail to adapt as models improve. Recent self-play methods unify code and test generation in a single model, but face a inherent dilemma: white-box access leads to self-collusion where the model produces trivial tests for easy rewards, yet black-box restriction yields generic tests that miss implementation-specific bugs. We introduce Code-A1, an adversarial co-evolution framework that jointly optimizes a Code LLM and a Test LLM with opposing objectives. The Code LLM is rewarded for passing more tests, while the Test LLM is rewarded for exposing more defects. This architectural separation eliminates self-collusion risks and safely enables white-box test generation, where the Test LLM can inspect candidate code to craft targeted adversarial tests. We further introduce a Mistake Book mechanism for experience replay and a composite reward balancing test validity with adversarial difficulty. Experiments on Qwen2.5-Coder models demonstrate that Code-A1 achieves code generation performance matching or exceeding models trained on human-annotated tests, while significantly improving test generation capability.
摘要：代码生成的强化学习依赖于单元测试通过率的可验证奖励。然而，高质量的测试套件稀缺，现有数据集的覆盖范围有限，并且静态奖励无法随着模型的改进而调整。最近的自我对弈方法将代码和测试生成统一在一个模型中，但面临着一个固有的困境：白盒访问会导致自共谋，模型会产生简单的测试以获得简单的奖励，而黑盒限制会产生通用测试，从而错过特定于实现的错误。我们引入了 Code-A1，这是一种对抗性共同进化框架，它联合优化具有相反目标的代码 LLM 和测试 LLM。代码法学硕士因通过更多测试而获得奖励，而测试法学硕士因暴露更多缺陷而获得奖励。这种架构分离消除了自我共谋风险，并安全地实现了白盒测试生成，其中测试法学硕士可以检查候选代码以制定有针对性的对抗性测试。我们进一步引入了用于经验回放的错误簿机制和平衡测试有效性与对抗难度的复合奖励。 Qwen2.5-Coder 模型上的实验表明，Code-A1 的代码生成性能可以匹配或超过人工注释测试训练的模型，同时显着提高测试生成能力。

Title: Mechanistic Origin of Moral Indifference in Language Models

Authors: Lingyu Li, Yan Teng, Yingchun Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15615
Pdf URL: https://arxiv.org/pdf/2603.15615
Copy Paste: [[2603.15615]] Mechanistic Origin of Moral Indifference in Language Models(https://arxiv.org/abs/2603.15615)
Keywords: language model, llm
Abstract: Existing behavioral alignment techniques for Large Language Models (LLMs) often neglect the discrepancy between surface compliance and internal unaligned representations, leaving LLMs vulnerable to long-tail risks. More crucially, we posit that LLMs possess an inherent state of moral indifference due to compressing distinct moral concepts into uniform probability distributions. We verify and remedy this indifference in LLMs' latent representations, utilizing 251k moral vectors constructed upon Prototype Theory and the Social-Chemistry-101 dataset. Firstly, our analysis across 23 models reveals that current LLMs fail to represent the distinction between opposed moral categories and fine-grained typicality gradients within these categories; notably, neither model scaling, architecture, nor explicit alignment reshapes this indifference. We then employ Sparse Autoencoders on Qwen3-8B, isolate mono-semantic moral features, and targetedly reconstruct their topological relationships to align with ground-truth moral vectors. This representational alignment naturally improves moral reasoning and granularity, achieving a 75% pairwise win-rate on the independent adversarial Flames benchmark. Finally, we elaborate on the remedial nature of current intervention methods from an experientialist philosophy, arguing that endogenously aligned AI might require a transformation from post-hoc corrections to proactive cultivation.
摘要：现有的大型语言模型 (LLM) 行为对齐技术常常忽略表面合规性和内部未对齐表示之间的差异，使 LLM 容易受到长尾风险的影响。更重要的是，我们假设法学硕士由于将不同的道德概念压缩为均匀的概率分布而具有道德冷漠的内在状态。我们利用基于原型理论和 Social-Chemistry-101 数据集构建的 251k 个道德向量来验证并纠正 LLM 潜在表征中的这种冷漠。首先，我们对 23 个模型的分析表明，当前的法学硕士无法区分对立的道德类别和这些类别内的细粒度典型性梯度之间的区别；值得注意的是，模型缩放、架构或明确的对齐都无法重塑这种冷漠。然后，我们在 Qwen3-8B 上使用稀疏自动编码器，分离单语义道德特征，并有针对性地重建它们的拓扑关系以与真实道德向量对齐。这种表征对齐自然地提高了道德推理和粒度，在独立对抗性 Flames 基准上实现了 75% 的成对胜率。最后，我们从经验主义哲学角度阐述了当前干预方法的补救性质，认为内生一致的人工智能可能需要从事后纠正到主动培养的转变。

Title: Mixture-of-Depths Attention

Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.15619
Pdf URL: https://arxiv.org/pdf/2603.15619
Copy Paste: [[2603.15619]] Mixture-of-Depths Attention(https://arxiv.org/abs/2603.15619)
Keywords: language model, llm
Abstract: Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at this https URL .
摘要：扩展深度是大型语言模型 (LLM) 的关键驱动因素。然而，随着 LLM 变得更深，它们经常会遭受信号衰减的困扰：浅层中形成的信息特征会被重复的残余更新逐渐稀释，使得它们更难以在更深的层中恢复。我们引入了混合深度注意力（MoDA），这种机制允许每个注意力头关注当前层的序列 KV 对和前面层的深度 KV 对。我们进一步描述了一种用于 MoDA 的硬件高效算法，该算法解决了非连续内存访问模式，在 64K 序列长度下实现了 FlashAttention-2 97.3% 的效率。 1.5B 参数模型的实验表明，MoDA 始终优于强基线。值得注意的是，它在 10 个验证基准中将平均困惑度提高了 0.2，并将 10 个下游任务的平均性能提高了 2.11%，而 FLOPs 的计算开销可以忽略不计，达到了 3.7%。我们还发现，将 MoDA 与后规范相结合比将其与前规范结合使用会产生更好的性能。这些结果表明 MoDA 是一种有前途的深度缩放原语。代码在此 https URL 发布。