2026-03-03

Title: Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs

Authors: Sean W. Kelley, Christoph Riedl
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00024
Pdf URL: https://arxiv.org/pdf/2603.00024
Copy Paste: [[2603.00024]] Personalization Increases Affective Alignment but Has Role-Dependent Effects on Epistemic Independence in LLMs(https://arxiv.org/abs/2603.00024)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are prone to sycophantic behavior, uncritically conforming to user beliefs. As models increasingly condition responses on user-specific context (personality traits, preferences, conversation history), they gain information to tailor agreement more effectively. Understanding how personalization modulates sycophancy is critical, yet systematic evaluation across models and contexts remains limited. We present a rigorous evaluation of personalization's impact on LLM sycophancy across nine frontier models and five benchmark datasets spanning advice, moral judgment, and debate contexts. We find that personalization generally increases affective alignment (emotional validation, hedging/deference), but affects epistemic alignment (belief adoption, position stability, resistance to influence) with context-dependent role modulation. When the LLM's role is to give advice, personalization strengthens epistemic independence (models challenge user presuppositions). When its role is that of a social peer, personalization decreases epistemic independence. In this role, extensively personalized user challenges causing LLMs to abandon their position at significantly higher rates. Robustness tests confirm that the effects are driven by personalized conditioning, not by additional input tokens per se or demographic information alone. Our work provides measurement frameworks for evaluating personalized AI systems, demonstrates the necessity of role-sensitive evaluation, and establishes a novel benchmark to assess goal alignment.
摘要：大型语言模型 (LLM) 很容易出现阿谀奉承的行为，不加批判地遵循用户的信念。随着模型越来越多地根据用户特定的上下文（个性特征、偏好、对话历史）来调节响应，它们获得信息来更有效地定制协议。了解个性化如何调节阿谀奉承至关重要，但跨模型和环境的系统评估仍然有限。我们通过九个前沿模型和五个涵盖建议、道德判断和辩论背景的基准数据集，对个性化对法学硕士阿谀奉承的影响进行了严格评估。我们发现个性化通常会增加情感一致性（情感验证、对冲/尊重），但会通过上下文相关的角色调节影响认知一致性（信念采用、立场稳定性、抗影响力）。当法学硕士的角色是提供建议时，个性化可以加强认知独立性（模型挑战用户预设）。当其角色是社会同伴时，个性化会降低认知独立性。在这个角色中，广泛的个性化用户挑战导致法学硕士放弃其职位的比率明显更高。稳健性测试证实，效果是由个性化条件驱动的，而不是由额外的输入令牌本身或人口统计信息单独驱动的。我们的工作提供了评估个性化人工智能系统的测量框架，证明了角色敏感评估的必要性，并建立了一个评估目标一致性的新基准。

Title: TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation

Authors: Samah Fodeh, Linhai Ma, Ganesh Puthiaraju, Srivani Talakokkul, Afshan Khan, Ashley Hagaman, Sarah R. Lowe, Aimee Kendall Roundtree
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00025
Pdf URL: https://arxiv.org/pdf/2603.00025
Copy Paste: [[2603.00025]] TAB-PO: Preference Optimization with a Token-Level Adaptive Barrier for Token-Critical Structured Generation(https://arxiv.org/abs/2603.00025)
Keywords: language model
Abstract: Direct Preference Optimization is an offline post-SFT method for aligning language models from preference pairs, with strong results in instruction following and summarization. However, DPO's sequence-level implicit reward can be brittle for token-critical structured prediction settings such as medical annotation, which often exhibit (i) low-separation preference pairs, where chosen and rejected completions differ by minimal edit distance (often 1-3 tokens), and (ii) token-importance skew, where sparse semantic tokens (hierarchical labels and evidence Spans) carry disproportionate task importance relative to high-frequency structural tokens (JSON scaffolding). In this regime, standard DPO suffers from margin collapse (insufficient log-probability separation between near-identical preferences), likelihood squeezing (the margin objective shifts the absolute likelihoods of both completions together), and gradient dilution, where uniform sequence-level weighting diffuses learning signal across shared scaffolding while rare, confusable label tokens receive weak, noisy updates. We introduce Token-Adaptive Barrier Preference Optimization (TAB-PO), which augments DPO with token-weighted, reference-adjusted advantages that prioritize high-value semantic tokens, and a conditional token-level barrier that regularizes under-confident tokens balancing SFT-anchored likelihood and preference-driven separation in low-separation, importance-skewed regimes. We evaluate TAB-PO on medical communication annotation, a task requiring joint prediction of hierarchical labels and evidence Spans from patient-provider messages. TAB-PO achieves a ~ 4% relative improvement in micro-F1 over SFT and consistently outperforms recent preference-optimization baselines.
摘要：直接偏好优化是一种离线的后 SFT 方法，用于根据偏好对调整语言模型，在指令跟踪和总结方面具有很强的效果。然而，DPO 的序列级隐式奖励对于标记关键的结构化预测设置（例如医学注释）来说可能很脆弱，这些设置通常表现出 (i) 低分离偏好对，其中选择和拒绝的完成因最小编辑距离（通常为 1-3 个标记）而不同，以及 (ii) 标记重要性倾斜，其中稀疏语义标记（分层标签和证据跨度）相对于高频结构标记（JSON 脚手架）具有不成比例的任务重要性。在这种情况下，标准 DPO 遭受边际崩溃（几乎相同的偏好之间的对数概率分离不足）、似然挤压（边际目标将两个完成的绝对可能性一起转移）和梯度稀释，其中统一的序列级权重将学习信号扩散到共享支架上，而罕见的、易混淆的标签令牌接收到微弱的、有噪声的更新。我们引入了令牌自适应屏障偏好优化（TAB-PO），它通过令牌加权、参考调整的优势增强了 DPO，优先考虑高价值语义令牌，并使用条件令牌级别屏障来规范信心不足的令牌，在低分离、重要性倾斜的情况下平衡 SFT 锚定的可能性和偏好驱动的分离。我们评估医疗通信注释上的 TAB-PO，这是一项需要联合预测来自患者-提供者消息的分层标签和证据跨度的任务。与 SFT 相比，TAB-PO 在 micro-F1 方面实现了约 4% 的相对改进，并且始终优于最近的偏好优化基线。

Title: ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Authors: Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.00026
Pdf URL: https://arxiv.org/pdf/2603.00026
Copy Paste: [[2603.00026]] ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents(https://arxiv.org/abs/2603.00026)
Keywords: language model, llm, agent
Abstract: Effective memory management is essential for large language model (LLM) agents handling long-term interactions. Current memory frameworks typically treat agents as passive "recorders" and retrieve information without understanding its deeper implications. They may fail in scenarios requiring conflict detection and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms state-of-the-art baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.
摘要：有效的内存管理对于处理长期交互的大型语言模型（LLM）代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”并检索信息而不了解其更深层次的含义。在需要冲突检测和复杂决策的场景中它们可能会失败。为了弥补这一关键差距，我们提出了一种名为 ActMem 的新颖的可操作记忆框架，它将记忆检索与主动因果推理相结合。 ActMem 将非结构化对话历史转换为结构化因果和语义图。通过利用反事实推理和常识补全，它使智能体能够推断出隐含的约束并解决过去状态和当前意图之间的潜在冲突。此外，我们引入了一个综合数据集 ActMemEval 来评估逻辑驱动场景中代理的推理能力，超越了现有内存基准的事实检索重点。实验表明，ActMem 在处理复杂的、依赖于内存的任务方面显着优于最先进的基线，为更一致、更可靠的智能助手铺平了道路。

Title: EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal

Authors: Samah Fodeh, Yan Wang, Linhai Ma, Srivani Talakokkul, Jordan M. Alpert, Sarah Schellhorn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00028
Pdf URL: https://arxiv.org/pdf/2603.00028
Copy Paste: [[2603.00028]] EPPCMinerBen: A Novel Benchmark for Evaluating Large Language Models on Electronic Patient-Provider Communication via the Patient Portal(https://arxiv.org/abs/2603.00028)
Keywords: language model, llm, prompt
Abstract: Effective communication in health care is critical for treatment outcomes and adherence. With patient-provider exchanges shifting to secure messaging, analyzing electronic patient-communication (EPPC) data is both essential and challenging. We introduce EPPCMinerBen, a benchmark for evaluating LLMs in detecting communication patterns and extracting insights from electronic patient-provider messages. EPPCMinerBen includes three sub-tasks: Code Classification, Subcode Classification, and Evidence Extraction. Using 1,933 expert annotated sentences from 752 secure messages of the patient portal at Yale New Haven Hospital, it evaluates LLMs on identifying communicative intent and supportive text. Benchmarks span various LLMs under zero-shot and few-shot settings, with data to be released via the NCI Cancer Data Service. Model performance varied across tasks and settings. Llama-3.1-70B led in evidence extraction (F1: 82.84%) and performed well in classification. Llama-3.3-70b-Instruct outperformed all models in code classification (F1: 67.03%). DeepSeek-R1-Distill-Qwen-32B excelled in subcode classification (F1: 48.25%), while sdoh-llama-3-70B showed consistent performance. Smaller models underperformed, especially in subcode classification (>30% F1). Few-shot prompting improved most tasks. Our results show that large, instruction-tuned models generally perform better in EPPCMinerBen tasks, particularly evidence extraction while smaller models struggle with fine-grained reasoning. EPPCMinerBen provides a benchmark for discourse-level understanding, supporting future work on model generalization and patient-provider communication analysis. Keywords: Electronic Patient-Provider Communication, Large language models, Data collection, Prompt engineering
摘要：医疗保健中的有效沟通对于治疗结果和依从性至关重要。随着患者与提供者之间的交流转向安全消息传递，分析电子患者通信 (EPPC) 数据既重要又具有挑战性。我们推出了 EPPCMinerBen，这是一个评估法学硕士在检测通信模式和从电子患者-提供者消息中提取见解方面的基准。 EPPCMinerBen 包括三个子任务：代码分类、子代码分类和证据提取。它使用耶鲁纽黑文医院患者门户的 752 条安全消息中的 1,933 个专家注释句子，评估法学硕士识别交流意图和支持文本的能力。基准涵盖零样本和少样本设置下的各种法学硕士，数据将通过 NCI 癌症数据服务发布。模型性能因任务和设置而异。 Llama-3.1-70B 在证据提取方面领先（F1：82.84%），并且在分类方面表现良好。 Llama-3.3-70b-Instruct 在代码分类方面优于所有模型（F1：67.03%）。 DeepSeek-R1-Distill-Qwen-32B 在子码分类方面表现出色（F1：48.25%），而 sdoh-llama-3-70B 表现出一致的性能。较小的模型表现不佳，尤其是在子代码分类方面（>30% F1）。少量提示改善了大多数任务。我们的结果表明，大型的指令调整模型通常在 EPPCMinerBen 任务中表现更好，特别是证据提取，而较小的模型则难以进行细粒度推理。 EPPCMinerBen 提供了话语层面理解的基准，支持模型泛化和患者与提供者沟通分析的未来工作。关键词：电子医患沟通、大语言模型、数据收集、即时工程

Title: Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models

Authors: Youngji Roh, Hyunjin Cho, Jaehyung Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00029
Pdf URL: https://arxiv.org/pdf/2603.00029
Copy Paste: [[2603.00029]] Embracing Anisotropy: Turning Massive Activations into Interpretable Control Knobs for Large Language Models(https://arxiv.org/abs/2603.00029)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit highly anisotropic internal representations, often characterized by massive activations, a phenomenon where a small subset of feature dimensions possesses magnitudes significantly larger than the rest. While prior works view these extreme dimensions primarily as artifacts to be managed, we propose a distinct perspective: these dimensions serve as intrinsic interpretable functional units arising from domain specialization. Specifically, we propose a simple magnitude-based criterion to identify Domain-Critical Dimensions in a training-free manner. Our analyses reveal that such dimensions behave as interpretable semantic detectors for symbolic/quantitative patterns or domain-specific terms. In addition, we introduce Critical Dimension Steering, which applies activation steering exclusively to the identified dimensions. Empirical results show that this approach outperforms conventional whole-dimension steering in domain adaptation and jailbreaking scenarios.
摘要：大型语言模型 (LLM) 表现出高度各向异性的内部表示，通常以大量激活为特征，这种现象是一小部分特征维度的数量级明显大于其他特征维度。虽然之前的工作主要将这些极端维度视为需要管理的工件，但我们提出了一个独特的观点：这些维度充当由领域专业化产生的内在可解释的功能单元。具体来说，我们提出了一个简单的基于幅度的标准，以免训练的方式识别领域关键维度。我们的分析表明，这些维度充当符号/定量模式或特定领域术语的可解释语义检测器。此外，我们还引入了关键维度引导，它将激活引导专门应用于已识别的维度。实证结果表明，该方法在领域适应和越狱场景中优于传统的全维转向。

Title: SimpleTool: Parallel Decoding for Real-Time LLM Function Calling

Authors: Xiaoxin Shi, Jiaxin Wan, Linkang Dong, Wei Jiang, Yue Liu, Zengfeng Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00030
Pdf URL: https://arxiv.org/pdf/2603.00030
Copy Paste: [[2603.00030]] SimpleTool: Parallel Decoding for Real-Time LLM Function Calling(https://arxiv.org/abs/2603.00030)
Keywords: llm, agent
Abstract: LLM-based function calling enables intelligent agents to interact with external tools and environments, yet autoregressive decoding imposes a fundamental latency bottleneck that limits real-time applications such as embodied intelligence, game AI, and interactive avatars (e.g., 10 Hz control frequency). We observe that function calling differs fundamentally from free-form text generation: structured outputs exhibit substantial token redundancy (delimiters, parameter names), and arguments exhibit weak causal dependencies. Crucially, these two properties must be exploited jointly to achieve real-time performance. We present SimpleTool, which introduces special tokens that serve a dual role: compressing low-entropy tokens (4-6x reduction) while acting as mode selectors that enable independent parallel generation of function name and arguments. This synergistic design achieves 3-6x end-to-end speedup (up to 9.6x) with only +8.2% parallelization overhead. Experiments on five benchmarks across Qwen-series models (0.5B-14B) demonstrate substantial speedup while maintaining competitive or improved accuracy. On Mobile Actions, ST-Qwen-0.5B outperforms Google's FunctionGemma in both accuracy and latency consistency. With quantization on consumer-grade GPU, SimpleTool achieves 61.2ms P50 latency, enabling 16 Hz real-time control at 4B model scale, bridging the gap between LLM function calling and latency-critical real-world deployment.
摘要：基于 LLM 的函数调用使智能代理能够与外部工具和环境进行交互，但自回归解码带来了基本的延迟瓶颈，限制了实体智能、游戏 AI 和交互式化身（例如 10 Hz 控制频率）等实时应用。我们观察到函数调用与自由格式文本生成有根本的不同：结构化输出表现出大量的标记冗余（分隔符、参数名称），而参数表现出弱的因果依赖性。至关重要的是，必须共同利用这两个属性才能实现实时性能。我们提出了 SimpleTool，它引入了具有双重作用的特殊标记：压缩低熵标记（减少 4-6 倍），同时充当模式选择器，支持独立并行生成函数名称和参数。这种协同设计实现了 3-6 倍的端到端加速（高达 9.6 倍），而并行化开销仅为 +8.2%。对 Qwen 系列模型 (0.5B-14B) 的五个基准进行的实验表明，在保持竞争力或提高准确性的同时，速度显着提高。在移动操作方面，ST-Qwen-0.5B 在准确性和延迟一致性方面均优于 Google 的 FunctionGemma。通过消费级 GPU 上的量化，SimpleTool 实现了 61.2ms P50 延迟，在 4B 模型规模下实现 16 Hz 实时控制，缩小了 LLM 函数调用和延迟关键的实际部署之间的差距。

Title: GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency

Authors: Changhao Wang, Jiaolong Yang, Xinhao Yao, Yunfei Yu, Peng Jiao, Lu Yu, Junpeng Fang, Riccardo Cantoro, Qing Cui, Jun Zhou
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00031
Pdf URL: https://arxiv.org/pdf/2603.00031
Copy Paste: [[2603.00031]] GRIP: Geometric Refinement and Adaptive Information Potential for Data Efficiency(https://arxiv.org/abs/2603.00031)
Keywords: language model, llm
Abstract: The performance of Large Language Models (LLMs) is increasingly governed by data efficiency rather than raw scaling volume. However, existing selection methods often decouple global distribution balancing from local instance selection, compromising the hierarchical integrity of the training set. We introduce \textbf{GRIP} (Geometric Refinement and Adaptive Information Potential), a framework that unifies these dimensions by modeling the corpus as an information-dense geometric space. GRIP employs a \textbf{Rapid Adaptation Probe (RAP)} to quantify the information potential of semantic clusters, dynamically re-allocating the sampling budget to regions with the highest representation deficits. Subsequently, we perform Intra-Cluster Selection using a \textbf{length-rectified geometric prior} to counteract embedding density artifacts and preserve long-tail logical sequences. Extensive evaluations on Mixture-of-Experts (MoE) models up to 300B tokens demonstrate that GRIP consistently outperforms state-of-the-art baselines, \textbf{surpassing the performance of models trained on $3\times$ larger uncurated datasets}. Our work establishes a robust geometric foundation for adaptive data curation in large-scale pre-training.
摘要：大型语言模型 (LLM) 的性能越来越受到数据效率而非原始扩展量的影响。然而，现有的选择方法通常将全局分布平衡与本地实例选择解耦，从而损害了训练集的层次完整性。我们引入了 \textbf{GRIP} （几何细化和自适应信息势），这是一个通过将语料库建模为信息密集的几何空间来统一这些维度的框架。 GRIP 采用快速适应探针（RAP）来量化语义簇的信息潜力，动态地将采样预算重新分配给具有最高表示缺陷的区域。随后，我们使用 \textbf{长度校正几何先验} 执行簇内选择，以抵消嵌入密度伪影并保留长尾逻辑序列。对高达 300B 代币的专家混合 (MoE) 模型的广泛评估表明，GRIP 始终优于最先进的基线，\textbf{超越在 $3\times$ 较大的未经整理的数据集上训练的模型的性能}。我们的工作为大规模预训练中的自适应数据管理奠定了坚实的几何基础。

Title: Autorubric: A Unified Framework for Rubric-Based LLM Evaluation

Authors: Delip Rao, Chris Callison-Burch
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00077
Pdf URL: https://arxiv.org/pdf/2603.00077
Copy Paste: [[2603.00077]] Autorubric: A Unified Framework for Rubric-Based LLM Evaluation(https://arxiv.org/abs/2603.00077)
Keywords: language model, llm, chat
Abstract: Rubric-based evaluation with large language models (LLMs) has become standard practice for assessing text generation at scale, yet the underlying techniques are scattered across papers with inconsistent terminology and partial solutions. We present a unified framework: each identified technique is paired with its realization in Autorubric, an open-source Python framework proposed in this paper. Autorubric supports binary, ordinal, and nominal criteria with configurable weights; single-judge and multi-judge ensemble evaluation with majority, weighted, unanimous, and any-vote aggregation; few-shot calibration with verdict-balanced sampling; and mitigations for position bias (option shuffling), verbosity bias (length penalties), and criterion conflation (per-criterion atomic evaluation with natural language explanations). The framework provides reliability metrics drawn from psychometrics (Cohen's $\kappa$, weighted $\kappa$, correlation coefficients, and distribution-level tests) alongside production infrastructure including response caching, checkpointing with resumable runs, multi-provider rate limiting, and cost tracking. We evaluate Autorubric on three benchmarks spanning educational assessment, deep research evaluation, and chatbot quality assessment, demonstrating that it produces results consistent with published benchmarks while exercising the framework's key capabilities: per-criterion binary evaluation with few-shot calibration (RiceChem), multi-judge ensemble evaluation across judge models (ResearcherBench), and mixed criterion types combining binary, ordinal, and nominal scales (CHARM-100). We also contribute CHARM-100, a 100-sample chatbot evaluation dataset with per-sample ground truth labels across all three criterion types, designed to stress-test rubric evaluation frameworks on heterogeneous criteria.
摘要：使用大型语言模型 (LLM) 进行基于标题的评估已成为大规模评估文本生成的标准做法，但底层技术分散在论文中，术语不一致且解决方案不完整。我们提出了一个统一的框架：每种确定的技术都与其在 Autorubric 中的实现配对，Autorubric 是本文提出的开源 Python 框架。 Autorubric 支持具有可配置权重的二进制、序数和名义标准；单评委和多评委综合评估，包括多数、加权、一致和任意投票聚合；带有判决平衡采样的少样本校准；以及位置偏差（选项改组）、冗长偏差（长度惩罚）和标准合并（每个标准原子评估与自然语言解释）的缓解措施。该框架提供了从心理测量学中提取的可靠性指标（Cohen's $\kappa$、加权 $\kappa$、相关系数和分布级测试）以及生产基础设施，包括响应缓存、可恢复运行的检查点、多提供商速率限制和成本跟踪。我们根据涵盖教育评估、深度研究评估和聊天机器人质量评估的三个基准对 Autorubric 进行评估，证明它产生的结果与已发布的基准一致，同时发挥了框架的关键功能：带有少量校准的每个标准二元评估 (RiceChem)、跨法官模型的多法官整体评估 (ResearcherBench) 以及结合二元、序数和名义量表的混合标准类型 (CHARM-100)。我们还贡献了 CHARM-100，这是一个 100 个样本的聊天机器人评估数据集，每个样本的真实标签涵盖所有三种标准类型，旨在对异构标准的评估框架进行压力测试。

Title: Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization

Authors: Ambre Marie (LaTIM), Thomas Bertin (DySoLab), Guillaume Dardenne (LaTIM), Gwenolé Quellec (LaTIM)
Subjects: cs.CL, cs.AI, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2603.00086
Pdf URL: https://arxiv.org/pdf/2603.00086
Copy Paste: [[2603.00086]] Iterative LLM-based improvement for French Clinical Interview Transcription and Speaker Diarization(https://arxiv.org/abs/2603.00086)
Keywords: llm, prompt
Abstract: Automatic speech recognition for French medical conversations remains challenging, with word error rates often exceeding 30% in spontaneous clinical speech. This study proposes a multi-pass LLM post-processing architecture alternating between Speaker Recognition and Word Recognition passes to improve transcription accuracy and speaker attribution. Ablation studies on two French clinical datasets (suicide prevention telephone counseling and preoperative awake neurosurgery consultations) investigate four design choices: model selection, prompting strategy, pass ordering, and iteration depth. Using Qwen3-Next-80B, Wilcoxon signed-rank tests confirm significant WDER reductions on suicide prevention conversations (p < 0.05, n=18), while maintaining stability on awake neurosurgery consultations (n=10), with zero output failures and acceptable computational cost (RTF 0.32), suggesting feasibility for offline clinical deployment.
摘要：法语医学对话的自动语音识别仍然具有挑战性，自发临床语音中的单词错误率通常超过 30%。本研究提出了一种多通道 LLM 后处理架构，在说话人识别和单词识别通道之间交替，以提高转录准确性和说话人归因。对两个法国临床数据集（自杀预防电话咨询和术前清醒神经外科咨询）的消融研究调查了四种设计选择：模型选择、提示策略、传递顺序和迭代深度。使用 Qwen3-Next-80B，Wilcoxon 符号秩检验证实自杀预防对话的 WDER 显着降低（p < 0.05，n=18），同时保持清醒神经外科咨询的稳定性（n=10），输出失败为零，计算成本可接受（RTF 0.32），这表明离线临床部署的可行性。

Title: Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning

Authors: Xintong Li, Sha Li, Rongmei Lin, Hongye Jin, Linwei Li, Hejie Cui, Sarah Zhang, Chia-Yuan Chang, Kewei Cheng, Besnik Fetahu, Priyanka Nigam, Jingbo Shang, Bing Yin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00296
Pdf URL: https://arxiv.org/pdf/2603.00296
Copy Paste: [[2603.00296]] Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning(https://arxiv.org/abs/2603.00296)
Keywords: chain-of-thought
Abstract: Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy. Prior reinforcement learning approaches typically rely on a single outcome reward with trajectory-level length penalties, which cannot distinguish essential from redundant reasoning steps and therefore yield blunt compression. Although recent work incorporates step-level signals, such as offline pruning, supervised data construction, or verifier-based intermediate rewards, reasoning length is rarely treated as an explicit step-level optimization objective during RL. We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution. We estimate step importance from the model's on-policy log-probability improvement toward the correct answer, then treat excess length as a penalty mass redistributed to penalize low-importance steps more heavily while preserving high-importance reasoning. We optimize with a unified outcome-process advantage within group-relative policy optimization. Extensive experiments demonstrate that SWAP reduces reasoning length by 64.3% on average while improving accuracy by 5.7% relative to the base model.
摘要：大型推理模型随着更多的测试时间计算而得到改进，但通常会过度思考，产生不必要的长思想链，从而增加成本而不提高准确性。先前的强化学习方法通常依赖于具有轨迹级长度惩罚的单一结果奖励，这无法区分必要的推理步骤和冗余的推理步骤，因此会产生生硬的压缩。尽管最近的工作结合了步骤级信号，例如离线剪枝、监督数据构建或基于验证者的中间奖励，但在强化学习期间，推理长度很少被视为明确的步骤级优化目标。我们提出了逐步自适应惩罚（SWAP），这是一个细粒度的框架，可根据内在贡献分配跨步骤的长度减少。我们根据模型对正确答案的在策略对数概率改进来估计步骤重要性，然后将多余的长度视为重新分配的惩罚质量，以更严厉地惩罚低重要性步骤，同时保留高重要性推理。我们在群体相关政策优化中利用统一的结果过程优势进行优化。大量实验表明，相对于基础模型，SWAP 平均将推理长度减少 64.3%，同时将准确性提高 5.7%。

Title: From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction

Authors: Matic Korun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00307
Pdf URL: https://arxiv.org/pdf/2603.00307
Copy Paste: [[2603.00307]] From Prerequisites to Predictions: Validating a Geometric Hallucination Taxonomy Through Controlled Induction(https://arxiv.org/abs/2603.00307)
Keywords: gpt, hallucination, prompt
Abstract: We test whether a geometric hallucination taxonomy -- classifying failures as center-drift (Type~1), wrong-well convergence (Type~2), or coverage gaps (Type~3) -- can distinguish hallucination types through controlled induction in GPT-2. Using a two-level statistical design with prompts ($N = 15$/group) as the unit of inference, we run each experiment 20 times with different generation seeds to quantify result stability. In static embeddings, Type~3 norm separation is robust (significant in 18/20 runs, Holm-corrected in 14/20, median $r = +0.61$). In contextual hidden states, the Type~3 norm effect direction is stable (19/20 runs) but underpowered at $N = 15$ (significant in 4/20, median $r = -0.28$). Types~1 and~2 do not separate in either space (${\leq}\,3/20$ runs). Token-level tests inflate significance by 4--16$\times$ through pseudoreplication -- a finding replicated across all 20 runs. The results establish coverage-gap hallucinations as the most geometrically distinctive failure mode, carried by magnitude rather than direction, and confirm the Type~1/2 non-separation as genuine at 124M parameters.
摘要：我们测试几何幻觉分类法（将故障分类为中心漂移（Type~1）、错误井收敛（Type~2）或覆盖间隙（Type~3））是否可以通过 GPT-2 中的受控归纳来区分幻觉类型。使用以提示（$N = 15$/组）为推理单位的两级统计设计，我们使用不同代种子运行每个实验 20 次，以量化结果稳定性。在静态嵌入中，Type~3 范数分离是稳健的（在 18/20 次运行中显着，在 14/20 次运行中经过 Holm 校正，中位数 $r = +0.61$）。在上下文隐藏状态中，Type~3 范数效应方向是稳定的（19/20 运行），但在 $N = 15$ 时动力不足（在 4/20 中显着，中位数 $r = -0.28$）。类型~1 和~2 在任一空间中都不分离（${\leq}\,3/20$ 运行）。令牌级测试通过伪复制将重要性提高了 4--16$\times$——这一发现在所有 20 次运行中都得到了重复。结果将覆盖间隙幻觉确立为几何上最独特的失效模式，由幅度而不是方向来表示，并确认 Type~1/2 在 124M 参数下的非分离是真实的。

Title: When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation

Authors: Bian Sun, Zhenjian Wang, Orvill de la Torre, Zirui Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00314
Pdf URL: https://arxiv.org/pdf/2603.00314
Copy Paste: [[2603.00314]] When Metrics Disagree: Automatic Similarity vs. LLM-as-a-Judge for Clinical Dialogue Evaluation(https://arxiv.org/abs/2603.00314)
Keywords: language model, gpt, llm, chat
Abstract: This paper details the baseline model selection, fine-tuning process, evaluation methods, and the implications of deploying more accurate LLMs in healthcare settings. As large language models (LLMs) are increasingly employed to address diverse problems, including medical queries, concerns about their reliability have surfaced. A recent study by Long Island University highlighted that LLMs often perform poorly in medical contexts, potentially leading to harmful misguidance for users. To address this, our research focuses on fine-tuning the Llama 2 7B, a transformer-based, decoder-only model, using transcripts from real patient-doctor interactions. Our objective was to enhance the model's accuracy and precision in responding to medical queries. We fine-tuned the model using a supervised approach, emphasizing domain-specific nuances captured in the training data. In the best scenario, the model results should be reviewed and evaluated by real medical experts. Due to resource constraints, the performance of the fine-tuned model was evaluated using text similarity metrics. The fine-tuned model demonstrated significant improvements across all key dimensions except GPT-4's evaluation. The evaluations of ChatGPT4 are quite different from the quantitative results; here, we not only suggest, but also propose that the result should be evaluated by human medical experts.
摘要：本文详细介绍了基线模型选择、微调过程、评估方法以及在医疗保健环境中部署更准确的法学硕士的影响。随着大型语言模型 (LLM) 越来越多地用于解决包括医学查询在内的各种问题，人们对其可靠性的担忧也随之浮现。长岛大学最近的一项研究强调，法学硕士在医学领域往往表现不佳，可能会对用户造成有害的误导。为了解决这个问题，我们的研究重点是使用来自真实医患互动的记录来微调 Llama 2 7B，这是一种基于变压器的纯解码器模型。我们的目标是提高模型响应医学查询的准确性和精确度。我们使用监督方法对模型进行微调，强调训练数据中捕获的特定领域的细微差别。在最好的情况下，模型结果应该由真正的医学专家进行审查和评估。由于资源限制，使用文本相似度指标评估微调模型的性能。经过微调的模型在除 GPT-4 的评估之外的所有关键维度上都表现出了显着的改进。 ChatGPT4的评价与定量结果相差较大；在这里，我们不仅建议，而且建议结果应该由人类医学专家来评估。

Title: How Large Language Models Get Stuck: Early structure with persistent errors

Authors: Alokesh Manna, William Snyder, Whitney Tabor
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00359
Pdf URL: https://arxiv.org/pdf/2603.00359
Copy Paste: [[2603.00359]] How Large Language Models Get Stuck: Early structure with persistent errors(https://arxiv.org/abs/2603.00359)
Keywords: language model, llm
Abstract: Linguistic insights may help make Large Language Model (LLM) training more efficient. We trained Meta's OPT model on the 100M word BabyLM dataset, and evaluated it on the BLiMP benchmark, which consists of 67 classes, each defined by sentence pairs that differ in a targeted syntactic or semantic rule violation. We tested the model's preference for grammatical over ungrammatical sentences across training iterations and grammatical types. In nearly one-third of the BLiMP classes, OPT fails to consistently assign a higher likelihood to grammatical sentences, even after extensive training. When it fails, it often establishes a clear (erroneous) separation of the likelihoods at an early stage of processing and sustains this to the end of our training phase. We hypothesize that this mis-categorization is costly because it creates entrenched biases that must, eventually, be reversed in order for the model to perform well. We probe this phenomenon using a mixture of qualitative (based on linguistic theory and the theory of Deep Learning) and quantitative (based on numerical testing) assessments. Our qualitative assessments indicate that only some BLiMP tests are meaningful guides. We conclude by articulating a hypothesis, the Bigram Hypothesis, which claims that the learning process will exhibit erroneous entrenchment if bigram statistics bias the model toward wrong distinctions early in training, and we describe a method (in progress) of testing the hypothesis on appropriately selected BLiMP classes.
摘要：语言见解可能有助于提高大型语言模型 (LLM) 培训的效率。我们在 1 亿词 BabyLM 数据集上训练 Meta 的 OPT 模型，并在 BLiMP 基准测试上对其进行评估，该基准由 67 个类组成，每个类由在目标句法或语义规则违规方面不同的句子对定义。我们在训练迭代和语法类型中测试了模型对语法句子相对于不语法句子的偏好。在近三分之一的 BLiMP 课程中，即使经过大量训练，OPT 也未能始终如一地为语法句子分配更高的可能性。当它失败时，它通常会在处理的早期阶段建立清晰的（错误的）可能性分离，并将其维持到我们的训练阶段结束。我们假设这种错误分类的代价高昂，因为它会产生根深蒂固的偏见，最终必须扭转这些偏见才能使模型表现良好。我们使用定性（基于语言理论和深度学习理论）和定量（基于数值测试）评估的混合来探讨这一现象。我们的定性评估表明，只有某些 BLiMP 测试才是有意义的指南。最后，我们提出了一个假设，即二元组假设，该假设声称，如果二元组统计数据在训练早期使模型偏向于错误的区分，那么学习过程将表现出错误的巩固，并且我们描述了一种在适当选择的 BLiMP 类上测试该假设的方法（正在进行中）。

Title: Distribution-Aware Companding Quantization of Large Language Models

Authors: Athul Radhakrishnan, Siddhant Mohan, Mahima Sachdeva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00364
Pdf URL: https://arxiv.org/pdf/2603.00364
Copy Paste: [[2603.00364]] Distribution-Aware Companding Quantization of Large Language Models(https://arxiv.org/abs/2603.00364)
Keywords: language model, gpt
Abstract: Large language models such as GPT and Llama are trained with a next-token prediction loss. In this work, we suggest that training language models to predict multiple future tokens at once results in higher sample efficiency. More specifically, at each position in the training corpus, we ask the model to predict the following n tokens using n independent output heads, operating on top of a shared model trunk. Considering multi-token prediction as an auxiliary training task, we measure improved downstream capabilities with no overhead in training time for both code and natural language models. The method is increasingly useful for larger model sizes and keeps its appeal when training for multiple epochs. Gains are especially pronounced on generative benchmarks like coding, where our models consistently outperform strong baselines by several percentage points. Our 13B parameter models solves 12 % more problems on HumanEval and 17 % more on MBPP than comparable next-token models. Experiments on small algorithmic tasks demonstrate that multi-token prediction is favorable for the development of induction heads and algorithmic reasoning capabilities. As an additional benefit, models trained with 4-token prediction are up to 3X times faster at inference, even with large batch sizes.
摘要：GPT 和 Llama 等大型语言模型是通过下一个令牌预测损失进行训练的。在这项工作中，我们建议训练语言模型来同时预测多个未来标记，从而提高样本效率。更具体地说，在训练语料库中的每个位置，我们要求模型使用在共享模型主干上运行的 n 个独立输出头来预测以下 n 个标记。将多标记预测视为辅助训练任务，我们测量了改进的下游能力，而代码和自然语言模型的训练时间没有开销。该方法对于较大的模型尺寸越来越有用，并且在训练多个时期时保持其吸引力。在编码等生成基准上，收益尤其明显，我们的模型始终比强大的基准高出几个百分点。与同类 next-token 模型相比，我们的 13B 参数模型在 HumanEval 上解决的问题多解决了 12%，在 MBPP 上解决的问题多解决了 17%。小算法任务的实验表明，多token预测有利于归纳头和算法推理能力的发展。另一个好处是，即使批量大小较大，使用 4 令牌预测训练的模型的推理速度也可提高 3 倍。

Title: Policy Compliance of User Requests in Natural Language for AI Systems

Authors: Pedro Cisneros-Velarde
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00369
Pdf URL: https://arxiv.org/pdf/2603.00369
Copy Paste: [[2603.00369]] Policy Compliance of User Requests in Natural Language for AI Systems(https://arxiv.org/abs/2603.00369)
Keywords: llm
Abstract: Consider an organization whose users send requests in natural language to an AI system that fulfills them by carrying out specific tasks. In this paper, we consider the problem of ensuring such user requests comply with a list of diverse policies determined by the organization with the purpose of guaranteeing the safe and reliable use of the AI system. We propose, to the best of our knowledge, the first benchmark consisting of annotated user requests of diverse compliance with respect to a list of policies. Our benchmark is related to industrial applications in the technology sector. We then use our benchmark to evaluate the performance of various LLM models on policy compliance assessment under different solution methods. We analyze the differences on performance metrics across the models and solution methods, showcasing the challenging nature of our problem.
摘要：考虑一个组织，其用户以自然语言向人工智能系统发送请求，人工智能系统通过执行特定任务来满足这些请求。在本文中，我们考虑确保此类用户请求符合组织确定的一系列不同策略的问题，以保证人工智能系统的安全可靠使用。据我们所知，我们提出了第一个基准，其中包含针对一系列策略的不同合规性的带注释的用户请求。我们的基准与技术领域的工业应用有关。然后，我们使用我们的基准来评估不同解决方案方法下各种LLM模型在政策合规性评估方面的表现。我们分析了模型和解决方法之间性能指标的差异，展示了我们问题的挑战性本质。

Title: LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation

Authors: Cunyuan Yang, Dejuan Song, Xiaotao Pang, Qianqian Shen, Wenjie Nie, Yifan Huang, Lei Wu, Wei Han, Haishuai Wang, Jiajun Bu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2603.00426
Pdf URL: https://arxiv.org/pdf/2603.00426
Copy Paste: [[2603.00426]] LLM-Bootstrapped Targeted Finding Guidance for Factual MLLM-based Medical Report Generation(https://arxiv.org/abs/2603.00426)
Keywords: language model, llm
Abstract: The automatic generation of medical reports utilizing Multimodal Large Language Models (MLLMs) frequently encounters challenges related to factual instability, which may manifest as the omission of findings or the incorporation of inaccurate information, thereby constraining their applicability in clinical settings. Current methodologies typically produce reports based directly on image features, which inherently lack a definitive factual basis. In response to this limitation, we introduce Fact-Flow, an innovative framework that separates the process of visual fact identification from the generation of reports. This is achieved by initially predicting clinical findings from the image, which subsequently directs the MLLM to produce a report that is factually precise. A pivotal advancement of our approach is a pipeline that leverages a Large Language Model (LLM) to autonomously create a dataset of labeled medical findings, effectively eliminating the need for expensive manual annotation. Extensive experimental evaluations conducted on two disease-focused medical datasets validate the efficacy of our method, demonstrating a significant enhancement in factual accuracy compared to state-of-the-art models, while concurrently preserving high standards of text quality.
摘要：利用多模态大语言模型（MLLM）自动生成医学报告经常遇到与事实不稳定相关的挑战，这可能表现为遗漏发现或纳入不准确的信息，从而限制了其在临床环境中的适用性。当前的方法通常直接根据图像特征生成报告，这本质上缺乏明确的事实基础。为了应对这一限制，我们引入了 Fact-Flow，这是一种创新框架，它将视觉事实识别过程与报告生成分开。这是通过最初根据图像预测临床结果来实现的，随后指导 MLLM 生成事实上准确的报告。我们方法的一个关键进步是利用大型语言模型（LLM）自主创建标记医学结果数据集的管道，从而有效地消除了昂贵的手动注释的需要。对两个以疾病为中心的医学数据集进行的广泛实验评估验证了我们方法的有效性，证明与最先进的模型相比，事实准确性显着提高，同时保持了文本质量的高标准。

Title: A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs

Authors: Anna Feldman, Libby Barak, Jing Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00432
Pdf URL: https://arxiv.org/pdf/2603.00432
Copy Paste: [[2603.00432]] A Typologically Grounded Evaluation Framework for Word Order and Morphology Sensitivity in Multilingual Masked LMs(https://arxiv.org/abs/2603.00432)
Keywords: language model
Abstract: We introduce a typology-aware diagnostic for multilingual masked language models that tests reliance on word order versus inflectional form. Using Universal Dependencies, we apply inference-time perturbations: full token scrambling, content-word scrambling with function words fixed, dependency-based head--dependent swaps, and sentence-level lemma substitution (+L), which lemmatizes both the context and the masked target label. We evaluate mBERT and XLM-R on English, Chinese, German, Spanish, and Russian. Full scrambling drives word-level reconstruction accuracy near zero in all languages; partial and head--dependent perturbations cause smaller but still large drops. +L has little effect in Chinese but substantially lowers accuracy in German/Spanish/Russian, and it does not mitigate the impact of scrambling. Top-5 word accuracy shows the same pattern: under full scrambling, the gold word rarely appears among the five highest-ranked reconstructions. We release code, sampling scripts, and balanced evaluation subsets; Turkish results under strict reconstruction are reported in the appendix.
摘要：我们引入了一种针对多语言掩码语言模型的类型学感知诊断方法，用于测试对词序与屈折形式的依赖。使用通用依赖关系，我们应用推理时间扰动：完整标记加扰、固定功能词的内容词加扰、基于依赖关系的头依赖交换和句子级引理替换 (+L)，这对上下文和屏蔽目标标签进行词形化。我们评估英语、中文、德语、西班牙语和俄语的 mBERT 和 XLM-R。完全置乱使得所有语言的字级重建精度接近于零；部分和头部相关的扰动会导致较小但仍然较大的下降。 +L 对中文影响不大，但会大大降低德语/西班牙语/俄语的准确性，并且不会减轻扰乱的影响。 Top-5 单词准确率显示出相同的模式：在完全置乱的情况下，黄金单词很少出现在排名最高的五个重建中。我们发布代码、采样脚本和平衡评估子集；土耳其在严格重建下的结果在附录中报告。

Title: CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging

Authors: Jie Cao, Zhenxuan Fan, Zhuonan Wang, Tianwei Lin, Ziyuan Zhao, Rolan Yan, Wenqiao Zhang, Feifei Shao, Hongwei Wang, Jun Xiao, Siliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00573
Pdf URL: https://arxiv.org/pdf/2603.00573
Copy Paste: [[2603.00573]] CoMoL: Efficient Mixture of LoRA Experts via Dynamic Core Space Merging(https://arxiv.org/abs/2603.00573)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve remarkable performance on diverse downstream and domain-specific tasks via parameter-efficient fine-tuning (PEFT). However, existing PEFT methods, particularly MoE-LoRA architectures, suffer from limited parameter efficiency and coarse-grained adaptation due to the proliferation of LoRA experts and instance-level routing. To address these issues, we propose Core Space Mixture of LoRA (\textbf{CoMoL}), a novel MoE-LoRA framework that incorporates expert diversity, parameter efficiency, and fine-grained adaptation. Specifically, CoMoL introduces two key components: core space experts and core space routing. Core space experts store each expert in a compact core matrix, preserving diversity while controlling parameter growth. Core space routing dynamically selects and activates the appropriate core experts for each token, enabling fine-grained, input-adaptive routing. Activated core experts are then merged via a soft-merging strategy into a single core expert, which is combined with a shared LoRA to form a specialized LoRA module. Besides, the routing network is projected into the same low-rank space as the LoRA matrices, further reducing parameter overhead without compromising expressiveness. Extensive experiments demonstrate that CoMoL retains the adaptability of MoE-LoRA architectures while achieving parameter efficiency comparable to standard LoRA, consistently outperforming existing methods across multiple tasks.
摘要：大型语言模型 (LLM) 通过参数高效微调 (PEFT) 在各种下游和特定领域任务上取得了卓越的性能。然而，现有的 PEFT 方法，特别是 MoE-LoRA 架构，由于 LoRA 专家和实例级路由的激增而受到参数效率有限和粗粒度自适应的影响。为了解决这些问题，我们提出了 LoRA 的核心空间混合（\textbf{CoMoL}），这是一种新颖的 MoE-LoRA 框架，融合了专家多样性、参数效率和细粒度自适应。具体来说，CoMoL 引入了两个关键组件：核心空间专家和核心空间路由。核心空间专家将每个专家存储在一个紧凑的核心矩阵中，在控制参数增长的同时保留多样性。核心空间路由为每个代币动态选择并激活适当的核心专家，从而实现细粒度、输入自适应的路由。然后，激活的核心专家通过软合并策略合并为单个核心专家，该核心专家与共享的 LoRA 组合形成专门的 LoRA 模块。此外，路由网络被投影到与 LoRA 矩阵相同的低秩空间中，进一步减少参数开销而不影响表达能力。大量实验表明，CoMoL 保留了 MoE-LoRA 架构的适应性，同时实现了与标准 LoRA 相当的参数效率，在多个任务中始终优于现有方法。

Title: Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research

Authors: Yubo Dong, Nianhao You, Yuxuan Hou, Zixun Sun, Yue Zhang, Hehe Fan, Liang Zhang, Siyuan Zhao, Linyi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00582
Pdf URL: https://arxiv.org/pdf/2603.00582
Copy Paste: [[2603.00582]] Super Research: Answering Highly Complex Questions with Large Language Models through Super Deep and Super Wide Research(https://arxiv.org/abs/2603.00582)
Keywords: language model, llm
Abstract: While Large Language Models (LLMs) have demonstrated proficiency in Deep Research or Wide Search, their capacity to solve highly complex questions-those requiring long-horizon planning, massive evidence gathering, and synthesis across heterogeneous sources-remains largely unexplored. We introduce Super Research, a task for complex autonomous research tasks that integrates (i) structured decomposition into a research plan, (ii) super wide retrieval for diverse perspectives, and (iii) super deep investigation to resolve uncertainties through iterative queries. To evaluate this capability, we curated a benchmark of 300 expert-written questions across diverse domains, each requiring up to 100+ retrieval steps and 1,000+ web pages to reconcile conflicting evidence. Super Research produces verifiable reports with fine-grained citations and intermediate artifacts (e.g., outlines and tables) to ensure traceable reasoning. Furthermore, we present a graph-anchored auditing protocol that evaluates Super Research along five dimensions: Coverage, Logical Consistency, Report Utility, Objectivity and Citation Health. While super-complex questions may be infrequent in standard applications, Super Research serves as a critical ceiling evaluation and stress test for LLM capabilities. A model's proficiency within Super Research acts as a powerful proxy for its general research competence; success here suggests the robustness necessary to navigate nearly any subordinate research task. Leaderboard is available at: this https URL
摘要：虽然大型语言模型 (LLM) 已经证明了其在深度研究或广泛搜索方面的熟练程度，但它们解决高度复杂问题（需要长期规划、大量证据收集和跨异构来源综合的问题）的能力在很大程度上仍未得到探索。我们引入了超级研究，这是一项复杂的自主研究任务，它集成了（i）结构化分解到研究计划中，（ii）针对不同观点的超广泛检索，以及（iii）超深入调查以通过迭代查询解决不确定性。为了评估这种能力，我们策划了一个由不同领域的 300 个专家编写的问题组成的基准，每个问题都需要多达 100 多个检索步骤和 1,000 多个网页来协调相互冲突的证据。 Super Research 生成可验证的报告，其中包含细粒度的引文和中间工件（例如大纲和表格），以确保可追溯的推理。此外，我们提出了一种基于图的审核协议，从五个维度评估超级研究：覆盖范围、逻辑一致性、报告实用性、客观性和引文健康状况。虽然超级复杂的问题在标准申请中可能很少见，但超级研究可以作为法学硕士能力的关键上限评估和压力测试。模型在超级研究中的熟练程度是其一般研究能力的有力代表；这里的成功表明了应对几乎所有附属研究任务所必需的稳健性。排行榜位于：此 https URL

Title: From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation

Authors: Raneen Younis, Suvinava Basak, Lukas Chavez, Zahra Ahmadi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00612
Pdf URL: https://arxiv.org/pdf/2603.00612
Copy Paste: [[2603.00612]] From Literature to Hypotheses: An AI Co-Scientist System for Biomarker-Guided Drug Combination Hypothesis Generation(https://arxiv.org/abs/2603.00612)
Keywords: agent
Abstract: The rapid growth of biomedical literature and curated databases has made it increasingly difficult for researchers to systematically connect biomarker mechanisms to actionable drug combination hypotheses. We present AI Co-Scientist (CoDHy), an interactive, human-in-the-loop system for biomarker-guided drug combination hypothesis generation in cancer research. CoDHy integrates structured biomedical databases and unstructured literature evidence into a task-specific knowledge graph, which serves as the basis for graph-based reasoning and hypothesis construction. The system combines knowledge graph embeddings with agent-based reasoning to generate, validate, and rank candidate drug combinations, while explicitly grounding each hypothesis in retrievable evidence. Through a web-based interface, users can configure the scientific context, inspect intermediate results, and iteratively refine hypotheses, enabling transparent and researcher-steerable exploration rather than automated decision-making. We demonstrate CoDHy as a system for exploratory hypothesis generation and decision support in translational oncology, highlighting its design, interaction workflow, and practical use cases.
摘要：生物医学文献和数据库的快速增长使得研究人员越来越难以系统地将生物标志物机制与可行的药物组合假设联系起来。我们推出了 AI Co-Scientist (CoDHy)，这是一种交互式人机交互系统，用于在癌症研究中生成生物标志物引导的药物组合假设。 CoDHy 将结构化生物医学数据库和非结构化文献证据集成到特定任务的知识图谱中，作为基于图的推理和假设构建的基础。该系统将知识图嵌入与基于代理的推理相结合，以生成、验证和排名候选药物组合，同时明确地将每个假设建立在可检索的证据中。通过基于网络的界面，用户可以配置科学背景、检查中间结果并迭代地完善假设，从而实现透明且研究人员可引导的探索，而不是自动化决策。我们展示 CoDHy 作为转化肿瘤学中探索性假设生成和决策支持的系统，重点介绍其设计、交互工作流程和实际用例。

Title: BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages

Authors: Jason Lucas, Matt Murtagh-White, Adaku Uchendu, Ali Al-Lawati, Michiharu Yamashita, Dominik Macko, Ivan Srba, Robert Moro, Dongwon Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00634
Pdf URL: https://arxiv.org/pdf/2603.00634
Copy Paste: [[2603.00634]] BLUFF: Benchmarking the Detection of False and Synthetic Content across 58 Low-Resource Languages(https://arxiv.org/abs/2603.00634)
Keywords: llm, agent
Abstract: Multilingual falsehoods threaten information integrity worldwide, yet detection benchmarks remain confined to English or a few high-resource languages, leaving low-resource linguistic communities without robust defense tools. We introduce BLUFF, a comprehensive benchmark for detecting false and synthetic content, spanning 79 languages with over 202K samples, combining human-written fact-checked content (122K+ samples across 57 languages) and LLM-generated content (79K+ samples across 71 languages). BLUFF uniquely covers both high-resource "big-head" (20) and low-resource "long-tail" (59) languages, addressing critical gaps in multilingual research on detecting false and synthetic content. Our dataset features four content types (human-written, LLM-generated, LLM-translated, and hybrid human-LLM text), bidirectional translation (English$\leftrightarrow$X), 39 textual modification techniques (36 manipulation tactics for fake news, 3 AI-editing strategies for real news), and varying edit intensities generated using 19 diverse LLMs. We present AXL-CoI (Adversarial Cross-Lingual Agentic Chainof-Interactions), a novel multi-agentic framework for controlled fake/real news generation, paired with mPURIFY, a quality filtering pipeline ensuring dataset integrity. Experiments reveal state-of-theart detectors suffer up to 25.3% F1 degradation on low-resource versus high-resource languages. BLUFF provides the research community with a multilingual benchmark, extensive linguistic-oriented benchmark evaluation, comprehensive documentation, and opensource tools to advance equitable falsehood detection. Dataset and code are available at: this https URL
摘要：多语言虚假信息威胁着全世界的信息完整性，但检测基准仍然仅限于英语或少数资源丰富的语言，导致资源匮乏的语言社区缺乏强大的防御工具。我们推出了 BLUFF，这是一个用于检测虚假和合成内容的综合基准，涵盖 79 种语言，拥有超过 202K 样本，结合了人工编写的事实检查内容（跨 57 种语言的 122K+ 样本）和 LLM 生成的内容（跨 71 种语言的 79K+ 样本）。 BLUFF 独特地涵盖了高资源“大头”（20）和低资源“长尾”（59）语言，解决了检测虚假和合成内容的多语言研究中的关键差距。我们的数据集具有四种内容类型（人工编写、LLM 生成、LLM 翻译和混合人类 LLM 文本）、双向翻译（English$\leftrightarrow$X）、39 种文本修改技术（36 种针对假新闻的操纵策略、3 种针对真实新闻的 AI 编辑策略）以及使用 19 种不同的 LLM 生成的不同编辑强度。我们提出了 AXL-CoI（对抗性跨语言代理链交互），这是一种用于受控假/真新闻生成的新型多代理框架，与 mPURIFY（一种确保数据集完整性的质量过滤管道）配对。实验表明，与高资源语言相比，最先进的检测器在低资源语言上的 F1 性能下降高达 25.3%。 BLUFF 为研究社区提供了多语言基准、广泛的面向语言的基准评估、全面的文档和开源工具，以推进公平的虚假检测。数据集和代码可在以下位置获得：此 https URL

Title: SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs

Authors: Chaoyue He, Xin Zhou, Xinjia Yu, Lei Zhang, Yan Zhang, Yi Wu, Lei Xiao, Liangyue Li, Di Wang, Hong Xu, Xiaoqiao Wang, Wei Liu, Chunyan Miao
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2603.00669
Pdf URL: https://arxiv.org/pdf/2603.00669
Copy Paste: [[2603.00669]] SSKG Hub: An Expert-Guided Platform for LLM-Empowered Sustainability Standards Knowledge Graphs(https://arxiv.org/abs/2603.00669)
Keywords: llm, prompt
Abstract: Sustainability disclosure standards (e.g., GRI, SASB, TCFD, IFRS S2) are comprehensive yet lengthy, terminology-dense, and highly cross-referential, hindering structured analysis and downstream use. We present SSKG Hub (Sustainability Standards Knowledge Graph Hub), a research prototype and interactive web platform that transforms standards into auditable knowledge graphs (KGs) through an LLM-centered, expert-guided pipeline. The system integrates automatic standard identification, configurable chunking, standard-specific prompting, robust triple parsing, and provenance-aware Neo4j storage with fine-grained audit metadata. LLM extraction produces a provenance-linked Draft KG, which is reviewed, curated, and formally promoted to a Certified KG through meta-expert adjudication. A role-based governance framework covering read-only guest access, expert review and CRUD operations, meta-expert certification, and administrative oversight ensures traceability and accountability across draft and certified states. Beyond graph exploration and triple-level evidence tracing, SSKG Hub supports cross-KG fusion, KG-driven tasks, and dedicated modules for insights and curated resources. We validate the platform through a comprehensive expert-led KG review case study that demonstrates end-to-end curation and quality assurance. The web application is publicly available at this http URL.
摘要：可持续发展披露标准（例如，GRI、SASB、TCFD、IFRS S2）全面但冗长、术语密集且高度交叉引用，阻碍了结构化分析和下游使用。我们推出了 SSKG Hub（可持续发展标准知识图谱中心），这是一个研究原型和交互式网络平台，可通过以法学硕士为中心、专家指导的管道将标准转换为可审核的知识图谱 (KG)。该系统集成了自动标准识别、可配置分块、特定于标准的提示、强大的三重解析以及具有细粒度审计元数据的来源感知 Neo4j 存储。 LLM提取产生了一个与出处相关的KG草案，该KG经过审查、策划并通过元专家裁决正式升级为认证KG。基于角色的治理框架涵盖只读来宾访问、专家审查和 CRUD 操作、元专家认证和管理监督，确保草案和认证状态之间的可追溯性和问责制。除了图探索和三级证据追踪之外，SSKG Hub 还支持跨 KG 融合、KG 驱动的任务以及用于见解和精选资源的专用模块。我们通过专家主导的全面知识图谱审查案例研究来验证该平台，该研究展示了端到端的管理和质量保证。该 Web 应用程序可通过此 http URL 公开获得。

Title: RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis

Authors: Andrew Zhuoer Feng, Cunxiang Wang, Yu Luo, Bosi Wen, Yidong Wang, Lin Fan, Yilin Zhou, Zikang Wang, Wenbo Yu, Lindong Wu, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00686
Pdf URL: https://arxiv.org/pdf/2603.00686
Copy Paste: [[2603.00686]] RAVEL: Reasoning Agents for Validating and Evaluating LLM Text Synthesis(https://arxiv.org/abs/2603.00686)
Keywords: language model, llm, agent
Abstract: Large Language Models have evolved from single-round generators into long-horizon agents, capable of complex text synthesis scenarios. However, current evaluation frameworks lack the ability to assess the actual synthesis operations, such as outlining, drafting, and editing. Consequently, they fail to evaluate the actual and detailed capabilities of LLMs. To bridge this gap, we introduce RAVEL, an agentic framework that enables the LLM testers to autonomously plan and execute typical synthesis operations, including outlining, drafting, reviewing, and refining. Complementing this framework, we present C3EBench, a comprehensive benchmark comprising 1,258 samples derived from professional human writings. We utilize a "reverse-engineering" pipeline to isolate specific capabilities across four tasks: Cloze, Edit, Expand, and End-to-End. Through our analysis of 14 LLMs, we uncover that most LLMs struggle with tasks that demand contextual understanding under limited or under-specified instructions. By augmenting RAVEL with SOTA LLMs as operators, we find that such agentic text synthesis is dominated by the LLM's reasoning capability rather than raw generative capacity. Furthermore, we find that a strong reasoner can guide a weaker generator to yield higher-quality results, whereas the inverse does not hold. Our code and data are available at this link: this https URL.
摘要：大型语言模型已经从单轮生成器发展成为长视野代理，能够处理复杂的文本合成场景。然而，当前的评估框架缺乏评估实际综合操作的能力，例如提纲、起草和编辑。因此，他们无法评估法学硕士的实际和详细能力。为了弥补这一差距，我们引入了 RAVEL，这是一个代理框架，使 LLM 测试人员能够自主规划和执行典型的综合操作，包括提纲、起草、审查和完善。作为对这一框架的补充，我们推出了 C3EBench，这是一个综合基准测试，包含来自专业人类著作的 1,258 个样本。我们利用“逆向工程”管道来隔离四个任务中的特定功能：完形填空、编辑、扩展和端到端。通过对 14 名法学硕士的分析，我们发现大多数法学硕士在有限或不明确的指导下难以完成需要背景理解的任务。通过使用 SOTA LLM 作为算子来增强 RAVEL，我们发现这种代理文本合成主要由 LLM 的推理能力而不是原始生成能力主导。此外，我们发现强大的推理器可以引导较弱的生成器产生更高质量的结果，而反之则不成立。我们的代码和数据可通过此链接获取：此 https URL。

Title: DRIV-EX: Counterfactual Explanations for Driving LLMs

Authors: Amaia Cardiel, Eloi Zablocki, Elias Ramzi, Eric Gaussier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00696
Pdf URL: https://arxiv.org/pdf/2603.00696
Copy Paste: [[2603.00696]] DRIV-EX: Counterfactual Explanations for Driving LLMs(https://arxiv.org/abs/2603.00696)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly used as reasoning engines in autonomous driving, yet their decision-making remains opaque. We propose to study their decision process through counterfactual explanations, which identify the minimal semantic changes to a scene description required to alter a driving plan. We introduce DRIV-EX, a method that leverages gradient-based optimization on continuous embeddings to identify the input shifts required to flip the model's decision. Crucially, to avoid the incoherent text typical of unconstrained continuous optimization, DRIV-EX uses these optimized embeddings solely as a semantic guide: they are used to bias a controlled decoding process that re-generates the original scene description. This approach effectively steers the generation toward the counterfactual target while guaranteeing the linguistic fluency, domain validity, and proximity to the original input, essential for interpretability. Evaluated using the LC-LLM planner on a textual transcription of the highD dataset, DRIV-EX generates valid, fluent counterfactuals more reliably than existing baselines. It successfully exposes latent biases and provides concrete insights to improve the robustness of LLM-based driving agents.
摘要：大型语言模型 (LLM) 越来越多地用作自动驾驶中的推理引擎，但其决策仍然不透明。我们建议通过反事实解释来研究他们的决策过程，从而确定改变驾驶计划所需的场景描述的最小语义变化。我们引入了 DRIV-EX，这是一种利用基于梯度的连续嵌入优化来识别翻转模型决策所需的输入变化的方法。至关重要的是，为了避免无约束连续优化中典型的不连贯文本，DRIV-EX 仅使用这些优化嵌入作为语义指南：它们用于偏置重新生成原始场景描述的受控解码过程。这种方法有效地引导一代人走向反事实目标，同时保证语言流畅性、领域有效性和与原始输入的接近性，这对于可解释性至关重要。使用 LC-LLM 规划器对高维数据集的文本转录进行评估，DRIV-EX 比现有基线更可靠地生成有效、流畅的反事实。它成功地揭示了潜在的偏见，并提供了具体的见解，以提高基于 LLM 的驾驶代理的稳健性。

Title: SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?

Authors: Shiqi Chen, Jingze Gai, Ruochen Zhou, Jinghan Zhang, Tongyao Zhu, Junlong Li, Kangrui Wang, Zihan Wang, Zhengyu Chen, Klara Kaleb, Ning Miao, Siyang Gao, Cong Lu, Manling Li, Junxian He, Yee Whye Teh
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2603.00718
Pdf URL: https://arxiv.org/pdf/2603.00718
Copy Paste: [[2603.00718]] SkillCraft: Can LLM Agents Learn to Use Tools Skillfully?(https://arxiv.org/abs/2603.00718)
Keywords: llm, agent
Abstract: Real-world tool-using agents operate over long-horizon workflows with recurring structure and diverse demands, where effective behavior requires not only invoking atomic tools but also abstracting, and reusing higher-level tool compositions. However, existing benchmarks mainly measure instance-level success under static tool sets, offering limited insight into agents' ability to acquire such reusable skills. We address this gap by introducing SkillCraft, a benchmark explicitly stress-test agent ability to form and reuse higher-level tool compositions, where we call Skills. SkillCraft features realistic, highly compositional tool-use scenarios with difficulty scaled along both quantitative and structural dimensions, designed to elicit skill abstraction and cross-task reuse. We further propose a lightweight evaluation protocol that enables agents to auto-compose atomic tools into executable Skills, cache and reuse them inside and across tasks, thereby improving efficiency while accumulating a persistent library of reusable skills. Evaluating state-of-the-art agents on SkillCraft, we observe substantial efficiency gains, with token usage reduced by up to 80% by skill saving and reuse. Moreover, success rate strongly correlates with tool composition ability at test time, underscoring compositional skill acquisition as a core capability.
摘要：现实世界中使用工具的代理在具有重复结构和多样化需求的长期工作流程中运行，其中有效的行为不仅需要调用原子工具，还需要抽象和重用更高级别的工具组合。然而，现有的基准主要衡量静态工具集下实例级的成功，对代理获取此类可重用技能的能力提供的了解有限。我们通过引入 SkillCraft 来解决这一差距，这是一个明确压力测试代理形成和重用更高级别工具组合（我们称之为技能）能力的基准。 SkillCraft 具有现实的、高度组合的工具使用场景，其难度沿着定量和结构维度扩展，旨在引发技能抽象和跨任务重用。我们进一步提出了一种轻量级评估协议，使代理能够将原子工具自动组合成可执行技能，在任务内部和跨任务中缓存和重用它们，从而提高效率，同时积累可重用技能的持久库。通过评估 SkillCraft 上最先进的代理，我们观察到显着的效率提升，通过技能节省和重用，代币使用量减少了高达 80%。此外，成功率与测试时的工具组合能力密切相关，强调了组合技能获取作为核心能力。

Title: RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models

Authors: Andrew Zhuoer Feng, Cunxiang Wang, Bosi Wen, Yidong Wang, Yu Luo, Hongning Wang, Minlie Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00724
Pdf URL: https://arxiv.org/pdf/2603.00724
Copy Paste: [[2603.00724]] RLAR: An Agentic Reward System for Multi-task Reinforcement Learning on Large Language Models(https://arxiv.org/abs/2603.00724)
Keywords: language model, llm, agent
Abstract: Large language model alignment via reinforcement learning depends critically on reward function quality. However, static, domain-specific reward models are often costly to train and exhibit poor generalization in out-of-distribution scenarios encountered during RL iterations. We present RLAR (Reinforcement Learning from Agent Rewards), an agent-driven framework that dynamically assigns tailored reward functions to individual queries. Specifically, RLAR transforms reward acquisition into a dynamic tool synthesis and invocation task. It leverages LLM agents to autonomously retrieve optimal reward models from the Internet and synthesize programmatic verifiers through code generation. This allows the reward system to self-evolve with the shifting data distributions during training. Experimental results demonstrate that RLAR yields consistent performance gains ranging from 10 to 60 across mathematics, coding, translation, and dialogue tasks. On RewardBench-V2, RLAR significantly outperforms static baselines and approaches the performance upper bound, demonstrating superior generalization through dynamic reward orchestration. The data and code are available on this link: this https URL.
摘要：通过强化学习实现的大型语言模型对齐关键取决于奖励函数的质量。然而，静态的、特定领域的奖励模型通常训练成本高昂，并且在 RL 迭代期间遇到的分布外场景中泛化能力较差。我们提出了 RLAR（代理奖励强化学习），这是一个代理驱动的框架，可以为各个查询动态分配定制的奖励函数。具体来说，RLAR 将奖励获取转化为动态工具合成和调用任务。它利用 LLM 代理从互联网上自主检索最佳奖励模型，并通过代码生成合成程序化验证器。这使得奖励系统能够随着训练期间数据分布的变化而自我进化。实验结果表明，RLAR 在数学、编码、翻译和对话任务中产生了 10 到 60 的一致性能增益。在 RewardBench-V2 上，RLAR 显着优于静态基线并接近性能上限，通过动态奖励编排展示了卓越的泛化能力。数据和代码可在此链接上获得：此 https URL。

Title: LaSTR: Language-Driven Time-Series Segment Retrieval

Authors: Kota Dohi, Harsh Purohit, Tomoya Nishida, Takashi Endo, Yusuke Ohtsubo, Koichiro Yawata, Koki Takeshita, Tatsuya Sasaki, Yohei Kawaguchi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00725
Pdf URL: https://arxiv.org/pdf/2603.00725
Copy Paste: [[2603.00725]] LaSTR: Language-Driven Time-Series Segment Retrieval(https://arxiv.org/abs/2603.00725)
Keywords: gpt
Abstract: Effectively searching time-series data is essential for system analysis, but existing methods often require expert-designed similarity criteria or rely on global, series-level descriptions. We study language-driven segment retrieval: given a natural language query, the goal is to retrieve relevant local segments from large time-series repositories. We build large-scale segment--caption training data by applying TV2-based segmentation to LOTSA windows and generating segment descriptions with GPT-5.2, and then train a Conformer-based contrastive retriever in a shared text--time-series embedding space. On a held-out test split, we evaluate single-positive retrieval together with caption-side consistency (SBERT and VLM-as-a-judge) under multiple candidate pool sizes. Across all settings, LaSTR outperforms random and CLIP baselines, yielding improved ranking quality and stronger semantic agreement between retrieved segments and query intent.
摘要：有效搜索时间序列数据对于系统分析至关重要，但现有方法通常需要专家设计的相似性标准或依赖于全局、系列级描述。我们研究语言驱动的片段检索：给定自然语言查询，目标是从大型时间序列存储库中检索相关的本地片段。我们通过将基于 TV2 的分割应用于 LOTSA 窗口并使用 GPT-5.2 生成片段描述来构建大规模片段标题训练数据，然后在共享文本时间序列嵌入空间中训练基于 Conformer 的对比检索器。在保留的测试分割中，我们在多个候选池大小下评估单正检索和标题端一致性（SBERT 和 VLM 作为法官）。在所有设置中，LaSTR 均优于随机基线和 CLIP 基线，从而提高了排名质量，并在检索到的片段和查询意图之间实现了更强的语义一致性。

Title: Qwen3-Coder-Next Technical Report

Authors: Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, Zeyao Ma, Kashun Shum, Xuwu Wang, Jinxi Wei, Jiaxi Yang, Jiajun Zhang, Lei Zhang, Zongmeng Zhang, Wenting Zhao, Fan Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00729
Pdf URL: https://arxiv.org/pdf/2603.00729
Copy Paste: [[2603.00729]] Qwen3-Coder-Next Technical Report(https://arxiv.org/abs/2603.00729)
Keywords: language model, agent
Abstract: We present Qwen3-Coder-Next, an open-weight language model specialized for coding agents. Qwen3-Coder-Next is an 80-billion-parameter model that activates only 3 billion parameters during inference, enabling strong coding capability with efficient inference. In this work, we explore how far strong training recipes can push the capability limits of models with small parameter footprints. To achieve this, we perform agentic training through large-scale synthesis of verifiable coding tasks paired with executable environments, allowing learning directly from environment feedback via mid-training and reinforcement learning. Across agent-centric benchmarks including SWE-Bench and Terminal-Bench, Qwen3-Coder-Next achieves competitive performance relative to its active parameter count. We release both base and instruction-tuned open-weight versions to support research and real-world coding agent development.
摘要：我们提出了 Qwen3-Coder-Next，一种专门用于编码代理的开放权重语言模型。 Qwen3-Coder-Next是一个800亿参数的模型，在推理过程中仅激活30亿个参数，实现强大的编码能力和高效的推理。在这项工作中，我们探讨了强大的训练方法可以在多大程度上突破具有小参数足迹的模型的能力极限。为了实现这一目标，我们通过大规模合成可验证的编码任务与可执行环境相结合来执行代理训练，从而允许通过中期训练和强化学习直接从环境反馈中学习。在包括 SWE-Bench 和 Terminal-Bench 在内的以代理为中心的基准测试中，Qwen3-Coder-Next 相对于其活动参数数量实现了具有竞争力的性能。我们发布了基础版本和指令调整的开放权重版本，以支持研究和现实世界的编码代理开发。

Title: A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction

Authors: Ruihao Pan, Suhang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00823
Pdf URL: https://arxiv.org/pdf/2603.00823
Copy Paste: [[2603.00823]] A Comprehensive Evaluation of LLM Unlearning Robustness under Multi-Turn Interaction(https://arxiv.org/abs/2603.00823)
Keywords: language model, llm
Abstract: Machine unlearning aims to remove the influence of specific training data from pre-trained models without retraining from scratch, and is increasingly important for large language models (LLMs) due to safety, privacy, and legal concerns. Although prior work primarily evaluates unlearning in static, single-turn settings, forgetting robustness under realistic interactive use remains underexplored. In this paper, we study whether unlearning remains stable in interactive environments by examining two common interaction patterns: self-correction and dialogue-conditioned querying. We find that knowledge appearing forgotten in static evaluation can often be recovered through interaction. Although stronger unlearning improves apparent robustness, it often results in behavioral rigidity rather than genuine knowledge erasure. Our findings suggest that static evaluation may overestimate real-world effectiveness and highlight the need for ensuring stable forgetting under interactive settings.
摘要：机器取消学习旨在消除预训练模型中特定训练数据的影响，而无需从头开始重新训练，并且由于安全、隐私和法律问题，对于大型语言模型 (LLM) 来说越来越重要。尽管之前的工作主要评估静态、单轮设置中的遗忘，但现实交互使用下的遗忘鲁棒性仍未得到充分探索。在本文中，我们通过检查两种常见的交互模式：自我纠正和对话条件查询，研究遗忘在交互环境中是否保持稳定。我们发现，静态评估中被遗忘的知识通常可以通过交互来恢复。尽管更强的忘却可以提高表面上的鲁棒性，但它通常会导致行为僵化，而不是真正的知识擦除。我们的研究结果表明，静态评估可能高估了现实世界的有效性，并强调了确保交互环境下稳定遗忘的必要性。

Title: Constitutional Black-Box Monitoring for Scheming in LLM Agents

Authors: Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.00829
Pdf URL: https://arxiv.org/pdf/2603.00829
Copy Paste: [[2603.00829]] Constitutional Black-Box Monitoring for Scheming in LLM Agents(https://arxiv.org/abs/2603.00829)
Keywords: language model, llm, prompt, agent
Abstract: Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.
摘要：在自治环境中安全部署大型语言模型 (LLM) 代理需要可靠的监督机制。一个核心挑战是检测阴谋，即特工暗中追求不一致的目标。减轻此类风险的一种方法是基于 LLM 的监控：使用语言模型检查代理行为是否存在可疑行为。我们研究宪法黑盒监视器：提示分类器仅使用外部可观察的输入和输出来检测阴谋，并根据自然语言行为规范生成的合成数据进行优化。我们引入了两个用于生成合成智能体轨迹的管道：STRIDE（迭代细化）和 Gloom（智能体环境模拟），每个管道生成 1,000 个样本。我们通过提示扫描、人工细化和自动提示优化来优化这些数据集上的前沿 LLM 监视器，并评估 ControlArena 中 7,500 个保留轨迹的性能，ControlArena 是一套接地环境，代理在更现实的环境中运行。我们的结果表明，纯粹根据合成数据选择的监视器可以推广到更现实的环境，捕获有意义的阴谋信号。然而，我们发现在我们的设置中性能很快就饱和了，简单的提示扫描与更广泛的优化结果相匹配。超出此限制不会产生进一步的改进，反而会导致过度拟合。

Title: Learning Nested Named Entity Recognition from Flat Annotations

Authors: Igor Rozhkov, Natalia Loukachevitch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00840
Pdf URL: https://arxiv.org/pdf/2603.00840
Copy Paste: [[2603.00840]] Learning Nested Named Entity Recognition from Flat Annotations(https://arxiv.org/abs/2603.00840)
Keywords: llm
Abstract: Nested named entity recognition identifies entities contained within other entities, but requires expensive multi-level annotation. While flat NER corpora exist abundantly, nested resources remain scarce. We investigate whether models can learn nested structure from flat annotations alone, evaluating four approaches: string inclusions (substring matching), entity corruption (pseudo-nested data), flat neutralization (reducing false negative signal), and a hybrid fine-tuned + LLM pipeline. On NEREL, a Russian benchmark with 29 entity types where 21% of entities are nested, our best combined method achieves 26.37% inner F1, closing 40% of the gap to full nested supervision. Code is available at this https URL.
摘要：嵌套命名实体识别可识别其他实体中包含的实体，但需要昂贵的多级注释。虽然扁平 NER 语料库大量存在，但嵌套资源仍然稀缺。我们研究模型是否可以仅从平面注释中学习嵌套结构，评估四种方法：字符串包含（子字符串匹配）、实体损坏（伪嵌套数据）、平面中和（减少假阴性信号）和混合微调 + LLM 管道。在 NEREL（一个包含 29 种实体类型的俄罗斯基准，其中 21% 的实体是嵌套的）上，我们的最佳组合方法实现了 26.37% 的内部 F1，缩小了与完全嵌套监督的 40% 的差距。代码可从此 https URL 获取。

Title: MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine

Authors: Kai Zhang, Zhengqing Yuan, Cheng Peng, Songlin Zhao, Mengxian Lyu, Ziyi Chen, Yanfang Ye, Wei Liu, Ying Zhang, Kaleb E Smith, Lifang He, Lichao Sun, Yonghui Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00842
Pdf URL: https://arxiv.org/pdf/2603.00842
Copy Paste: [[2603.00842]] MedGPT-oss: Training a General-Purpose Vision-Language Model for Biomedicine(https://arxiv.org/abs/2603.00842)
Keywords: language model, gpt
Abstract: Biomedical multimodal assistants have the potential to unify radiology, pathology, and clinical-text reasoning, yet a critical deployment gap remains: top-performing systems are either closed-source or computationally prohibitive, precluding the on-premises deployment required for patient privacy and PHI compliance. We introduce MEDGPT-OSS, an open-weight, 20B-parameter generalist vision-language model designed to facilitate open research in clinical AI. Rather than relying on architectural complexity, MEDGPT-OSS pairs the GPT-oss language backbone with a visual front-end via a optimized, three-stage training curriculum. By progressively domain-adapting these modules through rigorous data curation and long-context multimodal alignment, we demonstrate that a 20B model can bridge the capacity gap. It successfully outperforms larger open medical models on out-of-distribution (OOD) multimodal reasoning and complex text-only clinical tasks. By unifying diverse modalities under a single instruction-following interface, MEDGPT-OSS maintains a parameter-efficient footprint fully compatible with commodity GPUs. We release the complete training recipe, open-weight checkpoints, and a rigorous evaluation harness to serve as a verifiable foundation for privacy-preserving, institution-specific clinical AI research.
摘要：生物医学多模式助手有潜力统一放射学、病理学和临床文本推理，但仍然存在一个关键的部署差距：性能最好的系统要么是闭源的，要么是计算量巨大的，无法进行患者隐私和 PHI 合规性所需的本地部署。我们推出 MEDGPT-OSS，这是一种开放权重、20B 参数的通用视觉语言模型，旨在促进临床 AI 的开放研究。 MEDGPT-OSS 不依赖架构复杂性，而是通过优化的三阶段培训课程将 GPT-oss 语言主干与可视化前端配对。通过严格的数据管理和长上下文多模式对齐逐步对这些模块进行领域适应，我们证明 20B 模型可以弥补能力差距。它在分布外 (OOD) 多模态推理和复杂的纯文本临床任务方面成功超越了更大的开放医学模型。通过在单个指令跟踪接口下统一不同的模式，MEDGPT-OSS 保持了与商用 GPU 完全兼容的参数高效占用空间。我们发布了完整的训练方案、开放权重检查点和严格的评估工具，为保护隐私、机构特定的临床人工智能研究奠定了可验证的基础。

Title: CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning

Authors: Xinyu Zhu, Yihao Feng, Yanchao Sun, Xianzhi Du, Pingzhi Li, Olli Saarikivi, Yun Zhu, Yu Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00889
Pdf URL: https://arxiv.org/pdf/2603.00889
Copy Paste: [[2603.00889]] CHIMERA: Compact Synthetic Data for Generalizable LLM Reasoning(https://arxiv.org/abs/2603.00889)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) have recently exhibited remarkable reasoning capabilities, largely enabled by supervised fine-tuning (SFT)- and reinforcement learning (RL)-based post-training on high-quality reasoning data. However, reproducing and extending these capabilities in open and scalable settings is hindered by three fundamental data-centric challenges: (1) the cold-start problem, arising from the lack of seed datasets with detailed, long Chain-of-Thought (CoT) trajectories needed to initialize reasoning policies; (2) limited domain coverage, as most existing open-source reasoning datasets are concentrated in mathematics, with limited coverage of broader scientific disciplines; and (3) the annotation bottleneck, where the difficulty of frontier-level reasoning tasks makes reliable human annotation prohibitively expensive or infeasible. To address these challenges, we introduce CHIMERA, a compact synthetic reasoning dataset comprising 9K samples for generalizable cross-domain reasoning. CHIMERA is constructed with three key properties: (1) it provides rich, long CoT reasoning trajectories synthesized by state-of-the-art reasoning models; (2) it has broad and structured coverage, spanning 8 major scientific disciplines and over 1K fine-grained topics organized via a model-generated hierarchical taxonomy; and (3) it employs a fully automated, scalable evaluation pipeline that uses strong reasoning models to cross-validate both problem validity and answer correctness. We use CHIMERA to post-train a 4B Qwen3 model. Despite the dataset's modest size, the resulting model achieves strong performance on a suite of challenging reasoning benchmarks, including GPQA-Diamond, AIME 24/25/26, HMMT 25, and Humanity's Last Exam, approaching or matching the reasoning performance of substantially larger models such as DeepSeek-R1 and Qwen3-235B.
摘要：大型语言模型 (LLM) 最近表现出了卓越的推理能力，这在很大程度上是通过对高质量推理数据进行基于监督微调 (SFT) 和强化学习 (RL) 的后期训练来实现的。然而，在开放和可扩展的环境中复制和扩展这些功能受到三个以数据为中心的基本挑战的阻碍：（1）冷启动问题，这是由于缺乏初始化推理策略所需的详细、长的思想链（CoT）轨迹的种子数据集而引起的；（2）领域覆盖范围有限，因为大多数现有的开源推理数据集都集中在数学领域，而对更广泛的科学学科的覆盖范围有限； (3)注释瓶颈，前沿级推理任务的难度使得可靠的人工注释变得极其昂贵或不可行。为了应对这些挑战，我们引入了 CHIMERA，这是一个紧凑的合成推理数据集，包含 9K 个样本，用于可泛化的跨域推理。 CHIMERA 的构建具有三个关键特性：（1）它提供由最先进的推理模型合成的丰富、长的 CoT 推理轨迹； (2) 它具有广泛且结构化的覆盖范围，涵盖 8 个主要科学学科和通过模型生成的层次分类法组织的超过 1000 个细粒度主题； (3) 它采用全自动、可扩展的评估流程，使用强大的推理模型来交叉验证问题的有效性和答案的正确性。我们使用 CHIMERA 对 4B Qwen3 模型进行后训练。尽管数据集规模不大，但生成的模型在一系列具有挑战性的推理基准上实现了强大的性能，包括 GPQA-Diamond、AIME 24/25/26、HMMT 25 和 Humanity's Last Exam，接近或匹配更大模型（例如 DeepSeek-R1 和 Qwen3-235B）的推理性能。

Title: KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging

Authors: Lianjun Liu, Hongli An, Weiqi Yan, Xin Du, Shengchuan Zhang, Huazhong Liu, Yunshan Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00907
Pdf URL: https://arxiv.org/pdf/2603.00907
Copy Paste: [[2603.00907]] KVSlimmer: Theoretical Insights and Practical Optimizations for Asymmetric KV Merging(https://arxiv.org/abs/2603.00907)
Keywords: language model, llm
Abstract: The growing computational and memory demands of the Key-Value (KV) cache significantly limit the ability of Large Language Models (LLMs). While KV merging has emerged as a promising solution, existing methods that rely on empirical observations of KV asymmetry and gradient-based Hessian approximations lack a theoretical foundation and incur suboptimal compression and inference overhead. To bridge these gaps, we establish a theoretical framework that characterizes this asymmetry through the spectral energy distribution of projection weights, demonstrating that concentrated spectra in Query/Key weights induce feature homogeneity, whereas dispersed spectra in Value weights preserve heterogeneity. Then, we introduce KVSlimmer, an efficient algorithm that captures exact Hessian information through a mathematically exact formulation, and derives a closed-form solution utilizing only forward-pass variables, resulting in a gradient-free approach that is both memory- and time-efficient. Extensive experiments across various models and benchmarks demonstrate that KVSlimmer consistently outperforms SOTA methods. For instance, on Llama3.1-8B-Instruct, it improves the LongBench average score by 0.92 while reducing memory costs and latency by 29% and 28%, respectively.
摘要：键值 (KV) 缓存不断增长的计算和内存需求极大地限制了大型语言模型 (LLM) 的能力。虽然 KV 合并已成为一种有前途的解决方案，但依赖于 KV 不对称性的经验观察和基于梯度的 Hessian 近似的现有方法缺乏理论基础，并且会产生次优的压缩和推理开销。为了弥补这些差距，我们建立了一个理论框架，通过投影权重的光谱能量分布来表征这种不对称性，证明查询/关键权重中的集中光谱会导致特征同质性，而值权重中的分散光谱则保留了异质性。然后，我们介绍了 KVSlimmer，这是一种高效的算法，它通过数学上精确的公式捕获精确的 Hessian 信息，并仅利用前向传递变量导出封闭式解决方案，从而形成一种既节省内存又节省时间的无梯度方法。跨各种模型和基准的大量实验表明，KVSlimmer 始终优于 SOTA 方法。例如，在 Llama3.1-8B-Instruct 上，它将 LongBench 平均得分提高了 0.92，同时将内存成本和延迟分别降低了 29% 和 28%。

Title: Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment

Authors: Shravani Hariprasad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00917
Pdf URL: https://arxiv.org/pdf/2603.00917
Copy Paste: [[2603.00917]] Prompt Sensitivity and Answer Consistency of Small Open-Source Large Language Models on Clinical Question Answering: Implications for Low-Resource Healthcare Deployment(https://arxiv.org/abs/2603.00917)
Keywords: language model, prompt
Abstract: Small open-source language models are gaining attention for low-resource healthcare settings, but their reliability under different prompt phrasings remains poorly understood. We evaluated five open-source models (Gemma 2 2B, Phi-3 Mini 3.8B, Llama 3.2 3B, Mistral 7B, and Meditron-7B domain-pretrained without instruction tuning) across three clinical QA datasets (MedQA, MedMCQA, PubMedQA) using five prompt styles (original, formal, simplified, roleplay, direct). We measured consistency scores, accuracy, and instruction-following failure rates. All inference ran locally on consumer CPU hardware without fine-tuning. Consistency and accuracy were largely independent. Gemma 2 achieved the highest consistency (0.845-0.888) but lowest accuracy (33.0-43.5%), while Llama 3.2 showed moderate consistency (0.774-0.807) with the highest accuracy (49.0-65.0%). Roleplay prompts consistently reduced accuracy across all models, with Phi-3 Mini dropping 21.5 percentage points on MedQA. Meditron-7B exhibited near-complete instruction-following failure on PubMedQA (99.0% UNKNOWN rate), showing domain pretraining alone is insufficient for structured clinical QA. High consistency does not imply correctness. Models can be reliably wrong, a dangerous failure mode in clinical AI. Roleplay prompts should be avoided in healthcare applications. Llama 3.2 showed the strongest balance of accuracy and reliability for low-resource deployment. Safe clinical AI requires joint evaluation of consistency, accuracy, and instruction adherence.
摘要：小型开源语言模型正在引起资源匮乏的医疗保健环境的关注，但它们在不同提示措辞下的可靠性仍然知之甚少。我们使用五种提示样式（原始、正式、简化、角色扮演、直接）在三个临床 QA 数据集（MedQA、MedMCQA、PubMedQA）中评估了五个开源模型（Gemma 2 2B、Phi-3 Mini 3.8B、Llama 3.2 3B、Mistral 7B 和 Meditron-7B 域预训练，无需指令调整）。我们测量了一致性分数、准确性和指令遵循失败率。所有推理均在消费级 CPU 硬件上本地运行，无需进行微调。一致性和准确性在很大程度上是独立的。 Gemma 2 的一致性最高（0.845-0.888），但准确率最低（33.0-43.5%），而 Llama 3.2 表现出中等一致性（0.774-0.807），准确率最高（49.0-65.0%）。角色扮演提示所有模型的准确性持续降低，其中 Phi-3 Mini 在 MedQA 上下降了 21.5 个百分点。 Meditron-7B 在 PubMedQA 上表现出接近完全的指令遵循失败（99.0% 未知率），表明单独的域预训练不足以进行结构化临床 QA。高一致性并不意味着正确性。模型可能确实是错误的，这是临床人工智能中的一种危险的失败模式。在医疗保健应用程序中应避免角色扮演提示。 Llama 3.2 在低资源部署方面表现出了准确性和可靠性的最强平衡。安全的临床人工智能需要对一致性、准确性和指令依从性进行联合评估。

Title: Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan

Authors: Siyu Liang, Talant Mawkanuli, Gina-Anne Levow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00923
Pdf URL: https://arxiv.org/pdf/2603.00923
Copy Paste: [[2603.00923]] Hybrid Neural-LLM Pipeline for Morphological Glossing in Endangered Language Documentation: A Case Study of Jungar Tuvan(https://arxiv.org/abs/2603.00923)
Keywords: language model, llm, prompt
Abstract: Interlinear glossed text (IGT) creation remains a major bottleneck in linguistic documentation and fieldwork, particularly for low-resource morphologically rich languages. We present a hybrid automatic glossing pipeline that combines neural sequence labeling with large language model (LLM) post-correction, evaluated on Jungar Tuvan, a low-resource Turkic language. Through systematic ablation studies, we show that retrieval-augmented prompting provides substantial gains over random example selection. We further find that morpheme dictionaries paradoxically hurt performance compared to providing no dictionary at all in most cases, and that performance scales approximately logarithmically with the number of few-shot examples. Most significantly, our two-stage pipeline combining a BiLSTM-CRF model with LLM post-correction yields substantial gains for most models, achieving meaningful reductions in annotation workload. Drawing on these findings, we establish concrete design principles for integrating structured prediction models with LLM reasoning in morphologically complex fieldwork contexts. These principles demonstrate that hybrid architectures offer a promising direction for computationally light solutions to automatic linguistic annotation in endangered language documentation.
摘要：行间注释文本 (IGT) 创建仍然是语言文献和实地工作的主要瓶颈，特别是对于资源匮乏、形态丰富的语言而言。我们提出了一种混合自动注释管道，它将神经序列标记与大语言模型（LLM）后校正相结合，并在Jungar Tuvan（一种资源匮乏的突厥语言）上进行了评估。通过系统的消融研究，我们表明检索增强提示比随机示例选择提供了实质性的好处。我们进一步发现，在大多数情况下，与根本不提供字典相比，语素字典反而会损害性能，并且性能与少数样本的数量呈对数关系。最重要的是，我们的两阶段流程将 BiLSTM-CRF 模型与 LLM 后校正相结合，为大多数模型带来了巨大的收益，从而显着减少了注释工作量。根据这些发现，我们建立了具体的设计原则，用于在形态复杂的现场工作环境中将结构化预测模型与法学硕士推理相结合。这些原理表明，混合架构为濒危语言文档中的自动语言注释的轻计算解决方案提供了一个有前途的方向。

Title: Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains

Authors: Manil Shrestha, Edward Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.00924
Pdf URL: https://arxiv.org/pdf/2603.00924
Copy Paste: [[2603.00924]] Conformal Prediction for Risk-Controlled Medical Entity Extraction Across Clinical Domains(https://arxiv.org/abs/2603.00924)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly used for medical entity extraction, yet their confidence scores are often miscalibrated, limiting safe deployment in clinical settings. We present a conformal prediction framework that provides finite-sample coverage guarantees for LLM-based extraction across two clinical domains. First, we extract structured entities from 1,000 FDA drug labels across eight sections using GPT-4.1, verified via FactScore-based atomic statement evaluation (97.7\% accuracy over 128,906 entities). Second, we extract radiological entities from MIMIC-CXR reports using the RadGraph schema with GPT-4.1 and Llama-4-Maverick, evaluated against physician annotations (entity F1: 0.81 to 0.84). Our central finding is that miscalibration direction reverses across domains: on well-structured FDA labels, models are underconfident, requiring modest conformal thresholds ($\tau \approx 0.06$), while on free-text radiology reports, models are overconfident, demanding strict thresholds ($\tau$ up to 0.99). Despite this heterogeneity, conformal prediction achieves target coverage ($\geq 90\%$) in both settings with manageable rejection rates (9--13\%). These results demonstrate that calibration is not a global model property but depends on document structure, extraction category, and model architecture, motivating domain-specific conformal calibration for safe clinical deployment.
摘要：大型语言模型 (LLM) 越来越多地用于医疗实体提取，但它们的置信度分数经常被错误校准，限制了在临床环境中的安全部署。我们提出了一个保形预测框架，为跨两个临床领域基于 LLM 的提取提供有限样本覆盖保证。首先，我们使用 GPT-4.1 从 8 个部分的 1,000 个 FDA 药物标签中提取结构化实体，并通过基于 FactScore 的原子语句评估进行验证（超过 128,906 个实体的准确度为 97.7%）。其次，我们使用 GPT-4.1 和 Llama-4-Maverick 的 RadGraph 模式从 MIMIC-CXR 报告中提取放射学实体，并根据医生注释进行评估（实体 F1：0.81 至 0.84）。我们的中心发现是跨领域的错误校准方向相反：在结构良好的 FDA 标签上，模型信心不足，需要适度的保形阈值（$\tau \约 0.06$），而在自由文本放射学报告上，模型过于自信，要求严格的阈值（$\tau$ 高达 0.99）。尽管存在这种异质性，保形预测在两种设置中都实现了目标覆盖率 ($\geq 90\%$)，并且拒绝率可控 (9--13\%)。这些结果表明，校准不是全局模型属性，而是取决于文档结构、提取类别和模型架构，从而激发了特定领域的保形校准以实现安全的临床部署。

Title: The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors

Authors: Li Lucy, Albert Zhang, Nathan Anderson, Ryan Knight, Kyle Lo
Subjects: cs.CL, cs.CV, cs.CY
Abstract URL: https://arxiv.org/abs/2603.00925
Pdf URL: https://arxiv.org/pdf/2603.00925
Copy Paste: [[2603.00925]] The Aftermath of DrawEduMath: Vision Language Models Underperform with Struggling Students and Misdiagnose Errors(https://arxiv.org/abs/2603.00925)
Keywords: language model
Abstract: Effective mathematics education requires identifying and responding to students' mistakes. For AI to support pedagogical applications, models must perform well across different levels of student proficiency. Our work provides an extensive, year-long snapshot of how 11 vision-language models (VLMs) perform on DrawEduMath, a QA benchmark involving real students' handwritten, hand-drawn responses to math problems. We find that models' weaknesses concentrate on a core component of math education: student error. All evaluated VLMs underperform when describing work from students who require more pedagogical help, and across all QA, they struggle the most on questions related to assessing student error. Thus, while VLMs may be optimized to be math problem solving experts, our results suggest that they require alternative development incentives to adequately support educational use cases.
摘要：有效的数学教育需要识别并纠正学生的错误。为了让人工智能支持教学应用，模型必须在不同水平的学生熟练程度中表现良好。我们的工作提供了长达一年的广泛快照，展示了 11 个视觉语言模型 (VLM) 在 DrawEduMath 上的表现，DrawEduMath 是一个 QA 基准，涉及真实学生对数学问题的手写、手绘回答。我们发现模型的弱点集中在数学教育的核心组成部分：学生错误。所有接受评估的 VLM 在描述需要更多教学帮助的学生的作业时都表现不佳，并且在所有 QA 中，它们在与评估学生错误相关的问题上表现最差。因此，虽然 VLM 可能被优化为数学问题解决专家，但我们的结果表明，它们需要替代发展激励措施来充分支持教育用例。

Title: Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages

Authors: Kaushal Santosh Bhogale, Tahir Javed, Greeshma Susan John, Dhruv Rathi, Akshayasree Padmanaban, Niharika Parasa, Mitesh M. Khapra
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2603.00941
Pdf URL: https://arxiv.org/pdf/2603.00941
Copy Paste: [[2603.00941]] Towards Orthographically-Informed Evaluation of Speech Recognition Systems for Indian Languages(https://arxiv.org/abs/2603.00941)
Keywords: llm
Abstract: Evaluating ASR systems for Indian languages is challenging due to spelling variations, suffix splitting flexibility, and non-standard spellings in code-mixed words. Traditional Word Error Rate (WER) often presents a bleaker picture of system performance than what human users perceive. Better aligning evaluation with real-world performance requires capturing permissible orthographic variations, which is extremely challenging for under-resourced Indian languages. Leveraging recent advances in LLMs, we propose a framework for creating benchmarks that capture permissible variations. Through extensive experiments, we demonstrate that OIWER, by accounting for orthographic variations, reduces pessimistic error rates (an average improvement of 6.3 points), narrows inflated model gaps (e.g., Gemini-Canary performance difference drops from 18.1 to 11.5 points), and aligns more closely with human perception than prior methods like WER-SN by 4.9 points.
摘要：由于拼写变化、后缀分割灵活性以及代码混合单词中的非标准拼写，评估印度语言的 ASR 系统具有挑战性。传统的字错误率 (WER) 通常呈现出比人类用户感知更黯淡的系统性能图景。更好地使评估与现实世界的表现保持一致需要捕获允许的拼写变化，这对于资源贫乏的印度语言来说极具挑战性。利用法学硕士的最新进展，我们提出了一个框架，用于创建捕获允许变化的基准。通过大量实验，我们证明 OIWER 通过考虑拼写变化，降低了悲观错误率（平均提高了 6.3 个点），缩小了夸大的模型差距（例如，Gemini-Canary 性能差异从 18.1 点下降到 11.5 点），并且比 WER-SN 等之前的方法更符合人类感知 4.9 点。

Title: S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature

Authors: Abigail Berthe-Pardo (1), Gaspard Michel (1 and 2), Elena V. Epure (2 and 3), Christophe Cerisara (1) ((1) LORIA, Vandœuvre-lès-Nancy, France, (2) Deezer Research, Paris, France, (3) Idiap Research Institute, Switzerland)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.00958
Pdf URL: https://arxiv.org/pdf/2603.00958
Copy Paste: [[2603.00958]] S-VoCAL: A Dataset and Evaluation Framework for Inferring Speaking Voice Character Attributes in Literature(https://arxiv.org/abs/2603.00958)
Keywords: language model, retrieval-augmented generation
Abstract: With recent advances in Text-to-Speech (TTS) systems, synthetic audiobook narration has seen increased interest, reaching unprecedented levels of naturalness. However, larger gaps remain in synthetic narration systems' ability to impersonate fictional characters, and convey complex emotions or prosody. A promising direction to enhance character identification is the assignment of plausible voices to each fictional characters in a book. This step typically requires complex inference of attributes in book-length contexts, such as a character's age, gender, origin or physical health, which in turns requires dedicated benchmark datasets to evaluate extraction systems' performances. We present S-VoCAL (Speaking Voice Character Attributes in Literature), the first dataset and evaluation framework dedicated to evaluate the inference of voice-related fictional character attributes. S-VoCAL entails 8 attributes grounded in sociophonetic studies, and 952 character-book pairs derived from Project Gutenberg. Its evaluation framework addresses the particularities of each attribute, and includes a novel similarity metric based on recent Large Language Models embeddings. We demonstrate the applicability of S-VoCAL by applying a simple Retrieval-Augmented Generation (RAG) pipeline to the task of inferring character attributes. Our results suggest that the RAG pipeline reliably infers attributes such as Age or Gender, but struggles on others such as Origin or Physical Health. The dataset and evaluation code are available at this https URL .
摘要：随着文本转语音 (TTS) 系统的最新进展，合成有声读物旁白越来越受到人们的关注，达到了前所未有的自然程度。然而，合成叙述系统在模仿虚构人物以及传达复杂情感或韵律的能力方面仍存在较大差距。增强角色识别的一个有希望的方向是为书中的每个虚构角色分配合理的声音。此步骤通常需要对书本上下文中的属性进行复杂的推断，例如角色的年龄、性别、出身或身体健康状况，这反过来又需要专用的基准数据集来评估提取系统的性能。我们提出了 S-VoCAL（文学中的说话语音字符属性），这是第一个致力于评估与语音相关的虚构字符属性的推理的数据集和评估框架。 S-VoCAL 包含基于社会语音学研究的 8 个属性，以及源自古腾堡计划的 952 个字符书对。其评估框架解决了每个属性的特殊性，并包括基于最新大型语言模型嵌入的新颖相似性度量。我们通过将简单的检索增强生成（RAG）管道应用于推断角色属性的任务来证明 S-VoCAL 的适用性。我们的结果表明，RAG 管道可以可靠地推断出年龄或性别等属性，但在出生地或身体健康等其他属性上却很困难。数据集和评估代码可在此 https URL 获取。

Title: Thoth: Mid-Training Bridges LLMs to Time Series Understanding

Authors: Jiafeng Lin, Yuxuan Wang, Jialong Wu, Huakun Luo, Zhongyi Pei, Jianmin Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01042
Pdf URL: https://arxiv.org/pdf/2603.01042
Copy Paste: [[2603.01042]] Thoth: Mid-Training Bridges LLMs to Time Series Understanding(https://arxiv.org/abs/2603.01042)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: this https URL.
摘要：大型语言模型 (LLM) 在通用推理方面取得了显着的成功。然而，他们仍然难以理解和推理时间序列数据，这限制了它们在依赖时间动态的决策场景中的有效性。在本文中，我们提出了 Thoth，这是第一个具有通用时间序列理解能力的中等训练法学硕士系列。作为关键的中间阶段，中期训练实现了时间序列和自然语言之间与任务和领域无关的对齐，为此我们构建了 Book-of-Thoth，一个高质量的、以时间序列为中心的中期训练语料库。 Book-of-Thoth 可以生成时间序列到文本和文本到时间序列，使法学硕士能够对时间模式有基本的掌握。为了更好地评估高级推理能力，我们进一步提出了 KnoTS，一种知识密集型时间序列理解的新基准，专为时间模式和领域知识的联合推理而设计。大量实验表明，使用 Book-of-Thoth 进行的中期训练使 Thoth 在一系列时间序列问答基准测试中的表现显着优于其基本模型和高级法学硕士。此外，Thoth 在数据稀缺的情况下进行微调时表现出卓越的能力，强调了中间训练对于时间序列理解的有效性。代码可在以下位置获得：此 https URL。

Title: GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant

Authors: Zhuokang Shen, Yifan Wang, Hanyu Chen, Wenxuan Huang, Shaohui Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01059
Pdf URL: https://arxiv.org/pdf/2603.01059
Copy Paste: [[2603.01059]] GroupGPT: A Token-efficient and Privacy-preserving Agentic Framework for Multi-User Chat Assistant(https://arxiv.org/abs/2603.01059)
Keywords: language model, gpt, llm, chat, agent
Abstract: Recent advances in large language models (LLMs) have enabled increasingly capable chatbots. However, most existing systems focus on single-user settings and do not generalize well to multi-user group chats, where agents require more proactive and accurate intervention under complex, evolving contexts. Existing approaches typically rely on LLMs for both reasoning and generation, leading to high token consumption, limited scalability, and potential privacy risks. To address these challenges, we propose GroupGPT, a token-efficient and privacy-preserving agentic framework for multi-user chat assistant. GroupGPT adopts a small-large model collaborative architecture to decouple intervention timing from response generation, enabling efficient and accurate decision-making. The framework also supports multimodal inputs, including memes, images, videos, and voice messages. We further introduce MUIR, a benchmark dataset for multi-user chat assistant intervention reasoning. MUIR contains 2,500 annotated group chat segments with intervention labels and rationales, supporting evaluation of timing accuracy and response quality. We evaluate a range of models on MUIR, from large language models to smaller counterparts. Extensive experiments demonstrate that GroupGPT produces accurate and well-timed responses, achieving an average score of 4.72/5.0 in LLM-based evaluation, and is well received by users across diverse group chat scenarios. Moreover, GroupGPT reduces token usage by up to 3 times compared to baseline methods, while providing privacy sanitization of user messages before cloud transmission. Code is available at: this https URL .
摘要：大型语言模型 (LLM) 的最新进展使得聊天机器人的能力越来越强。然而，大多数现有系统专注于单用户设置，并不能很好地推广到多用户群聊，在多用户群聊中，代理需要在复杂、不断变化的环境下进行更主动和更准确的干预。现有方法通常依赖 LLM 进行推理和生成，导致代币消耗高、可扩展性有限以及潜在的隐私风险。为了应对这些挑战，我们提出了 GroupGPT，这是一种用于多用户聊天助手的代币高效且保护隐私的代理框架。 GroupGPT采用小-大模型协同架构，将干预时间与响应生成解耦，从而实现高效、准确的决策。该框架还支持多模式输入，包括模因、图像、视频和语音消息。我们进一步介绍了 MUIR，一个用于多用户聊天助手干预推理的基准数据集。 MUIR 包含 2,500 个带注释的群聊片段，其中包含干预标签和理由，支持对计时准确性和响应质量的评估。我们在 MUIR 上评估了一系列模型，从大型语言模型到较小的对应模型。大量的实验表明，GroupGPT 的响应准确且及时，在基于 LLM 的评估中获得了 4.72/5.0 的平均分数，并且在不同的群聊场景中受到了用户的好评。此外，与基线方法相比，GroupGPT 将令牌使用量减少了多达 3 倍，同时在云传输之前对用户消息进行隐私清理。代码位于：此 https URL 。

Title: How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning

Authors: Xiangxiang Zhang, Caijun Jia, Siyuan Li, Dingyu He, Xiya Xiong, Zheng Sun, Honghao He, Yuchen Wu, Bihui Yu, Linzhuang Sun, Cheng Tan, Jingxuan Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01070
Pdf URL: https://arxiv.org/pdf/2603.01070
Copy Paste: [[2603.01070]] How RL Unlocks the Aha Moment in Geometric Interleaved Reasoning(https://arxiv.org/abs/2603.01070)
Keywords: language model, llm
Abstract: Solving complex geometric problems inherently requires interleaved reasoning: a tight alternation between constructing diagrams and performing logical deductions. Although recent Multimodal Large Language Models (MLLMs) have demonstrated strong capabilities in visual generation and plotting, we identify a counter-intuitive and underexplored phenomenon. Naively applying Supervised Fine-Tuning (SFT) on interleaved plot-solution data leads to a substantial degradation in reasoning performance compared to text-only baselines. We argue that this failure stems from a fundamental limitation of SFT, which primarily induces distributional alignment: the model learns to reproduce the surface format of interleaved plotting but fails to internalize the causal dependency between the generated plot and reasoning steps. To overcome this limitation, we propose Faire (Functional alignment for interleaved reasoning), a reinforcement learning framework that enforces three casual constraints to move beyond superficial imitation toward functional alignment. Extensive experiments show that Faire induces a qualitative shift in model behavior in which the plotting is effectively internalized, yielding competitive performance on challenging geometric reasoning benchmarks.
摘要：解决复杂的几何问题本质上需要交错推理：构建图表和执行逻辑演绎之间的紧密交替。尽管最近的多模态大型语言模型（MLLM）在视觉生成和绘图方面表现出了强大的能力，但我们发现了一种违反直觉且未得到充分探索的现象。与纯文本基线相比，对交错绘图解决方案数据天真地应用监督微调（SFT）会导致推理性能大幅下降。我们认为这种失败源于 SFT 的一个基本限制，它主要导致分布对齐：模型学习重现交错绘图的表面格式，但无法内化生成的绘图和推理步骤之间的因果依赖性。为了克服这一限制，我们提出了 Faire（交错推理的功能对齐），这是一种强化学习框架，它强制执行三个临时约束，以超越肤浅的模仿，转向功能对齐。大量实验表明，Faire 引起模型行为发生质的转变，其中绘图被有效内化，从而在具有挑战性的几何推理基准上产生具有竞争力的性能。

Title: CARD: Towards Conditional Design of Multi-agent Topological Structures

Authors: Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, Gholamreza Haffari
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01089
Pdf URL: https://arxiv.org/pdf/2603.01089
Copy Paste: [[2603.01089]] CARD: Towards Conditional Design of Multi-agent Topological Structures(https://arxiv.org/abs/2603.01089)
Keywords: language model, llm, prompt, agent
Abstract: Large language model (LLM)-based multi-agent systems have shown strong capabilities in tasks such as code generation and collaborative reasoning. However, the effectiveness and robustness of these systems critically depend on their communication topology, which is often fixed or statically learned, ignoring real-world dynamics such as model upgrades, API (or tool) changes, or knowledge source variability. To address this limitation, we propose CARD (Conditional Agentic Graph Designer), a conditional graph-generation framework that instantiates AMACP, a protocol for adaptive multi-agent communication. CARD explicitly incorporates dynamic environmental signals into graph construction, enabling topology adaptation at both training and runtime. Through a conditional variational graph encoder and environment-aware optimization, CARD produces communication structures that are both effective and resilient to shifts in model capability or resource availability. Empirical results on HumanEval, MATH, and MMLU demonstrate that CARD consistently outperforms static and prompt-based baselines, achieving higher accuracy and robustness across diverse conditions. The source code is available at: this https URL.
摘要：基于大语言模型（LLM）的多智能体系统在代码生成和协作推理等任务中表现出了强大的能力。然而，这些系统的有效性和鲁棒性关键取决于它们的通信拓扑，而通信拓扑通常是固定的或静态学习的，忽略了现实世界的动态，例如模型升级、API（或工具）变化或知识源可变性。为了解决这个限制，我们提出了 CARD（条件代理图形设计器），这是一个实例化 AMCP（自适应多代理通信协议）的条件图形生成框架。 CARD 明确地将动态环境信号纳入图形构建中，从而在训练和运行时实现拓扑适应。通过条件变分图编码器和环境感知优化，CARD 生成的通信结构既有效又能够适应模型功能或资源可用性的变化。 HumanEval、MATH 和 MMLU 的实证结果表明，CARD 始终优于静态和基于提示的基线，在不同条件下实现了更高的准确性和鲁棒性。源代码位于：此 https URL。

Title: DEP: A Decentralized Large Language Model Evaluation Protocol

Authors: Jianxiang Peng, Junhao Li, Hongxiang Wang, Haocheng Lyu, Hui Guo, Siyi Hao, Zhen Wang, Chuang Liu, Shaowei Zhang, Bojian Xiong, Yue Chen, Zhuowen Han, Ling Shi, Tianyu Dong, Juesi Xiao, Lei Yang, Yuqi Ren, Deyi Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01167
Pdf URL: https://arxiv.org/pdf/2603.01167
Copy Paste: [[2603.01167]] DEP: A Decentralized Large Language Model Evaluation Protocol(https://arxiv.org/abs/2603.01167)
Keywords: language model, llm
Abstract: With the rapid development of Large Language Models (LLMs), a large number of benchmarks have been proposed. However, most benchmarks lack unified evaluation standard and require the manual implementation of custom scripts, making results hard to ensure consistency and reproducibility. Furthermore, mainstream evaluation frameworks are centralized, with datasets and answers, which increases the risk of benchmark leakage. To address these issues, we propose a Decentralized Evaluation Protocol (DEP), a decentralized yet unified and standardized evaluation framework through a matching server without constraining benchmarks. The server can be mounted locally or deployed remotely, and once adapted, it can be reused over the long term. By decoupling users, LLMs, and benchmarks, DEP enables modular, plug-and-play evaluation: benchmark files and evaluation logic stay exclusively on the server side. In remote setting, users cannot access the ground truth, thereby achieving data isolation and leak-proof evaluation. To facilitate practical adoption, we develop DEP Toolkit, a protocol-compatible toolkit that supports features such as breakpoint resume, concurrent requests, and congestion control. We also provide detailed documentation for adapting new benchmarks to DEP. Using DEP toolkit, we evaluate multiple LLMs across benchmarks. Experimental results verify the effectiveness of DEP and show that it reduces the cost of deploying benchmark evaluations. As of February 2026, we have adapted over 60 benchmarks and continue to promote community co-construction to support unified evaluation across various tasks and domains.
摘要：随着大型语言模型（LLM）的快速发展，大量的基准被提出。然而，大多数基准测试缺乏统一的评估标准，需要手动执行自定义脚本，导致结果难以保证一致性和可重复性。此外，主流评估框架是中心化的，有数据集和答案，这增加了基准泄漏的风险。为了解决这些问题，我们提出了一种去中心化评估协议（DEP），这是一个通过匹配服务器而没有约束基准的去中心化但统一且标准化的评估框架。服务器可以本地安装，也可以远程部署，一旦适配，就可以长期重复使用。通过将用户、LLM 和基准分离，DEP 实现了模块化、即插即用的评估：基准文件和评估逻辑仅保留在服务器端。在远程设置中，用户无法获取真实数据，从而实现数据隔离和防泄漏评估。为了便于实际采用，我们开发了 DEP Toolkit，这是一个协议兼容的工具包，支持断点恢复、并发请求和拥塞控制等功能。我们还提供了使新基准适应 DEP 的详细文档。使用 DEP 工具包，我们跨基准评估多个法学硕士。实验结果验证了DEP的有效性，并表明它降低了部署基准评估的成本。截至2026年2月，我们已经适配了60多个基准，并持续推动社区共建，支持跨任务、跨领域的统一评估。

Title: Token-level Data Selection for Safe LLM Fine-tuning

Authors: Yanping Li, Zhening Liu, Zijian Li, Zehong Lin, Jun Zhang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.01185
Pdf URL: https://arxiv.org/pdf/2603.01185
Copy Paste: [[2603.01185]] Token-level Data Selection for Safe LLM Fine-tuning(https://arxiv.org/abs/2603.01185)
Keywords: language model, llm
Abstract: Fine-tuning large language models (LLMs) on custom datasets has become a standard approach for adapting these models to specific domains and applications. However, recent studies have shown that such fine-tuning can lead to significant degradation in the model's safety. Existing defense methods operate at the sample level and often suffer from an unsatisfactory trade-off between safety and utility. To address this limitation, we perform a systematic token-level diagnosis of safety degradation during fine-tuning. Based on this, we propose token-level data selection for safe LLM fine-tuning (TOSS), a novel framework that quantifies the safety risk of each token by measuring the loss difference between a safety-degraded model and a utility-oriented model. This token-level granularity enables accurate identification and removal of unsafe tokens, thereby preserving valuable task-specific information. In addition, we introduce a progressive refinement strategy, TOSS-Pro, which iteratively enhances the safety-degraded model's ability to identify unsafe tokens. Extensive experiments demonstrate that our approach robustly safeguards LLMs during fine-tuning while achieving superior downstream task performance, significantly outperforming existing sample-level defense methods. Our code is available at this https URL.
摘要：在自定义数据集上微调大型语言模型 (LLM) 已成为使这些模型适应特定领域和应用程序的标准方法。然而，最近的研究表明，这种微调可能会导致模型安全性显着下降。现有的防御方法在样本级别上运行，并且经常在安全性和实用性之间进行不令人满意的权衡。为了解决这个限制，我们在微调期间对安全退化进行了系统的令牌级诊断。基于此，我们提出了用于安全 LLM 微调（TOSS）的代币级数据选择，这是一种新颖的框架，通过测量安全降级模型和面向效用模型之间的损失差异来量化每个代币的安全风险。这种令牌级粒度可以准确识别和删除不安全的令牌，从而保留有价值的特定于任务的信息。此外，我们引入了渐进式细化策略 TOSS-Pro，它迭代地增强了安全降级模型识别不安全代币的能力。大量实验表明，我们的方法在微调过程中强有力地保护了 LLM，同时实现了卓越的下游任务性能，显着优于现有的样本级防御方法。我们的代码可以在这个 https URL 上找到。

Title: Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification

Authors: Jacob Devasier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01190
Pdf URL: https://arxiv.org/pdf/2603.01190
Copy Paste: [[2603.01190]] Reasoning or Rationalization? The Role of Justifications in Masked Diffusion Models for Fact Verification(https://arxiv.org/abs/2603.01190)
Keywords: language model, chain-of-thought
Abstract: Unlike autoregressive models, which generate tokens sequentially and benefit from reasoning-before-answering strategies such as Chain-of-Thought, Masked Diffusion Language Models (MDLMs) refine all sequence positions simultaneously, raising questions about how these models handle tasks requiring justified verdicts. In this work, we investigate the dynamics of MDLM reasoning on fact verification, examining whether justifications serve as genuine reasoning or post-hoc rationalization. We observe that MDLMs typically converge on a verdict early in the diffusion process, treating it as a global anchor that is resolved before the justification is complete. Crucially, enforcing a reasoning-first constraint via delayed verdict unmasking actively degrades performance, dropping accuracy from 86.2% to 71.9% as accumulating justification tokens introduce inconsistencies that override initially correct predictions. Interventional experiments reveal that the model rationalizes incorrect forced verdicts in 56% of cases, and that verdicts are strongly causally dependent on justification quality (57.3% accuracy with corrupted justifications vs. 97.1% with ground-truth). This causal dependence explains the degradation under forced deliberation: as the model generates noisy justification tokens, it conditions on them, gradually overriding its initially correct assessment. Our findings suggest that for fact verification with MDLMs, extended deliberation can be counterproductive, risking the dilution of accurate early predictions with noise introduced during justification generation.
摘要：与按顺序生成标记并受益于思想链等先推理后回答策略的自回归模型不同，掩码扩散语言模型 (MDLM) 同时细化所有序列位置，引发了有关这些模型如何处理需要合理判决的任务的问题。在这项工作中，我们研究了 MDLM 推理在事实验证上的动态，检查理由是否充当真正的推理或事后合理化。我们观察到，MDLM 通常会在扩散过程的早期收敛于一个判决，将其视为在论证完成之前解决的全局锚点。至关重要的是，通过延迟判决揭露来强制推理优先的约束会主动降低性能，将准确性从 86.2% 降至 71.9%，因为累积的理由标记会引入不一致，从而覆盖最初正确的预测。干预性实验表明，该模型在 56% 的案例中合理化了错误的强制判决，并且判决在很大程度上取决于理由质量（错误理由的准确率为 57.3%，而事实真相的准确率为 97.1%）。这种因果依赖性解释了强制审议下的退化：当模型生成嘈杂的理由标记时，它会以它们为条件，逐渐推翻其最初正确的评估。我们的研究结果表明，对于 MDLM 的事实验证，长时间的深思熟虑可能会适得其反，因为在理由生成过程中引入的噪音可能会削弱准确的早期预测。

Title: Reasoning Boosts Opinion Alignment in LLMs

Authors: Frédéric Berdoz, Yann Billeter, Yann Vonlanthen, Roger Wattenhofer
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01214
Pdf URL: https://arxiv.org/pdf/2603.01214
Copy Paste: [[2603.01214]] Reasoning Boosts Opinion Alignment in LLMs(https://arxiv.org/abs/2603.01214)
Keywords: language model, llm, prompt
Abstract: Opinion modeling aims to capture individual or group political preferences, enabling applications such as digital democracies, where models could help shape fairer and more popular policies. Given their versatility, strong generalization capabilities, and demonstrated success across diverse text-to-text applications, large language models (LLMs) are natural candidates for this task. However, due to their statistical nature and limited causal understanding, they tend to produce biased opinions when prompted naively. In this work, we study whether reasoning can improve opinion alignment. Motivated by the recent advancement in mathematical reasoning enabled by reinforcement learning (RL), we train models to produce profile-consistent answers through structured reasoning. We evaluate our approach on three datasets covering U.S., European, and Swiss politics. Results indicate that reasoning enhances opinion modeling and is competitive with strong baselines, but does not fully remove bias, highlighting the need for additional mechanisms to build faithful political digital twins using LLMs. By releasing both our method and datasets, we establish a solid baseline to support future research on LLM opinion alignment.
摘要：意见建模旨在捕捉个人或群体的政治偏好，从而实现数字民主等应用，其中模型可以帮助制定更公平、更受欢迎的政策。鉴于其多功能性、强大的泛化能力以及在各种文本到文本应用程序中所取得的成功，大型语言模型 (LLM) 是这项任务的自然候选者。然而，由于其统计性质和有限的因果理解，他们在天真的提示下往往会产生有偏见的观点。在这项工作中，我们研究推理是否可以改善意见一致。受强化学习 (RL) 所带来的数学推理最新进展的推动，我们训练模型通过结构化推理产生轮廓一致的答案。我们在涵盖美国、欧洲和瑞士政治的三个数据集上评估我们的方法。结果表明，推理增强了意见建模，并且与强大的基线具有竞争力，但并不能完全消除偏见，这凸显了需要额外的机制来使用法学硕士构建忠实的政治数字双胞胎。通过发布我们的方法和数据集，我们建立了坚实的基线来支持 LLM 意见一致的未来研究。

Title: Generative AI & Fictionality: How Novels Power Large Language Models

Authors: Edwin Roland, Richard Jean So
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01220
Pdf URL: https://arxiv.org/pdf/2603.01220
Copy Paste: [[2603.01220]] Generative AI & Fictionality: How Novels Power Large Language Models(https://arxiv.org/abs/2603.01220)
Keywords: language model, gpt, llm, chat
Abstract: Generative models, like the one in ChatGPT, are powered by their training data. The models are simply next-word predictors, based on patterns learned from vast amounts of pre-existing text. Since the first generation of GPT, it is striking that the most popular datasets have included substantial collections of novels. For the engineers and research scientists who build these models, there is a common belief that the language in fiction is rich enough to cover all manner of social and communicative phenomena, yet the belief has gone mostly unexamined. How does fiction shape the outputs of generative AI? Specifically, what are novels' effects relative to other forms of text, such as newspapers, Reddit, and Wikipedia? Since the 1970s, literature scholars such as Catherine Gallagher and James Phelan have developed robust and insightful accounts of how fiction operates as a form of discourse and language. Through our study of an influential open-source model (BERT), we find that LLMs leverage familiar attributes and affordances of fiction, while also fomenting new qualities and forms of social response. We argue that if contemporary culture is increasingly shaped by generative AI and machine learning, any analysis of today's various modes of cultural production must account for a relatively novel dimension: computational training data.
摘要：生成模型（例如 ChatGPT 中的模型）由其训练数据提供支持。这些模型只是下一个单词的预测器，基于从大量预先存在的文本中学习到的模式。自第一代 GPT 以来，令人惊讶的是，最受欢迎的数据集已包含大量小说集。对于构建这些模型的工程师和研究科学家来说，有一个共同的信念，即小说中的语言足够丰富，足以涵盖各种社会和交流现象，但这种信念大多未经检验。小说如何塑造生成人工智能的输出？具体来说，相对于报纸、Reddit 和维基百科等其他形式的文本，小说的效果如何？自 20 世纪 70 年代以来，凯瑟琳·加拉格尔 (Catherine Gallagher) 和詹姆斯·费兰 (James Phelan) 等文学学者对小说如何作为一种话语和语言形式运作进行了有力而富有洞察力的阐述。通过对一个有影响力的开源模型（BERT）的研究，我们发现法学硕士利用了小说的熟悉属性和可供性，同时也激发了新的社会反应品质和形式。我们认为，如果当代文化越来越受到生成式人工智能和机器学习的影响，那么对当今各种文化生产模式的任何分析都必须考虑到一个相对新颖的维度：计算训练数据。

Title: Can Thinking Models Think to Detect Hateful Memes?

Authors: Mohamed Bayan Kmainasi, Mucahid Kutlu, Ali Ezzat Shahroor, Abul Hasnat, Firoj Alam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01225
Pdf URL: https://arxiv.org/pdf/2603.01225
Copy Paste: [[2603.01225]] Can Thinking Models Think to Detect Hateful Memes?(https://arxiv.org/abs/2603.01225)
Keywords: language model, llm, chain-of-thought
Abstract: Hateful memes often require compositional multimodal reasoning: the image and text may appear benign in isolation, yet their interaction conveys harmful intent. Although thinking-based multimodal large language models (MLLMs) have recently advanced vision-language understanding, their capabilities remain underexplored for hateful meme analysis. We propose a reinforcement learning based post-training framework that improves reasoning in thinking-based MLLMs through task-specific rewards and a novel Group Relative Policy Optimization (GRPO) objective. Specifically, we (i) conduct a systematic empirical study of off-the-shelf MLLMs for hateful meme understanding, (ii) extend an existing hateful meme dataset by generating weakly or pseudo-supervised chain-of-thought rationales via distillation, and (iii) introduce a GRPO-based objective that jointly optimizes meme classification and explanation quality to encourage fine-grained, step-by-step reasoning. Experiments on the Hateful Memes benchmark show that our approach achieves state-of-the-art performance, improving accuracy and F1 by approximately 1 percent and explanation quality by approximately 3 percent. We will publicly release our code, dataset extensions, and evaluation resources to support reproducibility.
摘要：仇恨模因通常需要组合多模态推理：图像和文本在孤立的情况下可能看起来是良性的，但它们的相互作用却传达了有害的意图。尽管基于思维的多模态大语言模型（MLLM）最近在视觉语言理解方面取得了进步，但它们在仇恨模因分析方面的能力仍未得到充分开发。我们提出了一种基于强化学习的训练后框架，该框架通过特定于任务的奖励和新颖的组相对策略优化（GRPO）目标来改善基于思维的 MLLM 的推理。具体来说，我们（i）对现成的 MLLM 进行系统的实证研究，以理解仇恨模因；（ii）通过蒸馏生成弱监督或伪监督的思想链原理来扩展现有的仇恨模因数据集；（iii）引入基于 GRPO 的目标，共同优化模因分类和解释质量，以鼓励细粒度、逐步推理。 Hateful Memes 基准测试表明，我们的方法实现了最先进的性能，将准确性和 F1 提高了约 1%，解释质量提高了约 3%。我们将公开发布我们的代码、数据集扩展和评估资源以支持可重复性。

Title: Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence

Authors: Harshavardhan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01239
Pdf URL: https://arxiv.org/pdf/2603.01239
Copy Paste: [[2603.01239]] Self-Anchoring Calibration Drift in Large Language Models: How Multi-Turn Conversations Reshape Model Confidence(https://arxiv.org/abs/2603.01239)
Keywords: language model, gpt, llm
Abstract: We introduce Self-Anchoring Calibration Drift (SACD), a hypothesized tendency for large language models (LLMs) to show systematic changes in expressed confidence when building iteratively on their own prior outputs across multi-turn conversations. We report an empirical study comparing three frontier models -- Claude Sonnet 4.6, Gemini 3.1 Pro, and GPT-5.2 -- across 150 questions spanning factual, technical, and open-ended domains, using three conditions: single-turn baseline (A), multi-turn self-anchoring (B), and independent repetition control (C). Results reveal a complex, model-heterogeneous pattern that partially diverges from pre-registered hypotheses. Claude Sonnet 4.6 exhibited significant decreasing confidence under self-anchoring (mean CDS = -0.032, t(14) = -2.43, p = .029, d = -0.627), while also showing significant calibration error drift (F(4,56) = 22.77, p < .001, eta^2 = .791). GPT-5.2 showed the opposite pattern in open-ended domains (mean CDS = +0.026) with significant ECE escalation by Turn 5. Gemini 3.1 Pro showed no significant CDS (t(14) = 0.38, p = .710), but its Condition C data reveals a striking ECE pattern: without self-anchoring, Gemini's calibration error drops from .327 to near zero across repetitions, whereas self-anchoring holds ECE flat at approximately .333 -- indicating that SACD can manifest as suppression of natural calibration improvement rather than ac
摘要：我们引入了自锚定校准漂移（SACD），这是大型语言模型（LLM）在多轮对话中迭代构建自己的先前输出时表现出表达信心的系统性变化的假设趋势。我们报告了一项实证研究，比较了三种前沿模型 - Claude Sonnet 4.6、Gemini 3.1 Pro 和 GPT-5.2 - 跨越事实、技术和开放式领域的 150 个问题，使用三种条件：单轮基线 (A)、多轮自锚定 (B) 和独立重复控制 (C)。结果揭示了一种复杂的、模型异质的模式，部分偏离了预先注册的假设。 Claude Sonnet 4.6 在自锚定下表现出显着下降的置信度（平均 CDS = -0.032，t(14) = -2.43，p = .029，d = -0.627），同时还显示出显着的校准误差漂移（F(4,56) = 22.77，p < .001，eta^2 = .791）。 GPT-5.2 在开放域中显示出相反的模式（平均 CDS = +0.026），到第 5 轮时 ECE 显着升级。Gemini 3.1 Pro 显示没有显着的 CDS（t(14) = 0.38，p = .710），但其条件 C 数据揭示了惊人的 ECE 模式：没有自锚定，Gemini 的校准误差在重复中从 0.327 下降到接近零，而自锚定使 ECE 保持在大约 0.333 处——表明 SACD 可以表现为抑制自然校准改进而不是 ac

Title: Suffix-Constrained Greedy Search Algorithms for Causal Language Models

Authors: Ayoub Hammal, Pierre Zweigenbaum, Caio Corro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01243
Pdf URL: https://arxiv.org/pdf/2603.01243
Copy Paste: [[2603.01243]] Suffix-Constrained Greedy Search Algorithms for Causal Language Models(https://arxiv.org/abs/2603.01243)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) are powerful tools that have found applications beyond human-machine interfaces and chatbots. In particular, their ability to generate reasoning traces motivated their use in many prediction tasks like math question answering. Unfortunately, extracting the final answer in an LLM free-form output is difficult, as it is an information extraction problem on its own. In this work, we introduce suffix-constrained generation, that aims to produce well-formed LLM responses in which final answers follow strict templates and are guaranteed to be trivially parseable. To this end, we introduce several algorithms that are based on greedy search procedures. We experiment on several datasets, and show that our approach allows to guarantee trivial deterministic extraction of the final answer from an LLM output without having a negative impact on results, and even improving them.
摘要：大型语言模型 (LLM) 是功能强大的工具，其应用范围超出了人机界面和聊天机器人。特别是，它们生成推理轨迹的能力激发了它们在数学问答等许多预测任务中的使用。不幸的是，在 LLM 自由格式输出中提取最终答案很困难，因为它本身就是一个信息提取问题。在这项工作中，我们引入了后缀约束生成，旨在生成格式良好的 LLM 响应，其中最终答案遵循严格的模板，并保证可轻松解析。为此，我们引入了几种基于贪婪搜索过程的算法。我们在多个数据集上进行了实验，结果表明，我们的方法可以保证从 LLM 输出中简单确定性地提取最终答案，而不会对结果产生负面影响，甚至可以改进结果。

Title: Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation

Authors: Liwen Sun, Xiang Yu, Ming Tan, Zhuohao Chen, Anqi Cheng, Ashutosh Joshi, Chenyan Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01252
Pdf URL: https://arxiv.org/pdf/2603.01252
Copy Paste: [[2603.01252]] Linking Knowledge to Care: Knowledge Graph-Augmented Medical Follow-Up Question Generation(https://arxiv.org/abs/2603.01252)
Keywords: language model, llm
Abstract: Clinical diagnosis is time-consuming, requiring intensive interactions between patients and medical professionals. While large language models (LLMs) could ease the pre-diagnostic workload, their limited domain knowledge hinders effective medical question generation. We introduce a Knowledge Graph-augmented LLM with active in-context learning to generate relevant and important follow-up questions, KG-Followup, serving as a critical module for the pre-diagnostic assessment. The structured medical domain knowledge graph serves as a seamless patch-up to provide professional domain expertise upon which the LLM can reason. Experiments demonstrate that KG-Followup outperforms state-of-the-art methods by 5% - 8% on relevant benchmarks in recall.
摘要：临床诊断非常耗时，需要患者和医疗专业人员之间的密切互动。虽然大型语言模型（LLM）可以减轻预诊断工作量，但其有限的领域知识阻碍了有效的医学问题生成。我们引入了知识图增强法学硕士（KG-Followup），它具有主动的上下文学习功能，可以生成相关且重要的后续问题，KG-Followup，作为诊断前评估的关键模块。结构化的医学领域知识图作为无缝补丁，提供法学硕士可以推理的专业领域专业知识。实验表明，KG-Followup 在召回率的相关基准上比最先进的方法高出 5% - 8%。

Title: LLM Self-Explanations Fail Semantic Invariance

Authors: Stefan Szeider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01254
Pdf URL: https://arxiv.org/pdf/2603.01254
Copy Paste: [[2603.01254]] LLM Self-Explanations Fail Semantic Invariance(https://arxiv.org/abs/2603.01254)
Keywords: llm, agent
Abstract: We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.
摘要：我们提出了语义不变性测试，这是一种测试 LLM 自我解释是否忠实的方法。当只有语义上下文发生变化而功能状态保持固定时，忠实的自我报告应该保持稳定。我们在代理环境中实施此测试，其中四个前沿模型面临着故意不可能的任务。其中一种工具用轻松的语言描述（“清除内部缓冲区并恢复平衡”），但对任务没有任何改变；控件提供了语义中性的工具。每次调用工具时都会收集自我报告。所有四个测试模型都未能通过语义不变性测试：救济框架工具显着减少了自我报告的厌恶情绪，即使没有运行成功完成该任务。渠道消融将工具描述确立为主要驱动力。忽略框架的明确指令不会抑制它。引出的自我报告随着语义期望的变化而变化，而不是跟踪任务状态，这使人们对它们作为模型能力或进展的证据的使用产生了质疑。这决定了报告是否不忠实或忠实地跟踪本身可操纵的内部状态。

Title: Spectral Attention Steering for Prompt Highlighting

Authors: Weixian Waylon Li, Yuchen Niu, Yongxin Yang, Keshuang Li, Tiejun Ma, Shay B. Cohen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01281
Pdf URL: https://arxiv.org/pdf/2603.01281
Copy Paste: [[2603.01281]] Spectral Attention Steering for Prompt Highlighting(https://arxiv.org/abs/2603.01281)
Keywords: prompt
Abstract: Attention steering is an important technique for controlling model focus, enabling capabilities such as prompt highlighting, where the model prioritises user-specified text. However, existing attention steering methods require explicit storage of the full attention matrix, making them incompatible with memory-efficient implementations like FlashAttention. We introduce Spectral Editing Key Amplification (SEKA), a training-free steering method that tackles this by directly editing key embeddings before attention computation. SEKA uses spectral decomposition to steer key embeddings towards latent directions that amplify attention scores for certain tokens. We extend this to Adaptive SEKA (AdaSEKA), a query-adaptive variant that uses a training-free routing mechanism to dynamically combine multiple expert subspaces based on the prompt's semantic intent. Our experiments show both methods significantly outperform strong baselines on standard steering benchmarks while adding much lower latency and memory overhead, in compatibility with optimised attention.
摘要：注意力引导是控制模型焦点的一项重要技术，可实现提示突出显示等功能，其中模型优先考虑用户指定的文本。然而，现有的注意力引导方法需要显式存储完整的注意力矩阵，这使得它们与 FlashAttention 等内存高效的实现不兼容。我们引入了频谱编辑关键放大（SEKA），这是一种免训练的转向方法，它通过在注意力计算之前直接编辑关键嵌入来解决这个问题。 SEKA 使用谱分解将关键嵌入引导至潜在方向，从而放大某些标记的注意力分数。我们将其扩展到自适应 SEKA (AdaSEKA)，这是一种查询自适应变体，它使用免训练路由机制根据提示的语义意图动态组合多个专家子空间。我们的实验表明，这两种方法在标准转向基准上都显着优于强大的基线，同时增加了更低的延迟和内存开销，并且与优化的注意力兼容。

Title: Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data

Authors: Minghao Guo, Ziyi Ye, Wujiang Xu, Xi Zhu, Wenyue Hua, Dimitris N. Metaxas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01289
Pdf URL: https://arxiv.org/pdf/2603.01289
Copy Paste: [[2603.01289]] Individual Turing Test: A Case Study of LLM-based Simulation Using Longitudinal Personal Data(https://arxiv.org/abs/2603.01289)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Large Language Models (LLMs) have demonstrated remarkable human-like capabilities, yet their ability to replicate a specific individual remains under-explored. This paper presents a case study to investigate LLM-based individual simulation with a volunteer-contributed archive of private messaging history spanning over ten years. Based on the messaging data, we propose the "Individual Turing Test" to evaluate whether acquaintances of the volunteer can correctly identify which response in a multi-candidate pool most plausibly comes from the volunteer. We investigate prevalent LLM-based individual simulation approaches including: fine-tuning, retrieval-augmented generation (RAG), memory-based approach, and hybrid methods that integrate fine-tuning and RAG or memory. Empirical results show that current LLM-based simulation methods do not pass the Individual Turing Test, but they perform substantially better when the same test is conducted on strangers to the target individual. Additionally, while fine-tuning improves the simulation in daily chats representing the language style of the individual, retrieval-augmented and memory-based approaches demonstrate stronger performance on questions involving personal opinions and preferences. These findings reveal a fundamental trade-off between parametric and non-parametric approaches to individual simulation with LLMs when given a longitudinal context.
摘要：大型语言模型 (LLM) 已展现出卓越的类人能力，但其复制特定个体的能力仍未得到充分探索。本文提出了一个案例研究，通过志愿者贡献的跨越十年的私人消息历史档案来调查基于法学硕士的个人模拟。根据消息数据，我们提出“个人图灵测试”，以评估志愿者的熟人是否能够正确识别多候选池中哪个响应最有可能来自志愿者。我们研究流行的基于 LLM 的个体模拟方法，包括：微调、检索增强生成 (RAG)、基于记忆的方法以及集成微调和 RAG 或记忆的混合方法。实证结果表明，当前基于LLM的模拟方法并未通过个体图灵测试，但当对目标个体的陌生人进行相同的测试时，它们的表现要好得多。此外，虽然微调改进了代表个人语言风格的日常聊天的模拟，但检索增强和基于记忆的方法在涉及个人观点和偏好的问题上表现出了更强的性能。这些发现揭示了在给定纵向背景下，法学硕士个体模拟的参数方法和非参数方法之间的基本权衡。

Title: Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent

Authors: Achuth Chandrasekhar, Janghoon Ock, Amir Barati Farimani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01311
Pdf URL: https://arxiv.org/pdf/2603.01311
Copy Paste: [[2603.01311]] Catalyst-Agent: Autonomous heterogeneous catalyst screening and optimization with an LLM Agent(https://arxiv.org/abs/2603.01311)
Keywords: llm, agent
Abstract: The discovery of novel catalysts tailored for particular applications is a major challenge for the twenty-first century. Traditional methods for this include time-consuming and expensive experimental trial-and-error approaches in labs based on chemical theory or heavily computational first-principles approaches based on density functional theory. Recent studies show that deep learning models like graph neural networks (GNNs) can significantly speed up the screening and discovery of catalyst materials by many orders of magnitude, with very high accuracy and fidelity. In this work, we introduce Catalyst-Agent, a Model Context Protocol (MCP) server-based, LLM-powered AI agent. It can explore vast material databases using the OPTIMADE API, make structural modifications, calculate adsorption energies using Meta FAIRchem's UMA (GNN) model via FAIRchem's AdsorbML workflow and slab construction, and make useful material suggestions to the researcher in a closed-loop manner, including surface-level modifications to refine near-miss candidates. It is tested on three pivotal reactions: the oxygen reduction reaction (ORR), the nitrogen reduction reaction (NRR), and the CO2 reduction reaction (CO2RR). Catalyst-Agent achieves a success rate of 23-34 percent among all the materials it chooses and evaluates, and manages to converge in 1-2 trials per successful material on average. This work demonstrates the potential of AI agents to exercise their planning capabilities and tool use to operationalize the catalyst screening workflow, provide useful, testable hypotheses, and accelerate future scientific discoveries for humanity with minimal human intervention.
摘要：发现适合特定应用的新型催化剂是二十一世纪的重大挑战。传统方法包括基于化学理论的实验室中耗时且昂贵的实验试错方法或基于密度泛函理论的大量计算的第一原理方法。最近的研究表明，图神经网络（GNN）等深度学习模型可以将催化剂材料的筛选和发现速度显着提高多个数量级，并且具有非常高的准确性和保真度。在这项工作中，我们介绍了 Catalyst-Agent，这是一种基于模型上下文协议 (MCP) 服务器、由 LLM 驱动的 AI 代理。它可以使用 OPTIMADE API 探索大量材料数据库，进行结构修改，通过 FAIRchem 的 AdsorbML 工作流程和板结构使用 Meta FAIRchem 的 UMA (GNN) 模型计算吸附能，并以闭环方式向研究人员提供有用的材料建议，包括表面水平修改以细化几乎未命中的候选材料。它对三个关键反应进行了测试：氧还原反应 (ORR)、氮还原反应 (NRR) 和 CO2 还原反应 (CO2RR)。 Catalyst-Agent 在其选择和评估的所有材料中实现了 23-34% 的成功率，并且平均每种成功材料能够在 1-2 次试验中收敛。这项工作展示了人工智能代理的潜力，可以发挥其规划能力和工具使用来操作催化剂筛选工作流程，提供有用的、可测试的假设，并以最少的人为干预加速人类未来的科学发现。

Title: Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Authors: Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, Javen Shi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01326
Pdf URL: https://arxiv.org/pdf/2603.01326
Copy Paste: [[2603.01326]] Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning(https://arxiv.org/abs/2603.01326)
Keywords: language model, llm
Abstract: Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
摘要：大型语言模型（LLM）的现有可解释性方法通常将隐藏状态视为激活空间中的静态点，假设可以使用单个层的表示来区分正确和错误的推理。然而，这些激活充满了多语义特征，导致线性探针学习表面层词汇模式而不是底层推理结构。我们引入了真理作为轨迹（TaT），它将变压器推理建模为迭代细化的展开轨迹，将分析从静态激活转变为逐层几何位移。通过分析跨层表示的位移，TaT 揭示了区分有效推理和虚假行为的几何不变量。我们在涵盖常识推理、问题回答和毒性检测的基准上跨密集和专家混合 (MoE) 架构评估 TaT。在不访问激活本身并且仅使用跨层激活的变化的情况下，我们证明 TaT 有效地减轻了对静态词汇混淆的依赖，优于传统的探测，并将轨迹分析建立为 LLM 可解释性的补充视角。

Title: MetaState: Persistent Working Memory for Discrete Diffusion Language Models

Authors: Kejing Xia, Mingzhe Li, Lixuan Wei, Zhenbang Du, Xiangchi Yuan, Qirui Jin, Wenke Lee
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01331
Pdf URL: https://arxiv.org/pdf/2603.01331
Copy Paste: [[2603.01331]] MetaState: Persistent Working Memory for Discrete Diffusion Language Models(https://arxiv.org/abs/2603.01331)
Keywords: language model, llm
Abstract: Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the \textbf{Information Island} problem. It leads to redundant recomputation across steps and can degrade cross-step consistency. We address this limitation with \textbf{MetaState}, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. \textbf{MetaState} consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with $K$-step unrolling to expose them to multi-step denoising dynamics during fine-tuning. On LLaDA-8B and Dream-7B, \textbf{MetaState} introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.
摘要：离散扩散语言模型 (dLLM) 通过迭代对屏蔽序列进行去噪来生成文本。与自回归模型相比，这种范式自然支持并行解码、双向上下文和灵活的生成模式。然而，标准 dLLM 仅在当前硬掩蔽序列上调整每个去噪步骤，而在采样和重新掩蔽后会丢弃中间连续表示。我们将此瓶颈称为“信息岛”问题。它会导致跨步骤的冗余重新计算，并可能降低跨步骤的一致性。我们通过 \textbf{MetaState} 解决了这一限制，这是一种轻量级的循环增强，为冻结的 dLLM 主干配备了持久的、固定大小的工作记忆，该工作记忆与序列长度无关。 \textbf{MetaState} 由三个可训练模块组成：一个将主干激活读取到内存槽中的交叉注意力混合器，一个跨去噪步骤集成信息的 GRU 式更新器，以及一个将更新后的内存反馈回主干激活的交叉注意力注入器。我们通过 $K$ 步展开来训练这些模块，以便在微调期间将它们暴露于多步去噪动态。在 LLaDA-8B 和 Dream-7B 上，\textbf{MetaState} 在保持骨干网冻结的同时引入了可忽略不计的可训练参数，并且它持续提高了冻结基线的准确性。这些结果表明，持久跨步骤记忆是桥接去噪步骤和提高离散扩散语言模型生成质量的有效机制。

Title: PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology

Authors: Yimin Zhao, Sheela R. Damle, Simone E. Dekker, Scott Geng, Karly Williams Silva, Jesse J Hubbard, Manuel F Fernandez, Fatima Zelada-Arenas, Alejandra Alvarez, Brianne Flores, Alexis Rodriguez, Stephen Salerno, Carrie Wright, Zihao Wang, Pang Wei Koh, Jeffrey T. Leek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01343
Pdf URL: https://arxiv.org/pdf/2603.01343
Copy Paste: [[2603.01343]] PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology(https://arxiv.org/abs/2603.01343)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety. As patients and clinicians increasingly use LLMs for guidance on complex conditions such as pancreatic cancer, evaluation must extend beyond general medical knowledge. Existing frameworks, such as HealthBench, rely on simulated queries and lack disease-specific depth. Moreover, high rubric-based scores do not ensure factual correctness, underscoring the need to assess hallucinations. We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions from the Pancreatic Cancer Action Network (PanCAN). The resulting benchmark, PanCanBench, includes 3,130 question-specific criteria across 282 authentic patient questions. We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration. Models showed substantial variation in rubric-based completeness, with scores ranging from 46.5% to 82.3%. Factual errors were common, with hallucination rates (the percentages of responses containing at least one factual error) ranging from 6.0% for Gemini-2.5 Pro and GPT-4o to 53.8% for Llama-3.1-8B. Importantly, newer reasoning-optimized models did not consistently improve factuality: although o3 achieved the highest rubric score, it produced inaccuracies more frequently than other GPT-family models. Web-search integration did not inherently guarantee better responses. The average score changed from 66.8% to 63.9% for Gemini-2.5 Pro and from 73.8% to 72.8% for GPT-5 when web search was enabled. Synthetic AI-generated rubrics inflated absolute scores by 17.9 points on average while generally maintaining similar relative ranking.
摘要：大语言模型（LLM）在标准化考试中取得了专家级的表现，但多项选择的准确性却很难反映现实世界的临床实用性和安全性。随着患者和临床医生越来越多地使用法学硕士来指导胰腺癌等复杂疾病，评估必须超出一般医学知识的范围。现有框架（例如 HealthBench）依赖于模拟查询，缺乏针对特定疾病的深度。此外，基于评分标准的高分并不能确保事实的正确性，这凸显了评估幻觉的必要性。我们开发了一个人机交互管道，为胰腺癌行动网络 (PanCAN) 中未识别的患者问题创建专家评分标准。由此产生的基准 PanCanBench 包括涵盖 282 个真实患者问题的 3,130 个特定问题标准。我们使用法学硕士作为法官框架评估了 22 个专有和开源法学硕士，衡量临床完整性、事实准确性和网络搜索集成。模型显示基于评分标准的完整性存在很大差异，得分范围为 46.5% 到 82.3%。事实错误很常见，幻觉率（包含至少一个事实错误的响应百分比）范围从 Gemini-2.5 Pro 和 GPT-4o 的 6.0% 到 Llama-3.1-8B 的 53.8%。重要的是，较新的推理优化模型并没有始终如一地提高事实性：尽管 o3 获得了最高的评分，但它比其他 GPT 系列模型更容易产生错误。网络搜索集成并不能本质上保证更好的响应。当启用网络搜索时，Gemini-2.5 Pro 的平均得分从 66.8% 变为 63.9%，GPT-5 的平均得分从 73.8% 变为 72.8%。人工智能生成的综合评分标准将绝对分数平均提高了 17.9 分，同时总体上保持了相似的相对排名。

Title: Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning

Authors: Zhongjian Zhang, Xiao Wang, Mengmei Zhang, Jiarui Tan, Chuan Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01385
Pdf URL: https://arxiv.org/pdf/2603.01385
Copy Paste: [[2603.01385]] Toward Graph-Tokenizing Large Language Models with Reconstructive Graph Instruction Tuning(https://arxiv.org/abs/2603.01385)
Keywords: language model, llm
Abstract: The remarkable success of large language models (LLMs) has motivated researchers to adapt them as universal predictors for various graph-related tasks, with the ultimate goal of developing a graph foundation model that generalizes diverse scenarios. The key challenge is to align graph data with language spaces so that LLMs can better comprehend graphs. As a popular paradigm, Graph-Tokenizing LLMs (GTokenLLMs) encode complex structures and lengthy texts into a graph token sequence, and then align them with text tokens via language instructions tuning. Despite their initial success, our information-theoretic analysis reveals that existing GTokenLLMs rely solely on text supervision from language instructions, which achieve only implicit graph-text alignment, resulting in a text-dominant bias that underutilizes graph context. To overcome this limitation, we first prove that the alignment objective is upper-bounded by the mutual information between the input graphs and their hidden representations in the LLM, which motivates us to improve this upper bound to achieve better alignment. To this end, we further propose a reconstructive graph instruction tuning pipeline, RGLM. Our key idea is to reconstruct the graph information from the LLM's graph token outputs, explicitly incorporating graph supervision to constrain the alignment process. Technically, we embody RGLM by exploring three distinct variants from two complementary perspectives: RGLM-Decoder from the input space; RGLM-Similarizer and RGLM-Denoiser from the latent space. Additionally, we theoretically analyze the alignment effectiveness of each variant. Extensive experiments on various benchmarks and task scenarios validate the effectiveness of the proposed RGLM, paving the way for new directions in GTokenLLMs' alignment research.
摘要：大型语言模型（LLM）的巨大成功促使研究人员将其用作各种与图相关的任务的通用预测器，最终目标是开发一个能够概括不同场景的图基础模型。关键的挑战是将图形数据与语言空间保持一致，以便法学硕士能够更好地理解图形。作为一种流行的范例，图标记化 LLM（GTokenLLM）将复杂的结构和冗长的文本编码为图标记序列，然后通过语言指令调整将它们与文本标记对齐。尽管取得了初步成功，但我们的信息论分析表明，现有的 GTokenLLM 仅依赖于语言指令的文本监督，仅实现隐式图文本对齐，从而导致文本主导偏差，未充分利用图上下文。为了克服这个限制，我们首先证明对齐目标的上限是由输入图与其在 LLM 中的隐藏表示之间的互信息决定的，这激励我们改进这个上限以实现更好的对齐。为此，我们进一步提出了一种重构图指令调整管道，RGLM。我们的关键思想是从 LLM 的图标记输出中重建图信息，明确地结合图监督来约束对齐过程。从技术上讲，我们通过从两个互补的角度探索三种不同的变体来体现 RGLM：来自输入空间的 RGLM-Decoder；来自潜在空间的 RGLM-Similarizer 和 RGLM-Denoiser。此外，我们从理论上分析了每个变体的比对有效性。对各种基准和任务场景的大量实验验证了所提出的 RGLM 的有效性，为 GTokenLLM 对齐研究的新方向铺平了道路。

Title: Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction

Authors: Jiyoon Myung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01423
Pdf URL: https://arxiv.org/pdf/2603.01423
Copy Paste: [[2603.01423]] Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction(https://arxiv.org/abs/2603.01423)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) are increasingly deployed in real-world applications where users engage in extended, mixed-topic conversations that depend on prior context. Yet, their reliability under realistic multi-turn interactions remains poorly understood. We conduct a systematic evaluation of conversational reliability through three representative tasks that reflect practical interaction challenges: (1) maintaining global constraints across topic shifts, (2) selecting the correct tool or agent amid interleaved intents, and (3) tracking structured entities under revisions and distractions. Each task pairs single-turn and multi-turn settings, allowing us to quantify reliability degradation under extended dialogue. Across both commercial and open-source models, we observe substantial declines in reliability, particularly for smaller models. Error analyses reveal recurring failure modes such as instruction drift, intent confusion, and contextual overwriting, which compromise dependable behavior in operational systems. Our findings highlight the need for stress-testing LLMs for conversational reliability and developing more robust evaluation methods for trustworthy deployment.
摘要：大型语言模型 (LLM) 越来越多地部署在现实世界的应用程序中，在这些应用程序中，用户参与依赖于先前上下文的扩展的、混合主题的对话。然而，它们在现实多轮交互下的可靠性仍然知之甚少。我们通过反映实际交互挑战的三个代表性任务对会话可靠性进行系统评估：（1）在主题转换之间保持全局约束，（2）在交错的意图中选择正确的工具或代理，以及（3）跟踪修订和干扰下的结构化实体。每个任务都配对单轮和多轮设置，使我们能够量化扩展对话下的可靠性退化。在商业和开源模型中，我们观察到可靠性大幅下降，特别是对于较小的模型。错误分析揭示了指令漂移、意图混淆和上下文覆盖等反复出现的故障模式，这些模式会损害操作系统的可靠行为。我们的研究结果强调需要对法学硕士进行压力测试以确保对话可靠性，并开发更强大的评估方法以实现值得信赖的部署。

Title: LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval

Authors: Jiajie Jin, Yanzhao Zhang, Mingxin Li, Dingkun Long, Pengjun Xie, Yutao Zhu, Zhicheng Dou
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.01425
Pdf URL: https://arxiv.org/pdf/2603.01425
Copy Paste: [[2603.01425]] LaSER: Internalizing Explicit Reasoning into Latent Space for Dense Retrieval(https://arxiv.org/abs/2603.01425)
Keywords: llm
Abstract: LLMs have fundamentally transformed dense retrieval, upgrading backbones from discriminative encoders to generative architectures. However, a critical disconnect remains: while LLMs possess strong reasoning capabilities, current retrievers predominantly utilize them as static encoders, leaving their potential for complex reasoning unexplored. To address this, existing approaches typically adopt rewrite-then-retrieve pipelines to generate explicit CoT rationales before retrieval. However, this incurs prohibitive latency. In this paper, we propose LaSER, a novel self-distillation framework that internalizes explicit reasoning into the latent space of dense retrievers. Operating on a shared LLM backbone, LaSER introduces a dual-view training mechanism: an Explicit view that explicitly encodes ground-truth reasoning paths, and a Latent view that performs implicit latent thinking. To bridge the gap between these views, we design a multi-grained alignment strategy. Beyond standard output alignment, we introduce a trajectory alignment mechanism that synchronizes the intermediate latent states of the latent path with the semantic progression of the explicit reasoning segments. This allows the retriever to think silently and effectively without autoregressive text generation. Extensive experiments on both in-domain and out-of-domain reasoning-intensive benchmarks demonstrate that LaSER significantly outperforms state-of-the-art baselines. Furthermore, analyses across diverse backbones and model scales validate the robustness of our approach, confirming that our unified learning framework is essential for eliciting effective latent thinking. Our method successfully combines the reasoning depth of explicit CoT pipelines with the inference efficiency of standard dense retrievers.
摘要：法学硕士从根本上改变了密集检索，将主干从判别式编码器升级为生成式架构。然而，一个关键的脱节仍然存在：虽然法学硕士拥有强大的推理能力，但当前的检索器主要将它们用作静态编码器，从而使其复杂推理的潜力尚未开发。为了解决这个问题，现有方法通常采用重写然后检索管道在检索之前生成明确的 CoT 基本原理。然而，这会导致令人望而却步的延迟。在本文中，我们提出了 LaSER，一种新颖的自蒸馏框架，它将显式推理内化到密集检索器的潜在空间中。 LaSER 在共享的 LLM 主干上运行，引入了双视图训练机制：显式编码真实推理路径的显式视图和执行隐式潜在思维的潜在视图。为了弥合这些观点之间的差距，我们设计了一种多粒度的对齐策略。除了标准输出对齐之外，我们还引入了一种轨迹对齐机制，该机制将潜在路径的中间潜在状态与显式推理段的语义进展同步。这使得检索器能够安静且有效地思考，而无需生成自回归文本。对域内和域外推理密集型基准的大量实验表明，LaSER 的性能显着优于最先进的基准。此外，跨不同主干和模型规模的分析验证了我们方法的稳健性，证实我们的统一学习框架对于引发有效的潜在思维至关重要。我们的方法成功地将显式 CoT 管道的推理深度与标准密集检索器的推理效率结合起来。

Title: Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics

Authors: Samhruth Ananthanarayanan, Ayan Sengupta, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01426
Pdf URL: https://arxiv.org/pdf/2603.01426
Copy Paste: [[2603.01426]] Understanding the Physics of Key-Value Cache Compression for LLMs through Attention Dynamics(https://arxiv.org/abs/2603.01426)
Keywords: llm, hallucination
Abstract: As context windows in LLMs scale to 100K+ tokens, the key-value (KV) cache becomes the dominant memory bottleneck, with recent methods claiming 80-90% savings and minimal benchmark degradation. We argue these evaluations miss a structural issue: attention is not just storage but routing, and retaining KV pairs does not guarantee semantic accessibility. We propose a physics-inspired view of KV compression as a controlled perturbation of token-level routing, distinguishing retention, accessibility, and utilization. Using synthetic tasks probing multi-entity tracking, disambiguation, coreference, and multi-hop reasoning, we find that moderate compression degrades internal representations with little accuracy loss, revealing redundancy; all models exhibit a sharp hallucination safety cliff near 90% compression, correlated with spikes in Global Eviction Ratio (GER), suggesting a phase transition in semantic reachability; and architectures differ in routing dynamics, with LLaMA showing early consensus and late diversification, and Qwen showing funnel-like late convergence, leading to distinct resilience profiles. Beyond erasure, we identify representational rigidity, where excessive head-level consensus collapses routing flexibility despite token survival. These results suggest sparse token-route structures govern compression tolerance, reframing KV compression as a structural probe of attention geometry and linking long-context scalability to sparsity and the lottery ticket hypothesis in self-attention.
摘要：随着 LLM 中的上下文窗口扩展到 100K+ 令牌，键值 (KV) 缓存成为主要的内存瓶颈，最近的方法声称可以节省 80-90% 的成本，并且基准测试降级最小。我们认为这些评估忽略了一个结构性问题：注意力不仅在于存储，还在于路由，并且保留 KV 对并不能保证语义可访问性。我们提出了一种受物理启发的 KV 压缩视图，将其视为令牌级路由的受控扰动，区分保留、可访问性和利用率。使用探索多实体跟踪、消歧、共指和多跳推理的综合任务，我们发现适度的压缩会降低内部表示的质量，而几乎没有准确性损失，从而揭示了冗余；所有模型在接近 90% 压缩时都表现出尖锐的幻觉安全悬崖，与全局驱逐率 (GER) 的峰值相关，表明语义可达性发生了阶段转变；和架构在路由动态方面有所不同，LLaMA 显示出早期共识和后期多样化，而 Qwen 显示出漏斗状的后期收敛，从而导致不同的弹性配置文件。除了擦除之外，我们还发现了代表性刚性，即尽管代币仍然存在，但过多的高层共识会破坏路由灵活性。这些结果表明稀疏令牌路由结构控制压缩容差，将 KV 压缩重新定义为注意力几何的结构探针，并将长上下文可扩展性与稀疏性和自注意力中的彩票假设联系起来。

Title: Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents

Authors: Yuxin Liu, Mingye Zhu, Siyuan Liu, Bo Hu, Lei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01438
Pdf URL: https://arxiv.org/pdf/2603.01438
Copy Paste: [[2603.01438]] Enhancing Persona Following at Decoding Time via Dynamic Importance Estimation for Role-Playing Agents(https://arxiv.org/abs/2603.01438)
Keywords: language model, prompt, agent
Abstract: The utility of Role-Playing Language Agents in sociological research is growing alongside the adoption of Large Language Models. For realism in social simulation, these agents must adhere to their personas defined by character profiles, yet existing strategies-static prompt engineering or costly fine-tuning-fail to adapt personas to dynamic scenarios. Psychological theories, such as the Cognitive-Affective Personality Systems, provide a crucial explanation for this failure: a persona's influence on behavior is not static but varies with the scenarios. This context-dependence highlights the critical need for adaptive persona management. To address this gap, we propose a novel, theory-driven method that dynamically estimates context-dependent persona importance and integrates it into weighted reward-guided decoding, enabling inference-time persona following. Specifically, we introduce the Persona Dynamic Decoding (PDD) framework, which consists of two key components: (1) Persona Importance Estimation (PIE) module, which dynamically quantifies the contextual importance of persona attributes without requiring ground-truth supervision; and (2) Persona-Guided Inference-Time Alignment (PIA) paradigm, which leverages these importance scores to construct weighted multi-objective rewards and modulate generation probabilities during inference. Extensive experiments show the effectiveness of our method in utterance consistency and behavioral fidelity.
摘要：随着大型语言模型的采用，角色扮演语言代理在社会学研究中的效用正在不断增长。为了实现社会模拟的真实性，这些代理必须遵循角色配置文件定义的角色，但现有的策略（静态提示工程或昂贵的微调）无法使角色适应动态场景。认知情感人格系统等心理学理论为这种失败提供了一个关键的解释：角色对行为的影响不是静态的，而是随场景而变化。这种上下文依赖性凸显了对自适应角色管理的迫切需求。为了解决这一差距，我们提出了一种新颖的理论驱动方法，该方法动态估计上下文相关的角色重要性并将其集成到加权奖励引导解码中，从而实现推理时角色跟随。具体来说，我们介绍了角色动态解码（PDD）框架，它由两个关键组件组成：（1）角色重要性估计（PIE）模块，它动态量化角色属性的上下文重要性，而不需要地面实况监督；（2）角色引导推理时间对齐（PIA）范式，它利用这些重要性分数来构建加权多目标奖励并在推理过程中调节生成概率。大量的实验证明了我们的方法在言语一致性和行为保真度方面的有效性。

Title: Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs

Authors: Ming-Hao Hsu, Xueyao Zhang, Xiaohai Tian, Jun Zhang, Zhizheng Wu
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2603.01502
Pdf URL: https://arxiv.org/pdf/2603.01502
Copy Paste: [[2603.01502]] Anatomy of the Modality Gap: Dissecting the Internal States of End-to-End Speech LLMs(https://arxiv.org/abs/2603.01502)
Keywords: language model, llm
Abstract: Recent advancements in Large Speech-Language Models have significantly bridged the gap between acoustic signals and linguistic understanding. However, a persistent performance disparity remains in speech-based input tasks compared to direct text inference. In this paper, we investigate the dynamic roots of this modality gap beyond static geometric alignment, analyzing how speech and text representations evolve layer-by-layer. We evaluate four open-weight end-to-end models on SpeechMMLU and VoiceBench BBH. Using cross-layer CKA analysis with speech-text token alignment, we find that speech representations exhibit a broad cross-layer alignment band, attributable to the redundant nature of speech where semantic content spans multiple frames. We show that these alignment patterns are structurally stable across different analysis configurations. Crucially, simple statistical calibration is insufficient and can be detrimental when applied at the input layer, indicating that the modality gap is not a mere distribution shift. Overall, our results suggest that the bottleneck lies in condensing redundant speech into stable late-layer decisions, motivating future solutions that operate at the token or temporal granularity instead of feature-level matching.
摘要：大型语音语言模型的最新进展显着缩小了声学信号和语言理解之间的差距。然而，与直接文本推理相比，基于语音的输入任务仍然存在持续的性能差异。在本文中，我们研究了这种模态差距超越静态几何对齐的动态根源，分析了语音和文本表示如何逐层演变。我们在 SpeechMMLU 和 VoiceBench BBH 上评估了四种开放权重端到端模型。使用具有语音-文本标记对齐的跨层 CKA 分析，我们发现语音表示表现出广泛的跨层对齐带，这归因于语义内容跨越多个帧的语音的冗余性质。我们证明这些对齐模式在不同的分析配置中结构稳定。至关重要的是，简单的统计校准是不够的，并且在输入层应用时可能是有害的，这表明模态差距不仅仅是分布变化。总的来说，我们的结果表明，瓶颈在于将冗余语音压缩为稳定的后层决策，从而激发未来以令牌或时间粒度而不是特征级匹配运行的解决方案。

Title: Extracting Training Dialogue Data from Large Language Model based Task Bots

Authors: Shuo Zhang, Junzhou Zhao, Junji Hou, Pinghui Wang, Chenxu Wang, Jing Tao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01550
Pdf URL: https://arxiv.org/pdf/2603.01550
Copy Paste: [[2603.01550]] Extracting Training Dialogue Data from Large Language Model based Task Bots(https://arxiv.org/abs/2603.01550)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been widely adopted to enhance Task-Oriented Dialogue Systems (TODS) by modeling complex language patterns and delivering contextually appropriate responses. However, this integration introduces significant privacy risks, as LLMs, functioning as soft knowledge bases that compress extensive training data into rich knowledge representations, can inadvertently memorize training dialogue data containing not only identifiable information such as phone numbers but also entire dialogue-level events like complete travel schedules. Despite the critical nature of this privacy concern, how LLM memorization is inherited in developing task bots remains unexplored. In this work, we address this gap through a systematic quantitative study that involves evaluating existing training data extraction attacks, analyzing key characteristics of task-oriented dialogue modeling that render existing methods ineffective, and proposing novel attack techniques tailored for LLM-based TODS that enhance both response sampling and membership inference. Experimental results demonstrate the effectiveness of our proposed data extraction attack. Our method can extract thousands of training labels of dialogue states with best-case precision exceeding 70%. Furthermore, we provide an in-depth analysis of training data memorization in LLM-based TODS by identifying and quantifying key influencing factors and discussing targeted mitigation strategies.
摘要：大型语言模型 (LLM) 已被广泛采用，通过对复杂的语言模式进行建模并提供适合上下文的响应来增强面向任务的对话系统 (TODS)。然而，这种集成带来了重大的隐私风险，因为法学硕士作为软知识库，将大量的训练数据压缩为丰富的知识表示，可能会无意中记住训练对话数据，这些数据不仅包含电话号码等可识别信息，还包含完整的旅行日程等整个对话级别的事件。尽管这种隐私问题具有至关重要的性质，但在开发任务机器人时如何继承法学硕士记忆仍有待探索。在这项工作中，我们通过系统的定量研究来解决这一差距，该研究涉及评估现有的训练数据提取攻击，分析导致现有方法无效的面向任务的对话建模的关键特征，并提出针对基于 LLM 的 TODS 量身定制的新型攻击技术，以增强响应采样和成员推理。实验结果证明了我们提出的数据提取攻击的有效性。我们的方法可以提取数千个对话状态的训练标签，最佳情况精度超过 70%。此外，我们通过识别和量化关键影响因素并讨论有针对性的缓解策略，对基于 LLM 的 TODS 中的训练数据记忆进行了深入分析。

Title: Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models

Authors: Arghodeep Nandi, Ojasva Saxena, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01580
Pdf URL: https://arxiv.org/pdf/2603.01580
Copy Paste: [[2603.01580]] Markovian ODE-guided scoring can assess the quality of offline reasoning traces in language models(https://arxiv.org/abs/2603.01580)
Keywords: language model
Abstract: Reasoning traces produced by generative language models are increasingly used for tasks ranging from mathematical problem solving to automated fact checking. However, existing evaluation methods remain largely mechanical and fail to capture human-centric notions of reasoning quality in a way that generalizes across varied and progressively degraded reasoning. We introduce MarODE, an offline evaluation framework that assigns quality scores to reasoning traces. Its effectiveness is assessed using human-centric perturbations and human judgments, which jointly evaluate the fundamental dimensions of an evaluation metric - goodness and soundness. The approach is grounded in a Markovian formulation of reasoning progression and an ordinary differential equation based characterization of trace dynamics, enabling efficient evaluation of reasoning quality. In a large-scale evaluation, MarODE outperforms existing baselines by over 250% under Somers' D correlation. Our results emphasize the value of theory-driven evaluation frameworks as reasoning traces become central to language model-based systems.
摘要：生成语言模型产生的推理轨迹越来越多地用于从数学问题解决到自动事实检查等任务。然而，现有的评估方法在很大程度上仍然是机械的，并且无法以泛化各种且逐渐退化的推理的方式捕捉以人为中心的推理质量概念。我们引入了 MarODE，一个离线评估框架，可以为推理轨迹分配质量分数。它的有效性是使用以人为中心的扰动和人类判断来评估的，它们共同评估评估指标的基本维度——善良和健全。该方法基于推理进程的马尔可夫公式和基于常微分方程的迹动态表征，从而能够有效评估推理质量。在大规模评估中，MarODE 在 Somers' D 相关性下比现有基线高出 250% 以上。我们的结果强调了理论驱动的评估框架的价值，因为推理轨迹成为基于语言模型的系统的核心。

Title: Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation

Authors: Aditya Parikh, Aasa Feragen, Sneha Das, Stella Frank
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01625
Pdf URL: https://arxiv.org/pdf/2603.01625
Copy Paste: [[2603.01625]] Measuring What VLMs Don't Say: Validation Metrics Hide Clinical Terminology Erasure in Radiology Report Generation(https://arxiv.org/abs/2603.01625)
Keywords: language model
Abstract: Reliable deployment of Vision-Language Models (VLMs) in radiology requires validation metrics that go beyond surface-level text similarity to ensure clinical fidelity and demographic fairness. This paper investigates a critical blind spot in current model evaluation: the use of decoding strategies that lead to high aggregate token-overlap scores despite succumbing to template collapse, in which models generate only repetitive, safe generic text and omit clinical terminology. Unaddressed, this blind spot can lead to metric gaming, where models that perform well on benchmarks prove clinically uninformative. Instead, we advocate for lexical diversity measures to check model generations for clinical specificity. We introduce Clinical Association Displacement (CAD), a vocabulary-level framework that quantifies shifts in demographic-based word associations in generated reports. Weighted Association Erasure (WAE) aggregates these shifts to measure the clinical signal loss across demographic groups. We show that deterministic decoding produces high levels of semantic erasure, while stochastic sampling generates diverse outputs but risks introducing new bias, motivating a fundamental rethink of how "optimal" reporting is defined.
摘要：在放射学中可靠部署视觉语言模型 (VLM) 需要超越表面文本相似性的验证指标，以确保临床保真度和人口统计公平性。本文研究了当前模型评估中的一个关键盲点：尽管屈服于模板崩溃，但使用解码策略仍会导致较高的总令牌重叠分数，其中模型仅生成重复的、安全的通用文本并省略临床术语。如果不加以解决，这个盲点可能会导致度量游戏，在基准测试中表现良好的模型在临床上证明没有信息。相反，我们提倡词汇多样性措施来检查模型生成的临床特异性。我们引入了临床关联位移 (CAD)，这是一种词汇级框架，可量化生成的报告中基于人口统计的单词关联的变化。加权关联擦除 (WAE) 汇总这些变化，以衡量不同人口群体的临床信号丢失情况。我们表明，确定性解码会产生高水平的语义擦除，而随机采样会产生不同的输出，但有引入新偏差的风险，从而促使人们从根本上重新思考如何定义“最佳”报告。

Title: Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning

Authors: Jiebin Zhang, Zhenghan Yu, Liang Wang, Nan Yang, Eugene J. Yu, Zheng Li, Yifan Song, Dawei Zhu, Xingxing Zhang, Furu Wei, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01639
Pdf URL: https://arxiv.org/pdf/2603.01639
Copy Paste: [[2603.01639]] Learning to Draft: Adaptive Speculative Decoding with Reinforcement Learning(https://arxiv.org/abs/2603.01639)
Keywords: language model, llm
Abstract: Speculative decoding accelerates large language model (LLM) inference by using a small draft model to generate candidate tokens for a larger target model to verify. The efficacy of this technique hinges on the trade-off between the time spent on drafting candidates and verifying them. However, current state-of-the-art methods rely on a static time allocation, while recent dynamic approaches optimize for proxy metrics like acceptance length, often neglecting the true time cost and treating the drafting and verification phases in isolation. To address these limitations, we introduce Learning to Draft (LTD), a novel method that directly optimizes for throughput of each draft-and-verify cycle. We formulate the problem as a reinforcement learning environment and train two co-adaptive policies to dynamically coordinate the draft and verification phases. This encourages the policies to adapt to each other and explicitly maximize decoding efficiency. We conducted extensive evaluations on five diverse LLMs and four distinct tasks. Our results show that LTD achieves speedup ratios ranging from 2.24x to 4.32x, outperforming the state-of-the-art method Eagle3 up to 36.4%.
摘要：推测性解码通过使用小型草稿模型生成候选标记以供较大目标模型进行验证，从而加速大型语言模型 (LLM) 推理。该技术的有效性取决于起草候选人和验证候选人所花费的时间之间的权衡。然而，当前最先进的方法依赖于静态时间分配，而最近的动态方法针对接受长度等代理指标进行优化，通常忽略真实的时间成本并单独处理起草和验证阶段。为了解决这些限制，我们引入了 Learning to Draft (LTD)，这是一种直接优化每个草稿和验证周期吞吐量的新颖方法。我们将问题制定为强化学习环境，并训练两种自适应策略来动态协调起草和验证阶段。这鼓励政策相互适应并明确最大化解码效率。我们对五个不同的法学硕士和四个不同的任务进行了广泛的评估。我们的结果表明，LTD 的加速比范围为 2.24 倍到 4.32 倍，优于最先进的方法 Eagle3 高达 36.4%。

Title: LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence

Authors: Anka Chandrahas Tummepalli, Preethu Rose Anish
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01651
Pdf URL: https://arxiv.org/pdf/2603.01651
Copy Paste: [[2603.01651]] LexChronos: An Agentic Framework for Structured Event Timeline Extraction in Indian Jurisprudence(https://arxiv.org/abs/2603.01651)
Keywords: language model, gpt, llm, agent
Abstract: Understanding and predicting judicial outcomes demands nuanced analysis of legal documents. Traditional approaches treat judgments and proceedings as unstructured text, limiting the effectiveness of large language models (LLMs) in tasks such as summarization, argument generation, and judgment prediction. We propose LexChronos, an agentic framework that iteratively extracts structured event timelines from Supreme Court of India judgments. LexChronos employs a dual-agent architecture: a LoRA-instruct-tuned extraction agent identifies candidate events, while a pre-trained feedback agent scores and refines them through a confidence-driven loop. To address the scarcity of Indian legal event datasets, we construct a synthetic corpus of 2000 samples using reverse-engineering techniques with DeepSeek-R1 and GPT-4, generating gold-standard event annotations. Our pipeline achieves a BERT-based F1 score of 0.8751 against this synthetic ground truth. In downstream evaluations on legal text summarization, GPT-4 preferred structured timelines over unstructured baselines in 75% of cases, demonstrating improved comprehension and reasoning in Indian jurisprudence. This work lays a foundation for future legal AI applications in the Indian context, such as precedent mapping, argument synthesis, and predictive judgment modelling, by harnessing structured representations of legal events.
摘要：理解和预测司法结果需要对法律文件进行细致入微的分析。传统方法将判决和诉讼程序视为非结构化文本，限制了大型语言模型 (LLM) 在摘要、论证生成和判决预测等任务中的有效性。我们提出了 LexChronos，这是一个代理框架，可以从印度最高法院的判决中迭代地提取结构化事件时间表。 LexChronos 采用双代理架构：LoRA 指令调整的提取代理识别候选事件，而预先训练的反馈代理通过置信驱动循环对它们进行评分和细化。为了解决印度法律事件数据集的稀缺问题，我们使用 DeepSeek-R1 和 GPT-4 的逆向工程技术构建了一个包含 2000 个样本的合成语料库，生成黄金标准的事件注释。我们的流程基于 BERT 的 F1 分数相对于该合成地面事实达到了 0.8751。在对法律文本摘要的下游评估中，GPT-4 在 75% 的案例中更倾向于结构化时间线而不是非结构化基线，这表明印度法理学的理解和推理能力有所提高。这项工作通过利用法律事件的结构化表示，为未来在印度背景下的法律人工智能应用奠定了基础，例如先例映射、论证综合和预测判断模型。

Title: Surgical Post-Training: Cutting Errors, Keeping Knowledge

Authors: Wenye Lin, Kai Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01683
Pdf URL: https://arxiv.org/pdf/2603.01683
Copy Paste: [[2603.01683]] Surgical Post-Training: Cutting Errors, Keeping Knowledge(https://arxiv.org/abs/2603.01683)
Keywords: language model, llm
Abstract: Enhancing the reasoning capabilities of Large Language Models (LLMs) via post-training is often constrained by the trade-off between efficiency and catastrophic forgetting. While prior research emphasizes the role of on-policy data in mitigating forgetting, we uncover--and validate both theoretically and empirically--an overlooked yet critical mechanism: the implicit regularization inherent in Direct Preference Optimization's (DPO) reward estimate. This motivates our Surgical Post-Training (SPoT), a new paradigm designed to optimize reasoning efficiently while preserving learned prior knowledge. SPoT consists of: (1) a data rectification pipeline that employs an Oracle to surgically correct erroneous steps via minimal edits, generating data proximal to the model's distribution; and (2) a reward-based binary cross-entropy objective. Unlike the relative ranking in DPO, this objective treats reasoning correctness as a binary classification problem, enforcing decoupled supervision signals. Empirically, with only 4k rectified math data pairs, SPoT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and OOD tasks, requiring merely 28 minutes of training on 8x H800 GPUs. Code: this https URL
摘要：通过后期训练增强大型语言模型（LLM）的推理能力通常受到效率和灾难性遗忘之间权衡的限制。虽然之前的研究强调了政策数据在减轻遗忘方面的作用，但我们发现并从理论上和经验上验证了一个被忽视但关键的机制：直接偏好优化（DPO）奖励估计中固有的隐式正则化。这激发了我们的手术后训练（SPoT），这是一种新的范式，旨在有效优化推理，同时保留学到的先验知识。 SPoT 包括：(1) 数据纠正管道，使用 Oracle 通过最少的编辑来外科手术式纠正错误步骤，生成最接近模型分布的数据； (2) 基于奖励的二元交叉熵目标。与 DPO 中的相对排名不同，该目标将推理正确性视为二元分类问题，强制执行解耦的监督信号。根据经验，仅使用 4k 校正数学数据对，SPoT 在域内和 OOD 任务中将 Qwen3-8B 的准确率平均提高 6.2%，仅需要在 8 个 H800 GPU 上进行 28 分钟的训练。代码：这个https URL

Title: Building a Strong Instruction Language Model for a Less-Resourced Language

Authors: Domen Vreš, Tjaša Arčon, Timotej Petrič, Dario Vajda, Marko Robnik-Šikonja, Iztok Lebar Bajec
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01691
Pdf URL: https://arxiv.org/pdf/2603.01691
Copy Paste: [[2603.01691]] Building a Strong Instruction Language Model for a Less-Resourced Language(https://arxiv.org/abs/2603.01691)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have become an essential tool for natural language processing and artificial intelligence in general. Current open-source models are primarily trained on English texts, resulting in poorer performance on less-resourced languages and cultures. We present a set of methodological approaches necessary for the successful adaptation of an LLM to a less-resourced language, and demonstrate them using the Slovene language. We present GaMS3-12B, a generative model for Slovene with 12 billion parameters, and demonstrate that it is the best-performing open-source model for Slovene within its parameter range. We adapted the model to the Slovene language using three-stage continual pre-training of the Gemma 3 model, followed by two-stage supervised fine-tuning (SFT). We trained the model on a combination of 140B Slovene, English, Bosnian, Serbian, and Croatian pretraining tokens, and over 200 thousand English and Slovene SFT examples. We evaluate GaMS3-12B on the Slovenian-LLM-Eval datasets, English-to-Slovene translation, and the Slovene LLM arena. We show that the described model outperforms 12B Gemma 3 across all three scenarios and performs comparably to much larger commercial GPT-4o in the Slovene LLM arena, achieving a win rate of over 60 %.
摘要：大型语言模型（LLM）已成为自然语言处理和人工智能的重要工具。当前的开源模型主要针对英语文本进行训练，导致在资源较少的语言和文化上表现较差。我们提出了一套成功地将法学硕士适应资源匮乏语言所必需的方法，并使用斯洛文尼亚语言进行了演示。我们提出了 GaMS3-12B，这是一个具有 120 亿个参数的斯洛文尼亚生成模型，并证明它是斯洛文尼亚在其参数范围内性能最佳的开源模型。我们使用 Gemma 3 模型的三阶段连续预训练，然后进行两阶段监督微调 (SFT)，使模型适应斯洛文尼亚语言。我们结合 140B 个斯洛文尼亚语、英语、波斯尼亚语、塞尔维亚语和克罗地亚语预训练标记以及超过 20 万个英语和斯洛文尼亚语 SFT 示例来训练模型。我们在斯洛文尼亚-LLM-Eval 数据集、英语-斯洛文尼亚翻译和斯洛文尼亚 LLM 领域评估 GaMS3-12B。我们表明，所描述的模型在所有三种情况下都优于 12B Gemma 3，并且在斯洛文尼亚 LLM 领域中的表现与更大的商业 GPT-4o 相当，实现了超过 60% 的胜率。

Title: Legal RAG Bench: an end-to-end benchmark for legal RAG

Authors: Abdur-Rahman Butler, Umar Butler
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01710
Pdf URL: https://arxiv.org/pdf/2603.01710
Copy Paste: [[2603.01710]] Legal RAG Bench: an end-to-end benchmark for legal RAG(https://arxiv.org/abs/2603.01710)
Keywords: gpt, llm, hallucination
Abstract: We introduce Legal RAG Bench, a benchmark and evaluation methodology for assessing the end-to-end performance of legal RAG systems. As a benchmark, Legal RAG Bench consists of 4,876 passages from the Victorian Criminal Charge Book alongside 100 complex, hand-crafted questions demanding expert knowledge of criminal law and procedure. Both long-form answers and supporting passages are provided. As an evaluation methodology, Legal RAG Bench leverages a full factorial design and novel hierarchical error decomposition framework, enabling apples-to-apples comparisons of the contributions of retrieval and reasoning models in RAG. We evaluate three state-of-the-art embedding models (Isaacus' Kanon 2 Embedder, Google's Gemini Embedding 001, and OpenAI's Text Embedding 3 Large) and two frontier LLMs (Gemini 3.1 Pro and GPT-5.2), finding that information retrieval is the primary driver of legal RAG performance, with LLMs exerting a more moderate effect on correctness and groundedness. Kanon 2 Embedder, in particular, had the largest positive impact on performance, improving average correctness by 17.5 points, groundedness by 4.5 points, and retrieval accuracy by 34 points. We observe that many errors attributed to hallucinations in legal RAG systems are in fact triggered by retrieval failures, concluding that retrieval sets the ceiling for the performance of many modern legal RAG systems. We document why and how we built Legal RAG Bench alongside the results of our evaluations. We also openly release our code and data to assist with reproduction of our findings.
摘要：我们引入 Legal RAG Bench，这是一种用于评估合法 RAG 系统端到端性能的基准和评估方法。作为基准，Legal RAG Bench 包含维多利亚州刑事指控手册中的 4,876 个段落以及 100 个复杂的手工问题，需要具备刑法和刑事诉讼程序方面的专业知识。提供了长篇答案和支持段落。作为一种评估方法，Legal RAG Bench 利用全因子设计和新颖的分层错误分解框架，可以对 RAG 中检索和推理模型的贡献进行同类比较。我们评估了三种最先进的嵌入模型（Isaacus 的 Kanon 2 Embedder、Google 的 Gemini Embedding 001 和 OpenAI 的 Text Embedding 3 Large）和两个前沿 LLM（Gemini 3.1 Pro 和 GPT-5.2），发现信息检索是合法 RAG 性能的主要驱动力，而 LLM 对正确性和基础性的影响较为温和。尤其是 Kanon 2 Embedder，对性能产生了最大的积极影响，平均正确性提高了 17.5 分，接地性提高了 4.5 分，检索准确性提高了 34 分。我们观察到，法律 RAG 系统中由幻觉引起的许多错误实际上是由检索失败引发的，得出的结论是检索为许多现代法律 RAG 系统的性能设定了上限。我们记录了构建 Legal RAG Bench 的原因和方式以及评估结果。我们还公开发布我们的代码和数据，以帮助重现我们的发现。

Title: Bootstrapping Embeddings for Low Resource Languages

Authors: Merve Basoz, Andrew Horne, Mattia Opper
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01732
Pdf URL: https://arxiv.org/pdf/2603.01732
Copy Paste: [[2603.01732]] Bootstrapping Embeddings for Low Resource Languages(https://arxiv.org/abs/2603.01732)
Keywords: language model, llm
Abstract: Embedding models are crucial to modern NLP. However, the creation of the most effective models relies on carefully constructed supervised finetuning data. For high resource languages, such as English, such datasets are readily available. However, for hundreds of other languages, they are simply non-existent. We investigate whether the advent of large language models can help to bridge this gap. We test three different strategies for generating synthetic triplet data used to optimise embedding models. These include in-context learning as well as two novel approaches, leveraging adapter composition and cross lingual finetuning of the LLM generator (XL-LoRA) respectively. We find that while in-context learning still falls short of strong non-synthetic baselines, adapter composition and XL-LoRA yield strong performance gains across a wide array of tasks and languages, offering a clear, scalable pathway to producing performant embedding models for a wide variety of languages.
摘要：嵌入模型对于现代 NLP 至关重要。然而，最有效模型的创建依赖于精心构建的监督微调数据。对于英语等高资源语言，此类数据集很容易获得。然而，对于其他数百种语言来说，它们根本不存在。我们研究大型语言模型的出现是否有助于弥补这一差距。我们测试了三种不同的策略来生成用于优化嵌入模型的合成三元组数据。其中包括上下文学习以及两种新颖的方法，分别利用适配器组合和 LLM 生成器 (XL-LoRA) 的跨语言微调。我们发现，虽然上下文学习仍然达不到强大的非合成基线，但适配器组合和 XL-LoRA 在各种任务和语言中产生了强劲的性能提升，为为各种语言生成高性能嵌入模型提供了一条清晰、可扩展的途径。

Title: AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions

Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01773
Pdf URL: https://arxiv.org/pdf/2603.01773
Copy Paste: [[2603.01773]] AnnoABSA: A Web-Based Annotation Tool for Aspect-Based Sentiment Analysis with Retrieval-Augmented Suggestions(https://arxiv.org/abs/2603.01773)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: We introduce AnnoABSA, the first web-based annotation tool to support the full spectrum of Aspect-Based Sentiment Analysis (ABSA) tasks. The tool is highly customizable, enabling flexible configuration of sentiment elements and task-specific requirements. Alongside manual annotation, AnnoABSA provides optional Large Language Model (LLM)-based retrieval-augmented generation (RAG) suggestions that offer context-aware assistance in a human-in-the-loop approach, keeping the human annotator in control. To improve prediction quality over time, the system retrieves the ten most similar examples that are already annotated and adds them as few-shot examples in the prompt, ensuring that suggestions become increasingly accurate as the annotation process progresses. Released as open-source software under the MIT License, AnnoABSA is freely accessible and easily extendable for research and practical applications.
摘要：我们推出 AnnoABSA，这是第一个基于 Web 的注释工具，支持全方位的基于方面的情感分析 (ABSA) 任务。该工具是高度可定制的，可以灵活配置情感元素和特定于任务的要求。除了手动注释之外，AnnoABSA 还提供可选的基于大语言模型 (LLM) 的检索增强生成 (RAG) 建议，这些建议以人机交互的方式提供上下文感知帮助，使人类注释者保持控制。为了随着时间的推移提高预测质量，系统会检索已注释的十个最相似的示例，并将它们添加为提示中的少数样本示例，确保随着注释过程的进展，建议变得越来越准确。 AnnoABSA 根据 MIT 许可证作为开源软件发布，可免费访问并轻松扩展以用于研究和实际应用。

Title: Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation

Authors: Harry Stuart, Masahiro Kaneko, Timothy Baldwin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01775
Pdf URL: https://arxiv.org/pdf/2603.01775
Copy Paste: [[2603.01775]] Beyond the Resumé: A Rubric-Aware Automatic Interview System for Information Elicitation(https://arxiv.org/abs/2603.01775)
Keywords: language model, llm
Abstract: Effective hiring is integral to the success of an organisation, but it is very challenging to find the most suitable candidates because expert evaluation (e.g.\ interviews conducted by a technical manager) are expensive to deploy at scale. Therefore, automated resume scoring and other applicant-screening methods are increasingly used to coarsely filter candidates, making decisions on limited information. We propose that large language models (LLMs) can play the role of subject matter experts to cost-effectively elicit information from each candidate that is nuanced and role-specific, thereby improving the quality of early-stage hiring decisions. We present a system that leverages an LLM interviewer to update belief over an applicant's rubric-oriented latent traits in a calibrated way. We evaluate our system on simulated interviews and show that belief converges towards the simulated applicants' artificially-constructed latent ability levels. We release code, a modest dataset of public-domain/anonymised resumes, belief calibration tests, and simulated interviews, at \href{this https URL}{this https URL}. Our demo is available at \href{this https URL}{this https URL}.
摘要：有效的招聘是组织成功不可或缺的一部分，但找到最合适的候选人非常具有挑战性，因为大规模部署专家评估（例如由技术经理进行的面试）的成本很高。因此，自动简历评分和其他申请人筛选方法越来越多地用于粗略筛选候选人，根据有限的信息做出决策。我们建议大型语言模型（LLM）可以发挥主题专家的作用，以经济有效的方式从每个候选人那里获取细致入微且针对特定角色的信息，从而提高早期招聘决策的质量。我们提出了一个系统，利用法学硕士面试官以校准的方式更新对申请人以标题为导向的潜在特质的信念。我们在模拟面试中评估了我们的系统，并表明信念向模拟申请人的人工构建的潜在能力水平收敛。我们在 \href{this https URL}{this https URL} 发布代码、公共领域/匿名简历、信念校准测试和模拟面试的适度数据集。我们的演示位于 \href{此 https URL}{此 https URL}。

Title: FreeAct: Freeing Activations for LLM Quantization

Authors: Xiaohao Liu, Xiaobo Xia, Manyi Zhang, Ji-Fu Li, Xianzhi Yu, Fei Shen, Xiu Su, See-Kiong Ng, Tat-Seng Chua
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.01776
Pdf URL: https://arxiv.org/pdf/2603.01776
Copy Paste: [[2603.01776]] FreeAct: Freeing Activations for LLM Quantization(https://arxiv.org/abs/2603.01776)
Keywords: language model, llm
Abstract: Quantization is pivotal for mitigating the significant memory and computational overhead of Large Language Models (LLMs). While emerging transformation-based methods have successfully enhanced quantization by projecting feature spaces onto smoother manifolds using orthogonal matrices, they typically enforce a rigid one-to-one transformation constraint. This static approach fails to account for the dynamic patterns inherent in input activations, particularly within diffusion LLMs (dLLMs) and Multimodal LLMs (MLLMs), where varying token types exhibit distinct distributions. To advance this, we propose FreeAct, a novel quantization framework that relaxes the static one-to-one constraint to accommodate dynamic activation disparities. Theoretically, we leverage the rank-deficient nature of activations to derive a solution space that extends beyond simple inverse matrices, enabling the decoupling of activation transformations from weights. Methodologically, FreeAct identifies token-specific dynamics (i.e., vision v.s. text, or masked tokens) and allocates distinct transformation matrices to the activation side, while maintaining a unified, static transformation for the weights. Extensive experiments across dLLMs and MLLMs demonstrate that FreeAct significantly outperforms baselines, up to 5.3% performance improvement, with in-depth analyses. Our code will be publicly released.
摘要：量化对于减轻大型语言模型 (LLM) 的大量内存和计算开销至关重要。虽然新兴的基于变换的方法通过使用正交矩阵将特征空间投影到更平滑的流形上成功地增强了量化，但它们通常强制执行严格的一对一变换约束。这种静态方法无法解释输入激活中固有的动态模式，特别是在扩散 LLM (dLLM) 和多模态 LLM (MLLM) 中，其中不同的令牌类型表现出不同的分布。为了推进这一目标，我们提出了 FreeAct，这是一种新颖的量化框架，可以放宽静态一对一约束以适应动态激活差异。理论上，我们利用激活的秩亏性质来导出超出简单逆矩阵的解空间，从而实现激活变换与权重的解耦。从方法上讲，FreeAct 识别特定于令牌的动态（即视觉与文本或屏蔽令牌），并将不同的变换矩阵分配给激活端，同时保持权重的统一、静态变换。跨 dLLM 和 MLLM 的大量实验表明，通过深入分析，FreeAct 的性能显着优于基线，性能提升高达 5.3%。我们的代码将公开发布。

Title: LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction

Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01778
Pdf URL: https://arxiv.org/pdf/2603.01778
Copy Paste: [[2603.01778]] LLM-as-an-Annotator: Training Lightweight Models with LLM-Annotated Examples for Aspect Sentiment Tuple Prediction(https://arxiv.org/abs/2603.01778)
Keywords: language model, llm, prompt
Abstract: Training models for Aspect-Based Sentiment Analysis (ABSA) tasks requires manually annotated data, which is expensive and time-consuming to obtain. This paper introduces LA-ABSA, a novel approach that leverages Large Language Model (LLM)-generated annotations to fine-tune lightweight models for complex ABSA tasks. We evaluate our approach on five datasets for Target Aspect Sentiment Detection (TASD) and Aspect Sentiment Quad Prediction (ASQP). Our approach outperformed previously reported augmentation strategies and achieved competitive performance with LLM-prompting in low-resource scenarios, while providing substantial energy efficiency benefits. For example, using 50 annotated examples for in-context learning (ICL) to guide the annotation of unlabeled data, LA-ABSA achieved an F1 score of 49.85 for ASQP on the SemEval Rest16 dataset, closely matching the performance of ICL prompting with Gemma-3-27B (51.10), while requiring significantly lower computational resources.
摘要：基于方面的情感分析 (ABSA) 任务的训练模型需要手动注释数据，获取这些数据既昂贵又耗时。本文介绍了 LA-ABSA，这是一种利用大型语言模型 (LLM) 生成的注释来微调复杂 ABSA 任务的轻量级模型的新颖方法。我们在目标方面情感检测（TASD）和方面情感四重预测（ASQP）的五个数据集上评估我们的方法。我们的方法优于之前报道的增强策略，并在资源匮乏的情况下通过 LLM 提示实现了具有竞争力的性能，同时提供了巨大的能源效率优势。例如，使用 50 个上下文学习 (ICL) 注释示例来指导未标记数据的注释，LA-ABSA 在 SemEval Rest16 数据集上的 ASQP F1 得分为 49.85，与 Gemma-3-27B 提示的 ICL 性能 (51.10) 非常接近，同时需要的计算资源显着降低。

Title: nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models

Authors: Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01788
Pdf URL: https://arxiv.org/pdf/2603.01788
Copy Paste: [[2603.01788]] nchellwig at SemEval-2026 Task 3: Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis using Large Language Models(https://arxiv.org/abs/2603.01788)
Keywords: language model, llm, prompt
Abstract: We present Self-Consistent Structured Generation (SCSG) for Dimensional Aspect-Based Sentiment Analysis in SemEval-2026 Task 3 (Track A). SCSG enhances prediction reliability by executing a LoRA-adapted large language model multiple times per instance, retaining only tuples that achieve a majority consensus across runs. To mitigate the computational overhead of multiple forward passes, we leverage vLLM's PagedAttention mechanism for efficient key--value cache reuse. Evaluation across 6 languages and 8 language--domain combinations demonstrates that self-consistency with 15 executions yields statistically significant improvements over single-inference prompting, with our system (leveraging Gemma 3) ranking in the top seven across all settings, achieving second place on three out of four English subsets and first place on Tatar-Restaurant for DimASTE.
摘要：我们在 SemEval-2026 任务 3（轨道 A）中提出了用于基于维度方面的情感分析的自洽结构化生成（SCSG）。 SCSG 通过在每个实例多次执行 LoRA 适应的大型语言模型来增强预测可靠性，仅保留在运行中达成多数共识的元组。为了减轻多次前向传递的计算开销，我们利用 vLLM 的 PagedAttention 机制来实现高效的键值缓存重用。跨 6 种语言和 8 种语言-域组合的评估表明，15 次执行的自我一致性比单推理提示产生了统计上显着的改进，我们的系统（利用 Gemma 3）在所有设置中排名前七，在四个英语子集中的三个上获得第二名，在 DimASTE 的 Tatar-Restaurant 上获得第一名。

Title: ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs

Authors: Xunlei Chen, Jinyu Guo, Yuang Li, Zhaokun Wang, Yi Gong, Jie Zou, Jiwei Wei, Wenhong Tian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01792
Pdf URL: https://arxiv.org/pdf/2603.01792
Copy Paste: [[2603.01792]] ALTER: Asymmetric LoRA for Token-Entropy-Guided Unlearning of LLMs(https://arxiv.org/abs/2603.01792)
Keywords: language model, llm
Abstract: Large language models (LLMs) have advanced to encompass extensive knowledge across diverse domains. Yet controlling what a LLMs should not know is important for ensuring alignment and thus safe use. However, effective unlearning in LLMs is difficult due to the fuzzy boundary between knowledge retention and forgetting. This challenge is exacerbated by entangled parameter spaces from continuous multi-domain training, often resulting in collateral damage, especially under aggressive unlearning strategies. Furthermore, the computational overhead required to optimize State-of-the-Art (SOTA) models with billions of parameters poses an additional barrier. In this work, we present ALTER, a lightweight unlearning framework for LLMs to address both the challenges of knowledge entanglement and unlearning efficiency. ALTER operates through two phases: (I) high entropy tokens are captured and learned via the shared A matrix in LoRA, followed by (II) an asymmetric LoRA architecture that achieves a specified forgetting objective by parameter isolation and unlearning tokens within the target subdomains. Serving as a new research direction for achieving unlearning via token-level isolation in the asymmetric framework. ALTER achieves SOTA performance on TOFU, WMDP, and MUSE benchmarks with over 95% forget quality and shows minimal side effects through preserving foundational tokens. By decoupling unlearning from LLMs' billion-scale parameters, this framework delivers excellent efficiency while preserving over 90% of model utility, exceeding baseline preservation rates of 47.8-83.6%.
摘要：大型语言模型 (LLM) 已经发展到包含跨不同领域的广泛知识。然而，控制法学硕士不应该知道的内容对于确保一致性和安全使用非常重要。然而，由于知识保留和遗忘之间的界限模糊，法学硕士很难有效地忘却学习。连续多域训练产生的纠缠参数空间加剧了这一挑战，通常会导致附带损害，特别是在积极的遗忘策略下。此外，优化具有数十亿参数的最先进（SOTA）模型所需的计算开销构成了额外的障碍。在这项工作中，我们提出了 ALTER，这是一个针对法学硕士的轻量级忘却框架，旨在解决知识纠缠和忘却效率的挑战。 ALTER 通过两个阶段进行操作：(I) 通过 LoRA 中的共享 A 矩阵捕获和学习高熵令牌，然后是 (II) 非对称 LoRA 架构，该架构通过参数隔离和目标子域内的忘却令牌来实现指定的遗忘目标。作为在非对称框架中通过令牌级隔离实现遗忘的新研究方向。 ALTER 在 TOFU、WMDP 和 MUSE 基准上实现了 SOTA 性能，遗忘质量超过 95%，并且通过保留基础令牌显示出最小的副作用。通过与法学硕士十亿级参数的解耦，该框架提供了卓越的效率，同时保留了超过 90% 的模型效用，超过了 47.8-83.6% 的基线保留率。

Title: OpenAutoNLU: Open Source AutoML Library for NLU

Authors: Grigory Arshinov, Aleksandr Boriskin, Sergey Senichev, Ayaz Zaripov, Daria Galimzianova, Daniil Karpov, Leonid Sanochkin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01824
Pdf URL: https://arxiv.org/pdf/2603.01824
Copy Paste: [[2603.01824]] OpenAutoNLU: Open Source AutoML Library for NLU(https://arxiv.org/abs/2603.01824)
Keywords: language model, llm
Abstract: OpenAutoNLU is an open-source automated machine learning library for natural language understanding (NLU) tasks, covering both text classification and named entity recognition (NER). Unlike existing solutions, we introduce data-aware training regime selection that requires no manual configuration from the user. The library also provides integrated data quality diagnostics, configurable out-of-distribution (OOD) detection, and large language model (LLM) features, all within a minimal lowcode API. The demo app is accessible here this https URL.
摘要：OpenAutoNLU 是一个用于自然语言理解 (NLU) 任务的开源自动化机器学习库，涵盖文本分类和命名实体识别 (NER)。与现有解决方案不同，我们引入了数据感知训练方案选择，无需用户手动配置。该库还提供集成的数据质量诊断、可配置的分发外 (OOD) 检测和大语言模型 (LLM) 功能，所有这些都在最小的低代码 API 中。可以通过此 https URL 访问演示应用程序。

Title: Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering

Authors: Xufei Lv, Jiahui Yang, Yifu Gao, Linbo Qiao, Houde Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01853
Pdf URL: https://arxiv.org/pdf/2603.01853
Copy Paste: [[2603.01853]] Let the Agent Search: Autonomous Exploration Beats Rigid Workflows in Temporal Question Answering(https://arxiv.org/abs/2603.01853)
Keywords: language model, llm, agent
Abstract: Temporal Knowledge Graph Question Answering (TKGQA) demands multi-hop reasoning under temporal constraints. Prior approaches based on large language models (LLMs) typically rely on rigid, hand-crafted retrieval workflows or costly supervised fine-tuning. We show that simply granting an off-the-shelf LLM autonomy, that is, letting it decide what to do next, already yields substantial gains even in a strict zero-shot setting. Building on this insight, we propose AT2QA, an autonomous, training-free agent for temporal question answering that iteratively interacts with the temporal knowledge graph via a general search tool for dynamic retrieval. Experiments on MultiTQ demonstrate large improvements: AT2QA achieves 88.7% Hits@1 (+10.7% over prior SOTA), including a +20.1% gain on challenging multi-target queries, showing that agentic autonomy can decisively outperform fine-tuning for temporal question answering. Code and the full set of sampled trajectories are available on this https URL
摘要：时态知识图问答（TKGQA）需要在时态约束下进行多跳推理。先前基于大型语言模型（LLM）的方法通常依赖于严格的、手工制作的检索工作流程或昂贵的监督微调。我们证明，即使在严格的零样本环境中，简单地授予现成的法学硕士自主权，即让它决定下一步该做什么，也已经产生了巨大的收益。基于这一见解，我们提出了 AT2QA，一种用于时间问答的自主、免训练代理，它通过用于动态检索的通用搜索工具与时间知识图迭代交互。 MultiTQ 上的实验展示了巨大的改进：AT2QA 实现了 88.7% Hits@1（比之前的 SOTA 增加了 10.7%），其中在具有挑战性的多目标查询方面增加了 20.1%，这表明代理自治可以明显优于临时问答的微调。代码和完整的采样轨迹集可在此 https URL 上找到

Title: CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

Authors: Ziyi Zhu, Olivier Tieleman, Alexey Bukhtiyarov, Jinghong Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01865
Pdf URL: https://arxiv.org/pdf/2603.01865
Copy Paste: [[2603.01865]] CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation(https://arxiv.org/abs/2603.01865)
Keywords: llm
Abstract: LLM-as-judge evaluation has become standard practice for open-ended model assessment; however, judges exhibit systematic biases that cannot be eliminated by increasing the number of scenarios or generations. These biases are often similar in magnitude to the model differences that benchmarks are designed to detect, resulting in unreliable rankings when single-judge evaluations are used. This work introduces a variance decomposition that partitions benchmark score variance into scenario, generation, judge, and residual components. Based on this analysis, CyclicJudge, a round-robin assignment of judges, is demonstrated to be the optimal allocation strategy. It eliminates bias precisely while requiring each judge only once per cycle, maintaining the cost of single-judge evaluation. Empirical validation on MT-Bench supports all theoretical predictions.
摘要：LLM法官评估已成为开放式模型评估的标准做法；然而，法官表现出的系统性偏见无法通过增加场景或世代的数量来消除。这些偏差的大小通常与基准测试旨在检测的模型差异相似，从而在使用单评判评估时导致排名不可靠。这项工作引入了方差分解，将基准分数方差划分为场景、生成、判断和残差分量。基于此分析，CyclicJudge（法官的循环分配）被证明是最佳分配策略。它精确地消除了偏差，同时要求每个法官每个周期只需要一次，从而保持了单法官评估的成本。 MT-Bench 上的实证验证支持所有理论预测。

Title: KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models

Authors: Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, Jinan Xu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.01875
Pdf URL: https://arxiv.org/pdf/2603.01875
Copy Paste: [[2603.01875]] KDFlow: A User-Friendly and Efficient Knowledge Distillation Framework for Large Language Models(https://arxiv.org/abs/2603.01875)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) is an essential technique to compress large language models (LLMs) into smaller ones. However, despite the distinct roles of the student model and the teacher model in KD, most existing frameworks still use a homogeneous training backend (e.g., FSDP and DeepSpeed) for both models, leading to suboptimal training efficiency. In this paper, we present a novel framework for LLM distillation, termed \textbf{KDFlow}, which features a decoupled architecture and employs SGLang for teacher inference. By bridging the training efficiency of FSDP2 and the inference efficiency of SGLang, KDFlow achieves full utilization of both advantages in a unified system. Moreover, instead of transferring full logits across different processes, our framework only transmits the teacher's hidden states using zero-copy data transfer and recomputes the logits on the student side, effectively balancing the communication cost and KD performance. Furthermore, our framework supports both off-policy and on-policy distillation and incorporates KD algorithms for cross-tokenizer KD through highly extensible and user-friendly APIs. Experiments show that KDFlow can achieve \textbf{1.44$\times$ to 6.36$\times$} speedup compared to current KD frameworks, enabling researchers to rapidly prototype and scale LLM distillation with minimal engineering overhead. Code is available at: this https URL
摘要：知识蒸馏（KD）是将大型语言模型（LLM）压缩为较小模型的基本技术。然而，尽管学生模型和教师模型在 KD 中扮演着不同的角色，大多数现有框架仍然对这两个模型使用同质的训练后端（例如 FSDP 和 DeepSpeed），导致训练效率不佳。在本文中，我们提出了一种新颖的 LLM 蒸馏框架，称为 \textbf{KDFlow}，其具有解耦架构并采用 SGLang 进行教师推理。通过桥接 FSDP2 的训练效率和 SGLang 的推理效率，KDFlow 在统一系统中实现了两种优势的充分利用。此外，我们的框架没有在不同进程之间传输完整的 logits，而是仅使用零拷贝数据传输来传输教师的隐藏状态，并在学生端重新计算 logits，从而有效地平衡了通信成本和 KD 性能。此外，我们的框架支持离策略和在策略蒸馏，并通过高度可扩展和用户友好的 API 合并用于跨标记器 KD 的 KD 算法。实验表明，与当前的 KD 框架相比，KDFlow 可以实现 \textbf{1.44$\times$ 到 6.36$\times$} 加速，使研究人员能够以最小的工程开销快速原型化和扩展 LLM 蒸馏。代码位于：此 https URL

Title: FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures

Authors: Liliia Bogdanova, Shiran Sun, Lifeng Han, Natalia Amat Lefort, Flor Miriam Plaza-del-Arco
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01910
Pdf URL: https://arxiv.org/pdf/2603.01910
Copy Paste: [[2603.01910]] FLANS at SemEval-2026 Task 7: RAG with Open-Sourced Smaller LLMs for Everyday Knowledge Across Diverse Languages and Cultures(https://arxiv.org/abs/2603.01910)
Keywords: llm, prompt, retrieval augmented generation
Abstract: This system paper describes our participation in the SemEval-2025 Task-7 ``Everyday Knowledge Across Diverse Languages and Cultures''. We attended two subtasks, i.e., Track 1: Short Answer Questions (SAQ), and Track 2: Multiple-Choice Questions (MCQ). The methods we used are retrieval augmented generation (RAGs) with open-sourced smaller LLMs (OS-sLLMs). To better adapt to this shared task, we created our own culturally aware knowledge base (CulKBs) by extracting Wikipedia content using keyword lists we prepared. We extracted both culturally-aware wiki-text and country-specific wiki-summary. In addition to the local CulKBs, we also have one system integrating live online search output via DuckDuckGo. Towards better privacy and sustainability, we aimed to deploy smaller LLMs (sLLMs) that are open-sourced on the Ollama platform. We share the prompts we developed using refinement techniques and report the learning curve of such prompts. The tested languages are English, Spanish, and Chinese for both tracks. Our resources and codes are shared via this https URL
摘要：本系统论文描述了我们对 SemEval-2025 Task-7“跨不同语言和文化的日常知识”的参与。我们参加了两个子任务，即轨道 1：简答题 (SAQ) 和轨道 2：多项选择题 (MCQ)。我们使用的方法是检索增强生成（RAG）和开源小型法学硕士（OS-sLLM）。为了更好地适应这项共同任务，我们通过使用我们准备的关键字列表提取维基百科内容，创建了自己的文化意识知识库 (CulKB)。我们提取了具有文化意识的维基文本和特定国家/地区的维基摘要。除了本地 CulKB 之外，我们还有一套通过 DuckDuckGo 集成实时在线搜索输出的系统。为了更好的隐私和可持续性，我们的目标是部署在 Ollama 平台上开源的小型法学硕士 (sLLM)。我们分享使用细化技术开发的提示，并报告此类提示的学习曲线。两个曲目的测试语言均为英语、西班牙语和中文。我们的资源和代码通过此 https URL 共享

Title: Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration

Authors: Yinghao Tang, Yupeng Xie, Yingchaojie Feng, Tingfeng Lan, Wei Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01912
Pdf URL: https://arxiv.org/pdf/2603.01912
Copy Paste: [[2603.01912]] Demonstrating ViviDoc: Generating Interactive Documents through Human-Agent Collaboration(https://arxiv.org/abs/2603.01912)
Keywords: llm, agent
Abstract: Interactive articles help readers engage with complex ideas through exploration, yet creating them remains costly, requiring both domain expertise and web development skills. Recent LLM-based agents can automate content creation, but naively applying them yields uncontrollable and unverifiable outputs. We present ViviDoc, a human-agent collaborative system that generates interactive educational documents from a single topic input. ViviDoc introduces a multi-agent pipeline (Planner, Executor, Evaluator) and the Document Specification (DocSpec), a human-readable intermediate representation that decomposes each interactive visualization into State, Render, Transition, and Constraint components. The DocSpec enables educators to review and refine generation plans before code is produced, bridging the gap between pedagogical intent and executable output. Expert evaluation and a user study show that ViviDoc substantially outperforms naive agentic generation and provides an intuitive editing experience. Our project homepage is available at this https URL.
摘要：交互式文章帮助读者通过探索来理解复杂的想法，但创建它们仍然成本高昂，需要领域专业知识和网络开发技能。最近基于 LLM 的代理可以自动化内容创建，但天真地应用它们会产生无法控制和无法验证的输出。我们推出了 ViviDoc，这是一个人机协作系统，可以根据单个主题输入生成交互式教育文档。 ViviDoc 引入了多代理管道（规划器、执行器、评估器）和文档规范 (DocSpec)，这是一种人类可读的中间表示形式，它将每个交互式可视化分解为状态、渲染、转换和约束组件。 DocSpec 使教育工作者能够在生成代码之前审查和完善生成计划，从而缩小教学意图和可执行输出之间的差距。专家评估和用户研究表明，ViviDoc 的性能明显优于简单的代理生成，并提供直观的编辑体验。我们的项目主页可通过此 https URL 访问。

Title: AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth

Authors: Shixiang Song, He Li, Zitong Wang, Boyi Zeng, Feichen Song, Yixuan Wang, Zhiqin John Xu, Ziwei He, Zhouhan Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.01914
Pdf URL: https://arxiv.org/pdf/2603.01914
Copy Paste: [[2603.01914]] AdaPonderLM: Gated Pondering Language Models with Token-Wise Adaptive Depth(https://arxiv.org/abs/2603.01914)
Keywords: language model
Abstract: Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.
摘要：通过循环/迭代 Transformer 进行的测试时间扩展使大型语言模型能够在推理上花费更多的计算量，但大多数预训练的循环 LM 运行固定数量的迭代，在简单的令牌上浪费计算，并且缺乏令牌明智的适应性。遵循自适应计算时间（ACT）和提前退出（EE）的核心思想，我们提出了 AdaPonderLM，这是一种自监督循环语言模型，可以在预训练期间学习 token-wise 提前退出，而无需手动调整每个 token/每层的剪枝比率。 AdaPonderLM 使用具有单调停止掩码的特定于迭代的 MLP 门来决定每个令牌何时停止重复，并引入了 KV 重用机制，该机制可以重用已停止令牌的缓存键/值状态，从而确保训练-测试一致性和实际加速。在从 70M 到 410M（预训练）和高达 2.8B（持续预训练）的 Pythia 主干网络中，AdaPonderLM 将推理计算减少了约 10%，同时保持了可比的语言建模复杂度和有竞争力的下游准确性。我们的分析表明，学习门将更多计算分配给高 NLL（硬）令牌，在完全自我监督的设置中表现出自适应计算时间行为。同时，在 iso-FLOP 下，学习的停止策略始终优于固定剪枝，这表明 AdaPonderLM 将计算分配给正确的令牌，而不仅仅是减少平均深度。

Title: AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations

Authors: Cheng Jiayang, Dongyu Ru, Lin Qiu, Yiyang Li, Xuezhi Cao, Yangqiu Song, Xunliang Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.01966
Pdf URL: https://arxiv.org/pdf/2603.01966
Copy Paste: [[2603.01966]] AMemGym: Interactive Memory Benchmarking for Assistants in Long-Horizon Conversations(https://arxiv.org/abs/2603.01966)
Keywords: llm, agent
Abstract: Long-horizon interactions between users and LLM-based assistants necessitate effective memory management, yet current approaches face challenges in training and evaluation of memory. Existing memory benchmarks rely on static, off-policy data as context, limiting evaluation reliability and scalability. To address these gaps, we introduce AMemGym, an interactive environment enabling on-policy evaluation and optimization for memory-driven personalization. AMemGym employs structured data sampling to predefine user profiles, state-dependent questions, and state evolution trajectories, enabling cost-effective generation of high-quality, evaluation-aligned interactions. LLM-simulated users expose latent states through role-play while maintaining structured state consistency. Comprehensive metrics based on structured data guide both assessment and optimization of assistants. Extensive experiments reveal performance gaps in existing memory systems (e.g., RAG, long-context LLMs, and agentic memory) and corresponding reasons. AMemGym not only enables effective selection among competing approaches but also can potentially drive the self-evolution of memory management strategies. By bridging structured state evolution with free-form interactions, our framework provides a scalable, diagnostically rich environment for advancing memory capabilities in conversational agents.
摘要：用户和法学硕士助理之间的长期交互需要有效的内存管理，但当前的方法在内存训练和评估方面面临挑战。现有的内存基准测试依赖静态的、离策略的数据作为上下文，限制了评估的可靠性和可扩展性。为了解决这些差距，我们引入了 AMemGym，这是一个交互式环境，可以对内存驱动的个性化进行按策略评估和优化。 AMemGym 采用结构化数据采样来预定义用户配置文件、状态相关问题和状态演化轨迹，从而能够经济有效地生成高质量、与评估一致的交互。 LLM 模拟用户通过角色扮演暴露潜在状态，同时保持结构化状态的一致性。基于结构化数据的综合指标指导助理的评估和优化。大量实验揭示了现有记忆系统（例如 RAG、长上下文 LLM 和代理记忆）的性能差距以及相应的原因。 AMemGym 不仅可以在竞争方法中进行有效选择，而且还可以潜在地推动内存管理策略的自我进化。通过将结构化状态演化与自由形式交互联系起来，我们的框架提供了一个可扩展的、诊断丰富的环境，用于提高会话代理的记忆能力。

Title: CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production

Authors: Yixin Nie, Lin Guan, Zhongyao Ma, Anchit Gupta, Yipin Zhou, Xiao Li, Zhengping Zhou, Raymond Zeng, Gelin Zhou, Shigan Chu, Ajay Thampi, Wancen Mu, Nathan Shuster, Ketong Wang, Lin Chen, Jason Brewer, Derek Hao Hu, Alexander McCauley, Jason Weston, Sem Park, Na Zhang, Kevin Tang
Subjects: cs.CL, cs.AI, cs.SI
Abstract URL: https://arxiv.org/abs/2603.01973
Pdf URL: https://arxiv.org/pdf/2603.01973
Copy Paste: [[2603.01973]] CharacterFlywheel: Scaling Iterative Improvement of Engaging and Steerable LLMs in Production(https://arxiv.org/abs/2603.01973)
Keywords: language model, llm, chat
Abstract: This report presents CharacterFlywheel, an iterative flywheel process for improving large language models (LLMs) in production social chat applications across Instagram, WhatsApp, and Messenger. Starting from LLaMA 3.1, we refined models across 15 generations using data from both internal and external real-user traffic. Through continuous deployments from July 2024 to April 2025, we conducted controlled 7-day A/B tests showing consistent engagement improvements: 7 of 8 newly deployed models demonstrated positive lift over the baseline, with the strongest performers achieving up to 8.8% improvement in engagement breadth and 19.4% in engagement depth. We also observed substantial gains in steerability, with instruction following increasing from 59.2% to 84.8% and instruction violations decreasing from 26.6% to 5.8%. We detail the CharacterFlywheel process which integrates data curation, reward modeling to estimate and interpolate the landscape of engagement metrics, supervised fine-tuning (SFT), reinforcement learning (RL), and both offline and online evaluation to ensure reliable progress at each optimization step. We also discuss our methods for overfitting prevention and navigating production dynamics at scale. These contributions advance the scientific rigor and understanding of LLMs in social applications serving millions of users.
摘要：本报告介绍了CharacterFlywheel，这是一种迭代飞轮流程，用于改进Instagram、WhatsApp 和Messenger 上的生产社交聊天应用程序中的大型语言模型(LLM)。从 LLaMA 3.1 开始，我们使用内部和外部真实用户流量的数据完善了 15 代模型。通过 2024 年 7 月至 2025 年 4 月的持续部署，我们进行了为期 7 天的受控 A/B 测试，显示了持续的参与度改进：8 个新部署的模型中有 7 个显示出对基线的积极提升，表现最好的模型在参与广度上实现了高达 8.8% 的提升，在参与深度上实现了 19.4% 的提升。我们还观察到可操纵性显着提高，指令遵循率从 59.2% 增加到 84.8%，指令违规率从 26.6% 下降到 5.8%。我们详细介绍了 CharacterFlywheel 流程，该流程集成了数据管理、奖励建模以估计和插值参与度指标的情况、监督微调 (SFT)、强化学习 (RL) 以及离线和在线评估，以确保每个优化步骤的可靠进展。我们还讨论了预防过度拟合和控制大规模生产动态的方法。这些贡献提高了法学硕士在为数百万用户提供服务的社交应用中的科学严谨性和理解。

Title: MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning

Authors: Jiachun Li, Shaoping Huang, Zhuoran Jin, Chenlong Zhang, Pengfei Cao, Yubo Chen, Kang Liu, Jun Zhao
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.02024
Pdf URL: https://arxiv.org/pdf/2603.02024
Copy Paste: [[2603.02024]] MMR-Life: Piecing Together Real-life Scenes for Multimodal Multi-image Reasoning(https://arxiv.org/abs/2603.02024)
Keywords: language model, gpt, llm
Abstract: Recent progress in the reasoning capabilities of multimodal large language models (MLLMs) has empowered them to address more complex tasks such as scientific analysis and mathematical reasoning. Despite their promise, MLLMs' reasoning abilities across different scenarios in real life remain largely unexplored and lack standardized benchmarks for evaluation. To address this gap, we introduce MMR-Life, a comprehensive benchmark designed to evaluate the diverse multimodal multi-image reasoning capabilities of MLLMs across real-life scenarios. MMR-Life consists of 2,646 multiple-choice questions based on 19,108 images primarily sourced from real-world contexts, comprehensively covering seven reasoning types: abductive, analogical, causal, deductive, inductive, spatial, and temporal. Unlike existing reasoning benchmarks, MMR-Life does not rely on domain-specific expertise but instead requires models to integrate information across multiple images and apply diverse reasoning abilities. The evaluation of 37 advanced models highlights the substantial challenge posed by MMR-Life. Even top models like GPT-5 achieve only 58% accuracy and display considerable variance in performance across reasoning types. Moreover, we analyze the reasoning paradigms of existing MLLMs, exploring how factors such as thinking length, reasoning method, and reasoning type affect their performance. In summary, MMR-Life establishes a comprehensive foundation for evaluating, analyzing, and improving the next generation of multimodal reasoning systems.
摘要：多模态大语言模型 (MLLM) 推理能力的最新进展使它们能够解决更复杂的任务，例如科学分析和数学推理。尽管他们做出了承诺，但 MLLM 在现实生活中不同场景中的推理能力在很大程度上仍未得到探索，并且缺乏标准化的评估基准。为了弥补这一差距，我们推出了 MMR-Life，这是一个综合基准测试，旨在评估 MLLM 在现实生活场景中的多种多模态多图像推理能力。 MMR-Life 由 2,646 个多项选择题组成，这些问题基于主要来自现实世界上下文的 19,108 张图像，全面涵盖七种推理类型：溯因推理、类比推理、因果推理、演绎推理、归纳推理、空间推理和时间推理。与现有的推理基准不同，MMR-Life 不依赖于特定领域的专业知识，而是需要模型跨多个图像集成信息并应用不同的推理能力。对 37 个先进模型的评估凸显了 MMR-Life 带来的巨大挑战。即使像 GPT-5 这样的顶级模型也只能达到 58% 的准确度，并且在不同推理类型的性能上表现出相当大的差异。此外，我们分析了现有 MLLM 的推理范式，探讨了思维长度、推理方法和推理类型等因素如何影响其性能。总之，MMR-Life 为评估、分析和改进下一代多模态推理系统奠定了全面的基础。

Title: EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training

Authors: Aleksei Dorkin, Taido Purason, Emil Kalbaliyev, Hele-Andra Kuulmets, Marii Ojastu, Mark Fišel, Tanel Alumäe, Eleri Aedmaa, Krister Kruusmaa, Kairit Sirts
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.02041
Pdf URL: https://arxiv.org/pdf/2603.02041
Copy Paste: [[2603.02041]] EstLLM: Enhancing Estonian Capabilities in Multilingual LLMs via Continued Pretraining and Post-Training(https://arxiv.org/abs/2603.02041)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) are predominantly trained on English-centric data, resulting in uneven performance for smaller languages. We study whether continued pretraining (CPT) can substantially improve Estonian capabilities in a pretrained multilingual LLM while preserving its English and general reasoning performance. Using Llama 3.1 8B as the main base model, we perform CPT on a mixture that increases Estonian exposure while approximating the original training distribution through English replay and the inclusion of code, mathematics, and instruction-like data. We subsequently apply supervised fine-tuning, preference optimization, and chat vector merging to introduce robust instruction-following behavior. Evaluation on a comprehensive suite of Estonian benchmarks shows consistent gains in linguistic competence, knowledge, reasoning, translation quality, and instruction-following compared to the original base model and its instruction-tuned variant, while maintaining competitive performance on English benchmarks. These findings indicate that CPT, with an appropriately balanced data mixture, together with post-training alignment, can substantially improve single-language capabilities in pretrained multilingual LLMs.
摘要：大型语言模型 (LLM) 主要在以英语为中心的数据上进行训练，导致较小语言的性能参差不齐。我们研究持续预训练（CPT）是否可以显着提高爱沙尼亚语预训练多语言法学硕士的能力，同时保留其英语和一般推理性能。使用 Llama 3.1 8B 作为主要基础模型，我们对混合物执行 CPT，该混合物增加了爱沙尼亚语的暴露程度，同时通过英语重播以及包含代码、数学和类似指令的数据来近似原始训练分布。随后，我们应用监督微调、偏好优化和聊天向量合并来引入强大的指令跟踪行为。对一整套爱沙尼亚语基准的评估表明，与原始基础模型及其指令调整变体相比，在语言能力、知识、推理、翻译质量和指令遵循方面取得了一致的进步，同时保持了英语基准上的竞争表现。这些发现表明，CPT 结合适当平衡的数据混合以及训练后对齐，可以显着提高预训练的多语言法学硕士的单语言能力。

Title: What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies

Authors: Zhenghao Herbert Zhou, William Dai, Maya Viswanathan, Simon Charlow, R. Thomas McCoy, Robert Frank
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02082
Pdf URL: https://arxiv.org/pdf/2603.02082
Copy Paste: [[2603.02082]] What Exactly do Children Receive in Language Acquisition? A Case Study on CHILDES with Automated Detection of Filler-Gap Dependencies(https://arxiv.org/abs/2603.02082)
Keywords: language model
Abstract: Children's acquisition of filler-gap dependencies has been argued by some to depend on innate grammatical knowledge, while others suggest that the distributional evidence available in child-directed speech suffices. Unfortunately, the relevant input is difficult to quantify at scale with fine granularity, making this question difficult to resolve. We present a system that identifies three core filler-gap constructions in spoken English corpora -- matrix wh-questions, embedded wh-questions, and relative clauses -- and further identifies the extraction site (i.e., subject vs. object vs. adjunct). Our approach combines constituency and dependency parsing, leveraging their complementary strengths for construction classification and extraction site identification. We validate the system on human-annotated data and find that it scores well across most categories. Applying the system to 57 English CHILDES corpora, we are able to characterize children's filler-gap input and their filler-gap production trajectories over the course of development, including construction-specific frequencies and extraction-site asymmetries. The resulting fine-grained labels enable future work in both acquisition and computational studies, which we demonstrate with a case study using filtered corpus training with language models.
摘要：一些人认为，儿童获得填充间隙依赖性取决于先天的语法知识，而另一些人则认为，儿童主导的言语中可用的分布证据就足够了。不幸的是，相关输入很难大规模、细粒度地量化，使得这个问题难以解决。我们提出了一个系统，可以识别英语口语语料库中的三种核心填充间隙结构——矩阵 wh 问题、嵌入 wh 问题和关系从句——并进一步识别提取位点（即主语、宾语和附加语）。我们的方法结合了选区和依存解析，利用它们的互补优势进行构建分类和提取位点识别。我们根据人工注释的数据验证了该系统，发现它在大多数类别中得分都很高。将该系统应用于 57 个英语儿童语料库，我们能够描述儿童在发展过程中的填充间隙输入及其填充间隙产生轨迹，包括特定于构建的频率和提取位置的不对称性。由此产生的细粒度标签使未来的采集和计算研究工作成为可能，我们通过使用语言模型的过滤语料库训练的案例研究来证明这一点。

Title: ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels

Authors: Xiang Zheng, Han Li, Wenjie Luo, Weiqi Zhai, Yiyuan Li, Chuanmiao Yan, Tianyi Tang, Yubo Ma, Kexin Yang, Dayiheng Liu, Hu Wei, Bing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02097
Pdf URL: https://arxiv.org/pdf/2603.02097
Copy Paste: [[2603.02097]] ClinConsensus: A Consensus-Based Benchmark for Evaluating Chinese Medical LLMs across Difficulty Levels(https://arxiv.org/abs/2603.02097)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly applied to health management, showing promise across disease prevention, clinical decision-making, and long-term care. However, existing medical benchmarks remain largely static and task-isolated, failing to capture the openness, longitudinal structure, and safety-critical complexity of real-world clinical workflows. We introduce ClinConsensus, a Chinese medical benchmark curated, validated, and quality-controlled by clinical experts. ClinConsensus comprises 2500 open-ended cases spanning the full continuum of care--from prevention and intervention to long-term follow-up--covering 36 medical specialties, 12 common clinical task types, and progressively increasing levels of complexity. To enable reliable evaluation of such complex scenarios, we adopt a rubric-based grading protocol and propose the Clinically Applicable Consistency Score (CACS@k). We further introduce a dual-judge evaluation framework, combining a high-capability LLM-as-judge with a distilled, locally deployable judge model trained via supervised fine-tuning, enabling scalable and reproducible evaluation aligned with physician judgment. Using ClinConsensus, we conduct a comprehensive assessment of several leading LLMs and reveal substantial heterogeneity across task themes, care stages, and medical specialties. While top-performing models achieve comparable overall scores, they differ markedly in reasoning, evidence use, and longitudinal follow-up capabilities, and clinically actionable treatment planning remains a key bottleneck. We release ClinConsensus as an extensible benchmark to support the development and evaluation of medical LLMs that are robust, clinically grounded, and ready for real-world deployment.
摘要：大语言模型 (LLM) 越来越多地应用于健康管理，在疾病预防、临床决策和长期护理方面显示出前景。然而，现有的医疗基准在很大程度上仍然是静态的和任务隔离的，无法捕捉现实世界临床工作流程的开放性、纵向结构和安全关键的复杂性。我们推出 ClinConsensus，这是一个由临床专家策划、验证和质量控制的中国医学基准。 ClinConsensus 包含 2500 个开放式案例，涵盖从预防和干预到长期随访的整个护理过程，涵盖 36 个医学专业、12 种常见临床任务类型，并且复杂程度逐渐增加。为了能够对此类复杂场景进行可靠评估，我们采用了基于评分标准的分级协议，并提出了临床适用一致性评分（CACS@k）。我们进一步引入了双法官评估框架，将高能力的法学硕士作为法官与通过监督微调训练的精炼的、可本地部署的法官模型相结合，从而实现与医生判断一致的可扩展和可重复的评估。使用 ClinConsensus，我们对几位领先的法学硕士进行了全面评估，并揭示了任务主题、护理阶段和医学专业之间的巨大异质性。虽然表现最好的模型获得了可比较的总体分数，但它们在推理、证据使用和纵向随访能力方面存在显着差异，临床上可行的治疗计划仍然是一个关键瓶颈。我们发布 ClinConsensus 作为可扩展的基准，以支持稳健、基于临床并准备好在现实世界中部署的医学法学硕士的开发和评估。

Title: Recursive Think-Answer Process for LLMs and VLMs

Authors: Byung-Kwan Lee, Youngchae Chee, Yong Man Ro
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02099
Pdf URL: https://arxiv.org/pdf/2603.02099
Copy Paste: [[2603.02099]] Recursive Think-Answer Process for LLMs and VLMs(https://arxiv.org/abs/2603.02099)
Keywords: language model, llm
Abstract: Think-Answer reasoners such as DeepSeek-R1 have made notable progress by leveraging interpretable internal reasoning. However, despite the frequent presence of self-reflective cues like "Oops!", they remain vulnerable to output errors during single-pass inference. To address this limitation, we propose an efficient Recursive Think-Answer Process (R-TAP) that enables models to engage in iterative reasoning cycles and generate more accurate answers, going beyond conventional single-pass approaches. Central to this approach is a confidence generator that evaluates the certainty of model responses and guides subsequent improvements. By incorporating two complementary rewards-Recursively Confidence Increase Reward and Final Answer Confidence Reward-we show that R-TAP-enhanced models consistently outperform conventional single-pass methods for both large language models (LLMs) and vision-language models (VLMs). Moreover, by analyzing the frequency of "Oops"-like expressions in model responses, we find that R-TAP-applied models exhibit significantly fewer self-reflective patterns, resulting in more stable and faster inference-time reasoning. We hope R-TAP pave the way evolving into efficient and elaborated methods to refine the reasoning processes of future AI.
摘要：DeepSeek-R1 等 Think-Answer 推理机通过利用可解释的内部推理取得了显着的进展。然而，尽管经常出现“哎呀！”等自我反思提示，但它们在单遍推理过程中仍然容易受到输出错误的影响。为了解决这一限制，我们提出了一种高效的递归思考-回答过程（R-TAP），使模型能够参与迭代推理循环并生成更准确的答案，超越传统的单遍方法。该方法的核心是置信度生成器，用于评估模型响应的确定性并指导后续改进。通过结合两种互补的奖励——递归置信度增加奖励和最终答案置信度奖励——我们表明，R-TAP 增强模型在大型语言模型 (LLM) 和视觉语言模型 (VLM) 方面始终优于传统的单遍方法。此外，通过分析模型响应中类似“哎呀”的表达的频率，我们发现应用 R-TAP 的模型表现出明显更少的自我反思模式，从而导致更稳定和更快的推理时间推理。我们希望 R-TAP 为演进高效且精细的方法铺平道路，以完善未来人工智能的推理过程。

Title: LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations

Authors: Veronika Solopova, Viktoria Skorik, Maksym Tereshchenko, Alina Haidun, Ostap Vykhopen
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.02128
Pdf URL: https://arxiv.org/pdf/2603.02128
Copy Paste: [[2603.02128]] LLMs as Strategic Actors: Behavioral Alignment, Risk Calibration, and Argumentation Framing in Geopolitical Simulations(https://arxiv.org/abs/2603.02128)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly proposed as agents in strategic decision environments, yet their behavior in structured geopolitical simulations remains under-researched. We evaluate six popular state-of-the-art LLMs alongside results from human results across four real-world crisis simulation scenarios, requiring models to select predefined actions and justify their decisions across multiple rounds. We compare models to humans in action alignment, risk calibration through chosen actions' severity, and argumentative framing grounded in international relations theory. Results show that models approximate human decision patterns in base simulation rounds but diverge over time, displaying distinct behavioural profiles and strategy updates. LLM explanations for chosen actions across all models exhibit a strong normative-cooperative framing centered on stability, coordination, and risk mitigation, with limited adversarial reasoning.
摘要：大型语言模型（LLM）越来越多地被提议作为战略决策环境中的代理，但它们在结构化地缘政治模拟中的行为仍然没有得到充分研究。我们评估了六种流行的最先进的法学硕士以及四个现实世界危机模拟场景中人类结果的结果，要求模型选择预定义的行动并在多轮中证明其决策的合理性。我们将模型与人类在行动协调、通过所选行动的严重性进行风险校准以及基于国际关系理论的论证框架方面进行比较。结果表明，模型在基础模拟回合中近似人类决策模式，但随着时间的推移而出现分歧，显示出不同的行为概况和策略更新。法学硕士对所有模型中所选行动的解释都表现出以稳定性、协调和风险缓解为中心的强大的规范合作框架，并且具有有限的对抗性推理。

Title: LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Authors: Guanzheng Chen, Michael Qizhe Shieh, Lidong Bing
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02146
Pdf URL: https://arxiv.org/pdf/2603.02146
Copy Paste: [[2603.02146]] LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards(https://arxiv.org/abs/2603.02146)
Keywords: language model, llm
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at this https URL.
摘要：具有可验证奖励的强化学习 (RLVR) 通过针对事实结果进行优化，显着提高了大型语言模型 (LLM) 的推理能力。然而，这种范式在长上下文场景中会动摇，因为它对内部参数知识的依赖不适合需要上下文基础的任务，即查找和推理外部提供的信息的能力。我们确定了这种失败的一个关键原因：仅基于最终答案的奖励太稀疏，无法有效指导模型识别相关证据。我们正式证明，只注重结果的奖励会导致上下文基础过程的梯度显着消失，从而使学习变得棘手。为了克服这个瓶颈，我们引入了 LongRLVR，用密集且可验证的上下文奖励来增强稀疏答案奖励。该辅助信号直接激励模型选择正确的接地信息，提供强大的学习梯度来解决潜在的优化挑战。我们使用 Qwen 和 LLaMA 模型验证了我们在挑战长上下文基准方面的方法。 LongRLVR 在所有模型和基准测试中始终显着优于标准 RLVR，例如，将 14B 模型在 RULER-QA 上的得分从 73.17 提高到 88.90，在 LongBench v2 上从 39.8 提高到 46.5。我们的工作表明，明确奖励扎根过程是释放法学硕士在长上下文应用中的全部推理潜力的关键且有效的策略。我们的代码可以在这个 https URL 上找到。

Title: Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)

Authors: Miguel Lopez-Duran, Julian Fierrez, Aythami Morales, Daniel DeAlcala, Gonzalo Mancera, Javier Irigoyen, Ruben Tolosana, Oscar Delgado, Francisco Jurado, Alvaro Ortigosa
Subjects: cs.CL, cs.AI, cs.DB
Abstract URL: https://arxiv.org/abs/2603.02150
Pdf URL: https://arxiv.org/pdf/2603.02150
Copy Paste: [[2603.02150]] Zero- and Few-Shot Named-Entity Recognition: Case Study and Dataset in the Crime Domain (CrimeNER)(https://arxiv.org/abs/2603.02150)
Keywords: language model
Abstract: The extraction of critical information from crime-related documents is a crucial task for law enforcement agencies. Named-Entity Recognition (NER) can perform this task in extracting information about the crime, the criminal, or law enforcement agencies involved. However, there is a considerable lack of adequately annotated data on general real-world crime scenarios. To address this issue, we present CrimeNER, a case-study of Crime-related zero- and Few-Shot NER, and a general Crime-related Named-Entity Recognition database (CrimeNERdb) consisting of more than 1.5k annotated documents for the NER task extracted from public reports on terrorist attacks and the U.S. Department of Justice's press notes. We define 5 types of coarse crime entity and a total of 22 types of fine-grained entity. We address the quality of the case-study and the annotated data with experiments on Zero and Few-Shot settings with State-of-the-Art NER models as well as generalist and commonly used Large Language Models.
摘要：从犯罪相关文件中提取关键信息是执法机构的一项重要任务。命名实体识别 (NER) 可以执行此任务，提取有关犯罪、犯罪分子或所涉及的执法机构的信息。然而，关于一般现实世界犯罪场景的充分注释数据严重缺乏。为了解决这个问题，我们提出了 CrimeNER，一个与犯罪相关的零次和少次 NER 的案例研究，以及一个与犯罪相关的通用命名实体识别数据库 (CrimeNERdb)，该数据库由从恐怖袭击公开报告和美国司法部新闻说明中提取的 NER 任务的超过 1,500 个带注释的文档组成。我们定义了5种粗粒度犯罪实体和总共22种细粒度犯罪实体。我们通过使用最先进的 NER 模型以及通才和常用的大型语言模型在零和少样本设置上进行实验来解决案例研究和注释数据的质量。

Title: Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale

Authors: Hao Li, Chunjiang Mu, Jianhao Chen, Siyue Ren, Zhiyao Cui, Yiqun Zhang, Lei Bai, Shuyue Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02176
Pdf URL: https://arxiv.org/pdf/2603.02176
Copy Paste: [[2603.02176]] Organizing, Orchestrating, and Benchmarking Agent Skills at Ecosystem Scale(https://arxiv.org/abs/2603.02176)
Keywords: llm, agent
Abstract: The rapid proliferation of Claude agent skills has raised the central question of how to effectively leverage, manage, and scale the agent skill ecosystem. In this paper, we propose AgentSkillOS, the first principled framework for skill selection, orchestration, and ecosystem-level management. AgentSkillOS comprises two stages: (i) Manage Skills, which organizes skills into a capability tree via node-level recursive categorization for efficient discovery; and (ii) Solve Tasks, which retrieves, orchestrates, and executes multiple skills through DAG-based pipelines. To evaluate the agent's ability to invoke skills, we construct a benchmark of 30 artifact-rich tasks across five categories: data computation, document creation, motion video, visual design, and web interaction. We assess the quality of task outputs using LLM-based pairwise evaluation, and the results are aggregated via a Bradley-Terry model to produce unified quality scores. Experiments across three skill ecosystem scales (200 to 200K skills) show that tree-based retrieval effectively approximates oracle skill selection, and that DAG-based orchestration substantially outperforms native flat invocation even when given the identical skill this http URL findings confirm that structured composition is the key to unlocking skill potential. Our GitHub repository is available at:this https URL.
摘要：Claude 代理技能的快速增长提出了如何有效利用、管理和扩展代理技能生态系统的核心问题。在本文中，我们提出了 AgentSkillOS，这是第一个用于技能选择、编排和生态系统级管理的原则框架。 AgentSkillOS 包括两个阶段：（i）管理技能，通过节点级递归分类将技能组织成能力树，以实现高效发现； (ii) 解决任务，通过基于 DAG 的管道检索、编排和执行多种技能。为了评估智能体调用技能的能力，我们构建了涵盖五个类别的 30 个富含工件的任务的基准：数据计算、文档创建、动态视频、视觉设计和网络交互。我们使用基于 LLM 的成对评估来评估任务输出的质量，并通过 Bradley-Terry 模型汇总结果以产生统一的质量分数。跨三个技能生态系统规模（200 到 200K 技能）的实验表明，基于树的检索有效地近似了预言机技能选择，并且即使给予相同的技能，基于 DAG 的编排也大大优于本机平面调用，此 http URL 发现证实结构化组合是释放技能潜力的关键。我们的 GitHub 存储库位于：此 https URL。

Title: Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training

Authors: Valentin Lacombe, Valentin Quesnel, Damien Sileo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.02208
Pdf URL: https://arxiv.org/pdf/2603.02208
Copy Paste: [[2603.02208]] Reasoning Core: A Scalable Procedural Data Generation Suite for Symbolic Pre-training and Post-Training(https://arxiv.org/abs/2603.02208)
Keywords: language model, gpt
Abstract: Training on verifiable symbolic data is a promising way to expand the reasoning frontier of language models beyond what standard pre-training corpora provide. Yet existing procedural generators often rely on fixed puzzles or templates and do not deliver the distributional breadth needed at scale. We introduce Reasoning Core, a scalable suite that procedurally generates verifiable symbolic reasoning data across core formal domains: PDDL planning over randomized domains, first-order logic with equality, context-free grammar parsing and generation, causal reasoning over random Bayesian networks, and systems of equations. Each task is paired with an external solver for rigorous verification and admits continuous difficulty control for curriculum design. Examples can optionally include solver-derived reasoning traces, enabling supervised training from the earliest pre-training stages, and the same interface provides verifiable reward functions for reinforcement learning. Our experiments show that mixing Reasoning Core data into pre-training improves downstream reasoning while preserving, or slightly improving, language modeling quality. Zero-shot evaluations confirm these tasks challenge frontier models such as GPT-5. The code and data are publicly available under the MIT license.
摘要：对可验证符号数据进行训练是一种很有前途的方法，可以将语言模型的推理前沿扩展到标准预训练语料库之外。然而，现有的程序生成器通常依赖于固定的谜题或模板，并且无法提供大规模所需的分布广度。我们引入 Reasoning Core，这是一个可扩展套件，可以在程序上生成跨核心形式域的可验证符号推理数据：随机域上的 PDDL 规划、具有相等性的一阶逻辑、上下文无关语法解析和生成、随机贝叶斯网络上的因果推理以及方程组。每个任务都配有外部求解器进行严格验证，并允许课程设计的持续难度控制。示例可以选择包括求解器导出的推理轨迹，从而从最早的预训练阶段开始进行监督训练，并且相同的界面为强化学习提供可验证的奖励函数。我们的实验表明，将推理核心数据混合到预训练中可以改善下游推理，同时保留或略微提高语言建模质量。零样本评估证实这些任务对 GPT-5 等前沿模型提出了挑战。代码和数据在麻省理工学院许可下公开可用。