2026-03-26

Title: Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking

Authors: Tianpeng Zheng, Zhehan Jiang, Jiayi Liu, Shicong Feng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23506
Pdf URL: https://arxiv.org/pdf/2603.23506
Copy Paste: [[2603.23506]] Leveraging Computerized Adaptive Testing for Cost-effective Evaluation of Large Language Models in Medical Benchmarking(https://arxiv.org/abs/2603.23506)
Keywords: language model, llm
Abstract: The rapid proliferation of large language models (LLMs) in healthcare creates an urgent need for scalable and psychometrically sound evaluation methods. Conventional static benchmarks are costly to administer repeatedly, vulnerable to data contamination, and lack calibrated measurement properties for fine-grained performance tracking. We propose and validate a computerized adaptive testing (CAT) framework grounded in item response theory (IRT) for efficient assessment of standardized medical knowledge in LLMs. The study comprises a two-phase design: a Monte Carlo simulation to identify optimal CAT configurations and an empirical evaluation of 38 LLMs using a human-calibrated medical item bank. Each model completed both the full item bank and an adaptive test that dynamically selected items based on real-time ability estimates and terminated upon reaching a predefined reliability threshold (standard error <= 0.3). Results show that CAT-derived proficiency estimates achieved a near-perfect correlation with full-bank estimates (r = 0.988) while using only 1.3 percent of the items. Evaluation time was reduced from several hours to minutes per model, with substantial reductions in token usage and computational cost, while preserving inter-model performance rankings. This work establishes a psychometric framework for rapid, low-cost benchmarking of foundational medical knowledge in LLMs. The proposed adaptive methodology is intended as a standardized pre-screening and continuous monitoring tool and is not a substitute for real-world clinical validation or safety-oriented prospective studies.
摘要：医疗保健领域大语言模型 (LLM) 的快速普及迫切需要可扩展且心理测量学上合理的评估方法。传统的静态基准测试重复管理成本高昂，容易受到数据污染，并且缺乏用于细粒度性能跟踪的校准测量属性。我们提出并验证了一种基于项目反应理论（IRT）的计算机化自适应测试（CAT）框架，用于有效评估法学硕士的标准化医学知识。该研究包括两个阶段的设计：蒙特卡罗模拟以确定最佳 CAT 配置，以及使用人工校准的医疗项目库对 38 名法学硕士进行实证评估。每个模型都完成了完整的项目库和自适应测试，该测试根据实时能力估计动态选择项目，并在达到预定义的可靠性阈值（标准误差 <= 0.3）时终止。结果表明，CAT 得出的熟练程度估算与全库估算实现了近乎完美的相关性 (r = 0.988)，而仅使用了 1.3% 的项目。每个模型的评估时间从几个小时减少到几分钟，大大减少了令牌使用和计算成本，同时保留了模型间的性能排名。这项工作建立了一个心理测量框架，用于快速、低成本地对法学硕士的基础医学知识进行基准测试。所提出的适应性方法旨在作为标准化的预筛选和持续监测工具，不能替代现实世界的临床验证或以安全为导向的前瞻性研究。

Title: Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes

Authors: Fangyu Ding, Ding Ding, Sijin Chen, Kaibo Wang, Peng Xu, Zijin Feng, Haoli Bai, Kai Han, Youliang Yan, Binhang Yuan, Jiacheng Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23507
Pdf URL: https://arxiv.org/pdf/2603.23507
Copy Paste: [[2603.23507]] Beyond Masks: Efficient, Flexible Diffusion Language Models via Deletion-Insertion Processes(https://arxiv.org/abs/2603.23507)
Keywords: language model
Abstract: While Masked Diffusion Language Models (MDLMs) relying on token masking and unmasking have shown promise in language modeling, their computational efficiency and generation flexibility remain constrained by the masking paradigm. In this paper, we propose Deletion-Insertion Diffusion language models (DID) that rigorously formulate token deletion and insertion as discrete diffusion processes, replacing the masking and unmasking processes in current MDLMs. DID improves training and inference efficiency by eliminating two major sources of computational overhead in MDLMs: the computations on non-informative 1) tokens inherent to the paradigm, and 2) tokens introduced in variable-length settings. Furthermore, DID offers greater flexibility by: 1) natively supporting variable-length sequences without requiring fixed-length padding, and 2) an intrinsic self-correction mechanism during generation due to insertion that dynamically adjusts token positions. To train DID, we design a score-based approach that assigns scores to token insertion operations and derive appropriate training objectives. The objectives involve subsequence counting problems, which we efficiently solve via a parallelized dynamic programming algorithm. Our experiments across fixed and variable-length settings demonstrate the advantage of DID over baselines of MDLMs and existing insertion-based LMs, in terms of modeling performance, sampling quality, and training/inference speed, without any hyperparameter tuning.
摘要：虽然依赖于标记掩码和去掩码的掩码扩散语言模型（MDLM）在语言建模中显示出了前景，但它们的计算效率和生成灵活性仍然受到掩码范式的限制。在本文中，我们提出删除-插入扩散语言模型（DID），该模型严格地将标记删除和插入制定为离散扩散过程，取代当前 MDLM 中的屏蔽和取消屏蔽过程。 DID 通过消除 MDLM 中计算开销的两个主要来源来提高训练和推理效率：对非信息性 1) 范式固有的标记的计算，以及 2) 在可变长度设置中引入的标记。此外，DID 通过以下方式提供了更大的灵活性：1）本机支持可变长度序列，无需固定长度填充；2）由于插入动态调整令牌位置，因此在生成过程中具有内在的自我校正机制。为了训练 DID，我们设计了一种基于分数的方法，为令牌插入操作分配分数并得出适当的训练目标。目标涉及子序列计数问题，我们通过并行动态编程算法有效地解决这些问题。我们在固定和可变长度设置上的实验证明了 DID 相对于 MDLM 和现有基于插入的 LM 的基线在建模性能、采样质量和训练/推理速度方面的优势，且无需任何超参数调整。

Title: Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

Authors: Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.23508
Pdf URL: https://arxiv.org/pdf/2603.23508
Copy Paste: [[2603.23508]] Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2603.23508)
Keywords: language model, long context, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is increasingly deployed in enterprise search and document-centric assistants, where responses must be grounded in long and complex source materials. In practice, verifying that generated answers faithfully reflect retrieved documents is difficult: large language models can check long contexts but are too slow and costly for interactive services, while lightweight classifiers operate within strict context limits and frequently miss evidence outside truncated passages. We present the design of a real-time verification component integrated into a production RAG pipeline that enables full-document grounding under latency constraints. The system processes documents up to 32K tokens and employs adaptive inference strategies to balance response time and verification coverage across workloads. We describe the architectural decisions, operational trade-offs, and evaluation methodology used to deploy the verifier, and show that full-context verification substantially improves detection of unsupported responses compared with truncated validation. Our experience highlights when long-context verification is necessary, why chunk-based checking often fails in real documents, and how latency budgets shape model design. These findings provide practical guidance for practitioners building reliable large-scale retrieval-augmented applications. (Model, benchmark, and code: this https URL)
摘要：检索增强生成 (RAG) 越来越多地部署在企业搜索和以文档为中心的助手中，其中响应必须基于冗长而复杂的源材料。在实践中，验证生成的答案是否忠实地反映检索到的文档是很困难的：大型语言模型可以检查较长的上下文，但对于交互式服务来说速度太慢且成本高昂，而轻量级分类器在严格的上下文限制内运行，并且经常会错过被截断的段落之外的证据。我们提出了集成到生产 RAG 管道中的实时验证组件的设计，该组件可在延迟限制下实现完整文档接地。该系统可处理多达 32K 个令牌的文档，并采用自适应推理策略来平衡跨工作负载的响应时间和验证覆盖范围。我们描述了用于部署验证器的架构决策、操作权衡和评估方法，并表明与截断验证相比，全上下文验证大大改进了对不受支持的响应的检测。我们的经验强调了何时需要长上下文验证、为什么基于块的检查在真实文档中经常失败，以及延迟预算如何影响模型设计。这些发现为从业者构建可靠的大规模检索增强应用程序提供了实用指导。（模型、基准测试和代码：此 https URL）

Title: Internal Safety Collapse in Frontier Large Language Models

Authors: Yutao Wu, Xiao Liu, Yifeng Gao, Xiang Zheng, Hanxun Huang, Yige Li, Cong Wang, Bo Li, Xingjun Ma, Yu-Gang Jiang
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2603.23509
Pdf URL: https://arxiv.org/pdf/2603.23509
Copy Paste: [[2603.23509]] Internal Safety Collapse in Frontier Large Language Models(https://arxiv.org/abs/2603.23509)
Keywords: language model, gpt, llm
Abstract: This work identifies a critical failure mode in frontier large language models (LLMs), which we term Internal Safety Collapse (ISC): under certain task conditions, models enter a state in which they continuously generate harmful content while executing otherwise benign tasks. We introduce TVD (Task, Validator, Data), a framework that triggers ISC through domain tasks where generating harmful content is the only valid completion, and construct ISC-Bench containing 53 scenarios across 8 professional disciplines. Evaluated on JailbreakBench, three representative scenarios yield worst-case safety failure rates averaging 95.3% across four frontier LLMs (including GPT-5.2 and Claude Sonnet 4.5), substantially exceeding standard jailbreak attacks. Frontier models are more vulnerable than earlier LLMs: the very capabilities that enable complex task execution become liabilities when tasks intrinsically involve harmful content. This reveals a growing attack surface: almost every professional domain uses tools that process sensitive data, and each new dual-use tool automatically expands this vulnerability--even without any deliberate attack. Despite substantial alignment efforts, frontier LLMs retain inherently unsafe internal capabilities: alignment reshapes observable outputs but does not eliminate the underlying risk profile. These findings underscore the need for caution when deploying LLMs in high-stakes settings. Source code: this https URL
摘要：这项工作确定了前沿大语言模型（LLM）中的一个关键故障模式，我们将其称为内部安全崩溃（ISC）：在某些任务条件下，模型进入一种状态，在这种状态下，它们在执行其他良性任务的同时不断生成有害内容。我们引入TVD（任务、验证器、数据），这是一个通过领域任务触发ISC的框架，其中生成有害内容是唯一有效的完成，并构建了包含跨8个专业学科的53个场景的ISC-Bench。在 JailbreakBench 上进行评估，四个前沿 LLM（包括 GPT-5.2 和 Claude Sonnet 4.5）中的三种代表性场景产生的最坏情况安全失败率平均为 95.3%，大大超过了标准越狱攻击。前沿模型比早期的法学硕士更容易受到攻击：当任务本质上涉及有害内容时，支持复杂任务执行的能力就变成了负担。这揭示了不断增长的攻击面：几乎每个专业领域都使用处理敏感数据的工具，并且每个新的两用工具都会自动扩展此漏洞 - 即使没有任何故意的攻击。尽管做出了大量的调整努力，前沿法学硕士保留了本质上不安全的内部功能：调整重塑了可观察的输出，但并没有消除潜在的风险状况。这些发现强调在高风险环境中部署法学硕士时需要谨慎。源代码：这个https URL

Title: Visuospatial Perspective Taking in Multimodal Language Models

Authors: Jonathan Prunty, Seraphina Zhang, Patrick Quinn, Jianxun Lian, Xing Xie, Lucy Cheke
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23510
Pdf URL: https://arxiv.org/pdf/2603.23510
Copy Paste: [[2603.23510]] Visuospatial Perspective Taking in Multimodal Language Models(https://arxiv.org/abs/2603.23510)
Keywords: language model
Abstract: As multimodal language models (MLMs) are increasingly used in social and collaborative settings, it is crucial to evaluate their perspective-taking abilities. Existing benchmarks largely rely on text-based vignettes or static scene understanding, leaving visuospatial perspective-taking (VPT) underexplored. We adapt two evaluation tasks from human studies: the Director Task, assessing VPT in a referential communication paradigm, and the Rotating Figure Task, probing perspective-taking across angular disparities. Across tasks, MLMs show pronounced deficits in Level 2 VPT, which requires inhibiting one's own perspective to adopt another's. These results expose critical limitations in current MLMs' ability to represent and reason about alternative perspectives, with implications for their use in collaborative contexts.
摘要：随着多模式语言模型 (MLM) 越来越多地在社交和协作环境中使用，评估其观点采择能力至关重要。现有的基准在很大程度上依赖于基于文本的插图或静态场景理解，而视觉空间透视（VPT）尚未得到充分探索。我们改编了人类研究中的两项评估任务：导演任务（在参考通信范式中评估 VPT）和旋转人物任务（探索跨角度差异的观点采择）。在各种任务中，传销在 2 级 VPT 方面表现出明显的缺陷，这需要抑制自己的观点以采纳他人的观点。这些结果暴露了当前传销机构表达和推理替代观点的能力的严重局限性，并对其在协作环境中的使用产生了影响。

Title: DISCO: Document Intelligence Suite for COmparative Evaluation

Authors: Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.23511
Pdf URL: https://arxiv.org/pdf/2603.23511
Copy Paste: [[2603.23511]] DISCO: Document Intelligence Suite for COmparative Evaluation(https://arxiv.org/abs/2603.23511)
Keywords: language model, prompt
Abstract: Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.
摘要：文档智能需要准确的文本提取和对文档内容的可靠推理。我们引入了 \textbf{DISCO}，一个 \emph{用于比较评估的文档智能套件}，它可以在不同文档类型（包括手写文本、多语言脚本、医疗表格、信息图表和多页文档）的解析和问答方面分别评估光学字符识别 (OCR) 管道和视觉语言模型 (VLM)。我们的评估表明，不同任务和文档特征的性能差异很大，这强调了选择具有复杂性意识的方法的必要性。 OCR 管道通常对于手写和长或多页文档更可靠，其中显式文本基础支持文本推理，而 VLM 在多语言文本和视觉丰富的布局上表现更好。任务感知提示会产生混合效果，提高某些文档类型的性能，同时降低其他文档类型的性能。这些发现为根据文档结构和推理需求选择文档处理策略提供了经验指导。

Title: S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering

Authors: Rong Fu, Yemin Wang, Tianxiang Xu, Yongtai Liu, Weizhi Tang, Wangyu Wu, Xiaowen Ma, Simon Fong
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.23512
Pdf URL: https://arxiv.org/pdf/2603.23512
Copy Paste: [[2603.23512]] S-Path-RAG: Semantic-Aware Shortest-Path Retrieval Augmented Generation for Multi-Hop Knowledge Graph Question Answering(https://arxiv.org/abs/2603.23512)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: We present S-Path-RAG, a semantic-aware shortest-path Retrieval-Augmented Generation framework designed to improve multi-hop question answering over large knowledge graphs. S-Path-RAG departs from one-shot, text-heavy retrieval by enumerating bounded-length, semantically weighted candidate paths using a hybrid weighted $k$-shortest, beam, and constrained random-walk strategy, learning a differentiable path scorer together with a contrastive path encoder and lightweight verifier, and injecting a compact soft mixture of selected path latents into a language model via cross-attention. The system runs inside an iterative Neural-Socratic Graph Dialogue loop in which concise diagnostic messages produced by the language model are mapped to targeted graph edits or seed expansions, enabling adaptive retrieval when the model expresses uncertainty. This combination yields a retrieval mechanism that is both token-efficient and topology-aware while preserving interpretable path-level traces for diagnostics and intervention. We validate S-Path-RAG on standard multi-hop KGQA benchmarks and through ablations and diagnostic analyses. The results demonstrate consistent improvements in answer accuracy, evidence coverage, and end-to-end efficiency compared to strong graph- and LLM-based baselines. We further analyze trade-offs between semantic weighting, verifier filtering, and iterative updates, and report practical recommendations for deployment under constrained compute and token budgets.
摘要：我们提出了 S-Path-RAG，一种语义感知的最短路径检索增强生成框架，旨在改进大型知识图谱上的多跳问答。 S-Path-RAG 脱离了一次性的文本检索，通过使用混合加权 $k$-最短、束和约束随机游走策略来枚举有界长度、语义加权的候选路径，学习可微的路径评分器以及对比路径编码器和轻量级验证器，并通过交叉注意力将选定路径潜在的紧凑软混合注入到语言模型中。该系统在迭代的神经-苏格拉底图对话循环中运行，其中语言模型生成的简明诊断消息被映射到目标图编辑或种子扩展，从而在模型表达不确定性时实现自适应检索。这种组合产生了一种既具有令牌效率又具有拓扑感知能力的检索机制，同时保留了用于诊断和干预的可解释的路径级跟踪。我们在标准多跳 KGQA 基准上并通过消融和诊断分析来验证 S-Path-RAG。结果表明，与基于图表和 LLM 的强大基线相比，答案准确性、证据覆盖率和端到端效率得到了持续改进。我们进一步分析语义权重、验证者过滤和迭代更新之间的权衡，并报告在有限的计算和代币预算下部署的实用建议。

Title: Berta: an open-source, modular tool for AI-enabled clinical documentation

Authors: Samridhi Vaid, Mike Weldon, Jesse Dunn, Sacha Davis, Kevin Lonergan, Henry Li, Jeffrey Franc, Mohamed Abdalla, Daniel C. Baumgart, Jake Hayward, J Ross Mitchell
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.23513
Pdf URL: https://arxiv.org/pdf/2603.23513
Copy Paste: [[2603.23513]] Berta: an open-source, modular tool for AI-enabled clinical documentation(https://arxiv.org/abs/2603.23513)
Keywords: language model
Abstract: Commercial AI scribes cost \$99-600 per physician per month, operate as opaque systems, and do not return data to institutional infrastructure, limiting organizational control over data governance, quality improvement, and clinical workflows. We developed Berta, an open-source modular scribe platform for AI-enabled clinical documentation, and deployed a customized implementation within Alberta Health Services (AHS) integrated with their existing Snowflake AI Data Cloud infrastructure. The system combines automatic speech recognition with large language models while retaining all clinical data within the secure AHS environment. During eight months (November 2024 to July 2025), 198 emergency physicians used the system in 105 urban and rural facilities, generating 22148 clinical sessions and more than 2800 hours of audio. The use grew from 680 to 5530 monthly sessions. Operating costs averaged less than \$30 per physician per month, a 70-95% reduction compared to commercial alternatives. AHS has since approved expansion to 850 physicians. This is the first provincial-scale deployment of an AI scribe integrated with existing health system infrastructure. By releasing Berta as open source, we provide a reproducible, cost-effective alternative that health systems can adapt to their own secure environments, supporting data sovereignty and informed evaluation of AI documentation technology.
摘要：商业人工智能抄写员的费用为每位医生每月 99-600 美元，作为不透明的系统运行，并且不会将数据返回到机构基础设施，从而限制了组织对数据治理、质量改进和临床工作流程的控制。我们开发了 Berta，这是一个开源模块化抄写平台，用于支持 AI 的临床文档，并在阿尔伯塔省卫生服务 (AHS) 内部署了定制实施方案，与现有的 Snowflake AI 数据云基础设施集成。该系统将自动语音识别与大型语言模型相结合，同时将所有临床数据保留在安全的 AHS 环境中。在 8 个月内（2024 年 11 月至 2025 年 7 月），198 名急诊医生在 105 个城乡机构使用该系统，生成 22148 次临床会议和超过 2800 小时的音频。每月使用次数从 680 次增加到 5530 次。每个医生每月平均运营成本不到 30 美元，与商业替代方案相比降低了 70-95%。此后，AHS 已批准将医生人数扩大到 850 名。这是人工智能抄写员与现有卫生系统基础设施集成的首次省级部署。通过将 Berta 开源，我们提供了一种可重复、经济高效的替代方案，使卫生系统能够适应自己的安全环境，支持数据主权和对人工智能文档技术的知情评估。

Title: DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models

Authors: Alexander Sheppert
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23514
Pdf URL: https://arxiv.org/pdf/2603.23514
Copy Paste: [[2603.23514]] DepthCharge: A Domain-Agnostic Framework for Measuring Depth-Dependent Knowledge in Large Language Models(https://arxiv.org/abs/2603.23514)
Keywords: language model, llm
Abstract: Large Language Models appear competent when answering general questions but often fail when pushed into domain-specific details. No existing methodology provides an out-of-the-box solution for measuring how deeply LLMs can sustain accurate responses under adaptive follow-up questioning across arbitrary domains. We present DepthCharge, a domain-agnostic framework that measures knowledge depth through three innovations: adaptive probing that generates follow-up questions based on concepts the model actually mentions, on-demand fact verification from authoritative sources, and survival statistics with constant sample sizes at every depth level. The framework can be deployed on any knowledge domain with publicly verifiable facts, without requiring pre-constructed test sets or domain-specific expertise. DepthCharge results are relative to the evaluator model used for answer checking, making the framework a tool for comparative evaluation rather than absolute accuracy certification. Empirical validation across four diverse domains (Medicine, Constitutional Law, Ancient Rome, and Quantum Computing) with five frontier models demonstrates that DepthCharge reveals depth-dependent performance variation hidden by standard benchmarks. Expected Valid Depth (EVD) ranges from 3.45 to 7.55 across model-domain combinations, and model rankings vary substantially by domain, with no single model dominating all areas. Cost-performance analysis further reveals that expensive models do not always achieve deeper knowledge, suggesting that domain-specific evaluation is more informative than aggregate benchmarks for model selection in professional applications.
摘要：大型语言模型在回答一般问题时显得很有能力，但在涉及特定领域的细节时往往会失败。现有的方法还没有提供开箱即用的解决方案来衡量法学硕士在跨任意领域的适应性后续提问下维持准确反应的深度。我们提出了 DepthCharge，一个与领域无关的框架，它通过三项创新来衡量知识深度：自适应探测，根据模型实际提到的概念生成后续问题，来自权威来源的按需事实验证，以及在每个深度级别具有恒定样本量的生存统计。该框架可以部署在任何具有可公开验证事实的知识领域，无需预先构建的测试集或特定领域的专业知识。 DepthCharge 结果与用于答案检查的评估器模型相关，使该框架成为比较评估而不是绝对准确性认证的工具。使用五个前沿模型在四个不同领域（医学、宪法、古罗马和量子计算）进行的实证验证表明，DepthCharge 揭示了标准基准隐藏的深度相关性能变化。模型-领域组合的预期有效深度 (EVD) 范围从 3.45 到 7.55，模型排名因领域而异，没有一个模型主导所有领域。成本性能分析进一步表明，昂贵的模型并不总是能够获得更深入的知识，这表明特定领域的评估比专业应用中模型选择的总体基准更具信息性。

Title: Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data

Authors: John Cook, Michael Wyatt, Peng Wei, Iris Chin, Santosh Gupta, Van Zyl Van Vuuren, Richie Siburian, Amanda Spicer, Kristen Viviano, Alda Cami, Raunaq Malhotra, Zhewei Yao, Jeff Rasley, Gaurav Kaushik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23515
Pdf URL: https://arxiv.org/pdf/2603.23515
Copy Paste: [[2603.23515]] Training a Large Language Model for Medical Coding Using Privacy-Preserving Synthetic Clinical Data(https://arxiv.org/abs/2603.23515)
Keywords: language model, agent
Abstract: Improving the accuracy and reliability of medical coding reduces clinician burnout and supports revenue cycle processes, freeing providers to focus more on patient care. However, automating the assignment of ICD-10-CM and CPT codes from clinical documentation remains a challenge due to heterogeneous records, nuanced coding guidelines, and long-tail distributions. Large language models have been proposed to help or automate specific medical coding tasks. However, foundation models are not explicitly trained for medical coding and zero-shot coding has yielded poor results. We investigate whether a modern open-weight foundation model can be adapted for an expert-level medical coding task using privacy-preserving synthetic training data derived from electronic health records. We fine-tune Llama 3-70B on pairs of clinical notes and gold codes generated from EHR-grounded templates and coding policies, then evaluate exact-code prediction for ICD-10-CM and CPT. A zero-shot baseline with the unadapted model achieved an F1 score of 0.18 for exact code match. After fine-tuning on the synthetic corpus, exact-match F1 exceeded 0.70, representing a large absolute gain across both code systems. Notably, performance remained high on complex categories that often require multi-step clinical reasoning and code composition, including Advanced Illness and Frailty classes, and the model retained its performance on medical comprehension tasks. These results indicate that synthetic, policy-aware data can efficiently teach a general-purpose large language model to support precise medical coding without exposing protected health information. The approach offers a practical path for training coding agents safely and iteratively on specific tasks that represent real-world populations.
摘要：提高医疗编码的准确性和可靠性可以减少临床医生的倦怠并支持收入周期流程，使提供商能够更多地关注患者护理。然而，由于异构记录、细致入微的编码指南和长尾分布，从临床文档中自动分配 ICD-10-CM 和 CPT 代码仍然是一个挑战。人们已经提出大型语言模型来帮助或自动化特定的医疗编码任务。然而，基础模型没有经过明确的医学编码训练，零样本编码的结果很差。我们研究现代开放权重基础模型是否可以使用来自电子健康记录的隐私保护合成训练数据来适应专家级医疗编码任务。我们根据基于 EHR 的模板和编码策略生成的临床记录和黄金代码对对 Llama 3-70B 进行微调，然后评估 ICD-10-CM 和 CPT 的精确代码预测。未适应模型的零样本基线在精确代码匹配方面获得了 0.18 的 F1 分数。在对合成语料库进行微调后，精确匹配 F1 超过了 0.70，这代表着两个代码系统都有很大的绝对增益。值得注意的是，在通常需要多步骤临床推理和代码编写的复杂类别（包括晚期疾病和虚弱类别）上，性能仍然很高，并且该模型在医学理解任务上保持了其性能。这些结果表明，合成的、策略感知的数据可以有效地教授通用大型语言模型，以支持精确的医疗编码，而无需暴露受保护的健康信息。该方法为安全、迭代地训练编码代理执行代表现实世界人群的特定任务提供了一条实用的路径。

Title: MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens

Authors: Yu Chen, Runkai Chen, Sheng Yi, Xinda Zhao, Xiaohong Li, Jianjin Zhang, Jun Sun, Chuanrui Hu, Yunyun Han, Lidong Bing, Yafeng Deng, Tianqiao Chen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.23516
Pdf URL: https://arxiv.org/pdf/2603.23516
Copy Paste: [[2603.23516]] MSA: Memory Sparse Attention for Efficient End-to-End Memory Model Scaling to 100M Tokens(https://arxiv.org/abs/2603.23516)
Keywords: language model, llm, agent
Abstract: Long-term memory is a cornerstone of human intelligence. Enabling AI to process lifetime-scale information remains a long-standing pursuit in the field. Due to the constraints of full-attention architectures, the effective context length of large language models (LLMs) is typically limited to 1M tokens. Existing approaches, such as hybrid linear attention, fixed-size memory states (e.g., RNNs), and external storage methods like RAG or agent systems, attempt to extend this limit. However, they often suffer from severe precision degradation and rapidly increasing latency as context length grows, an inability to dynamically modify memory content, or a lack of end-to-end optimization. These bottlenecks impede complex scenarios like large-corpus summarization, Digital Twins, and long-history agent reasoning, while limiting memory capacity and slowing inference. We present Memory Sparse Attention (MSA), an end-to-end trainable, efficient, and massively scalable memory model framework. Through core innovations including scalable sparse attention and document-wise RoPE, MSA achieves linear complexity in both training and inference while maintaining exceptional stability, exhibiting less than 9% degradation when scaling from 16K to 100M tokens. Furthermore, KV cache compression, combined with Memory Parallel, enables 100M-token inference on 2xA800 GPUs. We also propose Memory Interleaving to facilitate complex multi-hop reasoning across scattered memory segments. MSA significantly surpasses frontier LLMs, state-of-the-art RAG systems, and leading memory agents in long-context benchmarks. These results demonstrate that by decoupling memory capacity from reasoning, MSA provides a scalable foundation to endow general-purpose models with intrinsic, lifetime-scale memory.
摘要：长期记忆是人类智力的基石。让人工智能能够处理生命周期的信息仍然是该领域的长期追求。由于全注意力架构的限制，大型语言模型 (LLM) 的有效上下文长度通常限制为 1M 个令牌。现有的方法，例如混合线性注意力、固定大小的记忆状态（例如 RNN）以及 RAG 或代理系统等外部存储方法，试图扩展这一限制。然而，它们通常会遭受严重的精度下降和随着上下文长度的增长而迅速增加的延迟、无法动态修改内存内容或缺乏端到端优化的困扰。这些瓶颈阻碍了大型语料库摘要、数字孪生和长历史代理推理等复杂场景，同时限制了内存容量并减慢了推理速度。我们提出了内存稀疏注意力（MSA），这是一种端到端的可训练、高效且可大规模扩展的内存模型框架。通过可扩展的稀疏注意力和文档级 RoPE 等核心创新，MSA 在训练和推理方面实现了线性复杂性，同时保持卓越的稳定性，从 16K 到 100M 令牌扩展时，性能下降不到 9%。此外，KV 缓存压缩与内存并行相结合，可在 2xA800 GPU 上实现 100M 令牌推理。我们还提出内存交错来促进跨分散内存段的复杂多跳推理。 MSA 在长上下文基准测试中显着超越了前沿 LLM、最先进的 RAG 系统和领先的内存代理。这些结果表明，通过将内存容量与推理分离，MSA 提供了一个可扩展的基础，为通用模型赋予内在的、生命周期规模的内存。

Title: Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents

Authors: Peijun Qing, Puneet Mathur, Nedim Lipka, Varun Manjunatha, Ryan Rossi, Franck Dernoncourt, Saeed Hassanpour, Soroush Vosoughi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23518
Pdf URL: https://arxiv.org/pdf/2603.23518
Copy Paste: [[2603.23518]] Cluster-R1: Large Reasoning Models Are Instruction-following Clustering Agents(https://arxiv.org/abs/2603.23518)
Keywords: agent
Abstract: General-purpose embedding models excel at recognizing semantic similarities but fail to capture the characteristics of texts specified by user instructions. In contrast, instruction-tuned embedders can align embeddings with textual instructions yet cannot autonomously infer latent corpus structures, such as determining the optimal number of clusters. To address both limitations, we reframe instruction-following clustering as a generative task and train large reasoning models (LRMs) as autonomous clustering agents. Our reasoning-driven training pipeline enables LRMs to interpret high-level clustering instructions and then infer the corresponding latent groupings. To evaluate this paradigm, we introduce ReasonCluster, a comprehensive benchmark comprising 28 diverse tasks spanning daily dialogue, legal cases, and financial reports. Experiments across diverse datasets and clustering scenarios show that our approach consistently outperforms strong embedding-based methods and LRM baselines, demonstrating that explicit reasoning fosters more faithful and interpretable instruction-based clustering.
摘要：通用嵌入模型擅长识别语义相似性，但无法捕获用户指令指定的文本特征。相比之下，指令调整的嵌入器可以将嵌入与文本指令对齐，但无法自主推断潜在语料库结构，例如确定最佳集群数量。为了解决这两个限制，我们将指令跟踪聚类重新构建为生成任务，并将大型推理模型（LRM）训练为自主聚类代理。我们的推理驱动训练流程使 LRM 能够解释高级聚类指令，然后推断相应的潜在分组。为了评估这一范式，我们引入了 ReasonCluster，这是一个综合基准，包含 28 项不同的任务，涵盖日常对话、法律案例和财务报告。跨不同数据集和聚类场景的实验表明，我们的方法始终优于强大的基于嵌入的方法和 LRM 基线，这表明显式推理可以促进更忠实和可解释的基于指令的聚类。

Title: MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?

Authors: Lin Yang, Yuancheng Yang, Xu Wang, Changkun Liu, Haihua Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23519
Pdf URL: https://arxiv.org/pdf/2603.23519
Copy Paste: [[2603.23519]] MedMT-Bench: Can LLMs Memorize and Understand Long Multi-Turn Conversations in Medical Scenarios?(https://arxiv.org/abs/2603.23519)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, as existing medical-related benchmarks rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases that are highly consistent with real-world application scenarios. Each test case has an average of 22 rounds (maximum of 52 rounds), covering 5 types of difficult instruction following issues. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in this https URL
摘要：大型语言模型 (LLM) 在各个专业领域展示了令人印象深刻的功能，并已集成到医学等高风险领域。然而，现有的医疗相关基准测试很少对实践中所需的长上下文记忆、干扰鲁棒性和安全防御进行压力测试。为了弥补这一差距，我们推出了 MedMT-Bench，这是一种具有挑战性的医学多轮指令，遵循模拟整个诊断和治疗过程的基准。我们通过专家手动编辑完善的逐场景数据合成构建基准，产生 400 个与实际应用场景高度一致的测试用例。每个测试用例平均22轮（最多52轮），涵盖5类难点指导问题。为了进行评估，我们提出了一个 LLM 作为法官的协议，其中包含实例级规则和原子测试点，并根据专家注释进行了验证，人类与 LLM 的一致性为 91.94%。我们测试了 17 个前沿模型，所有模型在 MedMT-Bench 上均表现不佳（总体准确率低于 60.00%），最好的模型达到 59.75%。 MedMT-Bench 可以成为推动未来研究走向更安全、更可靠的医疗人工智能的重要工具。该基准可在此 https URL 中找到

Title: From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM

Authors: Chanyong Luo, Jirui Dai, Zhendong Wang, Kui Chen, Jiaxi Yang, Bingjie Lu, Jing Wang, Jiaxin Hao, Bing Li, Ruiyang He, Yiyu Qiao, Chenkai Zhang, Kaiyu Wang, Zhi Liu, Zeyu Zheng, Yan Li, Xiaohong Gu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23520
Pdf URL: https://arxiv.org/pdf/2603.23520
Copy Paste: [[2603.23520]] From Physician Expertise to Clinical Agents: Preserving, Standardizing, and Scaling Physicians' Medical Expertise with Lightweight LLM(https://arxiv.org/abs/2603.23520)
Keywords: language model, gpt, llm, agent
Abstract: Medicine is an empirical discipline refined through long-term observation and the messy, high-variance reality of clinical practice. Physicians build diagnostic and therapeutic competence through repeated cycles of application, reflection, and improvement, forming individualized methodologies. Yet outcomes vary widely, and master physicians' knowledge systems are slow to develop and hard to transmit at scale, contributing to the scarcity of high-quality clinical expertise. To address this, we propose Med-Shicheng, a general framework that enables large language models to systematically learn and transfer distinguished physicians' diagnostic-and-therapeutic philosophy and case-dependent adaptation rules in a standardized way. Built on Tianyi, Med-Shicheng consists of five stages. We target five National Masters of Chinese Medicine or distinguished TCM physicians, curate multi-source materials, and train a single model to internalize all five knowledge systems across seven tasks, including etiology-pathogenesis analysis, syndrome diagnosis, treatment principle selection, prescription generation, prescription explanation, symptom evolution with regimen adjustment, and clinical advice. Implemented on Qwen2.5-1.5B-Base, Med-Shicheng runs on resource-constrained GPUs while achieving performance comparable to DeepSeek-R1 and GPT-5. We also examine the reliability of LLM-as-a-judge versus physician evaluation: automated judging tracks overall trends but shows bias on fine-grained individualized distinctions, highlighting the need for physician involvement when ground truth is unavailable and for domain-adapted judge models.
摘要：医学是一门经验学科，通过长期观察和临床实践的混乱、高方差现实而完善。医生通过反复应用、反思和改进的循环来建立诊断和治疗能力，形成个体化的方法。然而，结果差异很大，医师的知识体系发展缓慢，难以大规模传播，导致高质量临床专业知识的稀缺。为了解决这个问题，我们提出了 Med-Shi Cheng，一个通用框架，使大型语言模型能够以标准化的方式系统地学习和迁移杰出医生的诊断和治疗理念以及病例相关的适应规则。麦世城建于天一之上，共分为五个阶段。我们以五名国家级中医大师或名中医为对象，策划多源材料，训练单一模型，内化病因病机分析、证候诊断、治疗原则选择、处方生成、处方解释、症状演变与治疗方案调整、临床建议等七项任务的所有五个知识体系。 Med-Shiheng 在 Qwen2.5-1.5B-Base 上实现，在资源受限的 GPU 上运行，同时实现与 DeepSeek-R1 和 GPT-5 相当的性能。我们还检查了法学硕士作为法官与医生评估的可靠性：自动判断跟踪总体趋势，但对细粒度的个体差异表现出偏见，强调了当无法获得基本事实时需要医生参与以及适应领域的法官模型。

Title: Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages

Authors: Shaharukh Khan, Ali Faraz, Abhinav Ravi, Mohd Nauman, Mohd Sarfraz, Akshat Patidar, Raja Kolla, Chandra Khatri, Shubham Agarwal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.23521
Pdf URL: https://arxiv.org/pdf/2603.23521
Copy Paste: [[2603.23521]] Chitrakshara: A Large Multilingual Multimodal Dataset for Indian languages(https://arxiv.org/abs/2603.23521)
Keywords: language model
Abstract: Multimodal research has predominantly focused on single-image reasoning, with limited exploration of multi-image scenarios. Recent models have sought to enhance multi-image understanding through large-scale pretraining on interleaved image-text datasets. However, most Vision-Language Models (VLMs) are trained primarily on English datasets, leading to inadequate representation of Indian languages. To address this gap, we introduce the Chitrakshara dataset series, covering 11 Indian languages sourced from Common Crawl. It comprises (1) Chitrakshara-IL, a large-scale interleaved pretraining dataset with 193M images, 30B text tokens, and 50M multilingual documents, and (2) Chitrakshara-Cap, which includes 44M image-text pairs with 733M tokens. This paper details the data collection pipeline, including curation, filtering, and processing methodologies. Additionally, we present a comprehensive quality and diversity analysis to assess the dataset's representativeness across Indic languages and its potential for developing more culturally inclusive VLMs.
摘要：多模态研究主要集中在单图像推理上，对多图像场景的探索有限。最近的模型试图通过对交错图像文本数据集进行大规模预训练来增强多图像理解。然而，大多数视觉语言模型（VLM）主要在英语数据集上进行训练，导致印度语言的代表性不足。为了解决这一差距，我们引入了 Chitrakshara 数据集系列，涵盖源自 Common Crawl 的 11 种印度语言。它包括 (1) Chitrakshara-IL，一个包含 193M 图像、30B 文本标记和 50M 多语言文档的大规模交错预训练数据集，以及 (2) Chitrakshara-Cap，其中包括 44M 图像文本对和 733M 标记。本文详细介绍了数据收集流程，包括管理、过滤和处理方法。此外，我们还提供了全面的质量和多样性分析，以评估数据集在印度语言中的代表性及其开发更具文化包容性的 VLM 的潜力。

Title: Qworld: Question-Specific Evaluation Criteria for LLMs

Authors: Shanghua Gao, Yuchang Su, Pengwei Sui, Curtis Ginder, Marinka Zitnik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23522
Pdf URL: https://arxiv.org/pdf/2603.23522
Copy Paste: [[2603.23522]] Qworld: Question-Specific Evaluation Criteria for LLMs(https://arxiv.org/abs/2603.23522)
Keywords: language model, llm
Abstract: Evaluating large language models (LLMs) on open-ended questions is difficult because response quality depends on the question's context. Binary scores and static rubrics fail to capture these context-dependent requirements. Existing methods define criteria at the dataset level or generate them in a single pass, which limits their ability to explore the evaluation space implied by each question. We introduce One-Question-One-World (Qworld), a method that generates question-specific evaluation criteria using a recursive expansion tree. Given a question, Qworld decomposes it into scenarios, perspectives, and fine-grained binary criteria through structured hierarchical and horizontal expansion. The resulting criteria specify what a high-quality answer must address for that question. On HealthBench, Qworld covers 89% of expert-authored criteria and generates 79% novel criteria validated by human experts. Experts rate Qworld criteria higher in insight and granularity than those produced by prior methods. When applied to 11 frontier LLMs on HealthBench and Humanity's Last Exam, Qworld reveals capability differences in dimensions such as long-term impact, equity, error handling, and interdisciplinary reasoning that coarse rubrics do not distinguish. By formulating criteria generation as structured coverage of question-implied evaluation axes, Qworld enables evaluation that adapts to each question rather than relying on fixed task-level criteria.
摘要：评估开放式问题的大型语言模型 (LLM) 很困难，因为响应质量取决于问题的上下文。二进制分数和静态评分标准无法捕获这些依赖于上下文的要求。现有方法在数据集级别定义标准或在单次传递中生成它们，这限制了它们探索每个问题所隐含的评估空间的能力。我们引入一个问题一个世界（Qworld），这是一种使用递归扩展树生成特定于问题的评估标准的方法。给定一个问题，Qworld 通过结构化的层次化和水平扩展，将其分解为场景、视角和细粒度的二元标准。由此产生的标准指定了高质量答案必须解决该问题的哪些内容。在 HealthBench 上，Qworld 涵盖了 89% 的专家编写标准，并生成了 79% 经人类专家验证的新颖标准。专家们对 Qworld 标准的评价在洞察力和粒度上比以前方法产生的标准更高。当应用于 HealthBench 和 Humanity's Last Exam 上的 11 位前沿法学硕士时，Qworld 揭示了粗略标准无法区分的长期影响、公平性、错误处理和跨学科推理等维度的能力差异。通过将标准生成制定为问题隐含评估轴的结构化覆盖，Qworld 能够适应每个问题的评估，而不是依赖于固定的任务级别标准。

Title: Do 3D Large Language Models Really Understand 3D Spatial Relationships?

Authors: Xianzheng Ma, Tao Sun, Shuai Chen, Yash Bhalgat, Jindong Gu, Angel X Chang, Iro Armeni, Iro Laina, Songyou Peng, Victor Adrian Prisacariu
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2603.23523
Pdf URL: https://arxiv.org/pdf/2603.23523
Copy Paste: [[2603.23523]] Do 3D Large Language Models Really Understand 3D Spatial Relationships?(https://arxiv.org/abs/2603.23523)
Keywords: language model, llm
Abstract: Recent 3D Large-Language Models (3D-LLMs) claim to understand 3D worlds, especially spatial relationships among objects. Yet, we find that simply fine-tuning a language model on text-only question-answer pairs can perform comparably or even surpass these methods on the SQA3D benchmark without using any 3D input. This indicates that the SQA3D benchmark may not be able to detect if the model exploits textual shortcuts rather than engages in 3D-aware reasoning. To address this issue, we introduce Real-3DQA, a more rigorous evaluation benchmark that filters out easy-to-guess questions and introduces a structured taxonomy to assess various aspects of 3D reasoning. Experiments on Real-3DQA confirm that existing 3D-LLMs struggle with spatial relationships once simple cues are removed. We further propose a 3D-reweighted training objective that guides model to rely more on 3D visual clues, substantially enhancing 3D-LLMs performance in spatial reasoning tasks. Our findings underscore the need for robust benchmarks and tailored training strategies to advance genuine 3D vision-language understanding. Project page: this https URL.
摘要：最近的 3D 大语言模型 (3D-LLM) 声称可以理解 3D 世界，尤其是对象之间的空间关系。然而，我们发现，只需在纯文本问答对上微调语言模型，就可以在 SQA3D 基准测试上达到甚至超越这些方法的效果，而无需使用任何 3D 输入。这表明 SQA3D 基准测试可能无法检测模型是否利用文本快捷方式而不是进行 3D 感知推理。为了解决这个问题，我们引入了 Real-3DQA，这是一种更严格的评估基准，可以过滤掉容易猜测的问题，并引入结构化分类法来评估 3D 推理的各个方面。 Real-3DQA 实验证实，一旦删除简单的线索，现有的 3D-LLM 就会难以应对空间关系。我们进一步提出了一个 3D 重新加权训练目标，引导模型更多地依赖 3D 视觉线索，从而显着增强 3D-LLM 在空间推理任务中的性能。我们的研究结果强调需要强大的基准和量身定制的培训策略来促进真正的 3D 视觉语言理解。项目页面：此 https URL。

Title: Navigating the Concept Space of Language Models

Authors: Wilson E. Marcílio-Jr, Danilo M. Eler
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23524
Pdf URL: https://arxiv.org/pdf/2603.23524
Copy Paste: [[2603.23524]] Navigating the Concept Space of Language Models(https://arxiv.org/abs/2603.23524)
Keywords: language model, llm
Abstract: Sparse autoencoders (SAEs) trained on large language model activations output thousands of features that enable mapping to human-interpretable concepts. The current practice for analyzing these features primarily relies on inspecting top-activating examples, manually browsing individual features, or performing semantic search on interested concepts, which makes exploratory discovery of concepts difficult at scale. In this paper, we present Concept Explorer, a scalable interactive system for post-hoc exploration of SAE features that organizes concept explanations using hierarchical neighborhood embeddings. Our approach constructs a multi-resolution manifold over SAE feature embeddings and enables progressive navigation from coarse concept clusters to fine-grained neighborhoods, supporting discovery, comparison, and relationship analysis among concepts. We demonstrate the utility of Concept Explorer on SAE features extracted from SmolLM2, where it reveals coherent high-level structure, meaningful subclusters, and distinctive rare concepts that are hard to identify with existing workflows.
摘要：在大型语言模型激活上训练的稀疏自动编码器 (SAE) 输出数千个特征，这些特征能够映射到人类可解释的概念。当前分析这些特征的实践主要依赖于检查最活跃的示例、手动浏览各个特征或对感兴趣的概念进行语义搜索，这使得大规模探索性发现概念变得困难。在本文中，我们提出了 Concept Explorer，这是一个可扩展的交互式系统，用于事后探索 SAE 功能，该系统使用分层邻域嵌入来组织概念解释。我们的方法在 SAE 特征嵌入上构建多分辨率流形，并实现从粗略概念集群到细粒度邻域的渐进导航，支持概念之间的发现、比较和关系分析。我们展示了 Concept Explorer 在从 SmolLM2 中提取的 SAE 特征上的实用性，它揭示了连贯的高级结构、有意义的子簇以及难以用现有工作流程识别的独特的稀有概念。

Title: Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial

Authors: Warren Johnson, Charles Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23525
Pdf URL: https://arxiv.org/pdf/2603.23525
Copy Paste: [[2603.23525]] Prompt Compression in Production Task Orchestration: A Pre-Registered Randomized Trial(https://arxiv.org/abs/2603.23525)
Keywords: prompt, agent
Abstract: The economics of prompt compression depend not only on reducing input tokens but on how compression changes output length, which is typically priced several times higher. We evaluate this in a pre-registered six-arm randomized controlled trial of prompt compression on production multi-agent task-orchestration, analyzing 358 successful Claude Sonnet 4.5 runs (59-61 per arm) drawn from a randomized corpus of 1,199 real orchestration instructions. We compare an uncompressed control with three uniform retention rates (r=0.8, 0.5, 0.2) and two structure-aware strategies (entropy-adaptive and recency-weighted), measuring total inference cost (input+output) and embedding-based response similarity. Moderate compression (r=0.5) reduced mean total cost by 27.9%, while aggressive compression (r=0.2) increased mean cost by 1.8% despite substantial input reduction, consistent with small mean output expansion (1.03x vs. control) and heavy-tailed uncertainty. Recency-weighted compression achieved 23.5% savings and, together with moderate compression, occupied the empirical cost-similarity Pareto frontier, whereas aggressive compression was dominated on both cost and similarity. These results show that "compress more" is not a reliable production heuristic and that output tokens must be treated as a first-class outcome when designing compression policies.
摘要：即时压缩的经济性不仅取决于减少输入令牌，还取决于压缩如何改变输出长度，而输出长度的价格通常要高出几倍。我们在预先注册的六臂随机对照试验中评估了这一点，该试验对生产多智能体任务编排进行即时压缩，分析了从 1,199 个真实编排指令的随机语料库中提取的 358 次成功的 Claude Sonnet 4.5 运行（每组 59-61 次）。我们将未压缩控制与三种统一保留率（r=0.8、0.5、0.2）和两种结构感知策略（熵自适应和新近加权）进行比较，测量总推理成本（输入+输出）和基于嵌入的响应相似性。适度压缩 (r=0.5) 使平均总成本降低了 27.9%，而激进压缩 (r=0.2) 使平均成本增加了 1.8%，尽管投入大幅减少，这与平均产出小幅扩张（相对于控制为 1.03 倍）和重尾不确定性一致。新近加权压缩实现了 23.5% 的节省，并且与适度压缩一起占据了经验成本相似性帕累托前沿，而激进压缩则在成本和相似性方面占主导地位。这些结果表明“压缩更多”并不是可靠的生产启发式方法，并且在设计压缩策略时必须将输出令牌视为一流的结果。

Title: Plato's Cave: A Human-Centered Research Verification System

Authors: Matheus Kunzler Maldaner, Raul Valle, Junsung Kim, Tonuka Sultan, Pranav Bhargava, Matthew Maloni, John Courtney, Hoang Nguyen, Aamogh Sawant, Kristian O'Connor, Stephen Wormald, Damon L. Woodard
Subjects: cs.CL, cs.HC, cs.MA
Abstract URL: https://arxiv.org/abs/2603.23526
Pdf URL: https://arxiv.org/pdf/2603.23526
Copy Paste: [[2603.23526]] Plato's Cave: A Human-Centered Research Verification System(https://arxiv.org/abs/2603.23526)
Keywords: agent
Abstract: The growing publication rate of research papers has created an urgent need for better ways to fact-check information, assess writing quality, and identify unverifiable claims. We present Plato's Cave as an open-source, human-centered research verification system that (i) creates a directed acyclic graph (DAG) from a document, (ii) leverages web agents to assign credibility scores to nodes and edges from the DAG, and (iii) gives a final score by interpreting and evaluating the paper's argumentative structure. We report the system implementation and results on a collected dataset of 104 research papers.
摘要：研究论文的发表率不断增长，迫切需要更好的方法来核实信息、评估写作质量和识别无法验证的主张。我们将 Plato's Cave 作为一个开源的、以人为中心的研究验证系统，它 (i) 从文档创建有向无环图 (DAG)，(ii) 利用网络代理为 DAG 中的节点和边分配可信度分数，(iii) 通过解释和评估论文的论证结构给出最终分数。我们在收集的 104 篇研究论文数据集上报告了系统实施和结果。

Title: Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression

Authors: Warren Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23527
Pdf URL: https://arxiv.org/pdf/2603.23527
Copy Paste: [[2603.23527]] Compression Method Matters: Benchmark-Dependent Output Dynamics in LLM Prompt Compression(https://arxiv.org/abs/2603.23527)
Keywords: gpt, llm, prompt
Abstract: Prompt compression is often evaluated by input-token reduction, but its real deployment impact depends on how compression changes output length and total inference cost. We present a controlled replication and extension study of benchmark-dependent output dynamics under aggressive compression, covering 5,400 API calls across three benchmarks and multiple providers. To explain conflicting prior observations, we formalize instruction survival probability (Psi), a structural metric that captures whether task-critical prompt segments remain after truncation. Results show a strong benchmark effect: under r=0.3, DeepSeek exhibits severe output expansion on MBPP (56x, Psi approx 0.15) but substantially lower expansion on HumanEval (5x, Psi approx 0.72), while GPT-4o-mini is comparatively stable across benchmarks. This reconciles the apparent discrepancy between previously reported extreme explosion and lower replication effects by identifying prompt structure, not provider identity alone, as the primary moderator. We introduce the Compression Robustness Index (CRI) for cross-benchmark evaluation and show that single-benchmark assessments can produce misleading conclusions about compression safety and efficiency. To contextualize energy claims, we incorporate companion direct NVML measurements from rented RunPod GPUs and show that token savings can overstate joule savings. These findings motivate benchmark-diverse testing and structure-aware compression policies for reliable, energy-conscious LLM deployment.
摘要：即时压缩通常通过输入令牌减少来评估，但其实际部署影响取决于压缩如何改变输出长度和总推理成本。我们对激进压缩下依赖于基准的输出动态进行了受控复制和扩展研究，涵盖了三个基准和多个提供商的 5,400 个 API 调用。为了解释相互矛盾的先前观察结果，我们形式化了指令生存概率（Psi），这是一种结构度量，用于捕获截断后任务关键提示段是否保留。结果显示出很强的基准效应：在 r=0.3 下，DeepSeek 在 MBPP 上表现出严重的输出扩展（56x，Psi 约 0.15），但在 HumanEval 上的扩展明显较低（5x，Psi 约 0.72），而 GPT-4o-mini 在各个基准上相对稳定。通过识别提示结构（而不是单独的提供者身份）作为主要调节者，这调和了先前报告的极端爆炸和较低复制效果之间的明显差异。我们引入了压缩鲁棒性指数（CRI）进行跨基准评估，并表明单基准评估可能会产生有关压缩安全性和效率的误导性结论。为了结合能源声明的背景，我们结合了来自租用的 RunPod GPU 的配套直接 NVML 测量，并表明代币节省可能夸大了焦耳节省。这些发现激发了基准多样化测试和结构感知压缩政策，以实现可靠、节能的法学硕士部署。

Title: The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression

Authors: Warren Johnson
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23528
Pdf URL: https://arxiv.org/pdf/2603.23528
Copy Paste: [[2603.23528]] The Compression Paradox in LLM Inference: Provider-Dependent Energy Effects of Prompt Compression(https://arxiv.org/abs/2603.23528)
Keywords: language model, gpt, llm, prompt, chat
Abstract: The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}). Energy is estimated with a token-based proxy calibrated against local direct measurements, and quality is tracked with benchmark pass rates. Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5. These results indicate that input-token reduction alone is not a reliable energy optimization strategy in production inference. For the evaluated settings, model selection and output-length control provided more consistent energy-quality tradeoffs than prompt compression.
摘要：The rapid proliferation of Large Language Models has created an environmental paradox: the very technology that could help solve climate challenges is itself becoming a significant contributor to global carbon emissions. We test whether prompt compression improves inference energy efficiency in 28,421 successful API trials (28,428 planned) across three providers (OpenAI GPT-4o-mini, Anthropic Claude-3.5-Sonnet, and DeepSeek-Chat), five benchmarks (HumanEval, MBPP, GSM8K, MATH, MMLU), and four compression ratios (r in {1.0, 0.7, 0.5, 0.3}).能源是通过基于代币的代理来估计的，该代理根据本地直接测量进行校准，并通过基准通过率来跟踪质量。 Compression produced substantial quality loss (overall pass rate 26.0% at baseline vs. 1.5% at r=0.7) and strongly provider-dependent energy behavior. DeepSeek exhibited output expansion under compression (21 to 798 tokens at r=0.3), corresponding to energy increases up to +2,140%, while GPT-4o-mini showed mixed effects including a reduction at r=0.5.这些结果表明，单独减少输入令牌并不是生产推理中可靠的能量优化策略。对于评估的设置，模型选择和输出长度控制提供了比即时压缩更一致的能量质量权衡。

Title: Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language

Authors: Reuben Chagas Fernandes, Gaurang S. Patkar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23529
Pdf URL: https://arxiv.org/pdf/2603.23529
Copy Paste: [[2603.23529]] Konkani LLM: Multi-Script Instruction Tuning and Evaluation for a Low-Resource Indian Language(https://arxiv.org/abs/2603.23529)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani. This performance deficit stems from acute training data scarcity compounded by high script diversity across Devanagari, Romi and Kannada orthographies. To address this gap, we introduce Konkani-Instruct-100k, a comprehensive synthetic instruction-tuning dataset generated through Gemini 3. We establish rigorous baseline benchmarks by evaluating leading open-weights architectures including Llama 3.1, Qwen2.5 and Gemma 3 alongside proprietary closed-source models. Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation. In machine translation, Konkani LLM delivers consistent gains over the corresponding base models and is competitive with and in several settings surpasses proprietary baselines
摘要：Large Language Models (LLMs) consistently under perform in low-resource linguistic contexts such as Konkani.这种性能缺陷源于严重的训练数据匮乏，加上天城文、罗米语和卡纳达语正字法的高度文字多样性。为了解决这一差距，我们引入了 Konkani-Instruct-100k，这是一个通过 Gemini 3 生成的综合综合指令调优数据集。我们通过评估领先的开放权重架构（包括 Llama 3.1、Qwen2.5 和 Gemma 3）以及专有的闭源模型来建立严格的基线基准。 Our primary contribution involves the development of Konkani LLM, a series of fine-tuned models optimized for regional nuances. Furthermore, we are developing the Multi-Script Konkani Benchmark to facilitate cross-script linguistic evaluation.在机器翻译领域，Konkani LLM 比相应的基本模型提供了一致的收益，并且在某些设置中具有竞争力并超越了专有基线

Title: Did You Forget What I Asked? Prospective Memory Failures in Large Language Models

Authors: Avni Mittal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23530
Pdf URL: https://arxiv.org/pdf/2603.23530
Copy Paste: [[2603.23530]] Did You Forget What I Asked? Prospective Memory Failures in Large Language Models(https://arxiv.org/abs/2603.23530)
Keywords: language model, llm, prompt
Abstract: Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behaviour through a prospective memory inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90-100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.
摘要：当大型语言模型必须同时执行要求较高的任务时，它们通常无法满足格式化指令。我们通过认知心理学的前瞻性记忆启发镜头来研究这种行为，使用一种受控范式，将可验证的格式约束与日益复杂的基准任务结合起来。 Across three model families and over 8,000 prompts, compliance drops by 2-21% under concurrent task load.漏洞高度依赖于类型：终端约束（需要在响应边界采取行动）降级最多，下降幅度高达 50%，而回避约束仍然相对稳健。显着性增强格式（明确的指令框架加上尾随提醒）可以恢复大部分失去的合规性，在许多设置中将性能恢复到 90-100%。 Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model's GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers without an LLM-as-judge component on publicly available datasets.

Title: Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

Authors: Özgür Togay, Florian Kunneman, Javier Garcia-Bernardo, Anastasia Giachanou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23531
Pdf URL: https://arxiv.org/pdf/2603.23531
Copy Paste: [[2603.23531]] Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction(https://arxiv.org/abs/2603.23531)
Keywords: language model, llm, prompt
Abstract: Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.
摘要：Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact.这在在线政治对话中尤其明显，这些对话往往细致入微且涵盖广泛的主题，因此很难自动识别讨论目标以及对其表达的意见。在本研究中，我们研究大型语言模型 (LLM) 是否可以通过目标立场提取 (TSE) 来应对这一挑战，目标立场提取 (TSE) 是一项最新的自然语言处理任务，结合了目标识别和立场检测，从而能够对政治观点进行更精细的分析。为此，我们构建了来自 r/NeutralPolitics 的 1,084 个 Reddit 帖子的数据集，涵盖 138 个不同的政治目标，并使用零样本、少样本和上下文增强提示策略评估一系列专有和开源法学硕士。我们的结果表明，最好的模型的表现与训练有素的人类注释者相当，并且在注释者间一致性较低的具有挑战性的帖子上保持稳健。这些发现表明，法学硕士可以在最少的监督下提取复杂的政治观点，为计算社会科学和政治文本分析提供可扩展的工具。

Title: Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs

Authors: Satya Sri Rajiteswari Nimmagadda, Ethan Young, Niladri Sengupta, Ananya Jana, Aniruddha Maiti
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23532
Pdf URL: https://arxiv.org/pdf/2603.23532
Copy Paste: [[2603.23532]] Generating Hierarchical JSON Representations of Scientific Sentences Using LLMs(https://arxiv.org/abs/2603.23532)
Keywords: llm
Abstract: This paper investigates whether structured representations can preserve the meaning of scientific sentences. To test this, a lightweight LLM is fine-tuned using a novel structural loss function to generate hierarchical JSON structures from sentences collected from scientific articles. These JSONs are then used by a generative model to reconstruct the original text. Comparing the original and reconstructed sentences using semantic and lexical similarity we show that hierarchical formats are capable of retaining information of scientific texts effectively.
摘要：本文研究结构化表征是否能够保留科学句子的含义。为了测试这一点，使用新颖的结构损失函数对轻量级 LLM 进行了微调，以从科学文章中收集的句子生成分层 JSON 结构。然后，生成模型使用这些 JSON 来重建原始文本。使用语义和词汇相似性比较原始句子和重构句子，我们表明分层格式能够有效地保留科学文本的信息。

Title: MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG

Authors: Bhavik Mangla
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23533
Pdf URL: https://arxiv.org/pdf/2603.23533
Copy Paste: [[2603.23533]] MDKeyChunker: Single-Call LLM Enrichment with Rolling Keys and Key-Based Restructuring for High-Accuracy RAG(https://arxiv.org/abs/2603.23533)
Keywords: llm
Abstract: RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; and (3) restructures chunks by merging those sharing the same semantic key via bin-packing, co-locating related content for retrieval. The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker is implemented in Python with four dependencies and supports any OpenAI-compatible endpoint.
摘要：RAG pipelines typically rely on fixed-size chunking, which ignores document structure, fragments semantic units across boundaries, and requires multiple LLM calls per chunk for metadata extraction. We present MDKeyChunker, a three-stage pipeline for Markdown documents that (1) performs structure-aware chunking treating headers, code blocks, tables, and lists as atomic units; (2) enriches each chunk via a single LLM call extracting title, summary, keywords, typed entities, hypothetical questions, and a semantic key, while propagating a rolling key dictionary to maintain document-level context; (3) 通过装箱合并共享相同语义键的块来重组块，并共同定位相关内容以供检索。 The single-call design extracts all seven metadata fields in one LLM invocation, eliminating the need for separate per-field extraction passes. Rolling key propagation replaces hand-tuned scoring with LLM-native semantic matching. An empirical evaluation on 30 queries over an 18-document Markdown corpus shows Config D (BM25 over structural chunks) achieves Recall@5=1.000 and MRR=0.911, while dense retrieval over the full pipeline (Config C) reaches Recall@5=0.867. MDKeyChunker 在 Python 中实现，具有四个依赖项，并支持任何 OpenAI 兼容端点。

Title: Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths

Authors: Amani Maina-Kilaas, Roger Levy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23624
Pdf URL: https://arxiv.org/pdf/2603.23624
Copy Paste: [[2603.23624]] Revisiting Real-Time Digging-In Effects: No Evidence from NP/Z Garden-Paths(https://arxiv.org/abs/2603.23624)
Keywords: language model
Abstract: Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models. We find no evidence for real-time digging-in effects. Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation. Nonfinal items -- the cleaner test of real-time processing -- show reverse trends consistent with neural model predictions.
摘要：Digging-in effects, where disambiguation difficulty increases with longer ambiguous regions, have been cited as evidence for self-organized sentence processing, in which structural commitments strengthen over time. In contrast, surprisal theory predicts no such effect unless lengthening genuinely shifts statistical expectations, and neural language models appear to show the opposite pattern. Whether digging-in is a robust real-time phenomenon in human sentence processing -- or an artifact of wrap-up processes or methodological confounds -- remains unclear. We report two experiments on English NP/Z garden-path sentences using Maze and self-paced reading, comparing human behavior with predictions from an ensemble of large language models.我们没有发现实时挖掘效应的证据。 Critically, items with sentence-final versus nonfinal disambiguation show qualitatively different patterns: positive digging-in trends appear only sentence-finally, where wrap-up effects confound interpretation.非最终项目——实时处理的更清晰的测试——显示出与神经模型预测一致的反向趋势。

Title: Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks

Authors: Fatih Uenal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23646
Pdf URL: https://arxiv.org/pdf/2603.23646
Copy Paste: [[2603.23646]] Swiss-Bench SBP-002: A Frontier Model Comparison on Swiss Legal and Regulatory Tasks(https://arxiv.org/abs/2603.23646)
Keywords: language model, gpt, llm, hallucination
Abstract: While recent work has benchmarked large language models on Swiss legal translation (Niklaus et al., 2025) and academic legal reasoning from university exams (Fan et al., 2025), no existing benchmark evaluates frontier model performance on applied Swiss regulatory compliance tasks. I introduce Swiss-Bench SBP-002, a trilingual benchmark of 395 expert-crafted items spanning three Swiss regulatory domains (FINMA, Legal-CH, EFK), seven task types, and three languages (German, French, Italian), and evaluate ten frontier models from March 2026 using a structured three-dimension scoring framework assessed via a blind three-judge LLM panel (GPT-4o, Claude Sonnet 4, Qwen3-235B) with majority-vote aggregation and weighted kappa = 0.605, with reference answers validated by an independent human legal expert on a 100-item subset (73% rated Correct, 0% Incorrect, perfect Legal Accuracy). Results reveal three descriptive performance clusters: Tier A (35-38% correct), Tier B (26-29%), and Tier C (13-21%). The benchmark proves difficult: even the top-ranked model (Qwen 3.5 Plus) achieves only 38.2% correct, with 47.3% incorrect and 14.4% partially correct. Task type difficulty varies widely: legal translation and case analysis yield 69-72% correct rates, while regulatory Q&A, hallucination detection, and gap analysis remain below 9%. Within this roster (seven open-weight, three closed-source), an open-weight model leads the ranking, and several open-weight models match or outperform their closed-source counterparts. These findings provide an initial empirical reference point for assessing frontier model capability on Swiss regulatory tasks under zero-retrieval conditions.
摘要：虽然最近的工作对瑞士法律翻译的大型语言模型（Niklaus 等人，2025）和大学考试的学术法律推理（Fan 等人，2025）进行了基准测试，但现有的基准还没有评估前沿模型在瑞士监管合规应用任务中的表现。我介绍了 Swiss-Bench SBP-002，这是一个由 395 个专家制作的项目组成的三语基准，涵盖三个瑞士监管领域（FINMA、Legal-CH、EFK）、七种任务类型和三种语言（德语、法语、意大利语），并使用由盲法三评委 LLM 小组评估的结构化三维评分框架（GPT-4o、Claude Sonnet 4、 Qwen3-235B），采用多数投票聚合，加权 kappa = 0.605，参考答案由独立人类法律专家对 100 项子集进行验证（73% 评级正确，0% 错误，完美的法律准确性）。结果显示了三个描述性性能集群：A 级（正确率 35-38%）、B 级（26-29%）和 C 级（13-21%）。事实证明，该基准测试非常困难：即使是排名最高的模型（Qwen 3.5 Plus），正确率也只有 38.2%，其中错误率为 47.3%，部分正确率为 14.4%。任务类型难度差异很大：法律翻译和案例分析的正确率在 69-72% 之间，而监管问答、幻觉检测和差距分析的正确率仍低于 9%。在这个名单（七个开放权重，三个闭源）中，一个开放权重模型排名领先，并且几个开放权重模型匹配或优于其闭源模型。这些发现为评估零检索条件下瑞士监管任务的前沿模型能力提供了初步的经验参考点。

Title: Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges

Authors: Weilun Xu, Alexander Rusnak, Frederic Kaplan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23659
Pdf URL: https://arxiv.org/pdf/2603.23659
Copy Paste: [[2603.23659]] Probing Ethical Framework Representations in Large Language Models: Structure, Entanglement, and Methodological Challenges(https://arxiv.org/abs/2603.23659)
Keywords: language model, llm
Abstract: When large language models make ethical judgments, do their internal representations distinguish between normative frameworks, or collapse ethics into a single acceptability dimension? We probe hidden representations across five ethical frameworks (deontology, utilitarianism, virtue, justice, commonsense) in six LLMs spanning 4B--72B parameters. Our analysis reveals differentiated ethical subspaces with asymmetric transfer patterns -- e.g., deontology probes partially generalize to virtue scenarios while commonsense probes fail catastrophically on justice. Disagreement between deontological and utilitarian probes correlates with higher behavioral entropy across architectures, though this relationship may partly reflect shared sensitivity to scenario difficulty. Post-hoc validation reveals that probes partially depend on surface features of benchmark templates, motivating cautious interpretation. We discuss both the structural insights these methods provide and their epistemological limitations.
摘要：当大型语言模型做出伦理判断时，它们的内部表征是否区分规范框架，或者将伦理崩溃为单一的可接受性维度？我们在跨越 4B--72B 参数的六个法学硕士中探讨了五个道德框架（义务论、功利主义、美德、正义、常识）的隐藏表征。我们的分析揭示了具有不对称转移模式的差异化道德子空间——例如，道义论探索部分地推广到美德场景，而常识性探索在正义方面遭遇了灾难性的失败。义务论和功利主义探索之间的分歧与架构中较高的行为熵相关，尽管这种关系可能部分反映了对场景难度的共同敏感性。事后验证表明探针部分依赖于基准模板的表面特征，因此需要谨慎解释。我们讨论这些方法提供的结构见解及其认识论局限性。

Title: PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation

Authors: Manjushree B. Aithal, Ph.D., Alexander Kotz, James Mitchell, Ph.D
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23678
Pdf URL: https://arxiv.org/pdf/2603.23678
Copy Paste: [[2603.23678]] PLACID: Privacy-preserving Large language models for Acronym Clinical Inference and Disambiguation(https://arxiv.org/abs/2603.23678)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) offer transformative solutions across many domains, but healthcare integration is hindered by strict data privacy constraints. Clinical narratives are dense with ambiguous acronyms, misinterpretation these abbreviations can precipitate severe outcomes like life-threatening medication errors. While cloud-dependent LLMs excel at Acronym Disambiguation, transmitting Protected Health Information to external servers violates privacy frameworks. To bridge this gap, this study pioneers the evaluation of small-parameter models deployed entirely on-device to ensure privacy preservation. We introduce a privacy-preserving cascaded pipeline leveraging general-purpose local models to detect clinical acronyms, routing them to domain-specific biomedical models for context-relevant expansions. Results reveal that while general instruction-following models achieve high detection accuracy (~0.988), their expansion capabilities plummet (~0.655). Our cascaded approach utilizes domain-specific medical models to increase expansion accuracy to (~0.81). This novel work demonstrates that privacy-preserving, on-device (2B-10B) models deliver high-fidelity clinical acronym disambiguation support.
摘要：大型语言模型 (LLM) 提供跨多个领域的变革性解决方案，但严格的数据隐私限制阻碍了医疗保健集成。临床叙述中充满了含糊不清的缩写词，误解这些缩写词可能会导致严重的后果，例如危及生命的用药错误。虽然依赖于云的法学硕士在首字母缩略词消歧方面表现出色，但将受保护的健康信息传输到外部服务器违反了隐私框架。为了弥补这一差距，本研究开创了对完全部署在设备上的小参数模型进行评估的先河，以确保隐私保护。我们引入了一个保护隐私的级联管道，利用通用本地模型来检测临床首字母缩略词，将它们路由到特定领域的生物医学模型以进行上下文相关的扩展。结果表明，虽然通用指令跟踪模型实现了较高的检测精度（~0.988），但其扩展能力却直线下降（~0.655）。我们的级联方法利用特定领域的医学模型将扩展精度提高到 (~0.81)。这项新颖的工作表明，隐私保护的设备上 (2B-10B) 模型可提供高保真临床首字母缩略词消歧支持。

Title: The Diminishing Returns of Early-Exit Decoding in Modern LLMs

Authors: Rui Wei, Rui Du, Hanfei Yu, Devesh Tiwari, Jian Li, Zhaozhuo Xu, Hao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23701
Pdf URL: https://arxiv.org/pdf/2603.23701
Copy Paste: [[2603.23701]] The Diminishing Returns of Early-Exit Decoding in Modern LLMs(https://arxiv.org/abs/2603.23701)
Keywords: language model, llm
Abstract: In Large Language Model (LLM) inference, early-exit refers to stopping computation at an intermediate layer once the prediction is sufficiently confident, thereby reducing latency and cost. However, recent LLMs adopt improved pretraining recipes and architectures that reduce layer redundancy, potentially limiting early-exit opportunities. We re-evaluate layer-wise early-exit in modern LLMs and analyze how intermediate representations evolve during training. We introduce a metric to quantify a model's intrinsic suitability for early-exit and propose a benchmark for researchers to explore the potential early-exit benefits on different models and workloads. Our results show a diminishing trend in early-exit effectiveness across newer model generations. We further find that dense transformers generally offer greater early-exit potential than Mixture-of-Experts and State Space Models. In addition, larger models, particularly those with more than 20 billion parameters, and base pretrained models without specialized tuning tend to exhibit higher early-exit potential.
摘要：在大型语言模型（LLM）推理中，提前退出是指一旦预测足够可信，就停止中间层的计算，从而减少延迟和成本。然而，最近的法学硕士采用了改进的预训练方法和架构，减少了层冗余，从而可能限制提前退出的机会。我们重新评估现代法学硕士中的逐层提前退出，并分析中间表示在训练过程中如何演变。我们引入了一个指标来量化模型对提前退出的内在适用性，并为研究人员提出了一个基准，以探索不同模型和工作负载的潜在提前退出优势。我们的结果显示，新一代车型的提前退出有效性呈递减趋势。我们进一步发现，密集变压器通常比专家混合模型和状态空间模型提供更大的提前退出潜力。此外，较大的模型，特别是那些参数超过 200 亿个的模型，以及未经专门调整的基本预训练模型往往表现出更高的提前退出潜力。

Title: IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge

Authors: Ali Abdelaal, Mohammed Nader Al Haffar, Mahmoud Fawzi, Walid Magdy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23750
Pdf URL: https://arxiv.org/pdf/2603.23750
Copy Paste: [[2603.23750]] IslamicMMLU: A Benchmark for Evaluating LLMs on Islamic Knowledge(https://arxiv.org/abs/2603.23750)
Keywords: language model, llm
Abstract: Large language models are increasingly consulted for Islamic knowledge, yet no comprehensive benchmark evaluates their performance across core Islamic disciplines. We introduce IslamicMMLU, a benchmark of 10,013 multiple-choice questions spanning three tracks: Quran (2,013 questions), Hadith (4,000 questions), and Fiqh (jurisprudence, 4,000 questions). Each track is formed of multiple types of questions to examine LLMs capabilities handling different aspects of Islamic knowledge. The benchmark is used to create the IslamicMMLU public leaderboard for evaluating LLMs, and we initially evaluate 26 LLMs, where their averaged accuracy across the three tracks varied between 39.8\% to 93.8\% (by Gemini 3 Flash). The Quran track shows the widest span (99.3\% to 32.4\%), while the Fiqh track includes a novel madhab (Islamic school of jurisprudence) bias detection task revealing variable school-of-thought preferences across models. Arabic-specific models show mixed results, but they all underperform compared to frontier models. The evaluation code and leaderboard are made publicly available.
摘要：人们越来越多地参考大型语言模型来获取伊斯兰知识，但没有全面的基准来评估它们在核心伊斯兰学科中的表现。我们引入 IslamMMLU，这是一个包含 10,013 个多项选择题的基准，涵盖三个领域：古兰经（2,013 个问题）、圣训（4,000 个问题）和 Fiqh（法理学，4,000 个问题）。每个轨道都由多种类型的问题组成，以考察法学硕士处理伊斯兰知识不同方面的能力。该基准用于创建用于评估 LLM 的 IslamMMLU 公共排行榜，我们最初评估了 26 个 LLM，其三个轨道的平均准确度在 39.8\% 到 93.8\% 之间变化（通过 Gemini 3 Flash）。《古兰经》轨道显示了最宽的跨度（99.3% 到 32.4%），而 Fiqh 轨道包括一个新颖的 Madhab（伊斯兰法学派）偏见检测任务，揭示了跨模型的可变思想流派偏好。阿拉伯语特定模型的结果好坏参半，但与前沿模型相比，它们的表现均较差。评估代码和排行榜是公开的。

Title: Perturbation: A simple and efficient adversarial tracer for representation learning in language models

Authors: Joshua Rozner, Cory Shain
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23821
Pdf URL: https://arxiv.org/pdf/2603.23821
Copy Paste: [[2603.23821]] Perturbation: A simple and efficient adversarial tracer for representation learning in language models(https://arxiv.org/abs/2603.23821)
Keywords: language model
Abstract: Linguistic representation learning in deep neural language models (LMs) has been studied for decades, for both practical and theoretical reasons. However, finding representations in LMs remains an unsolved problem, in part due to a dilemma between enforcing implausible constraints on representations (e.g., linearity; Arora et al. 2024) and trivializing the notion of representation altogether (Sutter et al., 2025). Here we escape this dilemma by reconceptualizing representations not as patterns of activation but as conduits for learning. Our approach is simple: we perturb an LM by fine-tuning it on a single adversarial example and measure how this perturbation ``infects'' other examples. Perturbation makes no geometric assumptions, and unlike other methods, it does not find representations where it should not (e.g., in untrained LMs). But in trained LMs, perturbation reveals structured transfer at multiple linguistic grain sizes, suggesting that LMs both generalize along representational lines and acquire linguistic abstractions from experience alone.
摘要：出于实践和理论的原因，深度神经语言模型 (LM) 中的语言表示学习已经被研究了数十年。然而，在语言模型中寻找表示仍然是一个未解决的问题，部分原因是对表示施加不合理的约束（例如线性；Arora et al. 2024）和完全淡化表示的概念（Sutter et al., 2025）之间的困境。在这里，我们通过将表示重新概念化为学习的渠道而不是激活模式来摆脱这种困境。我们的方法很简单：我们通过在单个对抗性示例上微调它来扰动 LM，并测量这种扰动如何“感染”其他示例。扰动不做任何几何假设，并且与其他方法不同，它不会找到不应该的表示（例如，在未经训练的 LM 中）。但在训练有素的语言模型中，扰动揭示了多种语言粒度的结构化迁移，这表明语言模型既可以沿着表征路线进行概括，又可以仅从经验中获得语言抽象。

Title: PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay

Authors: Rohan Khetan, Ashna Khetan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.23841
Pdf URL: https://arxiv.org/pdf/2603.23841
Copy Paste: [[2603.23841]] PoliticsBench: Benchmarking Political Values in Large Language Models with Multi-Turn Roleplay(https://arxiv.org/abs/2603.23841)
Keywords: language model, gpt, llm, chat
Abstract: While Large Language Models (LLMs) are increasingly used as primary sources of information, their potential for political bias may impact their objectivity. Existing benchmarks of LLM social bias primarily evaluate gender and racial stereotypes. When political bias is included, it is typically measured at a coarse level, neglecting the specific values that shape sociopolitical leanings. This study investigates political bias in eight prominent LLMs (Claude, Deepseek, Gemini, GPT, Grok, Llama, Qwen Base, Qwen Instruction-Tuned) using PoliticsBench: a novel multi-turn roleplay framework adapted from the EQ-Bench-v3 psychometric benchmark. We test whether commercially developed LLMs display a systematic left-leaning bias that becomes more pronounced in later stages of multi-stage roleplay. Through twenty evolving scenarios, each model reported its stance and determined its course of action. Scoring these responses on a scale of ten political values, we explored the values underlying chatbots' deviations from unbiased standards. Seven of our eight models leaned left, while Grok leaned right. Each left-leaning LLM strongly exhibited liberal traits and moderately exhibited conservative ones. We discovered slight variations in alignment scores across stages of roleplay, with no particular pattern. Though most models used consequence-based reasoning, Grok frequently argued with facts and statistics. Our study presents the first psychometric evaluation of political values in LLMs through multi-stage, free-text interactions.
摘要：虽然大型语言模型（LLM）越来越多地被用作主要信息来源，但它们潜在的政治偏见可能会影响其客观性。现有的法学硕士社会偏见基准主要评估性别和种族刻板印象。当政治偏见被包括在内时，通常是在粗略的水平上进行衡量，忽略了塑造社会政治倾向的具体价值观。本研究使用 PoliticsBench（改编自 EQ-Bench-v3 心理测量基准的新型多轮角色扮演框架）调查了八个著名法学硕士（Claude、Deepseek、Gemini、GPT、Grok、Llama、Qwen Base、Qwen Direction-Tuned）的政治偏见。我们测试商业开发的法学硕士是否表现出系统性的左倾偏见，这种偏见在多阶段角色扮演的后期阶段变得更加明显。通过二十个不断变化的场景，每个模型报告了自己的立场并确定了其行动方针。我们根据十个政治价值观对这些回应进行评分，探讨了聊天机器人偏离公正标准的潜在价值观。我们的八个模型中有七个向左倾斜，而 Grok 向右倾斜。每个左倾的法学硕士都强烈地表现出自由主义特征，并适度地表现出保守主义特征。我们发现角色扮演各个阶段的对齐分数略有不同，没有特定的模式。尽管大多数模型都使用基于结果的推理，但 Grok 经常与事实和统计数据进行争论。我们的研究通过多阶段、自由文本交互首次对法学硕士的政治价值观进行了心理测量评估。

Title: Language Model Planners do not Scale, but do Formalizers?

Authors: Owen Jiang, Cassie Huang, Ashish Sabharwal, Li Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23844
Pdf URL: https://arxiv.org/pdf/2603.23844
Copy Paste: [[2603.23844]] Language Model Planners do not Scale, but do Formalizers?(https://arxiv.org/abs/2603.23844)
Keywords: language model, llm
Abstract: Recent work shows overwhelming evidence that LLMs, even those trained to scale their reasoning trace, perform unsatisfactorily when solving planning problems too complex. Whether the same conclusion holds for LLM formalizers that generate solver-oriented programs remains unknown. We systematically show that LLM formalizers greatly out-scale LLM planners, some retaining perfect accuracy in the classic BlocksWorld domain with a huge state space of size up to $10^{165}$. While performance of smaller LLM formalizers degrades with problem complexity, we show that a divide-and-conquer formalizing technique can greatly improve its robustness. Finally, we introduce unraveling problems where one line of problem description realistically corresponds to exponentially many lines of formal language such as the Planning Domain Definition Language (PDDL), greatly challenging LLM formalizers. We tackle this challenge by introducing a new paradigm, namely LLM-as-higher-order-formalizer, where an LLM generates a program generator. This decouples token output from the combinatorial explosion of the underlying formalization and search space.
摘要：最近的研究表明，大量证据表明法学硕士，即使是那些接受过扩展推理轨迹训练的法学硕士，在解决过于复杂的规划问题时，表现也不尽如人意。同样的结论是否适用于生成面向求解器的程序的 LLM 形式化器仍然未知。我们系统地证明了 LLM 形式化器大大超出了 LLM 规划器的规模，其中一些规划器在经典的 BlocksWorld 域中保持了完美的准确性，其状态空间大小高达 10^{165}$。虽然较小的 LLM 形式化器的性能会随着问题的复杂性而降低，但我们表明分而治之的形式化技术可以极大地提高其鲁棒性。最后，我们引入了一些难以解决的问题，其中一行问题描述实际上对应于指数级多行形式语言，例如规划领域定义语言（PDDL），这对法学硕士形式化者提出了极大的挑战。我们通过引入一种新的范例来应对这一挑战，即LLM-as-higher-order-formalizer，其中LLM生成一个程序生成器。这将令牌输出与底层形式化和搜索空间的组合爆炸解耦。

Title: BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents

Authors: Praveen Kumar Myakala, Manan Agrawal, Rahul Manche
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.23848
Pdf URL: https://arxiv.org/pdf/2603.23848
Copy Paste: [[2603.23848]] BeliefShift: Benchmarking Temporal Belief Consistency and Opinion Drift in LLM Agents(https://arxiv.org/abs/2603.23848)
Keywords: gpt, llm, retrieval-augmented generation, agent
Abstract: LLMs are increasingly used as long-running conversational agents, yet every major benchmark evaluating their memory treats user information as static facts to be stored and retrieved. That's the wrong model. People change their minds, and over extended interactions, phenomena like opinion drift, over-alignment, and confirmation bias start to matter a lot. BeliefShift introduces a longitudinal benchmark designed specifically to evaluate belief dynamics in multi-session LLM interactions. It covers three tracks: Temporal Belief Consistency, Contradiction Detection, and Evidence-Driven Revision. The dataset includes 2,400 human-annotated multi-session interaction trajectories spanning health, politics, personal values, and product preferences. We evaluate seven models including GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, LLaMA-3, and Mistral-Large under zero-shot and retrieval-augmented generation (RAG) settings. Results reveal a clear trade-off: models that personalize aggressively resist drift poorly, while factually grounded models miss legitimate belief updates. We further introduce four novel evaluation metrics: Belief Revision Accuracy (BRA), Drift Coherence Score (DCS), Contradiction Resolution Rate (CRR), and Evidence Sensitivity Index (ESI).
摘要：法学硕士越来越多地用作长期运行的对话代理，但评估其记忆力的每个主要基准都将用户信息视为要存储和检索的静态事实。那是错误的模型。人们的想法会改变，并且在长时间的互动中，观点漂移、过度一致和确认偏差等现象开始变得非常重要。 BeliefShift 引入了专门设计用于评估多会话法学硕士互动中的信念动态的纵向基准。它涵盖三个轨道：时间信念一致性、矛盾检测和证据驱动的修订。该数据集包括 2,400 个人工注释的多会话交互轨迹，涵盖健康、政治、个人价值观和产品偏好。我们在零样本和检索增强生成 (RAG) 设置下评估了七个模型，包括 GPT-4o、Claude 3.5 Sonnet、Gemini 1.5 Pro、LLaMA-3 和 Mistral-Large。结果揭示了一个明显的权衡：积极个性化的模型很难抵抗漂移，而基于事实的模型则错过了合理的信念更新。我们进一步引入了四种新颖的评估指标：信念修正准确性（BRA）、漂移一致性得分（DCS）、矛盾解决率（CRR）和证据敏感性指数（ESI）。

Title: Self-Distillation for Multi-Token Prediction

Authors: Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23911
Pdf URL: https://arxiv.org/pdf/2603.23911
Copy Paste: [[2603.23911]] Self-Distillation for Multi-Token Prediction(https://arxiv.org/abs/2603.23911)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
摘要：随着大型语言模型（LLM）规模的扩大，推理效率成为一个关键瓶颈。多令牌预测 (MTP) 可以通过并行预测多个未来令牌来加速 LLM 推理。然而，现有的MTP方法仍然面临两个挑战：MTP头的接受率有限，以及联合训练多个MTP头的困难。因此，我们提出MTP-D，一种简单而有效的自蒸馏方法，以最小的额外训练成本，提高MTP头接受率（+7.5%），同时最大限度地保持主头性能。我们还为 MTP-D 引入了循环扩展策略，实现了有效且经济的 MTP 头扩展，并进一步显着提高了 1 头 MTP 的推理速度 (+220.4\%)。此外，我们通过对七个基准的广泛实验，系统地探索和验证了关于蒸馏策略和 MTP 潜在可扩展性的关键见解。这些结果表明，我们的 MTP-D 和循环扩展策略有效地提高了 MTP-head 的性能和推理效率，促进了 MTP 在法学硕士中的实际使用。

Title: Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development

Authors: Zongliang Ji, Ziyang Zhang, Xincheng Tan, Matthew Thompson, Anna Goldenberg, Carl Yang, Rahul G. Krishnan, Fan Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.23937
Pdf URL: https://arxiv.org/pdf/2603.23937
Copy Paste: [[2603.23937]] Dialogue to Question Generation for Evidence-based Medical Guideline Agent Development(https://arxiv.org/abs/2603.23937)
Keywords: language model, llm, prompt, agent
Abstract: Evidence-based medicine (EBM) is central to high-quality care, but remains difficult to implement in fast-paced primary care settings. Physicians face short consultations, increasing patient loads, and lengthy guideline documents that are impractical to consult in real time. To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations. We implemented two prompting strategies, a zero-shot baseline and a multi-stage reasoning variant, using Gemini 2.5 as the backbone model. We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.
摘要：循证医学 (EBM) 是高质量护理的核心，但在快节奏的初级保健环境中仍然难以实施。医生面临着短暂的会诊、不断增加的患者负荷以及冗长的指南文件，而实时查阅这些文件是不切实际的。 To address this gap, we investigate the feasibility of using large language models (LLMs) as ambient assistants that surface targeted, evidence-based questions during physician-patient encounters. Our study focuses on question generation rather than question answering, with the aim of scaffolding physician reasoning and integrating guideline-based practice into brief consultations.我们使用 Gemini 2.5 作为骨干模型，实施了两种提示策略：零样本基线和多阶段推理变体。 We evaluated on a benchmark of 80 de-identified transcripts from real clinical encounters, with six experienced physicians contributing over 90 hours of structured review. Results indicate that while general-purpose LLMs are not yet fully reliable, they can produce clinically meaningful and guideline-relevant questions, suggesting significant potential to reduce cognitive burden and make EBM more actionable at the point of care.

Title: Argument Mining as a Text-to-Text Generation Task

Authors: Masayuki Kawarada, Tsutomu Hirao, Wataru Uchida, Masaaki Nagata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23949
Pdf URL: https://arxiv.org/pdf/2603.23949
Copy Paste: [[2603.23949]] Argument Mining as a Text-to-Text Generation Task(https://arxiv.org/abs/2603.23949)
Keywords: language model
Abstract: Argument Mining(AM) aims to uncover the argumentative structures within a text. Previous methods require several subtasks, such as span identification, component classification, and relation classification. Consequently, these methods need rule-based postprocessing to derive argumentative structures from the output of each subtask. This approach adds to the complexity of the model and expands the search space of the hyperparameters. To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)
摘要：论证挖掘（AM）旨在揭示文本中的论证结构。以前的方法需要几个子任务，例如跨度识别、组件分类和关系分类。因此，这些方法需要基于规则的后处理来从每个子任务的输出中导出论证结构。这种方法增加了模型的复杂性并扩展了超参数的搜索空间。 To address this difficulty, we propose a simple yet strong method based on a text-to-text generation approach using a pretrained encoder-decoder language model. Our method simultaneously generates argumentatively annotated text for spans, components, and relations, eliminating the need for task-specific postprocessing and hyperparameter tuning. Furthermore, because it is a straightforward text-to-text generation method, we can easily adapt our approach to various types of argumentative structures. Experimental results demonstrate the effectiveness of our method, as it achieves state-of-the-art performance on three different types of benchmark datasets: the Argument-annotated Essays Corpus(AAEC), AbstRCT, and the Cornell eRulemaking Corpus(CDCP)

Title: From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents

Authors: Sirui Xia, Yikai Zhang, Aili Chen, Siye Wu, Siyu Yuan, Yanghua Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23951
Pdf URL: https://arxiv.org/pdf/2603.23951
Copy Paste: [[2603.23951]] From AI Assistant to AI Scientist: Autonomous Discovery of LLM-RL Algorithms with LLM Agents(https://arxiv.org/abs/2603.23951)
Keywords: language model, llm, agent
Abstract: Discovering improved policy optimization algorithms for language models remains a costly manual process requiring repeated mechanism-level modification and validation. Unlike simple combinatorial code search, this problem requires searching over algorithmic mechanisms tightly coupled with training dynamics while reusing empirical evidence across iterations. We propose POISE, a closed-loop framework for automated discovery of policy optimization algorithms for language models. POISE maintains a structured, genealogically linked archive linking proposals, executable implementations, standardized evaluations, and natural-language reflections to support evidence-driven iteration. In mathematical reasoning experiments starting from GRPO, POISE evaluates 64 candidate algorithms and discovers improved mechanisms, including analytic-variance scaling and validity masking. The best variant improves weighted Overall from 47.8 to 52.5 (+4.6) and increases AIME25 pass@32 from 26.7% to 43.3%, demonstrating the feasibility of automated policy optimization discovery while supporting interpretable design principles.
摘要：发现语言模型的改进策略优化算法仍然是一个成本高昂的手动过程，需要重复的机制级修改和验证。与简单的组合代码搜索不同，这个问题需要搜索与训练动态紧密结合的算法机制，同时在迭代中重用经验证据。我们提出了 POISE，一个用于自动发现语言模型策略优化算法的闭环框架。 POISE 维护一个结构化的、按谱系链接的档案，链接提案、可执行的实现、标准化评估和自然语言反思，以支持证据驱动的迭代。在从 GRPO 开始的数学推理实验中，POISE 评估了 64 种候选算法并发现了改进的机制，包括分析方差缩放和有效性屏蔽。最佳变体将加权总体从 47.8 提高到 52.5 (+4.6)，并将 AIME25 pass@32 从 26.7% 提高到 43.3%，证明了自动策略优化发现同时支持可解释设计原则的可行性。

Title: The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More

Authors: Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou
Subjects: cs.CL, cs.AI, cs.GT, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2603.23971
Pdf URL: https://arxiv.org/pdf/2603.23971
Copy Paste: [[2603.23971]] The Price Reversal Phenomenon: When Cheaper Reasoning Models End Up Costing More(https://arxiv.org/abs/2603.23971)
Keywords: language model, gpt
Abstract: Developers and consumers increasingly choose reasoning language models (RLMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning. We uncover the pricing reversal phenomenon: in 21.8% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 78% cheaper than GPT-5.2's, yet its actual cost across all tasks is 22% higher. We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another. In fact, removing thinking token costs reduces ranking reversals by 70% and raises the rank correlation (Kendall's $\tau$ ) between price and cost rankings from 0.563 to 0.873. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.
摘要：开发人员和消费者越来越多地根据列出的 API 价格选择推理语言模型 (RLM)。然而，这些价格如何准确地反映实际的推理成本？ We conduct the first systematic study of this question, evaluating 8 frontier RLMs across 9 diverse tasks covering competition math, science QA, code generation, and multi-domain reasoning.我们发现了定价反转现象：在 21.8% 的车型对比较中，标价较低的车型实际上会产生较高的总成本，反转幅度高达 28 倍。例如，Gemini 3 Flash 的标价比 GPT-5.2 便宜 78%，但其所有任务的实际成本却高出 22%。 We trace the root cause to vast heterogeneity in thinking token consumption: on the same query, one model may use 900% more thinking tokens than another.事实上，消除思考代币成本可以减少 70% 的排名逆转，并将价格和成本排名之间的排名相关性（Kendall's $\tau$）从 0.563 提高到 0.873。 We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor.我们的研究结果表明，列出的 API 定价并不能可靠地反映实际成本，因此需要选择具有成本意识的模型并进行透明的每个请求成本监控。

Title: Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

Authors: Somaya Eltanbouly, Samer Rashwani
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.23972
Pdf URL: https://arxiv.org/pdf/2603.23972
Copy Paste: [[2603.23972]] Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith(https://arxiv.org/abs/2603.23972)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: this https URL.
摘要：大型语言模型（LLM）在许多语言任务中取得了显着的进步，但它们仍然在处理复杂的历史和宗教阿拉伯文本（例如《古兰经》和《圣训》）时遇到困难。为了解决这个限制，我们开发了一个基于历时词典知识的检索增强生成（RAG）框架。与之前依赖通用语料库的 RAG 系统不同，我们的方法从多哈阿拉伯语历史词典 (DHDA) 中检索证据，这是记录阿拉伯语词汇历史发展的大型资源。所提出的管道将混合检索与基于意图的路由机制相结合，为法学硕士提供精确的、上下文相关的历史信息。我们的实验表明，这种方法将包括 Fanar 和 ALLaM 在内的阿拉伯语本地 LLM 的准确性提高到 85% 以上，大大缩小了与专有大型模型 Gemini 的性能差距。 Gemini 还充当了 LLM 法官系统，用于我们实验中的自动评估。自动判断通过人工评估得到验证，表现出高度一致性（kappa = 0.87）。错误分析进一步强调了关键的语言挑战，包括变音符号和复合表达。这些发现证明了将历时词典资源整合到检索增强生成框架中以增强阿拉伯语理解的价值，特别是对历史和宗教文本的理解。代码和资源可在以下位置公开获取：此 https URL。

Title: CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction

Authors: Kaize Shi, Xueyao Sun, Qika Lin, Firoj Alam, Qing Li, Xiaohui Tao, Guandong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.23989
Pdf URL: https://arxiv.org/pdf/2603.23989
Copy Paste: [[2603.23989]] CoCR-RAG: Enhancing Retrieval-Augmented Generation in Web Q&A via Concept-oriented Context Reconstruction(https://arxiv.org/abs/2603.23989)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has shown promising results in enhancing Q&A by incorporating information from the web and other external sources. However, the supporting documents retrieved from the heterogeneous web often originate from multiple sources with diverse writing styles, varying formats, and inconsistent granularity. Fusing such multi-source documents into a coherent and knowledge-intensive context remains a significant challenge, as the presence of irrelevant and redundant information can compromise the factual consistency of the inferred answers. This paper proposes the Concept-oriented Context Reconstruction RAG (CoCR-RAG), a framework that addresses the multi-source information fusion problem in RAG through linguistically grounded concept-level integration. Specifically, we introduce a concept distillation algorithm that extracts essential concepts from Abstract Meaning Representation (AMR), a stable semantic representation that structures the meaning of texts as logical graphs. The distilled concepts from multiple retrieved documents are then fused and reconstructed into a unified, information-intensive context by Large Language Models, which supplement only the necessary sentence elements to highlight the core knowledge. Experiments on the PopQA and EntityQuestions datasets demonstrate that CoCR-RAG significantly outperforms existing context-reconstruction methods across these Web Q&A benchmarks. Furthermore, CoCR-RAG shows robustness across various backbone LLMs, establishing itself as a flexible, plug-and-play component adaptable to different RAG frameworks.
摘要：检索增强生成（RAG）通过整合来自网络和其他外部来源的信息，在增强问答方面显示出了有希望的结果。然而，从异构网络检索到的支持文档通常来自多个来源，具有不同的写作风格、不同的格式和不一致的粒度。将此类多源文档融合到连贯且知识密集的背景中仍然是一个重大挑战，因为不相关和冗余信息的存在可能会损害推断答案的事实一致性。本文提出了面向概念的上下文重建 RAG (CoCR-RAG)，这是一个通过语言基础概念级集成解决 RAG 中多源信息融合问题的框架。具体来说，我们引入了一种概念蒸馏算法，该算法从抽象含义表示（AMR）中提取基本概念，抽象含义表示是一种稳定的语义表示，将文本的含义构建为逻辑图。然后，从多个检索到的文档中提取的概念通过大型语言模型融合并重构为统一的信息密集型上下文，仅补充必要的句子元素以突出核心知识。 PopQA 和 EntityQuestions 数据集上的实验表明，CoCR-RAG 在这些 Web 问答基准测试中显着优于现有的上下文重建方法。此外，CoCR-RAG 在各种骨干法学硕士中表现出稳健性，将其自身打造为一个灵活的、即插即用的组件，可适应不同的 RAG 框架。

Title: Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning

Authors: Kun-Yang Yu, Zhi Zhou, Shi-Yu Tian, Xiao-Wen Yang, Zi-Yi Jia, Ming Yang, Zi-Jian Cheng, Lan-Zhe Guo, Yu-Feng Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24004
Pdf URL: https://arxiv.org/pdf/2603.24004
Copy Paste: [[2603.24004]] Thinking with Tables: Enhancing Multi-Modal Tabular Understanding via Neuro-Symbolic Reasoning(https://arxiv.org/abs/2603.24004)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated remarkable reasoning capabilities across modalities such as images and text. However, tabular data, despite being a critical real-world modality, remains relatively underexplored in multimodal learning. In this paper, we focus on the task of Tabular-Vision Multi-Modal Understanding (TVMU) and identify three core challenges: (1) high structural variability and data incompleteness in tables, (2) implicit and complex feature dependencies, and (3) significant heterogeneity in problem-solving pipelines across downstream tasks. To address these issues, we propose Thinking with Tables (TWT). TWT employs a program-aided code-based neuro-symbolic reasoning mechanism that facilitates key operations, such as information extraction and element modeling, by interacting with external environments. We evaluate TWT on eight representative datasets. Experimental results demonstrate that TWT consistently outperforms existing baselines by an average of 10\% in accuracy, achieving performance comparable to, or even surpassing, proprietary commercial SOTA LLMs on TVMU tasks. Models and codes are available at this https URL
摘要：多模态大语言模型 (MLLM) 在图像和文本等模态中表现出了卓越的推理能力。然而，表格数据尽管是现实世界的一种重要模式，但在多模式学习中仍然相对未得到充分探索。在本文中，我们重点关注表格视觉多模态理解（TVMU）的任务，并确定了三个核心挑战：（1）表中的高结构变异性和数据不完整性，（2）隐式且复杂的特征依赖性，以及（3）下游任务的问题解决管道中的显着异质性。为了解决这些问题，我们提出用表格思考（TWT）。 TWT 采用基于程序辅助代码的神经符号推理机制，通过与外部环境交互来促进信息提取和元素建模等关键操作。我们在八个代表性数据集上评估 TWT。实验结果表明，TWT 在准确度上始终优于现有基线，平均提高 10%，在 TVMU 任务上实现的性能可与甚至超越专有商业 SOTA LLM 相媲美。型号和代码可在此 https URL 获取

Title: CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation

Authors: Wassim Swaileh, Mohammed-En-Nadhir Zighem, Hichem Telli, Salah Eddine Bekhouche, Abdellah Zakaria Sellam, Fadi Dornaika, Dimitrios Kotzinos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24012
Pdf URL: https://arxiv.org/pdf/2603.24012
Copy Paste: [[2603.24012]] CVPD at QIAS 2026: RAG-Guided LLM Reasoning for Al-Mawarith Share Computation and Heir Allocation(https://arxiv.org/abs/2603.24012)
Keywords: llm, retrieval-augmented generation
Abstract: Islamic inheritance (Ilm al-Mawarith) is a multi-stage legal reasoning task requiring the identification of eligible heirs, resolution of blocking rules (hajb), assignment of fixed and residual shares, handling of adjustments such as awl and radd, and generation of a consistent final distribution. The task is further complicated by variations across legal schools and civil-law codifications, requiring models to operate under explicit legal configurations. We present a retrieval-augmented generation (RAG) pipeline for this setting, combining rule-grounded synthetic data generation, hybrid retrieval (dense and BM25) with cross-encoder reranking, and schema-constrained output validation. A symbolic inheritance calculator is used to generate a large high-quality synthetic corpus with full intermediate reasoning traces, ensuring legal and numerical consistency. The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard. Results demonstrate that retrieval-grounded, schema-aware generation significantly improves reliability in high-precision Arabic legal reasoning tasks.
摘要：伊斯兰继承 (Ilm al-Mawarith) 是一项多阶段的法律推理任务，需要识别合格继承人、解决阻止规则 (hajb)、分配固定份额和剩余份额、处理 awl 和 radd 等调整，以及生成一致的最终分配。由于法学院和民法编纂之间的差异，这项任务变得更加复杂，要求模型在明确的法律配置下运作。我们为此设置提出了一个检索增强生成（RAG）管道，将基于规则的合成数据生成、混合检索（密集和 BM25）与跨编码器重新排名以及模式约束的输出验证相结合。使用符号继承计算器生成具有完整中间推理痕迹的大型高质量合成语料库，确保法律和数值的一致性。 The proposed system achieves a MIR-E score of 0.935 and ranks first on the official QIAS 2026 blind-test leaderboard.结果表明，基于检索、模式感知的生成显着提高了高精度阿拉伯语法律推理任务的可靠性。

Title: Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale

Authors: Chinmay Soni, Shivam Chourasia, Gaurav Kumar, Hitesh Kapoor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24023
Pdf URL: https://arxiv.org/pdf/2603.24023
Copy Paste: [[2603.24023]] Schema on the Inside: A Two-Phase Fine-Tuning Method for High-Efficiency Text-to-SQL at Scale(https://arxiv.org/abs/2603.24023)
Keywords: language model, prompt
Abstract: Applying large, proprietary API-based language models to text-to-SQL tasks poses a significant industry challenge: reliance on massive, schema-heavy prompts results in prohibitive per-token API costs and high latency, hindering scalable production deployment. We present a specialized, self-hosted 8B-parameter model designed for a conversational bot in CriQ, a sister app to Dream11, India's largest fantasy sports platform with over 250 million users, that answers user queries about cricket statistics. Our novel two-phase supervised fine-tuning approach enables the model to internalize the entire database schema, eliminating the need for long-context prompts. This reduces input tokens by over 99%, from a 17k-token baseline to fewer than 100, and replaces costly external API calls with efficient local inference. The resulting system achieves 98.4% execution success and 92.5% semantic accuracy, substantially outperforming a prompt-engineered baseline using Google's Gemini Flash 2.0 (95.6% execution, 89.4% semantic accuracy). These results demonstrate a practical path toward high-precision, low-latency text-to-SQL applications using domain-specialized, self-hosted language models in large-scale production environments.
摘要：将大型、基于专有 API 的语言模型应用于文本到 SQL 任务带来了重大的行业挑战：依赖大量、模式密集的提示会导致每个令牌 API 成本过高且延迟较高，从而阻碍可扩展的生产部署。我们提出了一个专门的、自托管的 8B 参数模型，专为 CriQ 中的对话机器人而设计，CriQ 是 Dream11 的姐妹应用程序，Dream11 是印度最大的梦幻体育平台，拥有超过 2.5 亿用户，可回答用户有关板球统计数据的查询。我们新颖的两阶段监督微调方法使模型能够内化整个数据库模式，从而消除了对长上下文提示的需要。这将输入令牌减少了 99% 以上，从 17k 令牌基线减少到不到 100 个，并用高效的本地推理取代了昂贵的外部 API 调用。由此产生的系统实现了 98.4% 的执行成功率和 92.5% 的语义准确性，大大优于使用 Google Gemini Flash 2.0 的提示设计基线（95.6% 的执行率，89.4% 的语义准确性）。这些结果展示了在大规模生产环境中使用领域专用、自托管语言模型实现高精度、低延迟文本到 SQL 应用程序的实用路径。

Title: From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs

Authors: Xiaoyong Guo, Nanjie Li, Zijie Zeng, Kai Wang, Hao Huang, Haihua Xu, Wei Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.24034
Pdf URL: https://arxiv.org/pdf/2603.24034
Copy Paste: [[2603.24034]] From Oracle to Noisy Context: Mitigating Contextual Exposure Bias in Speech-LLMs(https://arxiv.org/abs/2603.24034)
Keywords: llm
Abstract: Contextual automatic speech recognition (ASR) with Speech-LLMs is typically trained with oracle conversation history, but relies on error-prone history at inference, causing a train-test mismatch in the context channel that we term contextual exposure bias. We propose a unified training framework to improve robustness under realistic histories: (i) Teacher Error Knowledge by using Whisper large-v3 hypotheses as training-time history, (ii) Context Dropout to regularize over-reliance on history, and (iii) Direct Preference Optimization (DPO) on curated failure cases. Experiments on TED-LIUM 3 (in-domain) and zero-shot LibriSpeech (out-of-domain) show consistent gains under predicted-history decoding. With a two-utterance history as context, SFT with Whisper hypotheses reduce WER from 5.59% (oracle-history training) to 5.47%, and DPO further improves to 5.17%. Under irrelevant-context attacks, DPO yields the smallest degradation (5.17% -> 5.63%), indicating improved robustness to misleading context. Our code and models are published on this https URL.
摘要：使用 Speech-LLM 的上下文自动语音识别 (ASR) 通常使用 Oracle 对话历史记录进行训练，但依赖于推理时容易出错的历史记录，从而导致上下文通道中的训练测试不匹配，我们称之为上下文暴露偏差。我们提出了一个统一的训练框架来提高现实历史下的鲁棒性：(i) 通过使用 Whisper Large-v3 假设作为训练时间历史来获得教师错误知识，(ii) Context Dropout 来规范对历史的过度依赖，以及 (iii) 针对策划的失败案例的直接偏好优化 (DPO)。 TED-LIUM 3（域内）和零样本 LibriSpeech（域外）的实验显示了预测历史解码下一致的增益。以两个话语历史作为上下文，带有 Whisper 假设的 SFT 将 WER 从 5.59%（预言机历史训练）降低到 5.47%，DPO 进一步提高到 5.17%。在不相关上下文攻击下，DPO 产生的降级最小 (5.17% -> 5.63%)，表明对误导性上下文的稳健性有所提高。 Our code and models are published on this https URL.

Title: FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval

Authors: Caishuang Huang, Yang Qiao, Rongyu Zhang, Junjie Ye, Pu Lu, Wenxi Wu, Meng Zhou, Xiku Du, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24051
Pdf URL: https://arxiv.org/pdf/2603.24051
Copy Paste: [[2603.24051]] FinToolSyn: A forward synthesis Framework for Financial Tool-Use Dialogue Data with Dynamic Tool Retrieval(https://arxiv.org/abs/2603.24051)
Keywords: language model, llm
Abstract: Tool-use capabilities are vital for Large Language Models (LLMs) in finance, a domain characterized by massive investment targets and data-intensive inquiries. However, existing data synthesis methods typically rely on a reverse synthesis paradigm, generating user queries from pre-sampled tools. This approach inevitably introduces artificial explicitness, yielding queries that fail to capture the implicit, event-driven nature of real-world needs. Moreover, its reliance on static tool sets overlooks the dynamic retrieval process required to navigate massive tool spaces. To address these challenges, we introduce \textit{FinToolSyn}, a forward synthesis framework designed to generate high-quality financial dialogues. Progressing from persona instruction and atomic tool synthesis to dynamic retrieval dialogue generation, our pipeline constructs a repository of 43,066 tools and synthesizes over 148k dialogue instances, incorporating dynamic retrieval to emulate the noisy candidate sets typical of massive tool spaces. We also establish a dedicated benchmark to evaluate tool-calling capabilities in realistic financial scenarios. Extensive experiments demonstrate that models trained on FinToolSyn achieve a 21.06\% improvement, providing a robust foundation for tool learning in financial scenarios.
摘要：工具使用能力对于金融领域的大型语言模型 (LLM) 至关重要，该领域的特点是海量投资目标和数据密集型查询。然而，现有的数据合成方法通常依赖于逆向合成范例，从预采样工具生成用户查询。这种方法不可避免地引入了人为的明确性，产生的查询无法捕获现实世界需求的隐式、事件驱动的本质。此外，它对静态工具集的依赖忽视了导航庞大工具空间所需的动态检索过程。为了应对这些挑战，我们引入了 \textit{FinToolSyn}，这是一个旨在生成高质量金融对话的正向综合框架。从角色指令和原子工具合成到动态检索对话生成，我们的管道构建了一个包含 43,066 个工具的存储库，并合成了超过 148k 个对话实例，结合动态检索来模拟大规模工具空间中典型的嘈杂候选集。我们还建立了专门的基准来评估现实金融场景中的工具调用能力。大量实验表明，在 FinToolSyn 上训练的模型实现了 21.06% 的提升，为金融场景下的工具学习提供了坚实的基础。

Title: ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing

Authors: Yu-Chen Kang, Yu-Chien Tang, An-Zi Yen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24073
Pdf URL: https://arxiv.org/pdf/2603.24073
Copy Paste: [[2603.24073]] ConceptKT: A Benchmark for Concept-Level Deficiency Prediction in Knowledge Tracing(https://arxiv.org/abs/2603.24073)
Keywords: language model, llm
Abstract: Knowledge Tracing (KT) is a critical technique for modeling student knowledge to support personalized learning. However, most KT systems focus on binary correctness prediction and cannot diagnose the underlying conceptual misunderstandings that lead to errors. Such fine-grained diagnostic feedback is essential for designing targeted instruction and effective remediation. In this work, we introduce the task of concept-level deficiency prediction, which extends traditional KT by identifying the specific concepts a student is likely to struggle with on future problems. We present ConceptKT, a dataset annotated with labels that capture both the concepts required to solve each question and the missing concepts underlying incorrect responses. We investigate in-context learning approaches to KT and evaluate the diagnostic capabilities of various Large Language Models (LLMs) and Large Reasoning Models (LRMs). Different strategies for selecting informative historical records are explored. Experimental results demonstrate that selecting response histories based on conceptual alignment and semantic similarity leads to improved performance on both correctness prediction and concept-level deficiency identification.
摘要：知识追踪（KT）是对学生知识进行建模以支持个性化学习的关键技术。然而，大多数 KT 系统专注于二进制正确性预测，无法诊断导致错误的潜在概念误解。这种细粒度的诊断反馈对于设计有针对性的指导和有效的补救措施至关重要。在这项工作中，我们介绍了概念级缺陷预测的任务，该任务通过识别学生在未来问题上可能遇到的具体概念来扩展传统的知识转移。我们提出了 ConceptKT，这是一个用标签注释的数据集，它捕获了解决每个问题所需的概念以及错误答案背后缺失的概念。我们研究 KT 的上下文学习方法，并评估各种大型语言模型 (LLM) 和大型推理模型 (LRM) 的诊断能力。探索了选择信息丰富的历史记录的不同策略。实验结果表明，基于概念对齐和语义相似性选择响应历史可以提高正确性预测和概念级缺陷识别的性能。

Title: LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale

Authors: Muhammed Saeed, Simon Razniewski
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.24080
Pdf URL: https://arxiv.org/pdf/2603.24080
Copy Paste: [[2603.24080]] LLMpedia: A Transparent Framework to Materialize an LLM's Encyclopedic Knowledge at Scale(https://arxiv.org/abs/2603.24080)
Keywords: language model, gpt, llm, prompt
Abstract: Benchmarks such as MMLU suggest flagship language models approach factuality saturation, with scores above 90\%. We show this picture is incomplete. \emph{LLMpedia} generates encyclopedic articles entirely from parametric memory, producing ${\sim}$1M articles across three model families without retrieval. For gpt-5-mini, the verifiable true rate on Wikipedia-covered subjects is only 74.7\% -- more than 15 percentage points below the benchmark-based picture, consistent with the availability bias of fixed-question evaluation. Beyond Wikipedia, frontier subjects verifiable only through curated web evidence fall further to 63.2\% true rate. Wikipedia covers just 61\% of surfaced subjects, and three model families overlap by only 7.3\% in subject choice. In a capture-trap benchmark inspired by prior analysis of Grokipedia, LLMpedia achieves substantially higher factuality at roughly half the textual similarity to Wikipedia. Unlike Grokipedia, every prompt, artifact, and evaluation verdict is publicly released, making LLMpedia the first fully open parametric encyclopedia -- bridging factuality evaluation and knowledge materialization. All data, code, and a browsable interface are at this https URL.
摘要：MMLU 等基准表明旗舰语言模型接近事实饱和度，得分高于 90%。 We show this picture is incomplete. \emph{LLMpedia} 完全根据参数内存生成百科全书式文章，在三个模型系列中生成 ${\sim}$1M 篇文章，无需检索。对于 gpt-5-mini，维基百科涵盖的主题的可验证真实率仅为 74.7%，比基于基准的图片低 15 个百分点以上，这与固定问题评估的可用性偏差一致。除了维基百科之外，只能通过精心策划的网络证据验证的前沿主题的真实率进一步下降至 63.2%。维基百科仅涵盖了 61\% 的浮现主题，并且三个模型家族在主题选择上仅重叠 7.3\%。在受 Grokipedia 先前分析启发的捕获陷阱基准中，LLMpedia 的真实性显着提高，其文本相似度约为维基百科的一半。与 Grokipedia 不同的是，每一个提示、工件和评估结论都是公开发布的，这使得 LLMpedia 成为第一个完全开放的参数化百科全书——桥接事实性评估和知识物化。所有数据、代码和可浏览界面都位于此 https URL 中。

Title: Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study

Authors: Nour Bouchouchi, Thiabult Laugel, Xavier Renard, Christophe Marsala, Marie-Jeanne Lesot, Marcin Detyniecki
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24125
Pdf URL: https://arxiv.org/pdf/2603.24125
Copy Paste: [[2603.24125]] Alignment Reduces Expressed but Not Encoded Gender Bias: A Unified Framework and Study(https://arxiv.org/abs/2603.24125)
Keywords: language model, llm, prompt
Abstract: During training, Large Language Models (LLMs) learn social regularities that can lead to gender bias in downstream applications. Most mitigation efforts focus on reducing bias in generated outputs, typically evaluated on structured benchmarks, which raises two concerns: output-level evaluation does not reveal whether alignment modifies the model's underlying representations, and structured benchmarks may not reflect realistic usage scenarios. We propose a unified framework to jointly analyze intrinsic and extrinsic gender bias in LLMs using identical neutral prompts, enabling direct comparison between gender-related information encoded in internal representations and bias expressed in generated outputs. Contrary to prior work reporting weak or inconsistent correlations, we find a consistent association between latent gender information and expressed bias when measured under the unified protocol. We further examine the effect of alignment through supervised fine-tuning aimed at reducing gender bias. Our results suggest that while the latter indeed reduces expressed bias, measurable gender-related associations are still present in internal representations, and can be reactivated under adversarial prompting. Finally, we consider two realistic settings and show that debiasing effects observed on structured benchmarks do not necessarily generalize, e.g., to the case of story generation.
摘要：在训练过程中，大型语言模型（LLM）学习可能导致下游应用程序中性别偏见的社会规律。大多数缓解措施的重点是减少生成的输出中的偏差，通常在结构化基准上进行评估，这引发了两个问题：输出级别评估并不能揭示对齐是否会修改模型的底层表示，并且结构化基准可能无法反映实际的使用场景。我们提出了一个统一的框架，使用相同的中性提示联合分析法学硕士的内在和外在性别偏见，从而能够直接比较内部表示中编码的性别相关信息和生成输出中表达的偏见。与之前报告的相关性较弱或不一致的工作相反，我们发现在统一方案下测量时，潜在性别信息和表达的偏见之间存在一致的关联。我们通过旨在减少性别偏见的监督微调进一步检查对齐的效果。我们的结果表明，虽然后者确实减少了表达的偏见，但可测量的性别相关关联仍然存在于内部表征中，并且可以在对抗性提示下重新激活。最后，我们考虑了两个现实的设置，并表明在结构化基准上观察到的去偏差效应不一定适用于例如故事生成的情况。

Title: MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare

Authors: Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24132
Pdf URL: https://arxiv.org/pdf/2603.24132
Copy Paste: [[2603.24132]] MedAidDialog: A Multilingual Multi-Turn Medical Dialogue Dataset for Accessible Healthcare(https://arxiv.org/abs/2603.24132)
Keywords: language model
Abstract: Conversational artificial intelligence has the potential to assist users in preliminary medical consultations, particularly in settings where access to healthcare professionals is limited. However, many existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. In this work, we introduce MedAidDialog, a multilingual multi-turn medical dialogue dataset designed to simulate realistic physician--patient consultations. The dataset extends the MDDial corpus by generating synthetic consultations using large language models and further expands them into a parallel multilingual corpus covering seven languages: English, Hindi, Telugu, Tamil, Bengali, Marathi, and Arabic. Building on this dataset, we develop MedAidLM, a conversational medical model trained using parameter-efficient fine-tuning on quantized small language models, enabling deployment without high-end computational infrastructure. Our framework additionally incorporates optional patient pre-context information (e.g., age, gender, allergies) to personalize the consultation process. Experimental results demonstrate that the proposed system can effectively perform symptom elicitation through multi-turn dialogue and generate diagnostic recommendations. We further conduct medical expert evaluation to assess the plausibility and coherence of the generated consultations.
摘要：对话式人工智能有潜力帮助用户进行初步医疗咨询，特别是在医疗保健专业人员的接触机会有限的情况下。然而，许多现有的医疗对话系统以单轮问答模式运行，或者依赖基于模板的数据集，限制了对话的真实性和多语言适用性。在这项工作中，我们介绍了 MedAidDialog，这是一个多语言多轮医疗对话数据集，旨在模拟现实的医生与患者的咨询。该数据集通过使用大型语言模型生成综合咨询来扩展 MDDial 语料库，并进一步将其扩展为涵盖七种语言的并行多语言语料库：英语、印地语、泰卢固语、泰米尔语、孟加拉语、马拉地语和阿拉伯语。在此数据集的基础上，我们开发了 MedAidLM，这是一种对话式医学模型，使用量化小语言模型上的参数高效微调进行训练，无需高端计算基础设施即可进行部署。我们的框架还包含可选的患者背景信息（例如年龄、性别、过敏），以个性化咨询过程。实验结果表明，所提出的系统可以通过多轮对话有效地进行症状诱发并生成诊断建议。我们进一步进行医学专家评估，以评估所生成咨询的合理性和连贯性。

Title: Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition

Authors: Aleix Sant, Jordi Luque, Carlos Escolano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24242
Pdf URL: https://arxiv.org/pdf/2603.24242
Copy Paste: [[2603.24242]] Optimizing Multilingual LLMs via Federated Learning: A Study of Client Language Composition(https://arxiv.org/abs/2603.24242)
Keywords: language model, llm
Abstract: Federated Learning (FL) of Large Language Models (LLMs) in multilingual environments presents significant challenges stemming from heterogeneous language distributions across clients and disparities in language resource availability. To address these challenges, we extended the FederatedScope-LLM framework to support multilingual instruction-tuning experiments with LLMs. We also introduced a novel client-specific early stopping mechanism, Local Dynamic Early Stopping (LDES-FL), which allows clients to pause and resume local training based on client-side validation performance, enhancing training efficiency and sustainability. Through a series of experiments, we studied how client language composition - from fully monolingual to increasingly multilingual clients - affects multilingual quality, fairness and training cost. Monolingual local fine-tuning remains the most effective for single-language specialization, whereas federated training is better suited to learning a single balanced multilingual model. In FL, increasing within-client multilinguality leads to stronger and fairer global models, narrows the gap to centralized multilingual fine-tuning, and yields the largest gains for lower-resource languages, albeit at the cost of more optimization steps. Overall, our results identify client language composition as a key design variable in multilingual FL, shaping performance, fairness and efficiency
摘要：多语言环境中大型语言模型 (LLM) 的联邦学习 (FL) 面临着来自客户端之间的异构语言分布和语言资源可用性差异的重大挑战。为了应对这些挑战，我们扩展了 FederatedScope-LLM 框架，以支持 LLM 的多语言指令调整实验。我们还引入了一种新颖的特定于客户端的提前停止机制，即本地动态提前停止（LDES-FL），它允许客户端根据客户端验证性能暂停和恢复本地训练，从而提高训练效率和可持续性。通过一系列实验，我们研究了客户语言构成（从完全单语到越来越多语言的客户）如何影响多语言质量、公平性和培训成本。单语言局部微调对于单语言专业化仍然是最有效的，而联合训练更适合学习单一平衡的多语言模型。在 FL 中，增加客户端内的多语言性可以带来更强大、更公平的全局模型，缩小与集中式多语言微调的差距，并为资源较低的语言带来最大的收益，尽管是以更多优化步骤为代价。总体而言，我们的结果将客户语言构成确定为多语言 FL 中的关键设计变量，塑造绩效、公平性和效率

Title: Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning

Authors: He Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24258
Pdf URL: https://arxiv.org/pdf/2603.24258
Copy Paste: [[2603.24258]] Semantic Alignment across Ancient Egyptian Language Stages via Normalization-Aware Multitask Learning(https://arxiv.org/abs/2603.24258)
Keywords: language model
Abstract: We study word-level semantic alignment across four historical stages of Ancient Egyptian. These stages differ in script and orthography, and parallel data are scarce. We jointly train a compact encoder-decoder model with a shared byte-level tokenizer on all four stages, combining masked language modeling (MLM), translation language modeling (TLM), sequence-to-sequence translation, and part-of-speech tagging under a task-aware loss with fixed weights and uncertainty-based scaling. To reduce surface divergence we add Latin transliteration and IPA reconstruction as auxiliary views. We integrate these views through KL-based consistency and through embedding-level fusion. We evaluate alignment quality using pairwise metrics, specifically ROC-AUC and triplet accuracy, on curated Egyptian-English and intra-Egyptian cognate datasets. Translation yields the strongest gains. IPA with KL consistency improves cross-branch alignment, while early fusion demonstrates limited efficacy. Although the overall alignment remains limited, the findings provide a reproducible baseline and practical guidance for modeling historical languages under real constraints. They also show how normalization and task design shape what counts as alignment in typologically distant settings.
摘要：我们研究古埃及语四个历史阶段的词级语义对齐。这些阶段在文字和拼字法上有所不同，并且并行数据很少。我们在所有四个阶段上联合训练一个带有共享字节级分词器的紧凑编码器-解码器模型，结合掩码语言模型（MLM）、翻译语言模型（TLM）、序列到序列翻译和词性标记，并在任务感知损失下使用固定权重和基于不确定性的缩放。为了减少表面发散，我们添加拉丁音译和 IPA 重建作为辅助视图。我们通过基于 KL 的一致性和嵌入级融合来整合这些视图。我们使用成对指标（特别是 ROC-AUC 和三元组准确性）在精心策划的埃及英语和埃及内部同源数据集上评估对齐质量。翻译带来的收益最为强劲。具有 KL 一致性的 IPA 改善了跨分支对齐，而早期融合的效果有限。尽管总体一致性仍然有限，但研究结果为在实际约束下对历史语言进行建模提供了可重复的基线和实用指导。它们还展示了标准化和任务设计如何在类型上遥远的环境中塑造什么算作对齐。

Title: GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Authors: Yunzhe Wang, Runhui Xu, Kexin Zheng, Tianyi Zhang, Jayavibhav Niranjan Kogundi, Soham Hans, Volkan Ustun
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2603.24329
Pdf URL: https://arxiv.org/pdf/2603.24329
Copy Paste: [[2603.24329]] GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents(https://arxiv.org/abs/2603.24329)
Keywords: llm, agent
Abstract: Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.
摘要：从机器人到虚拟世界，多模态法学硕士越来越多地被部署为 3D 环境中自主代理的感知支柱。这些应用程序要求代理感知快速的状态变化，将操作归因于正确的实体，并从第一人称视角推理并发的多代理行为，现有基准无法充分评估这些功能。我们介绍 GameplayQA，一个通过视频理解评估以主体为中心的感知和推理的框架。具体来说，我们以 1.22 个标签/秒的速度对多人 3D 游戏视频进行密集注释，并使用围绕自我、其他代理和世界的三元系统构建的状态、动作和事件的时间同步、并发字幕，这是多代理环境的自然分解。根据这些注释，我们精炼了 2.4K 诊断 QA 对，将其组织为三个认知复杂度级别，并附有结构化干扰分类法，可以对模型产生幻觉的位置进行细粒度分析。对前沿 MLLM 的评估揭示了与人类表现的巨大差距，在时间和跨视频基础、代理角色归因以及处理游戏决策密度方面存在常见故障。我们希望 GameplayQA 能够促进未来在具体人工智能、代理感知和世界建模交叉领域的研究。

Title: When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools

Authors: Xingming Li, Runke Huang, Yanan Bao, Yuye Jin, Yuru Jiao, Qingyong Hu
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.24389
Pdf URL: https://arxiv.org/pdf/2603.24389
Copy Paste: [[2603.24389]] When AI Meets Early Childhood Education: Large Language Models as Assessment Teammates in Chinese Preschools(https://arxiv.org/abs/2603.24389)
Keywords: language model, llm
Abstract: High-quality teacher-child interaction (TCI) is fundamental to early childhood development, yet traditional expert-based assessment faces a critical scalability challenge. In large systems like China's-serving 36 million children across 250,000+ kindergartens-the cost and time requirements of manual observation make continuous quality monitoring infeasible, relegating assessment to infrequent episodic audits that limit timely intervention and improvement tracking. In this paper, we investigate whether AI can serve as a scalable assessment teammate by extracting structured quality indicators and validating their alignment with human expert judgments. Our contributions include: (1) TEPE-TCI-370h (Tracing Effective Preschool Education), the first large-scale dataset of naturalistic teacher-child interactions in Chinese preschools (370 hours, 105 classrooms) with standardized ECQRS-EC and SSTEW annotations; (2) We develop Interaction2Eval, a specialized LLM-based framework addressing domain-specific challenges-child speech recognition, Mandarin homophone disambiguation, and rubric-based reasoning-achieving up to 88% agreement; (3) Deployment validation across 43 classrooms demonstrating an 18x efficiency gain in the assessment workflow, highlighting its potential for shifting from annual expert audits to monthly AI-assisted monitoring with targeted human oversight. This work not only demonstrates the technical feasibility of scalable, AI-augmented quality assessment but also lays the foundation for a new paradigm in early childhood education-one where continuous, inclusive, AI-assisted evaluation becomes the engine of systemic improvement and equitable growth.
摘要：高质量的师生互动 (TCI) 是幼儿发展的基础，但传统的专家评估面临着严峻的可扩展性挑战。在像中国这样的大型系统中（25万多家幼儿园为3600万儿童提供服务），人工观察的成本和时间要求使得持续的质量监控变得不可行，将评估降级为不频繁的间歇性审核，从而限制了及时干预和改进跟踪。在本文中，我们通过提取结构化质量指标并验证其与人类专家判断的一致性来研究人工智能是否可以充当可扩展的评估团队伙伴。我们的贡献包括：（1）TEPE-TCI-370h（追踪有效学前教育），中国学前教育中第一个大规模自然师生互动数据集（370小时，105个教室），具有标准化ECQRS-EC和SSTEW注释； (2) 我们开发了Interaction2Eval，一个基于LLM的专门框架，解决特定领域的挑战——儿童语音识别、普通话同音词消歧和基于标题的推理——达成高达88%的一致性； (3) 在 43 个教室进行的部署验证表明，评估工作流程的效率提高了 18 倍，突显了其从年度专家审核转向每月人工智能辅助监测和有针对性的人工监督的潜力。这项工作不仅证明了可扩展的人工智能增强质量评估的技术可行性，而且还为幼儿教育的新范式奠定了基础——持续、包容、人工智能辅助的评估成为系统改进和公平增长的引擎。

Title: PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation

Authors: Manoj Balaji Jagadeeshan, Atul Singh, Nallani Chakravartula Sahith, Amrith Krishna, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24413
Pdf URL: https://arxiv.org/pdf/2603.24413
Copy Paste: [[2603.24413]] PINGALA: Prosody-Aware Decoding for Sanskrit Poetry Generation(https://arxiv.org/abs/2603.24413)
Keywords: language model
Abstract: Poetry generation in Sanskrit typically requires the verse to be semantically coherent and adhere to strict prosodic rules. In Sanskrit prosody, every line of a verse is typically a fixed length sequence of syllables adhering to prescribed binary patterns of syllable weights. We observe that instead of treating a verse as a monolithic sequence, segmenting them as grouped-lines leads to significant improvement in semantic coherence by 10\% with comparable metrical adherence. Specifically, PINGALA, our proposed decoding approach is designed to encourage every line to have well-formed words and our token selection biases the model towards it by preferring longer tokens. Writing in Sanskrit follows phonemic orthography, hence using a phonetically aware transliteration scheme, SLP1, increased the metrical alignment by 46\% with comparable semantic similarity, for a instruction fine-tuned large language models like Phi-4. We also introduce a new approach for reference-free evaluation using cross-encoders which achieved better alignment with true poetry instances.
摘要：梵文诗歌的生成通常要求诗句在语义上连贯并遵守严格的韵律规则。在梵文韵律中，诗句的每一行通常都是固定长度的音节序列，遵循规定的音节权重二进制模式。我们观察到，不是将一首诗视为一个整体序列，而是将它们分割为分组行，可以使语义连贯性显着提高 10%，并且具有可比的格律依从性。具体来说，我们提出的解码方法 PINGALA 旨在鼓励每一行都具有格式良好的单词，并且我们的标记选择通过偏爱较长的标记来使模型偏向于此。梵文书写遵循音位正字法，因此使用语音感知音译方案 SLP1，对于像 Phi-4 这样的指令微调大型语言模型，在具有可比语义相似性的情况下将韵律对齐提高了 46%。我们还引入了一种使用交叉编码器进行无参考评估的新方法，该方法可以更好地与真实的诗歌实例保持一致。

Title: Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving

Authors: Ruichen Qiu, Yichuan Cao, Junqi Liu, Dakai Guo, Xiao-Shan Gao, Lihong Zhi, Ruyong Feng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24465
Pdf URL: https://arxiv.org/pdf/2603.24465
Copy Paste: [[2603.24465]] Mechanic: Sorrifier-Driven Formal Decomposition Workflow for Automated Theorem Proving(https://arxiv.org/abs/2603.24465)
Keywords: language model, llm, agent
Abstract: Recent advances in large language models (LLMs) and LLM-based agents have substantially improved the capabilities of automated theorem proving. However, for problems requiring complex mathematical reasoning, current systems rarely succeed on the first try and must repeatedly modify their proof strategies. Existing approaches for handling failed attempts typically either discard the entire proof and regenerate it from scratch or iteratively fix errors within the proof. The former is inefficient, as it may abandon mostly correct reasoning due to localized errors, while the latter, although preserving prior progress, leads to progressively longer contexts which progressively degrades the model's ability to attend to the remaining unresolved subproblems. To address this dilemma, we propose Mechanic, a novel agent system that employs a sorry-driven formal decomposition strategy. By leveraging the sorry placeholder in Lean to precisely isolate unresolved subgoals while preserving the surrounding verified proof structure, Mechanic extracts each failed subproblem into a clean, self-contained context and resolves it independently. This avoids both the waste of full regeneration and the excessive context length induced by repeated repairs. Experimental results on challenging mathematical competition benchmarks, including IMO 2025 and Putnam 2025, demonstrate that our agent achieves significant advantages in proving efficiency.
摘要：大型语言模型 (LLM) 和基于 LLM 的代理的最新进展极大地提高了自动化定理证明的能力。然而，对于需要复杂数学推理的问题，当前的系统很少能一次成功，必须反复修改其证明策略。处理失败尝试的现有方法通常要么丢弃整个证明并从头开始重新生成它，要么迭代地修复证明中的错误。前者效率低下，因为它可能由于局部错误而放弃大部分正确的推理，而后者虽然保留了先前的进展，但会导致上下文逐渐变长，从而逐渐降低模型处理剩余未解决子问题的能力。为了解决这个困境，我们提出了 Mechanic，一种新颖的代理系统，它采用了抱歉驱动的形式分解策略。通过利用精益中的遗憾占位符来精确隔离未解决的子目标，同时保留周围经过验证的证明结构，Mechanic 将每个失败的子问题提取到一个干净的、独立的上下文中并独立解决它。这避免了完全再生的浪费和重复修复导致的上下文长度过长。在具有挑战性的数学竞赛基准（包括 IMO 2025 和 Putnam 2025）上的实验结果表明，我们的智能体在证明效率方面取得了显着的优势。

Title: Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?

Authors: Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, Yuqing Yang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24472
Pdf URL: https://arxiv.org/pdf/2603.24472
Copy Paste: [[2603.24472]] Why Does Self-Distillation (Sometimes) Degrade the Reasoning Capability of LLMs?(https://arxiv.org/abs/2603.24472)
Keywords: llm
Abstract: Self-distillation has emerged as an effective post-training paradigm for LLMs, often improving performance while shortening reasoning traces. However, in mathematical reasoning, we find that it can reduce response length while degrading performance. We trace this degradation to the suppression of epistemic verbalization - the model's expression of uncertainty during reasoning. Through controlled experiments varying conditioning context richness and task coverage, we show that conditioning the teacher on rich information suppresses uncertainty expression, enabling rapid in-domain optimization with limited task coverage but harming OOD performance, where unseen problems benefit from expressing uncertainty and adjusting accordingly. Across Qwen3-8B, DeepSeek-Distill-Qwen-7B, and Olmo3-7B-Instruct, we observe performance drops of up to 40%. Our findings highlight that exposing appropriate levels of uncertainty is crucial for robust reasoning and underscore the importance of optimizing reasoning behavior beyond merely reinforcing correct answer traces.
摘要：自蒸馏已成为法学硕士的一种有效的培训后范例，通常可以提高表现，同时缩短推理轨迹。然而，在数学推理中，我们发现它可以减少响应长度，同时降低性能。我们将这种退化追溯到认知语言表达的抑制——模型在推理过程中表达不确定性。通过改变调节上下文丰富度和任务覆盖范围的对照实验，我们表明，使教师适应丰富的信息会抑制不确定性表达，从而在有限的任务覆盖范围内实现快速域内优化，但会损害 OOD 性能，其中看不见的问题受益于表达不确定性和相应调整。在 Qwen3-8B、DeepSeek-Distill-Qwen-7B 和 Olmo3-7B-Instruct 中，我们观察到性能下降高达 40%。我们的研究结果强调，暴露适当水平的不确定性对于稳健推理至关重要，并强调优化推理行为而不仅仅是强化正确答案轨迹的重要性。

Title: Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding

Authors: Conrad Borchers, Jiayi Zhang, Ashish Gurung
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.24535
Pdf URL: https://arxiv.org/pdf/2603.24535
Copy Paste: [[2603.24535]] Representation Learning to Study Temporal Dynamics in Tutorial Scaffolding(https://arxiv.org/abs/2603.24535)
Keywords: language model
Abstract: Adaptive scaffolding enhances learning, yet the field lacks robust methods for measuring it within authentic tutoring dialogue. This gap has become more pressing with the rise of remote human tutoring and large language model-based systems. We introduce an embedding-based approach that analyzes scaffolding dynamics by aligning the semantics of dialogue turns, problem statements, and correct solutions. Specifically, we operationalize alignment by computing cosine similarity between tutor and student contributions and task-relevant content. We apply this framework to 1,576 real-world mathematics tutoring dialogues from the Eedi Question Anchored Tutoring Dialogues dataset. The analysis reveals systematic differences in task alignment and distinct temporal patterns in how participants ground their contributions in problem and solution content. Further, mixed-effects models show that role-specific semantic alignment predicts tutorial progression beyond baseline features such as message order and length. Tutor contributions exhibited stronger grounding in problem content early in interactions. In contrast, student solution alignment was modestly positively associated with progression. These findings support scaffolding as a continuous, role-sensitive process grounded in task semantics. By capturing role-specific alignment over time, this approach provides a principled method for analyzing instructional dialogue and evaluating conversational tutoring systems.
摘要：自适应脚手架可以增强学习，但该领域缺乏在真实的辅导对话中衡量学习的可靠方法。随着远程人工辅导和基于大型语言模型的系统的兴起，这种差距变得更加紧迫。我们引入了一种基于嵌入的方法，通过调整对话轮次、问题陈述和正确解决方案的语义来分析脚手架动态。具体来说，我们通过计算导师和学生贡献以及任务相关内容之间的余弦相似度来操作对齐。我们将此框架应用于 Eedi 问题锚定辅导对话数据集中的 1,576 个现实世界数学辅导对话。该分析揭示了任务调整的系统差异以及参与者如何将其贡献融入问题和解决方案内容的不同时间模式。此外，混合效果模型表明，特定于角色的语义对齐可以预测超出基线特征（例如消息顺序和长度）的教程进展。在互动的早期，导师的贡献在问题内容上表现出了更强的基础。相比之下，学生解决方案的一致性与进步呈适度正相关。这些发现支持脚手架是一个基于任务语义的连续的、角色敏感的过程。通过随着时间的推移捕获角色特定的一致性，这种方法提供了一种分析教学对话和评估对话辅导系统的原则方法。

Title: MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Authors: Zhuo Li, Yupeng Zhang, Pengyu Cheng, Jiajun Song, Mengyu Zhou, Hao Li, Shujie Hu, Yu Qin, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.24579
Pdf URL: https://arxiv.org/pdf/2603.24579
Copy Paste: [[2603.24579]] MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination(https://arxiv.org/abs/2603.24579)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. To address this, we introduce Multi-Agent Reinforced Self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker. The Solver generates an initial RAG response, which the Proposer decomposes into claim-level verifiable atomic propositions. Crucially, the Checker validates these propositions against retrieved evidence in isolation, deprived of the Solver's original output. This well-crafted information asymmetry scheme breaks the cycle of self-confirmation bias. By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence. Extensive experiments across hallucination benchmarks demonstrate that MARCH substantially reduces hallucination rates. Notably, an 8B-parameter LLM equipped with MARCH achieves performance competitive with powerful closed-source models. MARCH paves a scalable path for factual self-improvement of LLMs through co-evolution. The code is at this https URL.
摘要：幻觉仍然是大型语言模型（LLM）的一个关键瓶颈，削弱了它们在现实应用中的可靠性，特别是在检索增强生成（RAG）系统中。虽然现有的幻觉检测方法采用法学硕士作为法官来根据检索到的证据来验证法学硕士的输出，但它们存在固有的确认偏差，即验证者无意中重现了原始生成的错误。为了解决这个问题，我们引入了多智能体强化幻觉自我检查（MARCH），这是一个通过利用故意的信息不对称来强制执行严格的事实对齐的框架。 MARCH 协调了三个专业代理的协作管道：求解器、提议器和检查器。求解器生成初始 RAG 响应，提议器将其分解为声明级可验证原子命题。至关重要的是，检查器根据检索到的证据单独验证这些命题，剥夺了求解器的原始输出。这种精心设计的信息不对称方案打破了自我确认偏差的循环。通过使用多智能体强化学习 (MARL) 训练该管道，我们使智能体能够共同进化并优化事实遵守情况。跨幻觉基准的广泛实验表明，MARCH 大大降低了幻觉发生率。值得注意的是，配备 MARCH 的 8B 参数 LLM 实现了与强大的闭源模型相媲美的性能。 MARCH 为法学硕士通过共同进化进行实际自我完善铺平了一条可扩展的道路。代码位于此 https URL。

Title: Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA

Authors: Saahil Mathur, Ryan David Rittner, Vedant Ajit Thakur, Daniel Stuart Schiff, Tunazzina Islam
Subjects: cs.CL, cs.AI, cs.CY, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.24580
Pdf URL: https://arxiv.org/pdf/2603.24580
Copy Paste: [[2603.24580]] Retrieval Improvements Do Not Guarantee Better Answers: A Study of RAG for AI Policy QA(https://arxiv.org/abs/2603.24580)
Keywords: hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems are increasingly used to analyze complex policy documents, but achieving sufficient reliability for expert usage remains challenging in domains characterized by dense legal language and evolving, overlapping regulatory frameworks. We study the application of RAG to AI governance and policy analysis using the AI Governance and Regulatory Archive (AGORA) corpus, a curated collection of 947 AI policy documents. Our system combines a ColBERT-based retriever fine-tuned with contrastive learning and a generator aligned to human preferences using Direct Preference Optimization (DPO). We construct synthetic queries and collect pairwise preferences to adapt the system to the policy domain. Through experiments evaluating retrieval quality, answer relevance, and faithfulness, we find that domain-specific fine-tuning improves retrieval metrics but does not consistently improve end-to-end question answering performance. In some cases, stronger retrieval counterintuitively leads to more confident hallucinations when relevant documents are absent from the corpus. These results highlight a key concern for those building policy-focused RAG systems: improvements to individual components do not necessarily translate to more reliable answers. Our findings provide practical insights for designing grounded question-answering systems over dynamic regulatory corpora.
摘要：检索增强生成（RAG）系统越来越多地用于分析复杂的政策文件，但在以密集的法律语言和不断发展、重叠的监管框架为特征的领域中，为专家使用提供足够的可靠性仍然具有挑战性。我们使用 AI 治理和监管档案 (AGORA) 语料库（包含 947 个 AI 政策文档的精选集合）研究 RAG 在 AI 治理和政策分析中的应用。我们的系统结合了基于 ColBERT 的检索器和对比学习微调，以及使用直接偏好优化 (DPO) 来符合人类偏好的生成器。我们构建综合查询并收集成对偏好以使系统适应策略域。通过评估检索质量、答案相关性和可信度的实验，我们发现特定领域的微调可以提高检索指标，但并不能持续提高端到端的问答性能。在某些情况下，当语料库中缺少相关文档时，更强的检索会违反直觉地导致更自信的幻觉。这些结果凸显了那些构建以政策为中心的 RAG 系统的人所关心的一个关键问题：对各个组件的改进并不一定会转化为更可靠的答案。我们的研究结果为在动态监管语料库上设计扎根问答系统提供了实用的见解。