2026-03-30

Title: RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation

Authors: Jiajun Zhang, Yuying Li, Zhixun Li, Xingyu Guo, Jingzhuo Wu, Leqi Zheng, Yiran Yang, Jianke Zhang, Qingbin Li, Shannan Yan, Zhetong Li, Changguo Jia, Junfei Wu, Zilei Wang, Qiang Liu, Liang Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25804
Pdf URL: https://arxiv.org/pdf/2603.25804
Copy Paste: [[2603.25804]] RealChart2Code: Advancing Chart-to-Code Generation with Real Data and Multi-Task Evaluation(https://arxiv.org/abs/2603.25804)
Keywords: language model
Abstract: Vision-Language Models (VLMs) have demonstrated impressive capabilities in code generation across various domains. However, their ability to replicate complex, multi-panel visualizations from real-world data remains largely unassessed. To address this gap, we introduce \textbf{\texttt{RealChart2Code}}, a new large-scale benchmark with over 2,800 instances grounded in authentic datasets and featuring tasks with clear analytical intent. Crucially, it is the first benchmark to systematically evaluate chart generation from large-scale raw data and assess iterative code refinement in a multi-turn conversational setting. Our comprehensive evaluation of 14 leading VLMs on \texttt{RealChart2Code} reveals significant performance degradation compared to simpler benchmarks, highlighting their struggles with complex plot structures and authentic data. Our analysis uncovers a substantial performance gap between proprietary and open-weight models and confirms that even state-of-the-art VLMs often fail to accurately replicate intricate, multi-panel charts. These findings provide valuable insights into the current limitations of VLMs and guide future research directions. We release the benchmark and code at \url{this https URL}.
摘要：视觉语言模型 (VLM) 在跨各个领域的代码生成方面展示了令人印象深刻的功能。然而，他们从现实世界数据复制复杂的多面板可视化的能力在很大程度上仍未得到评估。为了解决这一差距，我们引入了 \textbf{\texttt{RealChart2Code}}，这是一个新的大规模基准测试，拥有超过 2,800 个基于真实数据集的实例，并具有具有明确分析意图的任务。至关重要的是，它是第一个系统评估大规模原始数据图表生成并评估多轮对话环境中迭代代码细化的基准。我们在 \texttt{RealChart2Code} 上对 14 个领先的 VLM 进行的综合评估显示，与更简单的基准相比，性能显着下降，突出了它们在复杂的绘图结构和真实数据方面的挣扎。我们的分析揭示了专有模型和开放权重模型之间存在巨大的性能差距，并证实即使是最先进的 VLM 也常常无法准确复制复杂的多面板图表。这些发现为了解 VLM 当前的局限性提供了宝贵的见解，并指导未来的研究方向。我们在 \url{this https URL} 发布了基准测试和代码。

Title: Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI

Authors: Anna Kozlova, Stanislau Salavei, Pavel Satalkin, Hanna Plotnitskaya, Sergey Parfenyuk
Subjects: cs.CL, cs.AI, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2603.25821
Pdf URL: https://arxiv.org/pdf/2603.25821
Copy Paste: [[2603.25821]] Doctorina MedBench: End-to-End Evaluation of Agent-Based Medical AI(https://arxiv.org/abs/2603.25821)
Keywords: agent
Abstract: We present Doctorina MedBench, a comprehensive evaluation framework for agent-based medical AI based on the simulation of realistic physician-patient interactions. Unlike traditional medical benchmarks that rely on solving standardized test questions, the proposed approach models a multi-step clinical dialogue in which either a physician or an AI system must collect medical history, analyze attached materials (including laboratory reports, images, and medical documents), formulate differential diagnoses, and provide personalized recommendations. System performance is evaluated using the D.O.T.S. metric, which consists of four components: Diagnosis, Observations/Investigations, Treatment, and Step Count, enabling assessment of both clinical correctness and dialogue efficiency. The system also incorporates a multi-level testing and quality monitoring architecture designed to detect model degradation during both development and deployment. The framework supports safety-oriented trap cases, category-based random sampling of clinical scenarios, and full regression testing. The dataset currently contains more than 1,000 clinical cases covering over 750 diagnoses. The universality of the evaluation metrics allows the framework to be used not only to assess medical AI systems, but also to evaluate physicians and support the development of clinical reasoning skills. Our results suggest that simulation of clinical dialogue may provide a more realistic assessment of clinical competence compared to traditional examination-style benchmarks.
摘要：我们推出了 Doctorina MedBench，这是一个基于代理的医疗人工智能的综合评估框架，基于真实医患互动的模拟。与依赖于解决标准化测试问题的传统医学基准不同，所提出的方法模拟了多步骤临床对话，其中医生或人工智能系统必须收集病史、分析附加材料（包括实验室报告、图像和医疗文件）、制定鉴别诊断并提供个性化建议。使用 D.O.T.S 评估系统性能。指标，由四个部分组成：诊断、观察/调查、治疗和步数，能够评估临床正确性和对话效率。该系统还采用了多级测试和质量监控架构，旨在检测开发和部署过程中的模型退化情况。该框架支持以安全为导向的陷阱案例、基于类别的临床场景随机抽样以及完整的回归测试。该数据集目前包含 1,000 多个临床病例，涵盖 750 多个诊断。评估指标的普遍性使得该框架不仅可以用于评估医疗人工智能系统，还可以用于评估医生并支持临床推理技能的发展。我们的结果表明，与传统的考试式基准相比，临床对话的模拟可以提供更真实的临床能力评估。

Title: Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio

Authors: Yijiong Yu, Shuai Yuan, Jie Zheng, Huazheng Wang, Ji Pei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25926
Pdf URL: https://arxiv.org/pdf/2603.25926
Copy Paste: [[2603.25926]] Density-aware Soft Context Compression with Semi-Dynamic Compression Ratio(https://arxiv.org/abs/2603.25926)
Keywords: llm, long context
Abstract: Soft context compression reduces the computational workload of processing long contexts in LLMs by encoding long context into a smaller number of latent tokens. However, existing frameworks apply uniform compression ratios, failing to account for the extreme variance in natural language information density. While adopting a density-aware dynamic compression ratio seems intuitive, empirical investigations reveal that models struggle intrinsically with operations parameterized by input dependent, continuous structural hyperparameters. To resolve this pitfall, we introduce Semi-Dynamic Context Compression framework. Our approach features a Discrete Ratio Selector, which predicts a compression target based on intrinsic information density and quantizes it to a predefined set of discrete compression ratios. It is efficiently jointly trained with the compressor on synthetic data, with the summary lengths as a proxy to create labels for compression ratio prediction. Extensive evaluations confirm that our density-aware framework, utilizing mean pooling as the backbone, consistently outperforms static baselines, establishing a robust Pareto frontier for context compression techniques. Our code, data and model weights are available at this https URL
摘要：软上下文压缩通过将长上下文编码为更少数量的潜在标记，减少了 LLM 中处理长上下文的计算工作量。然而，现有框架应用统一的压缩比，未能考虑自然语言信息密度的极端变化。虽然采用密度感知的动态压缩比似乎很直观，但实证研究表明，模型本质上与由输入相关的连续结构超参数参数化的操作相矛盾。为了解决这个缺陷，我们引入了半动态上下文压缩框架。我们的方法具有离散比率选择器，它根据内在信息密度预测压缩目标，并将其量化为一组预定义的离散压缩比率。它与压缩器在合成数据上进行有效的联合训练，以摘要长度作为代理来创建用于压缩比预测的标签。广泛的评估证实，我们的密度感知框架利用均值池作为骨干，始终优于静态基线，为上下文压缩技术建立了强大的帕累托前沿。我们的代码、数据和模型权重可在此 https URL 获取

Title: Can Small Models Reason About Legal Documents? A Comparative Study

Authors: Snehit Vaddi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25944
Pdf URL: https://arxiv.org/pdf/2603.25944
Copy Paste: [[2603.25944]] Can Small Models Reason About Legal Documents? A Comparative Study(https://arxiv.org/abs/2603.25944)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: Large language models show promise for legal applications, but deploying frontier models raises concerns about cost, latency, and data privacy. We evaluate whether sub-10B parameter models can serve as practical alternatives by testing nine models across three legal benchmarks (ContractNLI, CaseHOLD, and ECtHR) using five prompting strategies (direct, chain-of-thought, few-shot, BM25 RAG, and dense RAG). Across 405 experiments with three random seeds per configuration, we find that a Mixture-of-Experts model activating only 3B parameters matches GPT-4o-mini in mean accuracy while surpassing it on legal holding identification, and that architecture and training quality matter more than raw parameter count. Our largest model (9B parameters) performs worst overall. Chain-of-thought prompting proves sharply task-dependent, improving contract entailment but degrading multiple-choice legal reasoning, while few-shot prompting emerges as the most consistently effective strategy. Comparing BM25 and dense retrieval for RAG, we find near-identical results, suggesting the bottleneck lies in the language model's utilization of retrieved context rather than retrieval quality. All experiments were conducted via cloud inference APIs at a total cost of $62, demonstrating that rigorous LLM evaluation is accessible without dedicated GPU infrastructure.
摘要：大型语言模型显示出合法应用的前景，但部署前沿模型引发了对成本、延迟和数据隐私的担忧。我们通过使用五种提示策略（直接、思维链、few-shot、BM25 RAG 和密集 RAG）在三个法律基准（ContractNLI、CaseHOLD 和 ECtHR）上测试九个模型来评估 sub-10B 参数模型是否可以作为实用的替代方案。在每种配置使用三个随机种子的 405 次实验中，我们发现仅激活 3B 参数的专家混合模型在平均准确度上与 GPT-4o-mini 相匹配，同时在合法持有识别方面超过了它，并且架构和训练质量比原始参数数量更重要。我们最大的模型（9B 参数）总体表现最差。事实证明，思想链提示与任务密切相关，改善了合同的必然性，但降低了多项选择的法律推理能力，而少样本提示则成为最一致有效的策略。比较 BM25 和 RAG 的密集检索，我们发现了几乎相同的结果，这表明瓶颈在于语言模型对检索上下文的利用而不是检索质量。所有实验均通过云推理 API 进行，总成本为 62 美元，这表明无需专用 GPU 基础设施即可进行严格的 LLM 评估。

Title: When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models

Authors: Binesh Sadanandan, Vahid Behzadan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.25960
Pdf URL: https://arxiv.org/pdf/2603.25960
Copy Paste: [[2603.25960]] When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models(https://arxiv.org/abs/2603.25960)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) are increasingly deployed in medical settings, yet their sensitivity to prompt formatting remains poorly characterized. We evaluate MedGemma (4B and 27B parameters) on MedMCQA (4,183 questions) and PubMedQA (1,000 questions) across a broad suite of robustness tests. Our experiments reveal several concerning findings. Chain-of-Thought (CoT) prompting decreases accuracy by 5.7% compared to direct answering. Few-shot examples degrade performance by 11.9% while increasing position bias from 0.14 to 0.47. Shuffling answer options causes the model to change predictions 59.1% of the time, with accuracy dropping up to 27.4 percentage points. Front-truncating context to 50% causes accuracy to plummet below the no-context baseline, yet back-truncation preserves 97% of full-context accuracy. We further show that cloze scoring (selecting the highest log-probability option token) achieves 51.8% (4B) and 64.5% (27B), surpassing all prompting strategies and revealing that models "know" more than their generated text shows. Permutation voting recovers 4 percentage points over single-ordering inference. These results demonstrate that prompt engineering techniques validated on general-purpose models do not transfer to domain-specific medical LLMs, and that reliable alternatives exist.
摘要：大型语言模型 (LLM) 越来越多地部署在医疗环境中，但它们对提示格式的敏感性仍然没有得到很好的表征。我们通过一系列广泛的稳健性测试，在 MedMCQA（4,183 个问题）和 PubMedQA（1,000 个问题）上评估 MedGemma（4B 和 27B 参数）。我们的实验揭示了一些令人担忧的发现。与直接回答相比，思维链 (CoT) 提示的准确性降低了 5.7%。 Few-shot 示例使性能降低 11.9%，同时位置偏差从 0.14 增加到 0.47。改变答案选项会导致模型在 59.1% 的情况下改变预测，准确率下降高达 27.4 个百分点。将上下文前截断到 50% 会导致准确率骤降至无上下文基线以下，但后截断则保留了 97% 的全上下文准确度。我们进一步表明，完形填空评分（选择最高对数概率选项标记）达到了 51.8% (4B) 和 64.5% (27B)，超越了所有提示策略，并揭示了模型比其生成的文本显示的“知道”更多。排列投票比单序推理恢复了 4 个百分点。这些结果表明，在通用模型上验证的即时工程技术不会转移到特定领域的医学法学硕士，并且存在可靠的替代方案。

Title: MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization

Authors: Weizhi Zhang, Xiaokai Wei, Wei-Chieh Huang, Zheng Hui, Chen Wang, Michelle Gong, Philip S. Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.25973
Pdf URL: https://arxiv.org/pdf/2603.25973
Copy Paste: [[2603.25973]] MemoryCD: Benchmarking Long-Context User Memory of LLM Agents for Lifelong Cross-Domain Personalization(https://arxiv.org/abs/2603.25973)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) have expanded context windows to million-token scales, yet benchmarks for evaluating memory remain limited to short-session synthetic dialogues. We introduce \textsc{MemoryCD}, the first large-scale, user-centric, cross-domain memory benchmark derived from lifelong real-world behaviors in the Amazon Review dataset. Unlike existing memory datasets that rely on scripted personas to generate synthetic user data, \textsc{MemoryCD} tracks authentic user interactions across years and multiple domains. We construct a multi-faceted long-context memory evaluation pipeline of 14 state-of-the-art LLM base models with 6 memory method baselines on 4 distinct personalization tasks over 12 diverse domains to evaluate an agent's ability to simulate real user behaviors in both single and cross-domain settings. Our analysis reveals that existing memory methods are far from user satisfaction in various domains, offering the first testbed for cross-domain life-long personalization evaluation.
摘要：大型语言模型（LLM）的最新进展已将上下文窗口扩展到百万个令牌的规模，但评估记忆的基准仍然仅限于短会话合成对话。我们推出 \textsc{MemoryCD}，这是第一个大规模、以用户为中心的跨域内存基准测试，源自 Amazon Review 数据集中的终生真实行为。与依赖脚本角色生成合成用户数据的现有内存数据集不同，\textsc{MemoryCD} 跟踪跨年和多个域的真实用户交互。我们构建了一个多方面的长上下文记忆评估管道，由 14 个最先进的 LLM 基础模型和 6 个记忆方法基线组成，涉及 12 个不同领域的 4 个不同的个性化任务，以评估代理在单域和跨域设置中模拟真实用户行为的能力。我们的分析表明，现有的记忆方法在各个领域都远未达到用户满意度，这为跨领域终身个性化评估提供了第一个测试平台。

Title: Toward Culturally Grounded Natural Language Processing

Authors: Sina Bagheri Nezhad
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26013
Pdf URL: https://arxiv.org/pdf/2603.26013
Copy Paste: [[2603.26013]] Toward Culturally Grounded Natural Language Processing(https://arxiv.org/abs/2603.26013)
Keywords: prompt
Abstract: Recent progress in multilingual NLP is often taken as evidence of broader global inclusivity, but a growing literature shows that multilingual capability and cultural competence come apart. This paper synthesizes over 50 papers from 2020--2026 spanning multilingual performance inequality, cross-lingual transfer, culture-aware evaluation, cultural alignment, multimodal local-knowledge modeling, benchmark design critiques, and community-grounded data practices. Across this literature, training data coverage remains a strong determinant of performance, yet it is not sufficient: tokenization, prompt language, translated benchmark design, culturally specific supervision, and multimodal context all materially affect outcomes. Recent work on Global-MMLU, CDEval, WorldValuesBench, CulturalBench, CULEMO, CulturalVQA, GIMMICK, DRISHTIKON, WorldCuisines, CARE, CLCA, and newer critiques of benchmark design and community-grounded evaluation shows that strong multilingual models can still flatten local norms, misread culturally grounded cues, and underperform in lower-resource or community-specific settings. We argue that the field should move from treating languages as isolated rows in a benchmark spreadsheet toward modeling communicative ecologies: the institutions, scripts, translation pipelines, domains, modalities, and communities through which language is used. On that basis, we propose a research agenda for culturally grounded NLP centered on richer contextual metadata, culturally stratified evaluation, participatory alignment, within-language variation, and multimodal community-aware design.
摘要：多语言 NLP 的最新进展通常被视为更广泛的全球包容性的证据，但越来越多的文献表明，多语言能力和文化能力是分开的。本文综合了 2020 年-2026 年的 50 多篇论文，涵盖多语言绩效不平等、跨语言迁移、文化意识评估、文化一致性、多模式本地知识建模、基准设计批评和基于社区的数据实践。在这些文献中，训练数据覆盖率仍然是性能的重要决定因素，但这还不够：标记化、提示语言、翻译基准设计、特定文化的监督和多模式环境都会对结果产生重大影响。最近对 Global-MMLU、CDEval、WorldValuesBench、CulturalBench、CULEMO、CulturalVQA、GIMMICK、DRISHTIKON、WorldCuisines、CARE、CLCA 的研究以及对基准设计和基于社区的评估的新批评表明，强大的多语言模型仍然会压低当地规范，误读基于文化的线索，并在资源较低或特定社区的环境中表现不佳。我们认为，该领域应该从将语言视为基准电子表格中的孤立行转向对交流生态进行建模：使用语言的机构、脚本、翻译管道、领域、模式和社区。在此基础上，我们提出了一个基于文化的 NLP 研究议程，重点是更丰富的上下文元数据、文化分层评估、参与性调整、语言内变异和多模式社区意识设计。

Title: AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents

Authors: Wenbo Gao, Renxi Liu, Xian Wang, Fang Guo, Shuai Yang, Xi Chen, Hui-Ling Zhen, Hanting Chen, Weizhe Lin, Xiaosong Li, Yaoyuan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26034
Pdf URL: https://arxiv.org/pdf/2603.26034
Copy Paste: [[2603.26034]] AgentCollab: A Self-Evaluation-Driven Collaboration Paradigm for Efficient LLM Agents(https://arxiv.org/abs/2603.26034)
Keywords: language model, llm, agent
Abstract: Autonomous agents powered by large language models (LLMs) perform complex tasks through long-horizon reasoning and tool interaction, where a fundamental trade-off arises between execution efficiency and reasoning robustness. Models at different capability-cost levels offer complementary advantages: lower-cost models enable fast execution but may struggle on difficult reasoning segments, while stronger models provide more robust reasoning at higher computational cost. We present AgentCollab, a self-driven collaborative inference framework that dynamically coordinates models with different reasoning capacities during agent execution. Instead of relying on external routing modules, the framework uses the agent's own self-reflection signal to determine whether the current reasoning trajectory is making meaningful progress, and escalates control to a stronger reasoning tier only when necessary. To further stabilize long-horizon execution, we introduce a difficulty-aware cumulative escalation strategy that allocates additional reasoning budget based on recent failure signals. In our experiments, we instantiate this framework using a two-level small-large model setting. Experiments on diverse multi-step agent benchmarks show that AgentCollab consistently improves the accuracy-efficiency Pareto frontier of LLM agents.
摘要：由大型语言模型 (LLM) 提供支持的自主代理通过长期推理和工具交互来执行复杂的任务，其中执行效率和推理鲁棒性之间出现了基本的权衡。不同能力成本水平的模型具有互补的优势：较低成本的模型可以实现快速执行，但可能在困难的推理部分上遇到困难，而更强的模型以较高的计算成本提供更强大的推理。我们提出了 AgentCollab，一个自驱动的协作推理框架，可以在代理执行期间动态协调具有不同推理能力的模型。该框架不依赖外部路由模块，而是使用代理自身的自我反射信号来确定当前推理轨迹是否正在取得有意义的进展，并仅在必要时将控制升级到更强的推理层。为了进一步稳定长期执行，我们引入了一种难度感知累积升级策略，该策略根据最近的失败信号分配额外的推理预算。在我们的实验中，我们使用两级小-大模型设置来实例化该框架。对各种多步骤代理基准的实验表明，AgentCollab 持续提高了 LLM 代理的准确性-效率 Pareto 前沿。

Title: Retrieval-Augmented Generation Based Nurse Observation Extraction

Authors: Kyomin Hwang, Nojun Kwak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26046
Pdf URL: https://arxiv.org/pdf/2603.26046
Copy Paste: [[2603.26046]] Retrieval-Augmented Generation Based Nurse Observation Extraction(https://arxiv.org/abs/2603.26046)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent advancements in Large Language Models (LLMs) have played a significant role in reducing human workload across various domains, a trend that is increasingly extending into the medical field. In this paper, we propose an automated pipeline designed to alleviate the burden on nurses by automatically extracting clinical observations from nurse dictations. To ensure accurate extraction, we introduce a method based on Retrieval-Augmented Generation (RAG). Our approach demonstrates effective performance, achieving an F1-score of 0.796 on the MEDIQA-SYNUR test dataset.
摘要：大型语言模型 (LLM) 的最新进展在减少各个领域的人力工作量方面发挥了重要作用，这一趋势正日益扩展到医疗领域。在本文中，我们提出了一种自动化管道，旨在通过从护士听写中自动提取临床观察结果来减轻护士的负担。为了确保准确提取，我们引入了一种基于检索增强生成（RAG）的方法。我们的方法展示了有效的性能，在 MEDIQA-SYNUR 测试数据集上实现了 0.796 的 F1 分数。

Title: LLM Benchmark-User Need Misalignment for Climate Change

Authors: Oucheng Liu, Lexing Xie, Jing Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26106
Pdf URL: https://arxiv.org/pdf/2603.26106
Copy Paste: [[2603.26106]] LLM Benchmark-User Need Misalignment for Climate Change(https://arxiv.org/abs/2603.26106)
Keywords: language model, llm
Abstract: Climate change is a major socio-scientific issue shapes public decision-making and policy discussions. As large language models (LLMs) increasingly serve as an interface for accessing climate knowledge, whether existing benchmarks reflect user needs is critical for evaluating LLM in real-world settings. We propose a Proactive Knowledge Behaviors Framework that captures the different human-human and human-AI knowledge seeking and provision behaviors. We further develop a Topic-Intent-Form taxonomy and apply it to analyze climate-related data representing different knowledge behaviors. Our results reveal a substantial mismatch between current benchmarks and real-world user needs, while knowledge interaction patterns between humans and LLMs closely resemble those in human-human interactions. These findings provide actionable guidance for benchmark design, RAG system development, and LLM training. Code is available at this https URL.
摘要：气候变化是影响公共决策和政策讨论的重大社会科学问题。随着大型语言模型 (LLM) 越来越多地充当获取气候知识的界面，现有基准是否反映用户需求对于在现实环境中评估 LLM 至关重要。我们提出了一个主动知识行为框架，该框架捕获了不同的人与人以及人与人工智能的知识寻求和提供行为。我们进一步开发了主题意图形式分类法，并将其应用于分析代表不同知识行为的气候相关数据。我们的结果表明，当前基准与现实世界的用户需求之间存在严重不匹配，而人类和法学硕士之间的知识交互模式与人与人之间的交互非常相似。这些发现为基准设计、RAG 系统开发和法学硕士培训提供了可行的指导。代码可从此 https URL 获取。

Title: Clash of the models: Comparing performance of BERT-based variants for generic news frame detection

Authors: Vihang Jumle
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2603.26156
Pdf URL: https://arxiv.org/pdf/2603.26156
Copy Paste: [[2603.26156]] Clash of the models: Comparing performance of BERT-based variants for generic news frame detection(https://arxiv.org/abs/2603.26156)
Keywords: language model, llm, prompt
Abstract: Framing continues to remain one of the most extensively applied theories in political communication. Developments in computation, particularly with the introduction of transformer architecture and more so with large language models (LLMs), have naturally prompted scholars to explore various novel computational approaches, especially for deductive frame detection, in recent years. While many studies have shown that different transformer models outperform their preceding models that use bag-of-words features, the debate continues to evolve regarding how these models compare with each other on classification tasks. By placing itself at this juncture, this study makes three key contributions: First, it comparatively performs generic news frame detection and compares the performance of five BERT-based variants (BERT, RoBERTa, DeBERTa, DistilBERT and ALBERT) to add to the debate on best practices around employing computational text analysis for political communication studies. Second, it introduces various fine-tuned models capable of robustly performing generic news frame detection. Third, building upon numerous previous studies that work with US-centric data, this study provides the scholarly community with a labelled generic news frames dataset based on the Swiss electoral context that aids in testing the contextual robustness of these computational approaches to framing analysis.
摘要：框架仍然是政治传播中应用最广泛的理论之一。近年来，计算的发展，特别是变压器架构的引入，尤其是大型语言模型（LLM）的引入，自然促使学者们探索各种新颖的计算方法，特别是演绎帧检测。虽然许多研究表明不同的 Transformer 模型优于之前使用词袋特征的模型，但关于这些模型在分类任务上如何相互比较的争论仍在继续发展。通过将自己置于这个关头，本研究做出了三个关键贡献：首先，它比较执行通用新闻框架检测，并比较了五种基于 BERT 的变体（BERT、RoBERTa、DeBERTa、DistilBERT 和 ALBERT）的性能，以增加有关在政治传播研究中使用计算文本分析的最佳实践的争论。其次，它引入了各种能够稳健地执行通用新闻帧检测的微调模型。第三，在之前以美国为中心的数据的大量研究的基础上，本研究为学术界提供了一个基于瑞士选举背景的带标签的通用新闻框架数据集，有助于测试这些框架分析计算方法的背景稳健性。

Title: ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory

Authors: Zhuohan Ge, Haoyang Li, Yubo Wang, Nicole Hu, Chen Jason Zhang, Qing Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26182
Pdf URL: https://arxiv.org/pdf/2603.26182
Copy Paste: [[2603.26182]] ClinicalAgents: Multi-Agent Orchestration for Clinical Decision Making with Dual-Memory(https://arxiv.org/abs/2603.26182)
Keywords: language model, llm, agent
Abstract: While Large Language Models (LLMs) have demonstrated potential in healthcare, they often struggle with the complex, non-linear reasoning required for accurate clinical diagnosis. Existing methods typically rely on static, linear mappings from symptoms to diagnoses, failing to capture the iterative, hypothesis-driven reasoning inherent to human clinicians. To bridge this gap, we introduce ClinicalAgents, a novel multi-agent framework designed to simulate the cognitive workflow of expert clinicians. Unlike rigid sequential chains, ClinicalAgents employs a dynamic orchestration mechanism modeled as a Monte Carlo Tree Search (MCTS) process. This allows an Orchestrator to iteratively generate hypotheses, actively verify evidence, and trigger backtracking when critical information is missing. Central to this framework is a Dual-Memory architecture: a mutable Working Memory that maintains the evolving patient state for context-aware reasoning, and a static Experience Memory that retrieves clinical guidelines and historical cases via an active feedback loop. Extensive experiments demonstrate that ClinicalAgents achieves state-of-the-art performance, significantly enhancing both diagnostic accuracy and explainability compared to strong single-agent and multi-agent baselines.
摘要：虽然大型语言模型 (LLM) 在医疗保健领域展现出了潜力，但它们常常难以应对准确临床诊断所需的复杂、非线性推理。现有方法通常依赖于从症状到诊断的静态线性映射，无法捕捉人类临床医生固有的迭代的、假设驱动的推理。为了弥补这一差距，我们引入了 ClinicalAgents，这是一种新颖的多代理框架，旨在模拟专家临床医生的认知工作流程。与严格的顺序链不同，ClinicalAgents 采用建模为蒙特卡罗树搜索 (MCTS) 过程的动态编排机制。这使得 Orchestrator 能够迭代地生成假设、主动验证证据，并在关键信息丢失时触发回溯。该框架的核心是双内存架构：可变的工作内存可维持不断变化的患者状态以进行上下文感知推理，静态的经验内存可通过主动反馈循环检索临床指南和历史病例。大量实验表明，ClinicalAgents 实现了最先进的性能，与强大的单代理和多代理基线相比，显着提高了诊断准确性和可解释性。

Title: Sparse Auto-Encoders and Holism about Large Language Models

Authors: Jumbly Grindrod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26207
Pdf URL: https://arxiv.org/pdf/2603.26207
Copy Paste: [[2603.26207]] Sparse Auto-Encoders and Holism about Large Language Models(https://arxiv.org/abs/2603.26207)
Keywords: language model, llm
Abstract: Does Large Language Model (LLM) technology suggest a meta-semantic picture i.e. a picture of how words and complex expressions come to have the meaning that they do? One modest approach explores the assumptions that seem to be built into how LLMs capture the meanings of linguistic expressions as a way of considering their plausibility (Grindrod, 2026a, 2026b). It has previously been argued that LLMs, in employing a form of distributional semantics, adopt a form of holism about meaning (Grindrod, 2023; Grindrod et al., forthcoming). However, recent work in mechanistic interpretability presents a challenge to these arguments. Specifically, the discovery of a vast array of interpretable latent features within the high dimensional spaces used by LLMs potentially challenges the holistic interpretation. In this paper, I will present the original reasons for thinking that LLMs embody a form of holism (section 1), before introducing recent work on features generated through sparse auto-encoders, and explaining how the discovery of such features suggests an alternative decompositional picture of meaning (section 2). I will then respond to this challenge by considering in greater detail the nature of such features (section 3). Finally, I will return to the holistic picture defended by Grindrod et al. and argue that the picture still stands provided that the features are countable (section 4).
摘要：大语言模型 (LLM) 技术是否暗示了元语义图景，即单词和复杂表达方式如何具有其含义的图景？一种温和的方法探讨了法学硕士如何捕捉语言表达的含义的假设，作为考虑其合理性的一种方式（Grindrod，2026a，2026b）。此前有人认为，法学硕士在采用某种形式的分布式语义学时，采用了一种关于意义的整体论形式（Grindrod，2023；Grindrod 等人，即将出版）。然而，最近在机械解释方面的工作对这些论点提出了挑战。具体来说，法学硕士使用的高维空间中发现了大量可解释的潜在特征，这可能会对整体解释提出挑战。在本文中，我将介绍认为法学硕士体现了整体论形式的最初原因（第 1 节），然后介绍最近关于通过稀疏自动编码器生成的特征的工作，并解释这些特征的发现如何提出另一种意义的分解图（第 2 节）。然后，我将通过更详细地考虑这些功能的性质来应对这一挑战（第 3 节）。最后，我将回到 Grindrod 等人所捍卫的整体图景。并认为只要特征是可数的，图片仍然成立（第 4 节）。

Title: Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents

Authors: Nicholas Edwards, Sebastian Schuster
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26233
Pdf URL: https://arxiv.org/pdf/2603.26233
Copy Paste: [[2603.26233]] Ask or Assume? Uncertainty-Aware Clarification-Seeking in Coding Agents(https://arxiv.org/abs/2603.26233)
Keywords: language model, llm, agent
Abstract: As Large Language Model (LLM) agents are increasingly deployed in open-ended domains like software engineering, they frequently encounter underspecified instructions that lack crucial context. While human developers naturally resolve underspecification by asking clarifying questions, current agents are largely optimized for autonomous execution. In this work, we systematically evaluate the clarification-seeking abilities of LLM agents on an underspecified variant of SWE-bench Verified. We propose an uncertainty-aware multi-agent scaffold that explicitly decouples underspecification detection from code execution. Our results demonstrate that this multi-agent system using OpenHands + Claude Sonnet 4.5 achieves a 69.40% task resolve rate, significantly outperforming a standard single-agent setup (61.20%) and closing the performance gap with agents operating on fully specified instructions. Furthermore, we find that the multi-agent system exhibits well-calibrated uncertainty, conserving queries on simple tasks while proactively seeking information on more complex issues. These findings indicate that current models can be turned into proactive collaborators, where agents independently recognize when to ask questions to elicit missing information in real-world, underspecified tasks.
摘要：随着大型语言模型 (LLM) 代理越来越多地部署在软件工程等开放领域中，它们经常遇到缺乏关键上下文的未指定指令。虽然人类开发人员自然会通过提出澄清问题来解决规格不足的问题，但当前的代理在很大程度上针对自主执行进行了优化。在这项工作中，我们系统地评估了 LLM 代理人在 SWE-bench Verified 的未指定变体上寻求澄清的能力。我们提出了一种不确定性感知的多代理支架，可以显式地将规范不足检测与代码执行分离。我们的结果表明，使用 OpenHands + Claude Sonnet 4.5 的多智能体系统实现了 69.40% 的任务解决率，显着优于标准单智能体设置 (61.20%)，并缩小了与按照完全指定的指令操作的智能体的性能差距。此外，我们发现多智能体系统表现出经过良好校准的不确定性，保留对简单任务的查询，同时主动寻找有关更复杂问题的信息。这些发现表明，当前的模型可以转变为主动协作者，智能体可以独立识别何时提出问题，以引出现实世界中未指定任务中缺失的信息。

Title: A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs

Authors: Uri Z. Kialy, Avi Shtarkberg, Ayal Klein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26236
Pdf URL: https://arxiv.org/pdf/2603.26236
Copy Paste: [[2603.26236]] A Universal Vibe? Finding and Controlling Language-Agnostic Informal Register with SAEs(https://arxiv.org/abs/2603.26236)
Keywords: language model, llm
Abstract: While multilingual language models successfully transfer factual and syntactic knowledge across languages, it remains unclear whether they process culture-specific pragmatic registers, such as slang, as isolated language-specific memorizations or as unified, abstract concepts. We study this by probing the internal representations of Gemma-2-9B-IT using Sparse Autoencoders (SAEs) across three typologically diverse source languages: English, Hebrew, and Russian. To definitively isolate pragmatic register processing from trivial lexical sensitivity, we introduce a novel dataset in which every target term is polysemous, appearing in both literal and informal contexts. We find that while much of the informal-register signal is distributed across language-specific features, a small but highly robust cross-linguistic core consistently emerges. This shared core forms a geometrically coherent ``informal register subspace'' that sharpens in the model's deeper layers. Crucially, these shared representations are not merely correlational: activation steering with these features causally shifts output formality across all source languages and transfers zero-shot to six unseen languages spanning diverse language families and scripts. Together, these results provide the first mechanistic evidence that multilingual LLMs internalize informal register not just as surface-level heuristics, but as a portable, language-agnostic pragmatic abstraction.
摘要：虽然多语言模型成功地跨语言传递事实和句法知识，但仍不清楚它们是否将特定于文化的语用语域（例如俚语）处理为孤立的特定于语言的记忆或统一的抽象概念。我们通过使用稀疏自动编码器 (SAE) 跨三种类型不同的源语言（英语、希伯来语和俄语）探索 Gemma-2-9B-IT 的内部表示来研究这一问题。为了明确地将语用语域处理与琐碎的词汇敏感性分开，我们引入了一个新颖的数据集，其中每个目标术语都是多义的，出现在字面和非正式上下文中。我们发现，虽然大部分非正式语域信号分布在特定于语言的特征中，但一个小但高度鲁棒的跨语言核心不断出现。这个共享核心形成了一个几何上连贯的“非正式寄存器子空间”，该子空间在模型的更深层中变得清晰。至关重要的是，这些共享表示不仅仅是相关的：使用这些功能的激活控制会因果性地改变所有源语言的输出形式，并将零样本转移到跨越不同语系和脚本的六种看不见的语言。总之，这些结果提供了第一个机械证据，证明多语言法学硕士不仅将非正式语域内化为表面启发法，而且将其内化为可移植的、与语言无关的实用抽象。

Title: Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR

Authors: Shashi Kumar, Esaú Villatoro-Tello, Sergio Burdisso, Kadri Hacioglu, Thibault Bañeras-Roux, Hasindri Watawana, Dairazalia Sanchez-Cortes, Srikanth Madikeri, Petr Motlicek, Andreas Stolcke
Subjects: cs.CL, cs.AI, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2603.26246
Pdf URL: https://arxiv.org/pdf/2603.26246
Copy Paste: [[2603.26246]] Distilling Conversations: Abstract Compression of Conversational Audio Context for LLM-based ASR(https://arxiv.org/abs/2603.26246)
Keywords: llm
Abstract: Standard LLM-based speech recognition systems typically process utterances in isolation, limiting their ability to leverage conversational context. In this work, we study whether multimodal context from prior turns improves LLM-based ASR and how to represent that context efficiently. We find that, after supervised multi-turn training, conversational context mainly helps with the recognition of contextual entities. However, conditioning on raw context is expensive because the prior-turn audio token sequence grows rapidly with conversation length. To address this, we propose Abstract Compression, which replaces the audio portion of prior turns with a fixed number of learned latent tokens while retaining corresponding transcripts explicitly. On both in-domain and out-of-domain test sets, the compressed model recovers part of the gains of raw-context conditioning with a smaller prior-turn audio footprint. We also provide targeted analyses of the compression setup and its trade-offs.
摘要：基于法学硕士的标准语音识别系统通常单独处理话语，限制了它们利用对话上下文的能力。在这项工作中，我们研究了先前轮次的多模态上下文是否可以改善基于 LLM 的 ASR 以及如何有效地表示该上下文。我们发现，经过监督多轮训练后，会话上下文主要有助于上下文实体的识别。然而，对原始上下文的调节是昂贵的，因为前轮音频令牌序列随着对话长度而快速增长。为了解决这个问题，我们提出了抽象压缩，它用固定数量的学习潜在标记替换先前回合的音频部分，同时显式保留相应的转录本。在域内和域外测试集上，压缩模型以较小的前轮音频足迹恢复了原始上下文调节的部分增益。我们还提供压缩设置及其权衡的有针对性的分析。

Title: findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding

Authors: Héctor Javier Vázquez Martínez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26292
Pdf URL: https://arxiv.org/pdf/2603.26292
Copy Paste: [[2603.26292]] findsylls: A Language-Agnostic Toolkit for Syllable-Level Speech Tokenization and Embedding(https://arxiv.org/abs/2603.26292)
Keywords: language model
Abstract: Syllable-level units offer compact and linguistically meaningful representations for spoken language modeling and unsupervised word discovery, but research on syllabification remains fragmented across disparate implementations, datasets, and evaluation protocols. We introduce findsylls, a modular, language-agnostic toolkit that unifies classical syllable detectors and end-to-end syllabifiers under a common interface for syllable segmentation, embedding extraction, and multi-granular evaluation. The toolkit implements and standardizes widely used methods (e.g., Sylber, VG-HuBERT) and allows their components to be recombined, enabling controlled comparisons of representations, algorithms, and token rates. We demonstrate findsylls on English and Spanish corpora and on new hand-annotated data from Kono, an underdocumented Central Mande language, illustrating how a single framework can support reproducible syllable-level experiments across both high-resource and under-resourced settings.
摘要：音节级单元为口语建模和无监督单词发现提供了紧凑且具有语言意义的表示，但音节化的研究仍然分散在不同的实现、数据集和评估协议中。我们引入了 findylls，这是一个与语言无关的模块化工具包，它将经典音节检测器和端到端音节器统一在一个通用接口下，用于音节分割、嵌入提取和多粒度评估。该工具包实现并标准化了广泛使用的方法（例如，Sylber、VG-HuBERT），并允许重新组合其组件，从而实现表示、算法和令牌率的受控比较。我们在英语和西班牙语语料库以及 Kono（一种记录不足的中央曼德语言）的新手工注释数据上演示了 findylls，说明单个框架如何支持跨资源丰富和资源贫乏环境的可重复音节级实验。

Title: From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs

Authors: Jiyuan An, Liner Yang, Mengyan Wang, Luming Lu, Weihua An, Erhong Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26323
Pdf URL: https://arxiv.org/pdf/2603.26323
Copy Paste: [[2603.26323]] From Human Cognition to Neural Activations: Probing the Computational Primitives of Spatial Reasoning in LLMs(https://arxiv.org/abs/2603.26323)
Keywords: language model, llm
Abstract: As spatial intelligence becomes an increasingly important capability for foundation models, it remains unclear whether large language models' (LLMs) performance on spatial reasoning benchmarks reflects structured internal spatial representations or reliance on linguistic heuristics. We address this question from a mechanistic perspective by examining how spatial information is internally represented and used. Drawing on computational theories of human spatial cognition, we decompose spatial reasoning into three primitives, relational composition, representational transformation, and stateful spatial updating, and design controlled task families for each. We evaluate multilingual LLMs in English, Chinese, and Arabic under single pass inference, and analyze internal representations using linear probing, sparse autoencoder based feature analysis, and causal interventions. We find that task relevant spatial information is encoded in intermediate layers and can causally influence behavior, but these representations are transient, fragmented across task families, and weakly integrated into final predictions. Cross linguistic analysis further reveals mechanistic degeneracy, where similar behavioral performance arises from distinct internal pathways. Overall, our results suggest that current LLMs exhibit limited and context dependent spatial representations rather than robust, general purpose spatial reasoning, highlighting the need for mechanistic evaluation beyond benchmark accuracy.
摘要：随着空间智能成为基础模型越来越重要的能力，目前尚不清楚大型语言模型（LLM）在空间推理基准上的表现是否反映了结构化的内部空间表示或对语言启发法的依赖。我们通过研究空间信息的内部表示和使用方式，从机械的角度来解决这个问题。借鉴人类空间认知的计算理论，我们将空间推理分解为三个原语：关系组合、表征变换和状态空间更新，并为每个原语设计受控任务族。我们在单通道推理下评估英语、中文和阿拉伯语的多语言法学硕士，并使用线性探测、基于稀疏自动编码器的特征分析和因果干预来分析内部表示。我们发现与任务相关的空间信息被编码在中间层中，并且可以因果影响行为，但这些表示是瞬态的，在任务族中分散，并且弱集成到最终预测中。跨语言分析进一步揭示了机械简并性，即相似的行为表现源于不同的内部途径。总体而言，我们的结果表明，当前的法学硕士表现出有限且依赖于上下文的空间表示，而不是稳健的通用空间推理，这凸显了对超越基准准确性的机械评估的需求。

Title: CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law

Authors: JiHyeok Jung, TaeYoung Yoon, HyunSouk Cho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26332
Pdf URL: https://arxiv.org/pdf/2603.26332
Copy Paste: [[2603.26332]] CALRK-Bench: Evaluating Context-Aware Legal Reasoning in Korean Law(https://arxiv.org/abs/2603.26332)
Keywords: language model
Abstract: Legal reasoning requires not only the application of legal rules but also an understanding of the context in which those rules operate. However, existing legal benchmarks primarily evaluate rule application under the assumption of fixed norms, and thus fail to capture situations where legal judgments shift or where multiple norms interact. In this work, we propose CALRK-Bench, a context-aware legal reasoning benchmark based on the legal system in Korean. CALRK-Bench evaluates whether models can identify the temporal validity of legal norms, determine whether sufficient legal information is available for a given case, and understand the reasons behind shifts in legal judgments. The dataset is constructed from legal precedents and legal consultation records, and is validated by legal experts. Experimental results show that even recent large language models consistently exhibit low performance on these three tasks. CALRK-Bench provides a new stress test for evaluating context-aware legal reasoning rather than simple memorization of legal knowledge. Our code is available at this https URL.
摘要：法律推理不仅需要应用法律规则，还需要理解这些规则运作的背景。然而，现有的法律基准主要在固定规范的假设下评估规则的适用，因此无法捕捉法律判断发生变化或多种规范相互作用的情况。在这项工作中，我们提出了 CALRK-Bench，这是一种基于韩国法律体系的上下文感知法律推理基准。 CALRK-Bench 评估模型是否能够识别法律规范的时间有效性，确定特定案件是否有足够的法律信息，并了解法律判决变化背后的原因。该数据集根据法律先例和法律咨询记录构建，并经过法律专家的验证。实验结果表明，即使是最近的大型语言模型在这三项任务上也始终表现出较低的性能。 CALRK-Bench 提供了一种新的压力测试，用于评估上下文感知的法律推理，而不是简单的法律知识记忆。我们的代码可以在这个 https URL 上找到。

Title: Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers

Authors: Yusheng Zhao, Hourun Li, Bohan Wu, Jingyang Yuan, Meng Zhang, Yichun Yin, Lifeng Shang, Ming Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26380
Pdf URL: https://arxiv.org/pdf/2603.26380
Copy Paste: [[2603.26380]] Switch Attention: Towards Dynamic and Fine-grained Hybrid Transformers(https://arxiv.org/abs/2603.26380)
Keywords: language model
Abstract: The attention mechanism has been the core component in modern transformer architectures. However, the computation of standard full attention scales quadratically with the sequence length, serving as a major bottleneck in long-context language modeling. Sliding window attention restricts the context length for better efficiency at the cost of narrower receptive fields. While existing efforts attempt to take the benefits from both sides by building hybrid models, they often resort to static, heuristically designed alternating patterns that limit efficient allocation of computation in various scenarios. In this paper, we propose Switch Attention (SwiAttn), a novel hybrid transformer that enables dynamic and fine-grained routing between full attention and sliding window attention. For each token at each transformer layer, SwiAttn dynamically routes the computation to either a full-attention branch for global information aggregation or a sliding-window branch for efficient local pattern matching. An adaptive regularization objective is designed to encourage the model towards efficiency. Moreover, we adopt continual pretraining to optimize the model, transferring the full attention architecture to the hybrid one. Extensive experiments are conducted on twenty-three benchmark datasets across both regular (4K) and long (32K) context lengths, demonstrating the effectiveness of the proposed method.
摘要：注意力机制一直是现代 Transformer 架构的核心组件。然而，标准全注意力的计算与序列长度呈二次方关系，成为长上下文语言建模的主要瓶颈。滑动窗口注意力限制上下文长度以提高效率，但代价是接受域变窄。虽然现有的努力试图通过构建混合模型来获取双方的利益，但它们经常诉诸于静态的、启发式设计的交替模式，这限制了各种场景中计算的有效分配。在本文中，我们提出了 Switch Attention（SwiAttn），这是一种新型混合变压器，可以在完全注意和滑动窗口注意之间实现动态和细粒度的路由。对于每个转换器层的每个标记，SwiAttn 动态地将计算路由到用于全局信息聚合的全注意力分支或用于高效局部模式匹配的滑动窗口分支。自适应正则化目标旨在鼓励模型提高效率。此外，我们采用持续预训练来优化模型，将全注意力架构转移到混合架构。在常规（4K）和长（32K）上下文长度的 23 个基准数据集上进行了广泛的实验，证明了所提出方法的有效性。

Title: Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models

Authors: Richard J. Young
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26410
Pdf URL: https://arxiv.org/pdf/2603.26410
Copy Paste: [[2603.26410]] Why Models Know But Don't Say: Chain-of-Thought Faithfulness Divergence Between Thinking Tokens and Answers in Open-Weight Reasoning Models(https://arxiv.org/abs/2603.26410)
Keywords: chain-of-thought
Abstract: Extended-thinking models expose a second text-generation channel ("thinking tokens") alongside the user-visible answer. This study examines 12 open-weight reasoning models on MMLU and GPQA questions paired with misleading hints. Among the 10,506 cases where models actually followed the hint (choosing the hint's target over the ground truth), each case is classified by whether the model acknowledges the hint in its thinking tokens, its answer text, both, or neither. In 55.4% of these cases the model's thinking tokens contain hint-related keywords that the visible answer omits entirely, a pattern termed *thinking-answer divergence*. The reverse (answer-only acknowledgment) is near-zero (0.5%), confirming that the asymmetry is directional. Hint type shapes the pattern sharply: sycophancy is the most *transparent* hint, with 58.8% of sycophancy-influenced cases acknowledging the professor's authority in both channels, while consistency (72.2%) and unethical (62.7%) hints are dominated by thinking-only acknowledgment. Models also vary widely, from near-total divergence (Step-3.5-Flash: 94.7%) to relative transparency (Qwen3.5-27B: 19.6%). These results show that answer-text-only monitoring misses more than half of all hint-influenced reasoning and that thinking-token access, while necessary, still leaves 11.8% of cases with no verbalized acknowledgment in either channel.
摘要：扩展思维模型在用户可见的答案旁边公开了第二个文本生成通道（“思维标记”）。本研究针对 MMLU 和 GPQA 问题检验了 12 个开放权重推理模型，并配有误导性提示。在模型实际遵循提示（选择提示目标而不是真实事实）的 10,506 个案例中，每个案例都根据模型是否在其思维标记、答案文本中认可提示、两者都认可或两者都不认可来分类。在 55.4% 的情况下，模型的思维标记包含与提示相关的关键字，而可见答案完全忽略了这些关键字，这种模式称为“思维答案分歧”。相反（仅回答确认）接近于零 (0.5%)，证实不对称是方向性的。提示类型明显地塑造了模式：阿谀奉承是最“透明”的暗示，58.8％的受阿谀奉承影响的案例在两个渠道中都承认教授的权威，而一致性（72.2％）和不道德（62.7％）的提示则以仅思考的承认为主。模型也有很大差异，从接近完全背离（Step-3.5-Flash：94.7％）到相对透明度（Qwen3.5-27B：19.6％）。这些结果表明，仅对答案文本进行监控会错过一半以上的受提示影响的推理，而思维令牌访问虽然是必要的，但仍然使 11.8% 的案例在任一渠道中都没有口头确认。

Title: Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models

Authors: Mikko Saukkoriipi, Nicole Hernandez, Jaakko Sahlsten, Kimmo Kaski, Otso Arponen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26434
Pdf URL: https://arxiv.org/pdf/2603.26434
Copy Paste: [[2603.26434]] Automating Clinical Information Retrieval from Finnish Electronic Health Records Using Large Language Models(https://arxiv.org/abs/2603.26434)
Keywords: language model, llm
Abstract: Clinicians often need to retrieve patient-specific information from electronic health records (EHRs), a task that is time-consuming and error-prone. We present a locally deployable Clinical Contextual Question Answering (CCQA) framework that answers clinical questions directly from EHRs without external data transfer. Open-source large language models (LLMs) ranging from 4B to 70B parameters were benchmarked under fully offline conditions using 1,664 expert-annotated question-answer pairs derived from records of 183 patients. The dataset consisted predominantly of Finnish clinical text. In free-text generation, Llama-3.1-70B achieved 95.3% accuracy and 97.3% consistency across semantically equivalent question variants, while the smaller Qwen3-30B-A3B-2507 model achieved comparable performance. In a multiple-choice setting, models showed similar accuracy but variable calibration. Low-precision quantization (4-bit and 8-bit) preserved predictive performance while reducing GPU memory requirements and improving deployment feasibility. Clinical evaluation identified clinically significant errors in 2.9% of outputs, and semantically equivalent questions occasionally yielded discordant responses, including instances where one formulation was correct and the other contained a clinically significant error (0.96% of cases). These findings demonstrate that locally hosted open-source LLMs can accurately retrieve patient-specific information from EHRs using natural-language queries, while highlighting the need for validation and human oversight in clinical deployment.
摘要：临床医生经常需要从电子健康记录 (EHR) 中检索患者特定信息，这是一项耗时且容易出错的任务。我们提出了一个可本地部署的临床上下文问答 (CCQA) 框架，可直接从 EHR 回答临床问题，无需外部数据传输。使用源自 183 名患者记录的 1,664 个专家注释的问答对，在完全离线的条件下对参数范围从 4B 到 70B 的开源大语言模型 (LLM) 进行基准测试。该数据集主要由芬兰临床文本组成。在自由文本生成中，Llama-3.1-70B 在语义等效的问题变体中实现了 95.3% 的准确度和 97.3% 的一致性，而较小的 Qwen3-30B-A3B-2507 模型则实现了相当的性能。在多项选择设置中，模型显示出相似的精度，但校准存在差异。低精度量化（4 位和 8 位）保留了预测性能，同时降低了 GPU 内存需求并提高了部署可行性。临床评估发现 2.9% 的输出存在临床显着错误，语义等效的问题有时会产生不一致的反应，包括一种表述正确而另一种包含临床显着错误的情况（0.96% 的案例）。这些发现表明，本地托管的开源法学硕士可以使用自然语言查询从 EHR 准确检索患者特定信息，同时强调临床部署中验证和人工监督的必要性。

Title: ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims

Authors: Raia Abu Ahmad, Max Upravitelev, Aida Usmanova, Veronika Solopova, Georg Rehm
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26449
Pdf URL: https://arxiv.org/pdf/2603.26449
Copy Paste: [[2603.26449]] ClimateCheck 2026: Scientific Fact-Checking and Disinformation Narrative Classification of Climate-related Claims(https://arxiv.org/abs/2603.26449)
Keywords: language model
Abstract: Automatically verifying climate-related claims against scientific literature is a challenging task, complicated by the specialised nature of scholarly evidence and the diversity of rhetorical strategies underlying climate disinformation. ClimateCheck 2026 is the second iteration of a shared task addressing this challenge, expanding on the 2025 edition with tripled training data and a new disinformation narrative classification task. Running from January to February 2026 on the CodaBench platform, the competition attracted 20 registered participants and 8 leaderboard submissions, with systems combining dense retrieval pipelines, cross-encoder ensembles, and large language models with structured hierarchical reasoning. In addition to standard evaluation metrics (Recall@K and Binary Preference), we adapt an automated framework to assess retrieval quality under incomplete annotations, exposing systematic biases in how conventional metrics rank systems. A cross-task analysis further reveals that not all climate disinformation is equally verifiable, potentially implicating how future fact-checking systems should be designed.
摘要：根据科学文献自动验证与气候相关的主张是一项具有挑战性的任务，由于学术证据的专业性和气候虚假信息背后的修辞策略的多样性而变得复杂。 ClimateCheck 2026 是应对这一挑战的共享任务的第二次迭代，在 2025 年版本的基础上进行了扩展，提供了三倍的训练数据和新的虚假信息叙述分类任务。该竞赛于 2026 年 1 月至 2 月在 CodaBench 平台上举办，吸引了 20 名注册参与者和 8 份排行榜提交作品，其系统结合了密集检索管道、跨编码器集成和具有结构化分层推理的大型语言模型。除了标准评估指标（Recall@K 和 Binary Preference）之外，我们还采用自动化框架来评估不完整注释下的检索质量，暴露了传统指标对系统进行排名的系统偏差。跨任务分析进一步表明，并非所有气候虚假信息都同样可验证，这可能暗示未来的事实核查系统应如何设计。

Title: Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs

Authors: Vinicius Anjos de Almeida, Sandro Saorin da Silva, Josimar Chire, Leonardo Vicenzi, Nícolas Henrique Borges, Helena Kociolek, Sarah Miriã de Castro Rocha, Frederico Nassif Gomes, Júlia Cristina Ferreira, Oge Marques, Lucas Emanuel Silva e Oliveira
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26510
Pdf URL: https://arxiv.org/pdf/2603.26510
Copy Paste: [[2603.26510]] Clinical named entity recognition in the Portuguese language: a benchmark of modern BERT models and LLMs(https://arxiv.org/abs/2603.26510)
Keywords: language model, gpt, llm
Abstract: Clinical notes contain valuable unstructured information. Named entity recognition (NER) enables the automatic extraction of medical concepts; however, benchmarks for Portuguese remain scarce. In this study, we aimed to evaluate BERT-based models and large language models (LLMs) for clinical NER in Portuguese and to test strategies for addressing multilabel imbalance. We compared BioBERTpt, BERTimbau, ModernBERT, and mmBERT with LLMs such as GPT-5 and Gemini-2.5, using the public SemClinBr corpus and a private breast cancer dataset. Models were trained under identical conditions and evaluated using precision, recall, and F1-score. Iterative stratification, weighted loss, and oversampling were explored to mitigate class imbalance. The mmBERT-base model achieved the best performance (micro F1 = 0.76), outperforming all other models. Iterative stratification improved class balance and overall performance. Multilingual BERT models, particularly mmBERT, perform strongly for Portuguese clinical NER and can run locally with limited computational resources. Balanced data-splitting strategies further enhance performance.
摘要：临床笔记包含有价值的非结构化信息。命名实体识别（NER）能够自动提取医学概念；然而，葡萄牙语的基准仍然稀缺。在这项研究中，我们旨在评估用于葡萄牙语临床 NER 的基于 BERT 的模型和大语言模型 (LLM)，并测试解决多标签不平衡的策略。我们使用公共 SemClinBr 语料库和私人乳腺癌数据集，将 BioBERTpt、BERTimbau、ModernBERT 和 mmBERT 与 GPT-5 和 Gemini-2.5 等法学硕士进行了比较。模型在相同的条件下进行训练，并使用精确度、召回率和 F1 分数进行评估。探索了迭代分层、加权损失和过采样来减轻类别不平衡。基于 mmBERT 的模型取得了最佳性能（micro F1 = 0.76），优于所有其他模型。迭代分层改善了班级平衡和整体表现。多语言 BERT 模型，特别是 mmBERT，在葡萄牙语临床 NER 中表现强劲，并且可以在计算资源有限的情况下在本地运行。平衡的数据分割策略进一步提高性能。

Title: AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese

Authors: Afonso Simplício, Gonçalo Vinagre, Miguel Moura Ramos, Diogo Tavares, Rafael Ferreira, Giuseppe Attanasio, Duarte M. Alves, Inês Calvo, Inês Vieira, Rui Guerra, James Furtado, Beatriz Canaverde, Iago Paulo, Vasco Ramos, Diogo Glória-Silva, Miguel Faria, Marcos Treviso, Daniel Gomes, Pedro Gomes, David Semedo, André Martins, João Magalhães
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26511
Pdf URL: https://arxiv.org/pdf/2603.26511
Copy Paste: [[2603.26511]] AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese(https://arxiv.org/abs/2603.26511)
Keywords: language model, llm
Abstract: Despite rapid progress in open large language models (LLMs), European Portuguese (pt-PT) remains underrepresented in both training data and native evaluation, with machine-translated benchmarks likely missing the variant's linguistic and cultural nuances. We introduce AMALIA, a fully open LLM that prioritizes pt-PT by using more high-quality pt-PT data during both the mid- and post-training stages. To evaluate pt-PT more faithfully, we release a suite of pt-PT benchmarks that includes translated standard tasks and four new datasets targeting pt-PT generation, linguistic competence, and pt-PT/pt-BR bias. Experiments show that AMALIA matches strong baselines on translated benchmarks while substantially improving performance on pt-PT-specific evaluations, supporting the case for targeted training and native benchmarking for European Portuguese.
摘要：尽管开放大语言模型 (LLM) 取得了快速进展，但欧洲葡萄牙语 (pt-PT) 在训练数据和本地评估中的代表性仍然不足，机器翻译的基准可能会忽略该变体的语言和文化细微差别。我们推出 AMALIA，这是一个完全开放的法学硕士，它通过在训练中期和后期阶段使用更多高质量的 pt-PT 数据来优先考虑 pt-PT。为了更忠实地评估 pt-PT，我们发布了一套 pt-PT 基准，其中包括翻译的标准任务和四个针对 pt-PT 生成、语言能力和 pt-PT/pt-BR 偏差的新数据集。实验表明，AMALIA 与翻译基准的强大基线相匹配，同时显着提高了 pt-PT 特定评估的性能，支持欧洲葡萄牙语的针对性培训和本地基准测试。

Title: JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems

Authors: Guangzhao Yang, Yu Pan, Shi Qiu, Ningjie Bai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26515
Pdf URL: https://arxiv.org/pdf/2603.26515
Copy Paste: [[2603.26515]] JAL-Turn: Joint Acoustic-Linguistic Modeling for Real-Time and Robust Turn-Taking Detection in Full-Duplex Spoken Dialogue Systems(https://arxiv.org/abs/2603.26515)
Keywords: language model, agent
Abstract: Despite recent advances, efficient and robust turn-taking detection remains a significant challenge in industrial-grade Voice AI agent deployments. Many existing systems rely solely on acoustic or semantic cues, leading to suboptimal accuracy and stability, while recent attempts to endow large language models with full-duplex capabilities require costly full-duplex data and incur substantial training and deployment overheads, limiting real-time performance. In this paper, we propose JAL-Turn, a lightweight and efficient speech-only turn-taking framework that adopts a joint acoustic-linguistic modeling paradigm, in which a cross-attention module adaptively integrates pre-trained acoustic representations with linguistic features to support low-latency prediction of hold vs shift states. By sharing a frozen ASR encoder, JAL-Turn enables turn-taking prediction to run fully in parallel with speech recognition, introducing no additional end-to-end latency or computational overhead. In addition, we introduce a scalable data construction pipeline that automatically derives reliable turn-taking labels from large-scale real-world dialogue corpora. Extensive experiments on public multilingual benchmarks and an in-house Japanese customer-service dataset show that JAL-Turn consistently outperforms strong state-of-the-art baselines in detection accuracy while maintaining superior real-time performance.
摘要：尽管最近取得了进展，但高效、强大的轮流检测仍然是工业级语音 AI 代理部署中的重大挑战。许多现有系统仅依赖于声学或语义提示，导致准确性和稳定性欠佳，而最近尝试赋予大型语言模型全双工功能需要昂贵的全双工数据，并产生大量的培训和部署开销，限制了实时性能。在本文中，我们提出了 JAL-Turn，一种轻量级且高效的纯语音轮流框架，采用联合声学语言建模范式，其中交叉注意力模块自适应地将预训练的声学表示与语言特征相结合，以支持保持与移动状态的低延迟预测。通过共享冻结的 ASR 编码器，JAL-Turn 使轮流预测能够与语音识别完全并行运行，不会引入额外的端到端延迟或计算开销。此外，我们引入了一个可扩展的数据构建管道，可以从大规模的现实世界对话语料库中自动派生出可靠的轮流标签。对公共多语言基准和内部日本客户服务数据集的大量实验表明，JAL-Turn 在检测准确性方面始终优于最先进的基准，同时保持卓越的实时性能。

Title: ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs

Authors: Inês Vieira, Inês Calvo, Iago Paulo, James Furtado, Rafael Ferreira, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.26516
Pdf URL: https://arxiv.org/pdf/2603.26516
Copy Paste: [[2603.26516]] ALBA: A European Portuguese Benchmark for Evaluating Language and Linguistic Dimensions in Generative LLMs(https://arxiv.org/abs/2603.26516)
Keywords: language model, llm
Abstract: As Large Language Models (LLMs) expand across multilingual domains, evaluating their performance in under-represented languages becomes increasingly important. European Portuguese (pt-PT) is particularly affected, as existing training data and benchmarks are mainly in Brazilian Portuguese (pt-BR). To address this, we introduce ALBA, a linguistically grounded benchmark designed from the ground up to assess LLM proficiency in linguistic-related tasks in pt-PT across eight linguistic dimensions, including Language Variety, Culture-bound Semantics, Discourse Analysis, Word Plays, Syntax, Morphology, Lexicology, and Phonetics and Phonology. ALBA is manually constructed by language experts and paired with an LLM-as-a-judge framework for scalable evaluation of pt-PT generated language. Experiments on a diverse set of models reveal performance variability across linguistic dimensions, highlighting the need for comprehensive, variety-sensitive benchmarks that support further development of tools in pt-PT.
摘要：随着大型语言模型 (LLM) 扩展到多语言领域，评估其在代表性不足的语言中的表现变得越来越重要。欧洲葡萄牙语（pt-PT）受到的影响尤其严重，因为现有的训练数据和基准主要是巴西葡萄牙语（pt-BR）。为了解决这个问题，我们引入了 ALBA，这是一个从头开始设计的语言基础基准，旨在评估法学硕士在 pt-PT 中语言相关任务的熟练程度，涵盖八个语言维度，包括语言多样性、文化约束语义、话语分析、文字游戏、句法、形态学、词汇学以及语音学和音系学。 ALBA 由语言专家手动构建，并与 LLM-as-a-judge 框架配对，用于对 pt-PT 生成的语言进行可扩展的评估。对不同模型集的实验揭示了跨语言维度的性能变异性，凸显了对全面的、对多样性敏感的基准的需求，以支持 pt-PT 工具的进一步开发。

Title: How Open Must Language Models be to Enable Reliable Scientific Inference?

Authors: James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.26539
Pdf URL: https://arxiv.org/pdf/2603.26539
Copy Paste: [[2603.26539]] How Open Must Language Models be to Enable Reliable Scientific Inference?(https://arxiv.org/abs/2603.26539)
Keywords: language model
Abstract: How does the extent to which a model is open or closed impact the scientific inferences that can be drawn from research that involves it? In this paper, we analyze how restrictions on information about model construction and deployment threaten reliable inference. We argue that current closed models are generally ill-suited for scientific purposes, with some notable exceptions, and discuss ways in which the issues they present to reliable inference can be resolved or mitigated. We recommend that when models are used in research, potential threats to inference should be systematically identified along with the steps taken to mitigate them, and that specific justifications for model selection should be provided.
摘要：模型开放或封闭的程度如何影响从涉及该模型的研究中得出的科学推论？在本文中，我们分析了对模型构建和部署信息的限制如何威胁可靠的推理。我们认为，当前的封闭模型通常不适合科学目的，但有一些明显的例外，并讨论了如何解决或减轻它们向可靠推理提出的问题。我们建议，当模型用于研究时，应系统地识别对推理的潜在威胁以及缓解这些威胁的步骤，并提供模型选择的具体理由。

Title: Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model

Authors: Maria Kefala, Jeffery L. Painter, Syed Tauhid Bukhari, Maurizio Sessa
Subjects: cs.CL, q-bio.QM
Abstract URL: https://arxiv.org/abs/2603.26544
Pdf URL: https://arxiv.org/pdf/2603.26544
Copy Paste: [[2603.26544]] Development of a European Union Time-Indexed Reference Dataset for Assessing the Performance of Signal Detection Methods in Pharmacovigilance using a Large Language Model(https://arxiv.org/abs/2603.26544)
Keywords: language model
Abstract: Background: The identification of optimal signal detection methods is hindered by the lack of reliable reference datasets. Existing datasets do not capture when adverse events (AEs) are officially recognized by regulatory authorities, preventing restriction of analyses to pre-confirmation periods and limiting evaluation of early detection performance. This study addresses this gap by developing a time-indexed reference dataset for the European Union (EU), incorporating the timing of AE inclusion in product labels along with regulatory metadata. Methods: Current and historical Summaries of Product Characteristics (SmPCs) for all centrally authorized products (n=1,513) were retrieved from the EU Union Register of Medicinal Products (data lock: 15 December 2025). Section 4.8 was extracted and processed using DeepSeek V3 to identify AEs. Regulatory metadata, including labelling changes, were programmatically extracted. Time indexing was based on the date of AE inclusion in the SmPC. Results: The database includes 17,763 SmPC versions spanning 1995-2025, comprising 125,026 drug-AE associations. The time-indexed reference dataset, restricted to active products, included 1,479 medicinal products and 110,823 drug-AE associations. Most AEs were identified pre-marketing (74.5%) versus post-marketing (25.5%). Safety updates peaked around 2012. Gastrointestinal, skin, and nervous system disorders were the most represented System Organ Classes. Drugs had a median of 48 AEs across 14 SOCs. Conclusions: The proposed dataset addresses a critical gap in pharmacovigilance by incorporating temporal information on AE recognition for the EU, supporting more accurate assessment of signal detection performance and facilitating methodological comparisons across analytical approaches.
摘要：背景：由于缺乏可靠的参考数据集，最佳信号检测方法的识别受到阻碍。现有数据集无法捕获不良事件 (AE) 何时被监管机构正式认可，从而无法将分析限制在预确认期间，并限制对早期检测性能的评估。本研究通过为欧盟 (EU) 开发一个时间索引的参考数据集来解决这一差距，将 AE 包含在产品标签中的时间与监管元数据结合起来。方法：从欧盟药品登记册中检索所有集中授权产品（n = 1,513）的当前和历史产品特性摘要（SmPC）（数据锁定：2025 年 12 月 15 日）。使用 DeepSeek V3 提取并处理第 4.8 节以识别 AE。以编程方式提取监管元数据，包括标签更改。时间索引基于 AE 纳入 SmPC 的日期。结果：数据库包含 1995 年至 2025 年的 17,763 个 SmPC 版本，其中包含 125,026 个药物-AE 关联。时间索引参考数据集仅限于活性产品，包括 1,479 种医药产品和 110,823 个药物 AE 关联。大多数 AE 是在上市前 (74.5%) 和上市后 (25.5%) 发现的。安全更新在 2012 年左右达到顶峰。胃肠道、皮肤和神经系统疾病是最具代表性的系统器官类别。药物在 14 个 SOC 中的 AE 中位数为 48 个。结论：所提出的数据集通过纳入欧盟 AE 识别的时间信息，支持更准确地评估信号检测性能并促进跨分析方法的方法学比较，解决了药物警戒方面的关键差距。

Title: MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

Authors: Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26557
Pdf URL: https://arxiv.org/pdf/2603.26557
Copy Paste: [[2603.26557]] MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference(https://arxiv.org/abs/2603.26557)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.
摘要：大型语言模型 (LLM) 可提供强大的性能，但在实际服务中会产生较高的推理成本，尤其是在跨用户和会话重复或接近重复查询的工作负载下。在这项工作中，我们提出了 MemBoost，这是一种内存增强的 LLM 服务框架，它使轻量级模型能够重用以前生成的答案并检索相关支持信息以进行廉价推理，同时有选择地将困难或不确定的查询升级为更强的模型。与主要基于单个响应的标准检索增强生成不同，MemBoost 专为交互式设置而设计，支持答案重用、持续内存增长和成本感知路由。在模拟工作负载下跨多个模型进行的实验表明，MemBoost 大大减少了昂贵的大型模型调用和总体推理成本，同时保持了与强模型基线相当的高答案质量。

Title: Weight Tying Biases Token Embeddings Towards the Output Space

Authors: Antonio Lopardo, Avyukth Harish, Catherine Arnett, Akshat Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.26663
Pdf URL: https://arxiv.org/pdf/2603.26663
Copy Paste: [[2603.26663]] Weight Tying Biases Token Embeddings Towards the Output Space(https://arxiv.org/abs/2603.26663)
Keywords: language model, llm
Abstract: Weight tying, i.e. sharing parameters between input and output embedding matrices, is common practice in language model design, yet its impact on the learned embedding space remains poorly understood. In this paper, we show that tied embedding matrices align more closely with output (unembedding) matrices than with input embeddings of comparable untied models, indicating that the shared matrix is shaped primarily for output prediction rather than input representation. This unembedding bias arises because output gradients dominate early in training. Using tuned lens analysis, we show this negatively affects early-layer computations, which contribute less effectively to the residual stream. Scaling input gradients during training reduces this bias, providing causal evidence for the role of gradient imbalance. This is mechanistic evidence that weight tying optimizes the embedding matrix for output prediction, compromising its role in input representation. These results help explain why weight tying can harm performance at scale and have implications for training smaller LLMs, where the embedding matrix contributes substantially to total parameter count.
摘要：权重绑定，即在输入和输出嵌入矩阵之间共享参数，是语言模型设计中的常见做法，但它对学习嵌入空间的影响仍然知之甚少。在本文中，我们表明，绑定嵌入矩阵与输出（非嵌入）矩阵比与可比较的非绑定模型的输入嵌入更紧密地对齐，这表明共享矩阵的形状主要用于输出预测而不是输入表示。这种非嵌入偏差的出现是因为输出梯度在训练早期占主导地位。使用调谐透镜分析，我们表明这会对早期层计算产生负面影响，从而对残余流的贡献较小。在训练过程中缩放输入梯度可以减少这种偏差，为梯度不平衡的作用提供因果证据。这是权重绑定优化输出预测的嵌入矩阵的机械证据，损害了其在输入表示中的作用。这些结果有助于解释为什么权重绑定会大规模损害性能，并对训练较小的 LLM 产生影响，其中嵌入矩阵对总参数计数有很大贡献。