2026-03-05

Title: AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents

Authors: Wenhui Zhu, Xiwen Chen, Zhipeng Wang, Jingjing Wang, Xuanzhao Dong, Minzhou Huang, Rui Cai, Hejian Sang, Hao Wang, Peijie Qiu, Yueyue Deng, Prayag Tiwari, Brendan Hogan Rappazzo, Yalin Wang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03290
Pdf URL: https://arxiv.org/pdf/2603.03290
Copy Paste: [[2603.03290]] AriadneMem: Threading the Maze of Lifelong Memory for LLM Agents(https://arxiv.org/abs/2603.03290)
Keywords: gpt, llm, agent
Abstract: Long-horizon LLM agents require memory systems that remain accurate under fixed context budgets. However, existing systems struggle with two persistent challenges in long-term dialogue: (i) \textbf{disconnected evidence}, where multi-hop answers require linking facts distributed across time, and (ii) \textbf{state updates}, where evolving information (e.g., schedule changes) creates conflicts with older static logs. We propose AriadneMem, a structured memory system that addresses these failure modes via a decoupled two-phase pipeline. In the \textbf{offline construction phase}, AriadneMem employs \emph{entropy-aware gating} to filter noise and low-information message before LLM extraction and applies \emph{conflict-aware coarsening} to merge static duplicates while preserving state transitions as temporal edges. In the \textbf{online reasoning phase}, rather than relying on expensive iterative planning, AriadneMem executes \emph{algorithmic bridge discovery} to reconstruct missing logical paths between retrieved facts, followed by \emph{single-call topology-aware synthesis}. On LoCoMo experiments with GPT-4o, AriadneMem improves \textbf{Multi-Hop F1 by 15.2\%} and \textbf{Average F1 by 9.0\%} over strong baselines. Crucially, by offloading reasoning to the graph layer, AriadneMem reduces \textbf{total runtime by 77.8\%} using only \textbf{497} context tokens. The code is available at this https URL.
摘要：长期 LLM 代理需要在固定背景预算下保持准确的记忆系统。然而，现有系统在长期对话中面临着两个持续存在的挑战：(i) \textbf{断开的证据}，其中多跳答案需要链接跨时间分布的事实，以及 (ii) \textbf{状态更新}，其中不断变化的信息（例如，时间表更改）会与旧的静态日志产生冲突。我们提出了 AriadneMem，这是一种结构化内存系统，可通过解耦的两阶段管道解决这些故障模式。在 \textbf{离线构建阶段}，AriadneMem 在 LLM 提取之前采用 \emph{熵感知门控} 来过滤噪声和低信息消息，并应用 \emph{冲突感知粗化} 来合并静态重复项，同时将状态转换保留为时间边缘。在 \textbf{在线推理阶段}，AriadneMem 不依赖昂贵的迭代规划，而是执行 \emph{算法桥发现} 来重建检索到的事实之间缺失的逻辑路径，然后执行 \emph{单次调用拓扑感知合成}。在使用 GPT-4o 的 LoCoMo 实验中，AriadneMem 在强基线上将 \textbf{Multi-Hop F1 提高了 15.2\%} 和 \textbf{平均 F1 提高了 9.0\%}。至关重要的是，通过将推理转移到图形层，AriadneMem 仅使用 \textbf{497} 上下文标记将 \textbf{总运行时间减少了 77.8\%}。该代码可从此 https URL 获取。

Title: One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models

Authors: Daniel Fein, Max Lamparth, Violet Xiang, Mykel J. Kochenderfer, Nick Haber
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03291
Pdf URL: https://arxiv.org/pdf/2603.03291
Copy Paste: [[2603.03291]] One Bias After Another: Mechanistic Reward Shaping and Persistent Biases in Language Reward Models(https://arxiv.org/abs/2603.03291)
Keywords: language model
Abstract: Reward Models (RMs) are crucial for online alignment of language models (LMs) with human preferences. However, RM-based preference-tuning is vulnerable to reward hacking, whereby LM policies learn undesirable behaviors from flawed RMs. By systematically measuring biases in five high-quality RMs, including the state-of-the-art, we find that issues persist despite prior work with respect to length, sycophancy, and overconfidence. We also discover new issues related to bias toward model-specific styles and answer-order. We categorize RM failures by complexity and propose a simple post-hoc intervention to mitigate low-complexity biases that arise from spurious correlations. Our proposed mechanistic reward shaping reduces targeted biases without degrading reward quality and while using minimal labeled data. The method is extensible to new biases, model-internal, and generalizes out-of-distribution.
摘要：奖励模型 (RM) 对于在线调整语言模型 (LM) 与人类偏好至关重要。然而，基于 RM 的偏好调整很容易受到奖励黑客攻击，LM 策略会从有缺陷的 RM 中学习不良行为。通过系统地测量五个高质量 RM（包括最先进的 RM）的偏差，我们发现，尽管之前在长度、阿谀奉承和过度自信方面进行了工作，但问题仍然存在。我们还发现了与模型特定风格和答案顺序偏差相关的新问题。我们根据复杂性对 RM 故障进行分类，并提出一种简单的事后干预措施，以减轻由虚假相关性引起的低复杂性偏差。我们提出的机械奖励塑造减少了目标偏差，而不会降低奖励质量，同时使用最少的标记数据。该方法可扩展到新的偏差、模型内部，并概括分布外。

Title: From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG

Authors: Wenhao Wu, Zhentao Tang, Yafu Li, Shixiong Kai, Mingxuan Yuan, Zhenhong Sun, Chunlin Chen, Zhi Wang
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.03292
Pdf URL: https://arxiv.org/pdf/2603.03292
Copy Paste: [[2603.03292]] From Conflict to Consensus: Boosting Medical Reasoning via Multi-Round Agentic RAG(https://arxiv.org/abs/2603.03292)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) exhibit high reasoning capacity in medical question-answering, but their tendency to produce hallucinations and outdated knowledge poses critical risks in healthcare fields. While Retrieval-Augmented Generation (RAG) mitigates these issues, existing methods rely on noisy token-level signals and lack the multi-round refinement required for complex reasoning. In the paper, we propose **MA-RAG** (**M**ulti-Round **A**gentic RAG), a framework that facilitates test-time scaling for complex medical reasoning by iteratively evolving both external evidence and internal reasoning history within an agentic refinement loop. At each round, the agent transforms semantic **conflict** among candidate responses into actionable queries to retrieve external evidence, while optimizing history reasoning traces to mitigate long-context degradation. MA-RAG extends the *self-consistency* principle by leveraging the lack of consistency as a proactive signal for multi-round agentic reasoning and retrieval, and mirrors a *boosting* mechanism that iteratively minimizes the residual error toward a stable, high-fidelity medical **consensus**. Extensive evaluations across 7 medical Q&A benchmarks show that MA-RAG consistently surpasses competitive inference-time scaling and RAG baselines, delivering **substantial +6.8 points** on average accuracy over the backbone model. Our code is available at [this url](this https URL).
摘要：大型语言模型（LLM）在医学问答中表现出很高的推理能力，但它们产生幻觉和过时知识的倾向在医疗保健领域构成了严重风险。虽然检索增强生成（RAG）可以缓解这些问题，但现有方法依赖于嘈杂的令牌级信号，并且缺乏复杂推理所需的多轮细化。在本文中，我们提出了**MA-RAG**（**M**ulti-Round **A**gentic RAG），这是一个框架，通过在代理细化循环中迭代地发展外部证据和内部推理历史，促进复杂医学推理的测试时间扩展。在每一轮中，代理将候选响应之间的语义**冲突**转换为可操作的查询以检索外部证据，同时优化历史推理跟踪以减轻长上下文退化。 MA-RAG 通过利用一致性的缺乏作为多轮代理推理和检索的主动信号来扩展“自我一致性”原则，并反映了一种“增强”机制，该机制可以迭代地最小化残余误差，以实现稳定、高保真的医学“共识”。对 7 个医学问答基准的广泛评估表明，MA-RAG 始终超越竞争性推理时间缩放和 RAG 基线，与主干模型相比，平均准确度大幅提高 6.8 分**。我们的代码可在 [此 url]（此 https URL）获取。

Title: SE-Search: Self-Evolving Search Agent via Memory and Dense Reward

Authors: Jian Li, Yizhang Jin, Dongqi Liu, Hang Ding, Jiafu Wu, Dongsheng Chen, Yunhang Shen, Yulei Qin, Ying Tai, Chengjie Wang, Xiaotong Yuan, Yabiao Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03293
Pdf URL: https://arxiv.org/pdf/2603.03293
Copy Paste: [[2603.03293]] SE-Search: Self-Evolving Search Agent via Memory and Dense Reward(https://arxiv.org/abs/2603.03293)
Keywords: language model, llm, hallucination, retrieval augmented generation, agent
Abstract: Retrieval augmented generation (RAG) reduces hallucinations and factual errors in large language models (LLMs) by conditioning generation on retrieved external knowledge. Recent search agents further cast RAG as an autonomous, multi-turn information-seeking process. However, existing methods often accumulate irrelevant or noisy documents and rely on sparse reinforcement learning signals. We propose \textbf{S}elf-\textbf{E}volving \textbf{Search}, a Self-Evolving Search agent that improves online search behavior through three components, memory purification, atomic query training, and dense rewards. SE-Search follows a \textit{Think-Search-Memorize} strategy that retains salient evidence while filtering irrelevant content. Atomic query training promotes shorter and more diverse queries, improving evidence acquisition. Dense rewards provide fine-grained feedback that speeds training. Experiments on single-hop and multi-hop question answering benchmarks show that \texttt{SE-Search-3B} outperforms strong baselines, yielding a $10.8$ point absolute improvement and a $33.8\%$ relative gain over Search-R1.\footnote{We will make the code and model weights publicly available upon acceptance.}
摘要：检索增强生成 (RAG) 通过根据检索到的外部知识调节生成，减少大型语言模型 (LLM) 中的幻觉和事实错误。最近的搜索代理进一步将 RAG 视为一个自主的、多轮信息搜索过程。然而，现有的方法经常积累不相关或有噪声的文档，并且依赖于稀疏的强化学习信号。我们提出 \textbf{S}elf-\textbf{E}volving \textbf{Search}，这是一种自我进化搜索代理，它通过内存净化、原子查询训练和密集奖励三个组件来改善在线搜索行为。 SE-Search 遵循 \textit{Think-Search-Memorize} 策略，在过滤不相关内容的同时保留显着证据。原子查询训练可促进更短、更多样化的查询，从而改善证据获取。密集的奖励提供细粒度的反馈，加速训练。单跳和多跳问答基准的实验表明，\texttt{SE-Search-3B} 的性能优于强大的基线，与 Search-R1 相比，绝对改进了 10.8 美元，相对增益为 33.8 美元\%$。\footnote{我们将在接受后公开代码和模型权重。}

Title: Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory

Authors: Sanyam Singh, Naga Ganesh, Vineet Singh, Lakshmi Pedapudi, Ritesh Kumar, SSP Jyothi, Archana Karanam, C. Yashoda, Mettu Vijaya Rekha Reddy, Shesha Phani Debbesa, Chandan Dash
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03294
Pdf URL: https://arxiv.org/pdf/2603.03294
Copy Paste: [[2603.03294]] Fine-Tuning and Evaluating Conversational AI for Agricultural Advisory(https://arxiv.org/abs/2603.03294)
Keywords: language model, llm, prompt, chat
Abstract: Large Language Models show promise for agricultural advisory, yet vanilla models exhibit unsupported recommendations, generic advice lacking specific, actionable detail, and communication styles misaligned with smallholder farmer needs. In high stakes agricultural contexts, where recommendation accuracy has direct consequences for farmer outcomes, these limitations pose challenges for responsible deployment. We present a hybrid LLM architecture that decouples factual retrieval from conversational delivery: supervised fine-tuning with LoRA on expert-curated GOLDEN FACTS (atomic, verified units of agricultural knowledge) optimizes fact recall, while a separate stitching layer transforms retrieved facts into culturally appropriate, safety-aware responses. Our evaluation framework, DG-EVAL, performs atomic fact verification (measuring recall, precision, and contradiction detection) against expert-curated ground truth rather than Wikipedia or retrieved documents. Experiments across multiple model configurations on crops and queries from Bihar, India show that fine-tuning on curated data substantially improves fact recall and F1, while maintaining high relevance. Using a fine-tuned smaller model achieves comparable or better factual quality at a fraction of the cost of frontier models. A stitching layer further improves safety subscores while maintaining high conversational quality. We release the farmerchat-prompts library to enable reproducible development of domain-specific agricultural AI.
摘要：大型语言模型显示出农业咨询的前景，但普通模型却表现出不受支持的建议、缺乏具体、可操作细节的通用建议，以及与小农需求不相符的沟通方式。在高风险的农业环境中，推荐的准确性对农民的成果有直接影响，这些限制对负责任的部署提出了挑战。我们提出了一种混合法学硕士架构，将事实检索与对话交付解耦：使用 LoRA 对专家策划的 GOLDEN FACTS（农业知识的原子、经过验证的单元）进行监督微调，优化事实回忆，而单独的拼接层将检索到的事实转换为文化上适当的、安全意识的响应。我们的评估框架 DG-EVAL 针对专家策划的基本事实而不是维基百科或检索到的文档执行原子事实验证（测量召回率、精确度和矛盾检测）。针对印度比哈尔邦作物和查询的多种模型配置的实验表明，对精选数据进行微调可显着提高事实召回率和 F1，同时保持高相关性。使用经过微调的较小模型可以以前沿模型成本的一小部分实现可比或更好的事实质量。拼接层进一步提高了安全性分数，同时保持了高对话质量。我们发布了 farmerchat-prompts 库，以实现特定领域农业人工智能的可重复开发。

Title: Language Model Goal Selection Differs from Humans' in an Open-Ended Task

Authors: Gaia Molinaro, Dave August, Danielle Perszyk, Anne G. E. Collins
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2603.03295
Pdf URL: https://arxiv.org/pdf/2603.03295
Copy Paste: [[2603.03295]] Language Model Goal Selection Differs from Humans' in an Open-Ended Task(https://arxiv.org/abs/2603.03295)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: As large language models (LLMs) get integrated into human decision-making, they are increasingly choosing goals autonomously rather than only completing human-defined ones, assuming they will reflect human preferences. However, human-LLM similarity in goal selection remains largely untested. We directly assess the validity of LLMs as proxies for human goal selection in a controlled, open-ended learning task borrowed from cognitive science. Across four state-of-the-art models (GPT-5, Gemini 2.5 Pro, Claude Sonnet 4.5, and Centaur), we find substantial divergence from human behavior. While people gradually explore and learn to achieve goals with diversity across individuals, most models exploit a single identified solution (reward hacking) or show surprisingly low performance, with distinct patterns across models and little variability across instances of the same model. Even Centaur, explicitly trained to emulate humans in experimental settings, poorly captures people's goal selection. Chain-of-thought reasoning and persona steering provide limited improvements. These findings highlight the uniqueness of human goal selection, cautioning against replacing it with current models in applications such as personal assistance, scientific discovery, and policy research.
摘要：随着大型语言模型（LLM）融入人类决策，它们越来越多地自主选择目标，而不仅仅是完成人类定义的目标，假设它们会反映人类的偏好。然而，人类与法学硕士在目标选择方面的相似性在很大程度上尚未经过测试。我们直接评估法学硕士在借鉴认知科学的受控、开放式学习任务中作为人类目标选择代理的有效性。在四种最先进的模型（GPT-5、Gemini 2.5 Pro、Claude Sonnet 4.5 和 Centaur）中，我们发现与人类行为存在很大差异。虽然人们逐渐探索和学习实现个体多样性的目标，但大多数模型都利用单一已确定的解决方案（奖励黑客）或表现出令人惊讶的低性能，模型之间具有不同的模式，同一模型的实例之间几乎没有变化。即使是经过明确训练以在实验环境中模仿人类的半人马座也很难捕捉到人们的目标选择。思想链推理和角色引导提供的改进有限。这些发现强调了人类目标选择的独特性，并警告不要在个人援助、科学发现和政策研究等应用中用当前模型取代它。

Title: PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents

Authors: Ke Yang, Zixi Chen, Xuan He, Jize Jiang, Michel Galley, Chenglong Wang, Jianfeng Gao, Jiawei Han, ChengXiang Zhai
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.03296
Pdf URL: https://arxiv.org/pdf/2603.03296
Copy Paste: [[2603.03296]] PlugMem: A Task-Agnostic Plugin Memory Module for LLM Agents(https://arxiv.org/abs/2603.03296)
Keywords: language model, llm, agent
Abstract: Long-term memory is essential for large language model (LLM) agents operating in complex environments, yet existing memory designs are either task-specific and non-transferable, or task-agnostic but less effective due to low task-relevance and context explosion from raw memory retrieval. We propose PlugMem, a task-agnostic plugin memory module that can be attached to arbitrary LLM agents without task-specific redesign. Motivated by the fact that decision-relevant information is concentrated as abstract knowledge rather than raw experience, we draw on cognitive science to structure episodic memories into a compact, extensible knowledge-centric memory graph that explicitly represents propositional and prescriptive knowledge. This representation enables efficient memory retrieval and reasoning over task-relevant knowledge, rather than verbose raw trajectories, and departs from other graph-based methods like GraphRAG by treating knowledge as the unit of memory access and organization instead of entities or text chunks. We evaluate PlugMem unchanged across three heterogeneous benchmarks (long-horizon conversational question answering, multi-hop knowledge retrieval, and web agent tasks). The results show that PlugMem consistently outperforms task-agnostic baselines and exceeds task-specific memory designs, while also achieving the highest information density under a unified information-theoretic analysis. Code and data are available at this https URL.
摘要：长期记忆对于在复杂环境中运行的大型语言模型（LLM）代理至关重要，但现有的记忆设计要么是特定于任务且不可转移的，要么是与任务无关的，但由于任务相关性低和原始记忆检索的上下文爆炸而效率较低。我们提出了 PlugMem，这是一种与任务无关的插件内存模块，可以附加到任意 LLM 代理，而无需针对特定任务进行重新设计。受决策相关信息集中为抽象知识而不是原始经验这一事实的启发，我们利用认知科学将情景记忆构建成一个紧凑的、可扩展的以知识为中心的记忆图，该图明确地表示命题性和规定性知识。这种表示能够对任务相关的知识（而不是冗长的原始轨迹）进行高效的记忆检索和推理，并且通过将知识视为内存访问和组织的单位而不是实体或文本块，与 GraphRAG 等其他基于图的方法不同。我们在三个异构基准（长期会话问答、多跳知识检索和 Web 代理任务）中对 PlugMem 进行了不变的评估。结果表明，PlugMem 始终优于与任务无关的基线，并超过特定于任务的内存设计，同时还在统一的信息理论分析下实现了最高的信息密度。代码和数据可从此 https URL 获取。

Title: TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement

Authors: Haoyang He, Zihua Rong, Liangjie Zhao, Yunjia Zhao, Lan Yang, Honggang Zhang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03297
Pdf URL: https://arxiv.org/pdf/2603.03297
Copy Paste: [[2603.03297]] TTSR: Test-Time Self-Reflection for Continual Reasoning Improvement(https://arxiv.org/abs/2603.03297)
Keywords: language model, llm
Abstract: Test-time Training enables model adaptation using only test questions and offers a promising paradigm for improving the reasoning ability of large language models (LLMs). However, it faces two major challenges: test questions are often highly difficult, making self-generated pseudo-labels unreliable, and existing methods lack effective mechanisms to adapt to a model's specific reasoning weaknesses, leading to inefficient learning. To address these issues, we propose \textbf{TTSR}, a self-reflective test-time self-evolving training framework. TTSR employs a single pretrained language model that alternates between the roles of a \textit{Student} and a \textit{Teacher} at test time. The Student focuses on solving problems and learning from synthesized variant questions, while the Teacher analyzes the Student's failed reasoning trajectories, summarizes recurring reasoning weaknesses, and synthesizes targeted variant questions accordingly. This process guides the model to improve within a learnable regime through a continual self-evolving loop. Experimental results on multiple challenging mathematical reasoning benchmarks show that TTSR consistently improves reasoning performance and generalizes well across different model backbones and general-domain reasoning tasks. These findings suggest that teacher-mediated self-reflection provides an effective pathway for stable and continual reasoning improvement at test time.
摘要：测试时训练仅使用测试问题即可实现模型自适应，并为提高大型语言模型 (LLM) 的推理能力提供了一个有前景的范例。然而，它面临两个主要挑战：测试问题往往非常困难，使得自生成的伪标签不可靠；现有方法缺乏适应模型特定推理弱点的有效机制，导致学习效率低下。为了解决这些问题，我们提出了 \textbf{TTSR}，一种自我反思的测试时自我进化训练框架。 TTSR 采用单一预训练语言模型，在测试时交替扮演 \textit{Student} 和 \textit{Teacher} 的角色。学生专注于解决问题并从综合变体问题中学习，而教师则分析学生失败的推理轨迹，总结反复出现的推理弱点，并相应地综合有针对性的变体问题。这个过程引导模型通过持续的自我进化循环在可学习的机制内进行改进。多个具有挑战性的数学推理基准的实验结果表明，TTSR 持续提高了推理性能，并且在不同的模型主干和通用领域推理任务中具有良好的泛化能力。这些发现表明，教师介导的自我反思为测试时稳定且持续的推理改进提供了有效途径。

Title: TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation

Authors: Bartosz Dziuba, Kacper Kuchta, Paweł Batorski, Przemysław Spurek, Paul Swoboda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03298
Pdf URL: https://arxiv.org/pdf/2603.03298
Copy Paste: [[2603.03298]] TATRA: Training-Free Instance-Adaptive Prompting Through Rephrasing and Aggregation(https://arxiv.org/abs/2603.03298)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have improved substantially alignment, yet their behavior remains highly sensitive to prompt phrasing. This brittleness has motivated automated prompt engineering, but most existing methods (i) require a task-specific training set, (ii) rely on expensive iterative optimization to produce a single dataset-level prompt, and (iii) must be rerun from scratch for each new task. We introduce TATRA, a dataset-free prompting method that constructs instance-specific few-shot prompts by synthesizing on-the-fly examples to accompany a user-provided instruction. TATRA requires no labeled training data and avoids task-specific optimization loops, while retaining the benefits of demonstration-based prompting. Across standard text classification benchmarks, TATRA matches or improves over strong prompt-optimization baselines that depend on training data and extensive search. On mathematical reasoning benchmarks, TATRA achieves state-of-the-art performance on GSM8K and DeepMath, outperforming methods that explicitly optimize prompts on those tasks. Our results suggest that per-instance construction of effective in-context examples is more important than running long, expensive optimization loops to produce a single prompt per task. We will make all code publicly available upon acceptance of the paper. Code is available at this https URL
摘要：大型语言模型 (LLM) 显着改善了一致性，但它们的行为对提示措辞仍然高度敏感。这种脆弱性促进了自动化提示工程的发展，但大多数现有方法（i）需要特定于任务的训练集，（ii）依赖昂贵的迭代优化来生成单个数据集级提示，以及（iii）必须为每个新任务从头开始重新运行。我们引入了 TATRA，这是一种无数据集的提示方法，它通过合成即时示例来配合用户提供的指令来构建特定于实例的小提示。 TATRA 不需要标记的训练数据，并避免特定于任务的优化循环，同时保留基于演示的提示的好处。在标准文本分类基准中，TATRA 匹配或改进了依赖于训练数据和广泛搜索的强大提示优化基线。在数学推理基准测试中，TATRA 在 GSM8K 和 DeepMath 上实现了最先进的性能，优于显式优化这些任务提示的方法。我们的结果表明，按实例构建有效的上下文示例比运行长且昂贵的优化循环来为每个任务生成单个提示更重要。我们将在论文被接受后公开所有代码。代码可在此 https URL 获取

Title: How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations

Authors: MZ Naser
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03299
Pdf URL: https://arxiv.org/pdf/2603.03299
Copy Paste: [[2603.03299]] How LLMs Cite and Why It Matters: A Cross-Model Audit of Reference Fabrication in AI-Assisted Academic Writing and Methods to Detect Phantom Citations(https://arxiv.org/abs/2603.03299)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) have been noted to fabricate scholarly citations, yet the scope of this behavior across providers, domains, and prompting conditions remains poorly quantified. We present one of the largest citation hallucination audits to date, in which 10 commercially deployed LLMs were prompted across four academic domains, generating 69,557 citation instances verified against three scholarly databases (namely, CrossRef, OpenAlex, and Semantic Scholar). Our results show that the observed hallucination rates span a fivefold range (between 11.4% and 56.8%) and are strongly shaped by model, domain, and prompt framing. Our results also show that no model spontaneously generates citations when unprompted, which seems to establish hallucination as prompt-induced rather than intrinsic. We identify two practical filters: 1) multi-model consensus (with more than 3 LLMs citing the same work yields 95.6% accuracy, a 5.8-fold improvement), and 2) within-prompt repetition (with more than 2 replications yields 88.9% accuracy). In addition, we present findings on generational model tracking, which reveal that improvements are not guaranteed when deploying newer LLMs, and on capacity scaling, which appears to reduce hallucination within model families. Finally, a lightweight classifier trained solely on bibliographic string features is developed to classify hallucinated citations from verified citations, achieving AUC 0.876 in cross-validation and 0.834 in LOMO generalization (without querying any external database). This classifier offers a pre-screening tool deployable at inference time.
摘要：人们注意到大型语言模型 (LLM) 会伪造学术引文，但这种行为在提供者、领域和提示条件之间的范围仍然很难量化。我们提出了迄今为止最大的引文幻觉审计之一，其中跨四个学术领域提示了 10 个商业部署的法学硕士，生成了 69,557 个引文实例，并根据三个学术数据库（即 CrossRef、OpenAlex 和 Semantic Scholar）进行了验证。我们的结果表明，观察到的幻觉率跨越了五倍的范围（11.4% 到 56.8% 之间），并且受到模型、领域和提示框架的强烈影响。我们的结果还表明，没有任何模型在没有提示的情况下会自发地产生引用，这似乎表明幻觉是由提示引起的，而不是内在的。我们确定了两个实用的过滤器：1）多模型共识（超过 3 个法学硕士引用相同的工作，准确率达到 95.6％，提高了 5.8 倍），2）提示内重复（超过 2 次重复，准确率达到 88.9％）。此外，我们还提出了关于分代模型跟踪的发现，表明部署较新的法学硕士时并不能保证改进，以及关于容量扩展的发现，这似乎减少了模型系列内的幻觉。最后，开发了一个仅训练书目字符串特征的轻量级分类器，用于对已验证引文中的幻觉引文进行分类，在交叉验证中实现 AUC 0.876，在 LOMO 泛化中实现 0.834（无需查询任何外部数据库）。该分类器提供了可在推理时部署的预筛选工具。

Title: Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys

Authors: Mohamed Afane, Emaan Hariri, Derek Ouyang, Daniel E. Ho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03300
Pdf URL: https://arxiv.org/pdf/2603.03300
Copy Paste: [[2603.03300]] Benchmarking Legal RAG: The Promise and Limits of AI Statutory Surveys(https://arxiv.org/abs/2603.03300)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) offers significant potential for legal AI, yet systematic benchmarks are sparse. Prior work introduced LaborBench to benchmark RAG models based on ostensible ground truth from an exhaustive, multi-month, manual enumeration of all U.S. state unemployment insurance requirements by U.S. Department of Labor (DOL) attorneys. That prior work found poor performance of standard RAG (70% accuracy on Boolean tasks). Here, we assess three emerging tools not previously evaluated on LaborBench: the Statutory Research Assistant (STARA), a custom statutory research tool, and two commercial tools by Westlaw and LexisNexis marketing AI statutory survey capabilities. We make five main contributions. First, we show that STARA achieves substantial performance gains, boosting accuracy to 83%. Second, we show that commercial platforms fare poorly, with accuracy of 58% (Westlaw AI) and 64% (Lexis+ AI), even worse than standard RAG. Third, we conduct a comprehensive error analysis, comparing our outputs to those compiled by DOL attorneys, and document both reasoning errors, such as confusion between related legal concepts and misinterpretation of statutory exceptions, and retrieval failures, where relevant statutory provisions are not captured. Fourth, we discover that many apparent errors are actually significant omissions by DOL attorneys themselves, such that STARA's actual accuracy is 92%. Fifth, we chart the path forward for legal RAG through concrete design principles, offering actionable guidance for building AI systems capable of accurate multi-jurisdictional legal research.
摘要：检索增强生成（RAG）为合法人工智能提供了巨大的潜力，但系统基准却很少。之前的工作介绍了 LaborBench 来对 RAG 模型进行基准测试，该模型的基础是美国劳工部 (DOL) 律师对美国所有州失业保险要求进行了长达数月的详尽手动列举的表面事实。之前的工作发现标准 RAG 的性能较差（布尔任务的准确度为 70%）。在这里，我们评估了之前在 LaborBench 上未评估过的三种新兴工具：法定研究助理 (STARA)、定制法定研究工具以及 Westlaw 和 LexisNexis 营销 AI 法定调查功能的两种商业工具。我们做出了五项主要贡献。首先，我们证明 STARA 实现了显着的性能提升，将准确率提高到 83%。其次，我们发现商业平台表现不佳，准确率仅为 58%（Westlaw AI）和 64%（Lexis+ AI），甚至比标准 RAG 还要差。第三，我们进行全面的错误分析，将我们的输出与劳工部律师编制的输出进行比较，并记录推理错误，例如相关法律概念之间的混淆和对法定例外的误解，以及未捕获相关法定条款的检索失败。第四，我们发现许多明显的错误实际上是 DOL 律师本身的重大遗漏，因此 STARA 的实际准确度为 92%。第五，我们通过具体的设计原则为法律 RAG 制定了前进的道路，为构建能够进行准确的多司法管辖区法律研究的人工智能系统提供了可行的指导。

Title: From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings

Authors: Dvir David Biton, Roy Friedman
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03301
Pdf URL: https://arxiv.org/pdf/2603.03301
Copy Paste: [[2603.03301]] From Exact Hits to Close Enough: Semantic Caching for LLM Embeddings(https://arxiv.org/abs/2603.03301)
Keywords: language model, llm
Abstract: The rapid adoption of large language models (LLMs) has created demand for faster responses and lower costs. Semantic caching, reusing semantically similar requests via their embeddings, addresses this need but breaks classic cache assumptions and raises new challenges. In this paper, we explore offline policies for semantic caching, proving that implementing an optimal offline policy is NP-hard, and propose several polynomial-time heuristics. We also present online semantic aware cache policies that combine recency, frequency, and locality. Evaluations on diverse datasets show that while frequency based policies are strong baselines, our novel variant improves semantic accuracy. Our findings reveal effective strategies for current systems and highlight substantial headroom for future innovation. All code is open source.
摘要：大型语言模型 (LLM) 的快速采用催生了对更快响应和更低成本的需求。语义缓存，通过嵌入重用语义相似的请求，满足了这一需求，但打破了经典的缓存假设并提出了新的挑战。在本文中，我们探索了语义缓存的离线策略，证明了实现最优离线策略是 NP 困难的，并提出了几种多项式时间启发式方法。我们还提出了结合新近度、频率和位置的在线语义感知缓存策略。对不同数据集的评估表明，虽然基于频率的策略是强大的基线，但我们的新变体提高了语义准确性。我们的研究结果揭示了当前系统的有效策略，并强调了未来创新的巨大空间。所有代码都是开源的。

Title: Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs

Authors: Divija Amaram, Lu Gao, Gowtham Reddy Gudla, Tejaswini Sanjay Katale
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.03302
Pdf URL: https://arxiv.org/pdf/2603.03302
Copy Paste: [[2603.03302]] Developing an AI Assistant for Knowledge Management and Workforce Training in State DOTs(https://arxiv.org/abs/2603.03302)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Effective knowledge management is critical for preserving institutional expertise and improving the efficiency of workforce training in state transportation agencies. Traditional approaches, such as static documentation, classroom-based instruction, and informal mentorship, often lead to fragmented knowledge transfer, inefficiencies, and the gradual loss of expertise as senior engineers retire. Moreover, given the enormous volume of technical manuals, guidelines, and research reports maintained by these agencies, it is increasingly challenging for engineers to locate relevant information quickly and accurately when solving field problems or preparing for training tasks. These limitations hinder timely decision-making and create steep learning curves for new personnel in maintenance and construction operations. To address these challenges, this paper proposes a Retrieval-Augmented Generation (RAG) framework with a multi-agent architecture to support knowledge management and decision making. The system integrates structured document retrieval with real-time, context-aware response generation powered by a large language model (LLM). Unlike conventional single-pass RAG systems, the proposed framework employs multiple specialized agents for retrieval, answer generation, evaluation, and query refinement, which enables iterative improvement and quality control. In addition, the system incorporates an open-weight vision-language model to convert technical figures into semantic textual representations, which allows figure-based knowledge to be indexed and retrieved alongside text. Retrieved text and figure-based context are then provided to an open-weight large language model, which generates the final responses grounded in the retrieved evidence.
摘要：有效的知识管理对于保留机构专业知识和提高州交通机构劳动力培训的效率至关重要。静态文档、课堂教学和非正式指导等传统方法往往会导致知识转移分散、效率低下，并且随着高级工程师的退休，专业知识逐渐丧失。此外，鉴于这些机构维护的大量技术手册、指南和研究报告，工程师在解决现场问题或准备培训任务时快速准确地找到相关信息越来越具有挑战性。这些限制阻碍了及时决策，并为维护和施工作业的新人员带来了陡峭的学习曲线。为了应对这些挑战，本文提出了一种具有多代理架构的检索增强生成（RAG）框架来支持知识管理和决策。该系统将结构化文档检索与由大型语言模型 (LLM) 提供支持的实时上下文感知响应生成相集成。与传统的单通道 RAG 系统不同，所提出的框架采用多个专用代理进行检索、答案生成、评估和查询细化，从而实现迭代改进和质量控制。此外，该系统还采用开放权重视觉语言模型，将技术图形转换为语义文本表示，从而允许基于图形的知识与文本一起索引和检索。然后，将检索到的文本和基于图形的上下文提供给开放权重大型语言模型，该模型根据检索到的证据生成最终响应。

Title: HumanLM: Simulating Users with State Alignment Beats Response Imitation

Authors: Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, James Zou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03303
Pdf URL: https://arxiv.org/pdf/2603.03303
Copy Paste: [[2603.03303]] HumanLM: Simulating Users with State Alignment Beats Response Imitation(https://arxiv.org/abs/2603.03303)
Keywords: language model, llm, chat
Abstract: Large Language Models (LLMs) are increasingly used to simulate how specific users respond to a given context, enabling more user-centric applications that rely on user feedback. However, existing user simulators mostly imitate surface-level patterns and language styles, which fail to reflect the underlying states of real users (e.g., beliefs and emotions). To address these limitations, we propose a novel training framework, HumanLM, which builds user simulators that accurately reflect real users. Our key insight is that, in addition to generating responses, the model should generate natural-language latent states that align with ground-truth responses through reinforcement learning. These latent states correspond to a set of psychologically grounded state dimensions that drive how real users respond. HumanLM further synthesizes these aligned latent states into responses that accurately represent real users. For extensive evaluation, we develop Humanual, a comprehensive benchmark for simulating real users based on public data. Humanual consists of six large-scale datasets with 26k users and 216k responses in total, spanning diverse tasks such as generating user responses to daily life issues, political blogs, and chat sessions with LLM assistants. Across datasets, HumanLM significantly outperforms alternative approaches, achieving an average relative improvement of 16.3% in alignment scores from an LLM judge. In a real-time simulation study with 111 participants, HumanLM achieves the highest similarity to real user responses and competitive human-likeness scores.
摘要：大型语言模型 (LLM) 越来越多地用于模拟特定用户如何响应给定上下文，从而实现更多依赖于用户反馈的以用户为中心的应用程序。然而，现有的用户模拟器大多模仿表面模式和语言风格，无法反映真实用户的底层状态（例如信念和情感）。为了解决这些限制，我们提出了一种新颖的训练框架 HumanLM，它构建了准确反映真实用户的用户模拟器。我们的主要见解是，除了生成响应之外，该模型还应该通过强化学习生成与真实响应相一致的自然语言潜在状态。这些潜在状态对应于一组心理基础状态维度，这些维度驱动真实用户的反应。 HumanLM 进一步将这些对齐的潜在状态合成为准确代表真实用户的响应。为了进行广泛的评估，我们开发了 Humanual，这是一个基于公共数据模拟真实用户的综合基准。 Humanual 由六个大型数据集组成，总共有 26,000 个用户和 216,000 个回复，涵盖各种任务，例如生成用户对日常生活问题的响应、政治博客以及与 LLM 助理的聊天会话。在整个数据集中，HumanLM 的性能显着优于其他方法，LLM 法官的对齐分数平均相对提高了 16.3%。在一项有 111 名参与者参与的实时模拟研究中，HumanLM 实现了与真实用户响应的最高相似度以及具有竞争力的人类相似度分数。

Title: Draft-Conditioned Constrained Decoding for Structured Generation in LLMs

Authors: Avinash Reddy, Thayne T. Walker, James S. Ide, Amrit Singh Bedi
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03305
Pdf URL: https://arxiv.org/pdf/2603.03305
Copy Paste: [[2603.03305]] Draft-Conditioned Constrained Decoding for Structured Generation in LLMs(https://arxiv.org/abs/2603.03305)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly used to generate executable outputs, JSON objects, and API calls, where a single syntax error can make the output unusable. Constrained decoding enforces validity token-by-token via masking and renormalization, but it can distort generation when the model assigns low probability mass to valid continuations, pushing decoding toward locally valid yet semantically incorrect trajectories. We propose \emph{Draft-Conditioned Constrained Decoding (DCCD)}, a simple two-step, training-free inference procedure that decouples semantic planning from structural enforcement: an unconstrained draft is generated first, and constrained decoding is then applied, conditioned on this draft, to guarantee validity. We analyze DCCD through a KL-projection view, showing that draft conditioning increases feasible mass and reduces the cumulative "projection tax" induced by hard constraints, with an optional best-of-$K$ draft selection. Across structured reasoning benchmarks, DCCD improves strict structured accuracy by up to +24 percentage points over standard constrained decoding (e.g., 15.2\% to 39.0\% on GSM8K with a 1B model), and enables smaller model pairs to match or exceed much larger constrained baselines, yielding substantial gains in parameter efficiency.
摘要：大型语言模型 (LLM) 越来越多地用于生成可执行输出、JSON 对象和 API 调用，其中单个语法错误可能会导致输出无法使用。约束解码通过掩码和重整化强制逐个令牌的有效性，但当模型将低概率质量分配给有效延续时，它可能会扭曲生成，从而将解码推向局部有效但语义上不正确的轨迹。我们提出\emph{草稿条件约束解码（DCCD）}，这是一种简单的两步、免训练推理程序，它将语义规划与结构执行分离：首先生成无约束草稿，然后以该草稿为条件应用约束解码，以保证有效性。我们通过 KL 投影视图分析 DCCD，表明吃水条件增加了可行质量，并减少了由硬约束引起的累积“投影税”，并可选择最佳 $K$ 吃水选择。在结构化推理基准测试中，DCCD 与标准约束解码相比，严格的结构化精度提高了 24 个百分点（例如，在具有 1B 模型的 GSM8K 上提高了 15.2% 到 39.0%），并且使较小的模型对能够匹配或超过更大的约束基线，从而在参数效率方面产生了显着的收益。

Title: Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation

Authors: Ivan Matveev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03306
Pdf URL: https://arxiv.org/pdf/2603.03306
Copy Paste: [[2603.03306]] Token-Oriented Object Notation vs JSON: A Benchmark of Plain and Constrained Decoding Generation(https://arxiv.org/abs/2603.03306)
Keywords: llm, prompt
Abstract: Recently presented Token-Oriented Object Notation (TOON) aims to replace JSON as a serialization format for passing structured data to LLMs with significantly reduced token usage. While showing solid accuracy in LLM comprehension, there is a lack of tests against JSON generation. Though never present in training data, TOON syntax is simple enough to suggest one-shot in-context learning could support accurate generation. The inevitable prompt overhead can be an acceptable trade-off for shorter completions. To test this, we conducted a benchmark creating several test cases with regard to structural complexity, a validation pipeline, and comparing plain JSON generation vs structured output (via constrained decoding) JSON generation vs TOON one-shot in-context learning generation. JSON structured output was included to establish a minimum token budget baseline and to set a starting point for future experiments testing TOON constrained decoding inference enforcement. Key findings: TOON shows promising accuracy/token consumption ratio for in-domain generation tasks, though this advantage is often reduced by the "prompt tax" of instructional overhead in shorter contexts. Plain JSON generation shows the best one-shot and final accuracy, even compared with constrained decoding structured output, where the only significant advantage is the lowest token usage as a trade-off for slightly decreased accuracy overall and significant degradation for some models. Notably, for simple structures, this "lowest token usage" of constrained decoding outperformed even TOON, hinting that TOON enforcing via frameworks such as xgrammar may not yield the desired results. Furthermore, the results suggest a scaling hypothesis: TOON's true efficiency potential likely follows a non-linear curve, shining only beyond a specific point where cumulative syntax savings amortize the initial prompt overhead.
摘要：最近提出的面向令牌的对象表示法 (TOON) 旨在取代 JSON 作为序列化格式，用于将结构化数据传递给 LLM，同时显着减少令牌的使用。虽然 LLM 理解表现出可靠的准确性，但缺乏针对 JSON 生成的测试。尽管 TOON 语法从未出现在训练数据中，但它足够简单，表明一次性上下文学习可以支持准确的生成。对于较短的完成时间来说，不可避免的即时开销是可以接受的权衡。为了测试这一点，我们进行了基准测试，创建了几个关于结构复杂性、验证管道的测试用例，并比较了纯 JSON 生成与结构化输出（通过约束解码）、JSON 生成与 TOON 一次性上下文学习生成。包含 JSON 结构化输出是为了建立最小代币预算基线，并为未来测试 TOON 约束解码推理执行的实验设定起点。主要发现：TOON 在域内生成任务中显示出有希望的准确性/令牌消耗比，尽管这种优势通常会因较短上下文中的教学开销的“即时税”而降低。即使与受限解码结构化输出相比，普通 JSON 生成也显示出最佳的一次性和最终准确性，其中唯一显着的优势是最低的令牌使用量，作为总体准确性略有下降和某些模型显着退化的权衡。值得注意的是，对于简单的结构，约束解码的这种“最低令牌使用”甚至优于 TOON，这暗示通过 xgrammar 等框架执行 TOON 可能不会产生预期的结果。此外，结果提出了一个扩展假设：TOON 的真正效率潜力可能遵循一条非线性曲线，仅在超出累积语法节省分摊初始提示开销的特定点时发挥作用。

Title: Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

Authors: Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03308
Pdf URL: https://arxiv.org/pdf/2603.03308
Copy Paste: [[2603.03308]] Old Habits Die Hard: How Conversational History Geometrically Traps LLMs(https://arxiv.org/abs/2603.03308)
Keywords: language model, llm, hallucination
Abstract: How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at this https URL.
摘要：大型语言模型 (LLM) 的对话历史如何影响其未来的表现？最近的研究表明，法学硕士会以意想不到的方式受到他们的对话历史的影响。例如，先前交互中的幻觉可能会影响随后的模型响应。在这项工作中，我们介绍了 History-Echoes，这是一个研究对话历史如何对后代产生偏见的框架。该框架从两个角度探讨了这种偏差：在概率上，我们将对话建模为马尔可夫链来量化状态一致性；在概率上，我们将对话建模为马尔可夫链以量化状态一致性；在几何上，我们测量连续隐藏表示的一致性。在涵盖不同现象的三个模型系列和六个数据集中，我们的分析揭示了两种观点之间的很强相关性。通过桥接这些观点，我们证明行为持久性表现为几何陷阱，其中潜在空间中的间隙限制了模型的轨迹。代码可在此 https URL 获取。

Title: Combating data scarcity in recommendation services: Integrating cognitive types of VARK and neural network technologies (LLM)

Authors: Nikita Zmanovskii
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2603.03309
Pdf URL: https://arxiv.org/pdf/2603.03309
Copy Paste: [[2603.03309]] Combating data scarcity in recommendation services: Integrating cognitive types of VARK and neural network technologies (LLM)(https://arxiv.org/abs/2603.03309)
Keywords: language model, llm
Abstract: Cold start scenarios present fundamental obstacles to effective recommendation generation, particularly when dealing with users lacking interaction history or items with sparse metadata. This research proposes an innovative hybrid framework that leverages Large Language Models (LLMs) for content semantic analysis and knowledge graph development, integrated with cognitive profiling based on VARK (Visual, Auditory, Reading/Writing, Kinesthetic) learning preferences. The proposed system tackles multiple cold start dimensions: enriching inadequate item descriptions through LLM processing, generating user profiles from minimal data, and dynamically adjusting presentation formats based on cognitive assessment. The framework comprises six integrated components: semantic metadata enhancement, dynamic graph construction, VARK-based profiling, mental state estimation, graph-enhanced retrieval with LLM-powered ranking, and adaptive interface design with iterative learning. Experimental validation on MovieLens-1M dataset demonstrates the system's capacity for personalized recommendation generation despite limited initial information. This work establishes groundwork for cognitively-aware recommendation systems capable of overcoming cold start limitations through semantic comprehension and psychological modeling, offering personalized, explainable recommendations from initial user contact.
摘要：冷启动场景对有效的推荐生成提出了根本障碍，特别是在处理缺乏交互历史记录的用户或元数据稀疏的项目时。这项研究提出了一种创新的混合框架，利用大型语言模型 (LLM) 进行内容语义分析和知识图谱开发，并与基于 VARK（视觉、听觉、阅读/写作、动觉）学习偏好的认知分析相集成。所提出的系统解决了多个冷启动维度：通过LLM处理丰富不充分的项目描述，从最少的数据生成用户配置文件，以及基于认知评估动态调整呈现格式。该框架由六个集成组件组成：语义元数据增强、动态图构建、基于 VARK 的分析、心理状态估计、具有 LLM 支持的排名的图增强检索以及具有迭代学习的自适应界面设计。 MovieLens-1M 数据集上的实验验证表明，尽管初始信息有限，但系统仍具有生成个性化推荐的能力。这项工作为认知感知推荐系统奠定了基础，该系统能够通过语义理解和心理建模克服冷启动限制，从初始用户接触中提供个性化、可解释的推荐。

Title: Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention

Authors: Andrew Kiruluta
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03310
Pdf URL: https://arxiv.org/pdf/2603.03310
Copy Paste: [[2603.03310]] Entropic-Time Inference: Self-Organizing Large Language Model Decoding Beyond Attention(https://arxiv.org/abs/2603.03310)
Keywords: language model, llm
Abstract: Modern large language model (LLM) inference engines optimize throughput and latency under fixed decoding rules, treating generation as a linear progression in token time. We propose a fundamentally different paradigm: entropic\-time inference, where decoding is governed by the flow of uncertainty rather than token index. We introduce a self\-organizing inference architecture that jointly couples scheduling, attention sparsification, and sampling temperature under a unified entropy control objective. Our method extends vLLM with entropy-aware scheduling, entropic pruning of paged attention blocks, and adaptive temperature control that stabilizes generation near a target entropy regime. This transforms inference into a resource\-intelligent thermodynamic process that allocates computation where uncertainty reduction is maximized. We present a concrete systems design, pseudocode, and integration plan, demonstrating how entropy can serve as a first\-class control signal for scalable LLM inference.
摘要：现代大语言模型 (LLM) 推理引擎在固定解码规则下优化吞吐量和延迟，将生成视为令牌时间的线性进展。我们提出了一种根本不同的范例：熵时间推理，其中解码由不确定性流而不是令牌索引控制。我们引入了一种自组织推理架构，在统一的熵控制目标下将调度、注意力稀疏化和采样温度联合起来。我们的方法通过熵感知调度、分页注意力块的熵修剪以及稳定目标熵状态附近的自适应温度控制来扩展 vLLM。这将推理转化为资源智能热力学过程，该过程在最大程度减少不确定性的情况下分配计算。我们提出了一个具体的系统设计、伪代码和集成计划，展示了熵如何作为可扩展的 LLM 推理的一流控制信号。

Title: Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding

Authors: Yuchen Wang, Haonan Wang, Yu Guo, Honglong Yang, Xiaomeng Li
Subjects: cs.CL, cs.AI, cs.HC, eess.AS, q-bio.NC
Abstract URL: https://arxiv.org/abs/2603.03312
Pdf URL: https://arxiv.org/pdf/2603.03312
Copy Paste: [[2603.03312]] Escaping the BLEU Trap: A Signal-Grounded Framework with Decoupled Semantic Guidance for EEG-to-Text Decoding(https://arxiv.org/abs/2603.03312)
Keywords: language model, llm, hallucination, prompt
Abstract: Decoding natural language from non-invasive EEG signals is a promising yet challenging task. However, current state-of-the-art models remain constrained by three fundamental limitations: Semantic Bias (mode collapse into generic templates), Signal Neglect (hallucination based on linguistic priors rather than neural inputs), and the BLEU Trap, where evaluation metrics are artificially inflated by high-frequency stopwords, masking a lack of true semantic fidelity. To address these challenges, we propose SemKey, a novel multi-stage framework that enforces signal-grounded generation through four decoupled semantic objectives: sentiment, topic, length, and surprisal. We redesign the interaction between the neural encoder and the Large Language Model (LLM) by injecting semantic prompts as Queries and EEG embeddings as Key-Value pairs, strictly forcing the model to attend to neural inputs. Furthermore, we move beyond standard translation metrics by adopting N-way Retrieval Accuracy and Fréchet Distance to rigorously assess diversity and alignment. Extensive experiments demonstrate that our approach effectively eliminates hallucinations on noise inputs and achieves SOTA performance on these robust protocols. Code will be released upon acceptance at this https URL.
摘要：从非侵入性脑电图信号中解码自然语言是一项有前途但具有挑战性的任务。然而，当前最先进的模型仍然受到三个基本限制的限制：语义偏差（模式崩溃为通用模板）、信号忽略（基于语言先验而不是神经输入的幻觉）和 BLEU 陷阱，其中评估指标被高频停用词人为夸大，掩盖了真正语义保真度的缺乏。为了应对这些挑战，我们提出了 SemKey，这是一种新颖的多阶段框架，它通过四个解耦的语义目标：情感、主题、长度和惊喜来强制基于信号的生成。我们重新设计了神经编码器和大语言模型（LLM）之间的交互，通过注入语义提示作为查询和脑电图嵌入作为键值对，严格强制模型关注神经输入。此外，我们超越了标准翻译指标，采用 N 路检索精度和 Fréchet 距离来严格评估多样性和一致性。大量实验表明，我们的方法有效消除了噪声输入的幻觉，并在这些稳健的协议上实现了 SOTA 性能。代码将在此 https URL 接受后发布。

Title: How does fine-tuning improve sensorimotor representations in large language models?

Authors: Minghua Wu, Javier Conde, Pedro Reviriego, Marc Brysbaert
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03313
Pdf URL: https://arxiv.org/pdf/2603.03313
Copy Paste: [[2603.03313]] How does fine-tuning improve sensorimotor representations in large language models?(https://arxiv.org/abs/2603.03313)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit a significant "embodiment gap", where their text-based representations fail to align with human sensorimotor experiences. This study systematically investigates whether and how task-specific fine-tuning can bridge this gap. Utilizing Representational Similarity Analysis (RSA) and dimension-specific correlation metrics, we demonstrate that the internal representations of LLMs can be steered toward more embodied, grounded patterns through fine-tuning. Furthermore, the results show that while sensorimotor improvements generalize robustly across languages and related sensory-motor dimensions, they are highly sensitive to the learning objective, failing to transfer across two disparate task formats.
摘要：大型语言模型（LLM）表现出显着的“体现差距”，它们基于文本的表示无法与人类感觉运动体验保持一致。这项研究系统地研究了针对特定任务的微调是否以及如何能够弥补这一差距。利用表征相似性分析（RSA）和特定维度的相关性指标，我们证明了法学硕士的内部表征可以通过微调转向更具体、更基础的模式。此外，结果表明，虽然感觉运动的改善在语言和相关的感觉运动维度上普遍存在，但它们对学习目标高度敏感，无法跨两种不同的任务格式进行迁移。

Title: Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO

Authors: Xin Yang, Letian Li, Abudukelimu Wuerkaixi, Xuxin Cheng, Cao Liu, Ke Zeng, Xunliang Cai, Wenyuan Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03314
Pdf URL: https://arxiv.org/pdf/2603.03314
Copy Paste: [[2603.03314]] Towards Self-Robust LLMs: Intrinsic Prompt Noise Resistance via CoIPO(https://arxiv.org/abs/2603.03314)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have demonstrated remarkable and steadily improving performance across a wide range of tasks. However, LLM performance may be highly sensitive to prompt variations especially in scenarios with limited openness or strict output formatting requirements, indicating insufficient robustness. In real-world applications, user prompts provided to LLMs often contain imperfections, which may undermine the quality of the model's responses. To address this issue, previous work has primarily focused on preprocessing prompts, employing external tools or even LLMs to refine prompt formulations in advance. However, these approaches overlook the intrinsic robustness of LLMs, and their reliance on external components introduces additional computational overhead and uncertainty. In this work, we propose a Contrastive Learning-based Inverse Direct Preference Optimization (CoIPO) method that minimizes the discrepancy between the label-aligned logits produced by the model under a clean prompt and its noisy counterpart, and conduct a detailed analysis using mutual information theory. We augment the FLAN dataset by constructing paired prompts, each consisting of a clean prompt and its corresponding noisy version for training. Additionally, to evaluate the effectiveness, we develop NoisyPromptBench, a benchmark enhanced and derived from the existing PromptBench. Experimental results conducted on NoisyPromptBench demonstrate that our proposed method achieves a significant improvement in average accuracy over the current state-of-the-art approaches. The source code of CoIPO, pair-wise FLAN datasets, and NoisyPromptBench have already been released on this https URL.
摘要：大型语言模型 (LLM) 在广泛的任务中表现出了显着且稳定提高的性能。然而，LLM的表现可能对提示变化高度敏感，尤其是在开放性有限或输出格式要求严格的场景下，表明鲁棒性不足。在现实应用中，提供给法学硕士的用户提示通常包含缺陷，这可能会降低模型响应的质量。为了解决这个问题，以前的工作主要集中在预处理提示，利用外部工具甚至法学硕士提前完善提示公式。然而，这些方法忽视了法学硕士的内在鲁棒性，并且它们对外部组件的依赖引入了额外的计算开销和不确定性。在这项工作中，我们提出了一种基于对比学习的逆直接偏好优化（CoIPO）方法，该方法最大限度地减少了模型在干净提示下产生的标签对齐逻辑与其噪声对应物之间的差异，并使用互信息理论进行了详细分析。我们通过构建配对提示来扩充 FLAN 数据集，每个提示都包含一个干净的提示及其相应的用于训练的噪声版本。此外，为了评估有效性，我们开发了 NoisyPromptBench，这是一个从现有 PromptBench 增强和派生的基准。在 NoisyPromptBench 上进行的实验结果表明，与当前最先进的方法相比，我们提出的方法在平均精度方面取得了显着提高。 CoIPO、pair-wise FLAN 数据集和 NoisyPromptBench 的源代码已在此 https URL 上发布。

Title: M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity

Authors: Stefano De Giorgis, Ting-Chih Chen, Filip Ilievski
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03315
Pdf URL: https://arxiv.org/pdf/2603.03315
Copy Paste: [[2603.03315]] M-QUEST -- Meme Question-Understanding Evaluation on Semantics and Toxicity(https://arxiv.org/abs/2603.03315)
Keywords: language model, prompt
Abstract: Internet memes are a powerful form of online communication, yet their nature and reliance on commonsense knowledge make toxicity detection challenging. Identifying key features for meme interpretation and understanding, is a crucial task. Previous work has been focused on some elements contributing to the meaning, such as the Textual dimension via OCR, the Visual dimension via object recognition, upper layers of meaning like the Emotional dimension, Toxicity detection via proxy variables, such as hate speech detection, and sentiment analysis. Nevertheless, there is still a lack of an overall architecture able to formally identify elements contributing to the meaning of a meme, and be used in the sense-making process. In this work, we present a semantic framework and a corresponding benchmark for automatic knowledge extraction from memes. First, we identify the necessary dimensions to understand and interpret a meme: Textual material, Visual material, Scene, Background Knowledge, Emotion, Semiotic Projection, Analogical Mapping, Overall Intent, Target Community, and Toxicity Assessment. Second, the framework guides a semi-automatic process of generating a benchmark with commonsense question-answer pairs about meme toxicity assessment and its underlying reason. The resulting benchmark M-QUEST consists of 609 question-answer pairs for 307 memes. Thirdly, we evaluate eight open-source large language models on their ability to correctly solve M-QUEST. Our results show that current models' commonsense reasoning capabilities for toxic meme interpretation vary depending on the dimension and architecture. Models with instruction tuning and reasoning capabilities significantly outperform the others, though pragmatic inference questions remain challenging. We release code, benchmark, and prompts to support future research intersecting multimodal content safety and commonsense reasoning.
摘要：互联网模因是一种强大的在线交流形式，但其本质和对常识知识的依赖使得毒性检测具有挑战性。识别模因解释和理解的关键特征是一项至关重要的任务。之前的工作重点关注一些有助于意义的元素，例如通过 OCR 的文本维度、通过对象识别的视觉维度、情感维度等上层意义、通过代理变量的毒性检测（例如仇恨言论检测）和情感分析。尽管如此，仍然缺乏一个能够正式识别有助于模因含义的元素并在意义构建过程中使用的整体架构。在这项工作中，我们提出了一个语义框架和相应的基准，用于从模因中自动提取知识。首先，我们确定理解和解释模因的必要维度：文本材料、视觉材料、场景、背景知识、情感、符号学投影、类比映射、总体意图、目标社区和毒性评估。其次，该框架指导一个半自动过程，通过关于模因毒性评估及其根本原因的常识性问答对生成基准。由此产生的基准 M-QUEST 由 307 个模因的 609 个问答对组成。第三，我们评估了八种开源大型语言模型正确解决 M-QUEST 的能力。我们的结果表明，当前模型对有毒模因解释的常识推理能力因维度和架构而异。尽管实用的推理问题仍然具有挑战性，但具有指令调整和推理能力的模型明显优于其他模型。我们发布代码、基准测试和提示，以支持未来跨模式内容安全和常识推理的研究。

Title: Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations

Authors: David Kogan, Sam Nguyen, Masanori Suzuki, Feiyang Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03317
Pdf URL: https://arxiv.org/pdf/2603.03317
Copy Paste: [[2603.03317]] Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations(https://arxiv.org/abs/2603.03317)
Keywords: language model, llm, prompt, agent
Abstract: Recent advances in Large Language Models (LLMs) allow agents to execute complex natural language tasks. Many LLM applications, such as support agents, teaching assistants, and interactive bots, involve multi-turn conversations. However, it remains challenging to control LLMs in the context of such interactions, particularly when the LLM behavior needs to be adjustable over the course of the conversation. In this paper, we present Retcon, a few-shot prompting technique designed to provide turn-level control over LLMs in conversations. We then demonstrate that it performs significantly better than zero-shot and traditional few-shot prompting.
摘要：大型语言模型 (LLM) 的最新进展允许代理执行复杂的自然语言任务。许多法学硕士应用程序，例如支持代理、助教和交互式机器人，都涉及多轮对话。然而，在这种互动的背景下控制法学硕士仍然具有挑战性，特别是当法学硕士的行为需要在对话过程中进行调整时。在本文中，我们提出了 Retcon，这是一种几次提示技术，旨在为对话中的 LLM 提供回合级别控制。然后我们证明它的性能明显优于零样本和传统的少样本提示。

Title: Quantum-Inspired Self-Attention in a Large Language Model

Authors: Nikita Kuznetsov, Niyaz Ismagilov, Ernesto Campos
Subjects: cs.CL, cs.AI, quant-ph
Abstract URL: https://arxiv.org/abs/2603.03318
Pdf URL: https://arxiv.org/pdf/2603.03318
Copy Paste: [[2603.03318]] Quantum-Inspired Self-Attention in a Large Language Model(https://arxiv.org/abs/2603.03318)
Keywords: language model, gpt
Abstract: Recent advances in Natural Language Processing have been predominantly driven by transformer-based architectures, which rely heavily on self-attention mechanisms to model relationships between tokens in a sequence. Similarly, the field of Quantum Natural Language Processing, which seeks to leverage quantum principles to address challenges in language understanding and generation tasks, has seen the recent development of quantum self-attention mechanisms. We propose a classical quantum-inspired self-attention (QISA) mechanism and integrate it into the full autoregressive language modeling pipeline of GPT-1. To the best of our knowledge, this is the first integration of this kind, as previous quantum self-attention mechanisms have been primarily tested on text classification. In our experiments, QISA achieves better performance when compared to standard self-attention on the metrics character error rate ($15.5\times$ better), word error rate ($4.7 \times $) and cross-entropy loss ($13 \times$). This is achieved while only requiring a $ 2.6\times$ longer inference time.
摘要：自然语言处理的最新进展主要是由基于变压器的架构驱动的，该架构严重依赖自注意力机制来对序列中标记之间的关系进行建模。同样，量子自然语言处理领域寻求利用量子原理来解决语言理解和生成任务中的挑战，最近也看到了量子自注意力机制的发展。我们提出了一种经典的量子启发自注意力（QISA）机制，并将其集成到 GPT-1 的完整自回归语言建模管道中。据我们所知，这是此类集成的首次，因为之前的量子自注意力机制主要在文本分类上进行了测试。在我们的实验中，与标准自注意力机制相比，QISA 在字符错误率（更好 15.5 美元）、单词错误率（4.7 美元）和交叉熵损失（13 美元）方面取得了更好的性能。只需延长 2.6 倍的推理时间即可实现这一目标。

Title: Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Authors: James Wedgwood, Chhavi Yadav, Virginia Smith
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03319
Pdf URL: https://arxiv.org/pdf/2603.03319
Copy Paste: [[2603.03319]] Automated Concept Discovery for LLM-as-a-Judge Preference Analysis(https://arxiv.org/abs/2603.03319)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.
摘要：大型语言模型 (LLM) 越来越多地用作模型输出的可扩展评估器，但它们的偏好判断表现出系统偏差，并且可能与人类评估有所不同。之前关于法学硕士作为法官的工作主要集中在一小部分预定义的假设偏差上，从而留下了自动发现法学硕士偏好的未知驱动因素的问题。我们通过研究几种用于分析 LLM 判断行为的嵌入级概念提取方法来解决这一差距。我们在可解释性和预测性方面对这些方法进行了比较，发现基于稀疏自动编码器的方法比其他方法恢复了更多可解释的偏好特征，同时在预测 LLM 决策方面保持竞争力。使用来自多个人类偏好数据集的超过 27,000 个配对响应以及来自三个法学硕士的判断，我们分析了法学硕士的判断并将其与人类注释者的判断进行比较。我们的方法既验证了现有的结果，例如法学硕士比人类更愿意以更高的比例拒绝敏感请求的趋势，也揭示了一般数据集和特定领域数据集的新趋势，包括在处理新情况时对强调具体性和同理心的反应的偏见，对学术建议中的细节和形式的偏见，以及反对提倡采取主动措施（如报警和提起诉讼）的法律指导的偏见。我们的结果表明，自动概念发现可以对法学硕士法官的偏好进行系统分析，而无需预定义的偏差分类法。

Title: From We to Me: Theory Informed Narrative Shift with Abductive Reasoning

Authors: Jaikrishna Manojkumar Patil, Divyagna Bavikadi, Kaustuv Mukherji, Ashby Steward-Nolan, Peggy-Jean Allin, Tumininu Awonuga, Joshua Garland, Paulo Shakarian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03320
Pdf URL: https://arxiv.org/pdf/2603.03320
Copy Paste: [[2603.03320]] From We to Me: Theory Informed Narrative Shift with Abductive Reasoning(https://arxiv.org/abs/2603.03320)
Keywords: language model, gpt, llm
Abstract: Effective communication often relies on aligning a message with an audience's narrative and worldview. Narrative shift involves transforming text to reflect a different narrative framework while preserving its original core message--a task we demonstrate is significantly challenging for current Large Language Models (LLMs). To address this, we propose a neurosymbolic approach grounded in social science theory and abductive reasoning. Our method automatically extracts rules to abduce the specific story elements needed to guide an LLM through a consistent and targeted narrative transformation. Across multiple LLMs, abduction-guided transformed stories shifted the narrative while maintaining the fidelity with the original story. For example, with GPT-4o we outperform the zero-shot LLM baseline by 55.88% for collectivistic to individualistic narrative shift while maintaining superior semantic similarity with the original stories (40.4% improvement in KL divergence). For individualistic to collectivistic transformation, we achieve comparable improvements. We show similar performance across both directions for Llama-4, and Grok-4 and competitive performance for Deepseek-R1.
摘要：有效的沟通通常依赖于使信息与受众的叙述和世界观保持一致。叙事转变涉及转换文本以反映不同的叙事框架，同时保留其原始核心信息——我们证明这一任务对于当前的大型语言模型（LLM）来说是一项巨大的挑战。为了解决这个问题，我们提出了一种基于社会科学理论和溯因推理的神经符号方法。我们的方法自动提取规则来推断指导法学硕士通过一致且有针对性的叙事转换所需的特定故事元素。在多个法学硕士中，绑架引导的故事改变了叙事，同时保持了对原始故事的保真度。例如，使用 GPT-4o，我们在集体主义到个人主义叙事转变方面比零样本 LLM 基线高出 55.88%，同时保持与原始故事的卓越语义相似性（KL 散度改善了 40.4%）。对于个人主义到集体主义的转变，我们取得了类似的进步。我们在 Llama-4 和 Grok-4 的两个方向上展示了相似的性能，而 Deepseek-R1 的性能则具有竞争力。

Title: DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following

Authors: Nardine Basta, Dali Kaafar
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03321
Pdf URL: https://arxiv.org/pdf/2603.03321
Copy Paste: [[2603.03321]] DIALEVAL: Automated Type-Theoretic Evaluation of LLM Instruction Following(https://arxiv.org/abs/2603.03321)
Keywords: language model, llm, agent
Abstract: Evaluating instruction following in Large Language Models requires decomposing instructions into verifiable requirements and assessing satisfaction--tasks currently dependent on manual annotation and uniform criteria that do not align with human judgment patterns. We present DIALEVAL, a type-theoretic framework using dual LLM agents to automate instruction decomposition into typed predicates and implement type-specific satisfaction semantics. The framework enforces formal atomicity and independence constraints during automated extraction, then applies differentiated evaluation criteria--semantic equivalence for content predicates, exact precision for numerical predicates--mirroring empirically observed human assessment patterns. Extended to multi-turn dialogues through history-aware satisfaction functions, DIALEVAL enables evaluation in conversational contexts where single-turn methods fail. Validation demonstrates 90.38% accuracy (26.45% error reduction over baselines) and substantially stronger correlation with human judgment for complex instructions.
摘要：评估大型语言模型中的指令遵循需要将指令分解为可验证的要求并评估满意度——目前的任务依赖于手动注释和与人类判断模式不符的统一标准。我们提出了 DIALEVAL，一种类型理论框架，使用双 LLM 代理将指令自动分解为类型化谓词并实现特定于类型的满足语义。该框架在自动提取过程中强制执行形式原子性和独立性约束，然后应用差异化的评估标准——内容谓词的语义等价性、数值谓词的精确性——反映了经验观察到的人类评估模式。通过历史感知满意度函数扩展到多轮对话，DIALEVAL 可以在单轮方法失败的对话环境中进行评估。验证显示准确度为 90.38%（与基线相比误差减少了 26.45%），并且与人类对复杂指令的判断具有更强的相关性。

Title: Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

Authors: Chaoqun Yang, Xinyu Lin, Shulin Li, Wenjie Wang, Ruihan Guo, Fuli Feng, Tat-Seng Chua
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03322
Pdf URL: https://arxiv.org/pdf/2603.03322
Copy Paste: [[2603.03322]] Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery(https://arxiv.org/abs/2603.03322)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.
摘要：大型语言模型（LLM）代理的最新进展在自动知识发现方面展示了巨大的潜力。然而，严格评估人工智能的知识发现能力仍然是一个严峻的挑战。现有的基准主要依赖于静态数据集，导致不可避免的数据污染，其中模型可能在训练期间看到了评估知识。此外，现代法学硕士的快速发布周期使得静态基准很快就过时了，无法评估发现真正新知识的能力。为了解决这些限制，我们提出了 DBench-Bio，这是一种动态且全自动的基准，旨在评估 AI 的生物知识发现能力。 DBench-Bio采用三阶段流程：（1）严谨、权威的论文摘要数据采集；（2）利用法学硕士进行QA提取，综合科学假设问题和相应的发现答案； (3) QA 过滤器，根据相关性、清晰度和中心性确保质量。我们实例化该管道以构建涵盖 12 个生物医学子领域的每月更新基准。对 SOTA 模型的广泛评估揭示了当前发现新知识的局限性。我们的工作提供了第一个动态的、自动的框架，用于评估人工智能系统的新知识发现能力，为人工智能研究社区建立一个活生生的、不断发展的资源，以促进知识发现的发展。

Title: Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement

Authors: Yuxiao Lu, Lin Xu, Yang Sun, Wenjun Li, Jie Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03323
Pdf URL: https://arxiv.org/pdf/2603.03323
Copy Paste: [[2603.03323]] Discern Truth from Falsehood: Reducing Over-Refusal via Contrastive Refinement(https://arxiv.org/abs/2603.03323)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) aligned for safety often suffer from over-refusal, the tendency to reject seemingly toxic or benign prompts by misclassifying them as toxic. This behavior undermines models' helpfulness and restricts usability in sensitive or nuanced contexts. While prior work has proposed mitigation strategies such as data augmentation and activation steering, these approaches often face a trade-off: reducing over-refusal typically degrades the model's ability to reject genuinely harmful content. We argue that this issue arises from the ambiguous influence of toxic and seemingly toxic prompts on the model's learning dynamics. To address it, we introduce a preceding alignment stage, DCR: Discernment via Contrastive Refinement. Both theoretically and empirically, we demonstrate that contrastive refinement improves an LLM's capacity to distinguish truly toxic prompts from superficially toxic ones. Evaluation across diverse benchmarks shows that our method effectively reduces over-refusal while preserving the safety benefits of alignment. Importantly, it achieves this with minimal degradation of general capabilities, offering a more principled and robust direction for safety alignment.
摘要：以安全为目的的大型语言模型 (LLM) 经常会遭受过度拒绝的困扰，即倾向于通过将看似有毒或良性的提示错误地归类为有毒来拒绝它们。这种行为破坏了模型的有用性并限制了敏感或微妙环境中的可用性。虽然之前的工作提出了数据增强和激活引导等缓解策略，但这些方法通常面临权衡：减少过度拒绝通常会降低模型拒绝真正有害内容的能力。我们认为，这个问题源于有毒和看似有毒的提示对模型学习动态的模糊影响。为了解决这个问题，我们引入了前面的对齐阶段，DCR：通过对比细化进行辨别。从理论上和经验上，我们证明对比细化提高了法学硕士区分真正有毒提示和表面有毒提示的能力。跨不同基准的评估表明，我们的方法有效地减少了过度拒绝，同时保留了对齐的安全优势。重要的是，它以最小程度降低一般功能的方式实现了这一目标，为安全调整提供了更有原则性和稳健的方向。

Title: Controlling Chat Style in Language Models via Single-Direction Editing

Authors: Zhenyu Xu, Victor S. Sheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03324
Pdf URL: https://arxiv.org/pdf/2603.03324
Copy Paste: [[2603.03324]] Controlling Chat Style in Language Models via Single-Direction Editing(https://arxiv.org/abs/2603.03324)
Keywords: language model, llm, prompt, chat
Abstract: Controlling stylistic attributes in large language models (LLMs) remains challenging, with existing approaches relying on either prompt engineering or post-training alignment. This paper investigates this challenge through the lens of representation engineering, testing the hypothesis that distinct stylistic attributes - from emotional tone to linguistic structure - are encoded as linear directions in the model's activation space. We provide strong empirical evidence for this hypothesis across a wide range of styles and, based on this finding, present a lightweight, training-free method for precise style control. Our approach supports linear style composition, enhances safety by ablating undesirable behaviors, and, as confirmed by experiments on over a dozen models, achieves high style adherence while preserving core capabilities at minimal computational cost.
摘要：控制大型语言模型 (LLM) 中的风格属性仍然具有挑战性，现有方法依赖于即时工程或训练后对齐。本文通过表示工程的视角研究了这一挑战，测试了以下假设：不同的文体属性（从情绪基调到语言结构）被编码为模型激活空间中的线性方向。我们为这一假设提供了强有力的经验证据，涵盖多种风格，并基于这一发现，提出了一种轻量级、免训练的精确风格控制方法。我们的方法支持线性风格组合，通过消除不良行为来增强安全性，并且正如对十多个模型的实验所证实的那样，在以最小的计算成本保留核心功能的同时，实现了高风格依从性。

Title: IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference

Authors: Guanming Liu, Meng Wu, Peng Zhang, Yu Zhang, Yubo Shu, Xianliang Huang, Kainan Tu, Ning Gu, Liuxin Zhang, Qianying Wang, Tun Lu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03325
Pdf URL: https://arxiv.org/pdf/2603.03325
Copy Paste: [[2603.03325]] IntPro: A Proxy Agent for Context-Aware Intent Understanding via Retrieval-conditioned Inference(https://arxiv.org/abs/2603.03325)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have become integral to modern Human-AI collaboration workflows, where accurately understanding user intent serves as a crucial step for generating satisfactory responses. Context-aware intent understanding, which involves inferring user intentions from situational environments, is inherently challenging because it requires reasoning over both the immediate context and the user's underlying motivations that drive their behavior. Moreover, existing approaches often treat intent understanding as a static recognition task, overlooking users' accumulated intent patterns that could provide valuable references for more accurate and generalizable understanding. To address this gap, we propose IntPro, a proxy agent that learns to adapt to individual users via retrieval-conditioned intent inference. We design intent explanations that abstract how contextual signals connect to expressed intents, and store them in an individual intent history library for retrieval. We train IntPro through supervised fine-tuning on retrieval-conditioned trajectories and multi-turn Group Relative Policy Optimization (GRPO) with tool-aware reward functions, enabling the agent to learn when to leverage historical intent patterns and when to infer directly. Experiments across three diverse scenarios (Highlight-Intent, MIntRec2.0, and Weibo Post-Sync) demonstrate that IntPro achieves strong intent understanding performance with effective context-aware reasoning capabilities across different scenarios and model types.
摘要：大型语言模型 (LLM) 已成为现代人类与人工智能协作工作流程中不可或缺的一部分，其中准确理解用户意图是生成满意响应的关键步骤。上下文感知意图理解涉及从情境环境推断用户意图，这本身就具有挑战性，因为它需要对直接上下文和用户驱动其行为的潜在动机进行推理。此外，现有方法通常将意图理解视为静态识别任务，忽略了用户积累的意图模式，这些模式可以为更准确和普遍的理解提供有价值的参考。为了解决这一差距，我们提出了 IntPro，一种代理代理，它通过检索条件意图推断来学习适应单个用户。我们设计意图解释，抽象上下文信号如何连接到所表达的意图，并将它们存储在单独的意图历史库中以供检索。我们通过对检索条件轨迹进行监督微调以及具有工具感知奖励功能的多轮组相对策略优化（GRPO）来训练 IntPro，使代理能够学习何时利用历史意图模式以及何时直接推断。跨三种不同场景（Highlight-Intent、MIntRec2.0和微博后同步）的实验表明，IntPro通过跨不同场景和模型类型的有效上下文感知推理能力实现了强大的意图理解性能。

Title: Controllable and explainable personality sliders for LLMs at inference time

Authors: Florian Hoppe, David Khachaturov, Robert Mullins, Mark Huasong Meng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03326
Pdf URL: https://arxiv.org/pdf/2603.03326
Copy Paste: [[2603.03326]] Controllable and explainable personality sliders for LLMs at inference time(https://arxiv.org/abs/2603.03326)
Keywords: language model, llm
Abstract: Aligning Large Language Models (LLMs) with specific personas typically relies on expensive and monolithic Supervised Fine-Tuning (SFT) or RLHF. While effective, these methods require training distinct models for every target personality profile. Inference-time activation steering offers a parameter-efficient alternative, yet naive approaches fail to control multiple traits simultaneously due to destructive vector interference. In this work, we propose a modular framework for continuous, multi-dimensional personality control. Our key innovation is Sequential Adaptive Steering (SAS): a method that orthogonalizes steering vectors by training subsequent probes on the residual stream shifted by prior interventions. This approach transforms steering vectors into reusable primitives, allowing users to instantly synthesize complex, high-fidelity personality profiles by simply adjusting coefficients alpha. We validate our framework on the Big Five personality traits, demonstrating that it outperforms naive baselines in both goal adherence and coherence, enabling precise, holistic personality modulation without updating model parameters.
摘要：将大型语言模型 (LLM) 与特定角色保持一致通常依赖于昂贵且单一的监督微调 (SFT) 或 RLHF。这些方法虽然有效，但需要为每个目标性格特征训练不同的模型。推理时激活控制提供了一种参数有效的替代方案，但由于破坏性矢量干扰，简单的方法无法同时控制多个特征。在这项工作中，我们提出了一个用于连续、多维个性控制的模块化框架。我们的关键创新是顺序自适应转向（SAS）：一种通过在先前干预所移动的残余流上训练后续探针来正交化转向向量的方法。这种方法将转向向量转换为可重复使用的基元，允许用户通过简单地调整系数 alpha 来立即合成复杂的、高保真度的个性档案。我们验证了我们的大五人格特征框架，证明它在目标坚持性和连贯性方面都优于幼稚基线，从而无需更新模型参数即可实现精确、整体的人格调节。

Title: StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

Authors: Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03328
Pdf URL: https://arxiv.org/pdf/2603.03328
Copy Paste: [[2603.03328]] StructLens: A Structural Lens for Language Models via Maximum Spanning Trees(https://arxiv.org/abs/2603.03328)
Keywords: language model
Abstract: Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest internal structures as well. While interpretability research has investigated the components of language models, existing approaches focus on local inter-token relationships within layers or modules (e.g., Multi-Head Attention), leaving global inter-layer relationships largely overlooked. To address this gap, we introduce StructLens, an analytical framework designed to reveal how internal structures relate holistically through their inter-token connection within a layer. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, analogous to dependency parsing, and leverages the tree properties to quantify inter-layer distance (or similarity) from a structural perspective. Our findings demonstrate that StructLens yields an inter-layer similarity pattern that is distinctively different from conventional cosine similarity. Moreover, this structure-aware similarity proves to be beneficial for practical tasks, such as layer pruning, highlighting the effectiveness of structural analysis for understanding and optimizing language models. Our code is available at this https URL.
摘要：语言表现出固有的结构，这种属性可以解释语言习得和语言变化。鉴于这一特征，我们期望语言模型也能够体现内部结构。虽然可解释性研究已经调查了语言模型的组成部分，但现有方法侧重于层或模块内的局部标记间关系（例如，多头注意力），而很大程度上忽视了全局层间关系。为了解决这一差距，我们引入了 StructLens，这是一个分析框架，旨在揭示内部结构如何通过层内令牌间连接进行整体关联。 StructLens 基于残差流中的语义表示构建最大生成树，类似于依赖解析，并利用树属性从结构角度量化层间距离（或相似性）。我们的研究结果表明，StructLens 产生的层间相似性模式与传统的余弦相似性明显不同。此外，这种结构感知的相似性被证明对实际任务有益，例如层剪枝，突出了结构分析对于理解和优化语言模型的有效性。我们的代码可以在这个 https URL 上找到。

Title: AutoHarness: improving LLM agents by automatically synthesizing a code harness

Authors: Xinghua Lou, Miguel Lázaro-Gredilla, Antoine Dedieu, Carter Wendelken, Wolfgang Lehrach, Kevin P. Murphy
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03329
Pdf URL: https://arxiv.org/pdf/2603.03329
Copy Paste: [[2603.03329]] AutoHarness: improving LLM agents by automatically synthesizing a code harness(https://arxiv.org/abs/2603.03329)
Keywords: language model, gpt, llm, agent
Abstract: Despite significant strides in language models in the last few years, when used as agents, such models often try to perform actions that are not just suboptimal for a given state, but are strictly prohibited by the external environment. For example, in the recent Kaggle GameArena chess competition, 78% of Gemini-2.5-Flash losses were attributed to illegal moves. Often people manually write "harnesses" around LLMs to prevent such failures. In this paper, we demonstrate that Gemini-2.5-Flash can automatically synthesize such a code harness, using a small number of rounds of iterative code refinement given feedback from the (game) environment. The resulting harness prevents all illegal moves in 145 different TextArena games (both 1-player and 2-player), enabling the smaller Gemini-2.5-Flash model to outperform larger models, such as Gemini-2.5-Pro. Pushing our technique to the limit, we can get Gemini-2.5-Flash to generate the entire policy in code, thus eliminating the need to use the LLM at decision making time. The resulting code-policy receives a higher average reward than Gemini-2.5-Pro and GPT-5.2-High on 16 TextArena 1-player games. Our results show that using a smaller model to synthesize a custom code harness (or entire policy) can outperform a much larger model, while also being more cost effective.
摘要：尽管过去几年语言模型取得了重大进展，但当用作代理时，此类模型常常尝试执行不仅对于给定状态而言不是最佳的操作，而且受到外部环境的严格禁止。例如，在最近的 Kaggle GameArena 国际象棋比赛中，Gemini-2.5-Flash 78% 的损失归因于非法走棋。通常人们会在 LLM 周围手动编写“工具”来防止此类失败。在本文中，我们证明 Gemini-2.5-Flash 可以根据（游戏）环境的反馈，使用少量迭代代码细化来自动合成这样的代码工具。由此产生的安全带可防止 145 种不同的 TextArena 游戏（1 人游戏和 2 人游戏）中的所有非法动作，使较小的 Gemini-2.5-Flash 模型能够胜过较大的模型，例如 Gemini-2.5-Pro。将我们的技术推向极限，我们可以让 Gemini-2.5-Flash 在代码中生成整个策略，从而消除在决策时使用 LLM 的需要。在 16 场 TextArena 1 人游戏中，生成的代码策略获得的平均奖励高于 Gemini-2.5-Pro 和 GPT-5.2-High。我们的结果表明，使用较小的模型来合成自定义代码工具（或整个策略）可以优于较大的模型，同时也更具成本效益。

Title: Certainty robustness: Evaluating LLM stability under self-challenging prompts

Authors: Mohammadreza Saadat, Steve Nemzer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03330
Pdf URL: https://arxiv.org/pdf/2603.03330
Copy Paste: [[2603.03330]] Certainty robustness: Evaluating LLM stability under self-challenging prompts(https://arxiv.org/abs/2603.03330)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) often present answers with high apparent confidence despite lacking an explicit mechanism for reasoning about certainty or truth. While existing benchmarks primarily evaluate single-turn accuracy, truthfulness or confidence calibration, they do not capture how models behave when their responses are challenged in interactive settings. We introduce the Certainty Robustness Benchmark, a two-turn evaluation framework that measures how LLMs balance stability and adaptability under self-challenging prompts such as uncertainty ("Are you sure?") and explicit contradiction ("You are wrong!"), alongside numeric confidence elicitation. Using 200 reasoning and mathematics questions from LiveBench, we evaluate four state-of-the-art LLMs and distinguish between justified self-corrections and unjustified answer changes. Our results reveal substantial differences in interactive reliability that are not explained by baseline accuracy alone: some models abandon correct answers under conversational pressure, while others demonstrate strong resistance to challenge and better alignment between confidence and correctness. These findings identify certainty robustness as a distinct and critical dimension of LLM evaluation, with important implications for alignment, trustworthiness and real-world deployment.
摘要：尽管缺乏用于推理确定性或真实性的明确机制，大型语言模型 (LLM) 通常会以很高的明显置信度提供答案。虽然现有的基准主要评估单轮准确性、真实性或置信度校准，但它们并没有捕获模型在交互设置中的响应受到挑战时的行为方式。我们引入了确定性鲁棒性基准，这是一个两轮评估框架，用于衡量法学硕士如何在自我挑战的提示下平衡稳定性和适应性，例如不确定性（“你确定吗？”）和明确的矛盾（“你错了！”），以及数字置信度启发。我们使用 LiveBench 中的 200 个推理和数学问题来评估四位最先进的法学硕士，并区分合理的自我更正和不合理的答案更改。我们的结果揭示了交互可靠性的显着差异，这些差异不能仅用基线准确性来解释：一些模型在对话压力下放弃正确答案，而另一些模型则表现出对挑战的强大抵抗力以及置信度和正确性之间更好的一致性。这些发现将确定性稳健性确定为法学硕士评估的一个独特而关键的维度，对一致性、可信度和现实世界的部署具有重要影响。

Title: PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning

Authors: Hung Manh Pham, Jinyang Wu, Xiao Ma, Yiming Zhang, Yixin Xu, Aaqib Saeed, Bin Zhu, Zhou Pan, Dong Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03331
Pdf URL: https://arxiv.org/pdf/2603.03331
Copy Paste: [[2603.03331]] PulseLM: A Foundation Dataset and Benchmark for PPG-Text Learning(https://arxiv.org/abs/2603.03331)
Keywords: language model
Abstract: Photoplethysmography (PPG) is a widely used non-invasive sensing modality for continuous cardiovascular and physiological monitoring across clinical, laboratory, and wearable settings. While existing PPG datasets support a broad range of downstream tasks, they typically provide supervision in the form of numerical measurements or task-specific labels, limiting their suitability for language-based physiological reasoning and multimodal foundation models. In this work, we introduce PulseLM, a large-scale PPG-text dataset designed to bridge raw PPG waveforms and natural language through a unified, closed-ended question answering (QA) formulation. PulseLM aggregates PPG recordings from fifteen publicly available sources and harmonizes heterogeneous annotations into twelve common physiologically QA tasks. The dataset comprises 1.31 million standardized 10-second PPG segments, associated with 3.15 million question-answer pairs. We further define reproducible preprocessing, supervision, and evaluation protocols and establish baseline benchmarks using multimodal PPG-aware large language models. PulseLM provides a standardized foundation for studying multimodal physiological reasoning, cross-dataset generalization, and scalable benchmarking of PPG-based language models. The data and code can be found publicly available at: this https URL.
摘要：光电体积描记法 (PPG) 是一种广泛使用的非侵入式传感方式，用于临床、实验室和可穿戴环境中的连续心血管和生理监测。虽然现有的 PPG 数据集支持广泛的下游任务，但它们通常以数值测量或特定任务标签的形式提供监督，限制了它们对基于语言的生理推理和多模态基础模型的适用性。在这项工作中，我们介绍了 PulseLM，这是一个大型 PPG 文本数据集，旨在通过统一的封闭式问答 (QA) 公式来连接原始 PPG 波形和自然语言。 PulseLM 聚合来自 15 个公开来源的 PPG 记录，并将异构注释协调为 12 种常见的生理 QA 任务。该数据集包含 131 万个标准化 10 秒 PPG 片段，与 315 万个问答对相关。我们进一步定义可重复的预处理、监督和评估协议，并使用多模式 PPG 感知大语言模型建立基线基准。 PulseLM 为研究多模态生理推理、跨数据集泛化和基于 PPG 的语言模型的可扩展基准测试提供了标准化基础。数据和代码可在以下位置公开找到：此 https URL。

Title: Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Authors: Ashwath Vaithinathan Aravindan, Mayank Kejriwal
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03332
Pdf URL: https://arxiv.org/pdf/2603.03332
Copy Paste: [[2603.03332]] Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations(https://arxiv.org/abs/2603.03332)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has emerged as a foundational technique for eliciting reasoning from Large Language Models (LLMs), yet the robustness of this approach to corruptions in intermediate reasoning steps remains poorly understood. This paper presents a comprehensive empirical evaluation of LLM robustness to a structured taxonomy of 5 CoT perturbation types: \textit{MathError, UnitConversion, Sycophancy, SkippedSteps,} and \textit{ExtraSteps}. We evaluate 13 models spanning three orders of magnitude in parameter count (3B to 1.5T\footnote{Assumed parameter count of closed models}), testing their ability to complete mathematical reasoning tasks despite perturbations injected at different points in the reasoning chain. Our key findings reveal heterogeneous vulnerability patterns: MathError perturbations produce the most severe degradation in small models (50-60\% accuracy loss) but show strong scaling benefits; UnitConversion remains challenging across all scales (20-30\% loss even for largest models); ExtraSteps incur minimal accuracy degradation (0-6\%) regardless of scale; Sycophancy produces modest effects (7\% loss for small models); and SkippedSteps cause intermediate damage (15\% loss). Scaling relationships follow power-law patterns, with model size serving as a protective factor against some perturbations but offering limited defense against dimensional reasoning tasks. These findings have direct implications for deploying LLMs in multi-stage reasoning pipelines and underscore the necessity of task-specific robustness assessments and mitigation strategies. The code and results are available \href{this https URL}{here}.
摘要：思想链 (CoT) 提示已成为从大型语言模型 (LLM) 中引发推理的基本技术，但这种方法对中间推理步骤中的损坏的鲁棒性仍然知之甚少。本文对 LLM 对 5 种 CoT 扰动类型的结构化分类法的鲁棒性进行了全面的实证评估：\textit{MathError、UnitConversion、Sycophancy、SkppedSteps} 和 \textit{ExtraSteps}。我们评估了参数数量跨越三个数量级的 13 个模型（3B 到 1.5T\footnote{假设的封闭模型参数数量}），测试它们完成数学推理任务的能力，尽管在推理链的不同点注入了扰动。我们的主要发现揭示了异构漏洞模式：MathError 扰动在小型模型中产生最严重的退化（50-60% 的精度损失），但显示出强大的扩展优势； UnitConversion 在所有尺度上仍然具有挑战性（即使对于最大的模型也有 20-30% 的损失）；无论规模如何，ExtraSteps 都会导致最小的精度下降 (0-6\%)；阿谀奉承产生的影响不大（小模型损失 7%）；和 SkippedSteps 会造成中等伤害（损失 15\%）。缩放关系遵循幂律模式，模型大小可以作为针对某些扰动的保护因素，但对维度推理任务的防御能力有限。这些发现对于在多阶段推理流程中部署法学硕士具有直接影响，并强调了针对特定任务的稳健性评估和缓解策略的必要性。代码和结果可在 \href{此 https URL}{此处} 获取。

Title: Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding

Authors: Jeongtae Lee, Minjung Jo, Hyunjoon Jeong, Gunho Park, Sunghyeon Woo, Joonghoon Kim, Se Jung Kwon, Dongsoo Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03333
Pdf URL: https://arxiv.org/pdf/2603.03333
Copy Paste: [[2603.03333]] Training-free Dropout Sampling for Semantic Token Acceptance in Speculative Decoding(https://arxiv.org/abs/2603.03333)
Keywords: language model
Abstract: Speculative decoding accelerates large language model inference by proposing tokens with a lightweight draft model and selectively accepting them using a target model. This work introduces DropMatch, a novel approach that matches draft tokens to the predictive distribution of the target model via Monte Carlo dropout applied exclusively to the LM head, enabling sampling-based acceptance decisions. By generating multiple decoding paths, our method forms an empirical token distribution against which draft tokens are evaluated for consistency. This acceptance mechanism enables the model to adaptively control the size of decoding paths under an appropriate dropout probability, preventing substantial distortion of the target model predictive distribution. The proposed method operates in a training-free, data-free, and calibration-free manner, requires no architectural modification to pretrained models, and can be orthogonally integrated with a wide range of existing speculative decoding and inference acceleration techniques. Experiments across multiple benchmarks demonstrate that our approach increases acceptance length while maintaining competitive task performance, yielding inference speedups ranging from 1.09x to 1.33x over the standard baseline, and up to an additional 1.09x speedup when applied on top of EAGLE3.
摘要：推测性解码通过使用轻量级草稿模型提出令牌并使用目标模型有选择地接受它们来加速大型语言模型推理。这项工作引入了 DropMatch，这是一种新颖的方法，通过专门应用于 LM 头的蒙特卡罗 dropout 将草案标记与目标模型的预测分布进行匹配，从而实现基于采样的接受决策。通过生成多个解码路径，我们的方法形成了一个经验令牌分布，根据该分布评估草稿令牌的一致性。这种接受机制使模型能够在适当的丢失概率下自适应地控制解码路径的大小，从而防止目标模型预测分布的严重失真。所提出的方法以免训练、免数据和免校准的方式运行，不需要对预训练模型进行架构修改，并且可以与各种现有的推测解码和推理加速技术正交集成。跨多个基准的实验表明，我们的方法在保持有竞争力的任务性能的同时增加了接受长度，与标准基线相比，推理速度提高了 1.09 倍到 1.33 倍，当应用在 EAGLE3 之上时，推理速度可额外提高 1.09 倍。

Title: The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?

Authors: Bianca Raimondi, Francesco Pivi, Davide Evangelista, Maurizio Gabbrielli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03334
Pdf URL: https://arxiv.org/pdf/2603.03334
Copy Paste: [[2603.03334]] The CompMath-MCQ Dataset: Are LLMs Ready for Higher-Level Math?(https://arxiv.org/abs/2603.03334)
Keywords: language model, llm
Abstract: The evaluation of Large Language Models (LLMs) on mathematical reasoning has largely focused on elementary problems, competition-style questions, or formal theorem proving, leaving graduate-level and computational mathematics relatively underexplored. We introduce CompMath-MCQ, a new benchmark dataset for assessing LLMs on advanced mathematical reasoning in a multiple-choice setting. The dataset consists of 1{,}500 originally authored questions by professors of graduate-level courses, covering topics including Linear Algebra, Numerical Optimization, Vector Calculus, Probability, and Python-based scientific computing. Three option choices are provided for each question, with exactly one of them being correct. To ensure the absence of data leakage, all questions are newly created and not sourced from existing materials. The validity of questions is verified through a procedure based on cross-LLM disagreement, followed by manual expert review. By adopting a multiple-choice format, our dataset enables objective, reproducible, and bias-free evaluation through lm_eval library. Baseline results with state-of-the-art LLMs indicate that advanced computational mathematical reasoning remains a significant challenge. We release CompMath-MCQ at the following link: this https URL
摘要：大型语言模型（LLM）对数学推理的评估主要集中在基本问题、竞赛式问题或形式定理证明上，而研究生水平和计算数学的探索相对不足。我们推出了 CompMath-MCQ，这是一个新的基准数据集，用于在多项选择设置中评估法学硕士的高级数学推理能力。该数据集包含 1{,}500 个由研究生课程教授原创的问题，涵盖线性代数、数值优化、向量微积分、概率和基于 Python 的科学计算等主题。每个问题提供三个选项，其中只有一个是正确的。为了确保不泄露数据，所有问题都是新创建的，而不是源自现有材料。问题的有效性通过基于跨法学硕士分歧的程序进行验证，然后进行专家手动评审。通过采用多项选择格式，我们的数据集可以通过 lm_eval 库进行客观、可重复且无偏差的评估。最先进的法学硕士的基线结果表明，高级计算数学推理仍然是一个重大挑战。我们通过以下链接发布 CompMath-MCQ：此 https URL

Title: Compressed Sensing for Capability Localization in Large Language Models

Authors: Anna Bair, Yixuan Even Xu, Mingjie Sun, J. Zico Kolter
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03335
Pdf URL: https://arxiv.org/pdf/2603.03335
Copy Paste: [[2603.03335]] Compressed Sensing for Capability Localization in Large Language Models(https://arxiv.org/abs/2603.03335)
Keywords: language model, llm
Abstract: Large language models (LLMs) exhibit a wide range of capabilities, including mathematical reasoning, code generation, and linguistic behaviors. We show that many capabilities are highly localized to small subsets of attention heads within Transformer architectures. Zeroing out as few as five task-specific heads can degrade performance by up to $65\%$ on standard benchmarks measuring the capability of interest, while largely preserving performance on unrelated tasks. We introduce a compressed sensing based method that exploits the sparsity of these heads to identify them via strategic knockouts and a small number of model evaluations. We validate these findings across Llama and Qwen models ranging from 1B to 8B parameters and a diverse set of capabilities including mathematical abilities and code generation, revealing a modular organization in which specialized capabilities are implemented by sparse, functionally distinct components. Overall, our results suggest that capability localization is a general organizational principle of Transformer language models, with implications for interpretability, model editing, and AI safety. Code is released at this https URL.
摘要：大型语言模型 (LLM) 展现出广泛的功能，包括数学推理、代码生成和语言行为。我们表明，许多功能都高度本地化于 Transformer 架构中注意力头的小子集。在衡量感兴趣的能力的标准基准上，将少至五个特定于任务的头清零可能会使性能降低高达 65%$，同时在很大程度上保留不相关任务的性能。我们引入了一种基于压缩感知的方法，该方法利用这些头部的稀疏性，通过策略性淘汰和少量模型评估来识别它们。我们在 Llama 和 Qwen 模型中验证了这些发现，范围从 1B 到 8B 参数以及包括数学能力和代码生成在内的各种功能，揭示了一种模块化组织，其中专业功能由稀疏的、功能不同的组件实现。总的来说，我们的结果表明，能力本地化是 Transformer 语言模型的一般组织原则，对可解释性、模型编辑和人工智能安全具有影响。代码在此 https URL 发布。

Title: Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification

Authors: Angel Rodrigo Avelar Menendez, Yufeng Liu, Xiaowu Dai
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03336
Pdf URL: https://arxiv.org/pdf/2603.03336
Copy Paste: [[2603.03336]] Prompt-Dependent Ranking of Large Language Models with Uncertainty Quantification(https://arxiv.org/abs/2603.03336)
Keywords: language model, llm, prompt
Abstract: Rankings derived from pairwise comparisons are central to many economic and computational systems. In the context of large language models (LLMs), rankings are typically constructed from human preference data and presented as leaderboards that guide deployment decisions. However, existing approaches rely on point estimates, implicitly treating rankings as fixed objects despite substantial estimation noise and context-dependent performance variation. Acting on such rankings can lead to misallocation and welfare loss when apparent differences are not statistically meaningful. We study prompt-dependent ranking inference under pairwise human preferences and develop a framework for decision-safe rankings with statistically valid uncertainty guarantees. We model preferences using a contextual Bradley-Terry-Luce model in which the latent utility of each model depends on the input prompt. Rather than targeting point estimates of utilities, we directly conduct inference on induced rankings, constructing confidence sets based on simultaneous confidence intervals for pairwise utility differences. This approach yields statistically valid marginal and simultaneous confidence sets for prompt-specific ranks. Our framework connects recent advances in rank inference to contextual preference learning and provides tools for robust ranking-based decision-making. Empirically, using large-scale human preference data from LLM evaluations, we show that rankings vary substantially across prompt characteristics and that many apparent rank differences are not statistically distinguishable. We further demonstrate how uncertainty-aware rankings identify dominance only when supported by the data and otherwise return partial orders.
摘要：从成对比较得出的排名是许多经济和计算系统的核心。在大型语言模型 (LLM) 的背景下，排名通常是根据人类偏好数据构建的，并以排行榜的形式呈现，以指导部署决策。然而，现有的方法依赖于点估计，隐含地将排名视为固定对象，尽管存在大量估计噪声和依赖于上下文的性能变化。当明显差异没有统计意义时，按照此类排名行事可能会导致分配不当和福利损失。我们研究成对人类偏好下的即时依赖排名推断，并开发一个具有统计上有效的不确定性保证的决策安全排名框架。我们使用上下文 Bradley-Terry-Luce 模型对偏好进行建模，其中每个模型的潜在效用取决于输入提示。我们不是针对效用的点估计，而是直接对诱导排名进行推断，根据成对效用差异的同时置信区间构建置信集。这种方法为特定提示的排名生成统计上有效的边际和同时置信集。我们的框架将排名推断的最新进展与上下文偏好学习联系起来，并为基于排名的稳健决策提供了工具。根据经验，使用法学硕士评估中的大规模人类偏好数据，我们表明，不同提示特征的排名差异很大，并且许多明显的排名差异在统计上无法区分。我们进一步证明了不确定性感知排名如何仅在数据支持的情况下识别主导地位，否则返回部分订单。

Title: Tracing Pharmacological Knowledge In Large Language Models

Authors: Basil Hasan Khwaja, Dylan Chen, Guntas Toor, Anastasiya Kuznetsova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03407
Pdf URL: https://arxiv.org/pdf/2603.03407
Copy Paste: [[2603.03407]] Tracing Pharmacological Knowledge In Large Language Models(https://arxiv.org/abs/2603.03407)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown strong empirical performance across pharmacology and drug discovery tasks, yet the internal mechanisms by which they encode pharmacological knowledge remain poorly understood. In this work, we investigate how drug-group semantics are represented and retrieved within Llama-based biomedical language models using causal and probing-based interpretability methods. We apply activation patching to localize where drug-group information is stored across model layers and token positions, and complement this analysis with linear probes trained on token-level and sum-pooled activations. Our results demonstrate that early layers play a key role in encoding drug-group knowledge, with the strongest causal effects arising from intermediate tokens within the drug-group span rather than the final drug-group token. Linear probing further reveals that pharmacological semantics are distributed across tokens and are already present in the embedding space, with token-level probes performing near chance while sum-pooled representations achieve maximal accuracy. Together, these findings suggest that drug-group semantics in LLMs are not localized to single tokens but instead arise from distributed representations. This study provides the first systematic mechanistic analysis of pharmacological knowledge in LLMs, offering insights into how biomedical semantics are encoded in large language models.
摘要：大语言模型（LLM）在药理学和药物发现任务中表现出了强大的经验表现，但它们编码药理学知识的内部机制仍然知之甚少。在这项工作中，我们研究了如何使用因果和基于探测的可解释性方法在基于 Llama 的生物医学语言模型中表示和检索药物组语义。我们应用激活修补来定位药物组信息跨模型层和令牌位置的存储位置，并通过在令牌级别和总和激活上训练的线性探针来补充此分析。我们的结果表明，早期层在编码药物组知识中发挥着关键作用，最强的因果效应来自药物组范围内的中间标记，而不是最终的药物组标记。线性探测进一步揭示，药理学语义分布在令牌之间，并且已经存在于嵌入空间中，令牌级探测几乎有机会执行，而总和表示实现了最大准确性。总之，这些发现表明法学硕士中的药物组语义并不局限于单个标记，而是源自分布式表示。这项研究首次对法学硕士的药理学知识进行了系统的机制分析，深入了解了生物医学语义如何在大型语言模型中编码。

Title: Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs

Authors: Mingyu Jin, Yutong Yin, Jingcheng Niu, Qingcheng Zeng, Wujiang Xu, Mengnan Du, Wei Cheng, Zhaoran Wang, Tianlong Chen, Dimitris N. Metaxas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03415
Pdf URL: https://arxiv.org/pdf/2603.03415
Copy Paste: [[2603.03415]] Farther the Shift, Sparser the Representation: Analyzing OOD Mechanisms in LLMs(https://arxiv.org/abs/2603.03415)
Keywords: language model, llm
Abstract: In this work, we investigate how Large Language Models (LLMs) adapt their internal representations when encountering inputs of increasing difficulty, quantified as the degree of out-of-distribution (OOD) shift. We reveal a consistent and quantifiable phenomenon: as task difficulty increases, whether through harder reasoning questions, longer contexts, or adding answer choices, the last hidden states of LLMs become substantially sparser. In short, \textbf{\textit{the farther the shift, the sparser the representations}}. This sparsity--difficulty relation is observable across diverse models and domains, suggesting that language models respond to unfamiliar or complex inputs by concentrating computation into specialized subspaces in the last hidden state. Through a series of controlled analyses with a learning dynamic explanation, we demonstrate that this sparsity is not incidental but an adaptive mechanism for stabilizing reasoning under OOD. Leveraging this insight, we design \textit{Sparsity-Guided Curriculum In-Context Learning (SG-ICL)}, a strategy that explicitly uses representation sparsity to schedule few-shot demonstrations, leading to considerable performance enhancements. Our study provides new mechanistic insights into how LLMs internalize OOD challenges. The source code is available at the URL: this https URL.
摘要：在这项工作中，我们研究了大型语言模型（LLM）在遇到难度不断增加的输入（量化为分布外（OOD）转变的程度）时如何调整其内部表示。我们揭示了一个一致且可量化的现象：随着任务难度的增加，无论是通过更难的推理问题、更长的上下文，还是添加答案选择，法学硕士的最后隐藏状态都变得更加稀疏。简而言之，\textbf{\textit{移位越远，表示越稀疏}}。这种稀疏性-难度关系在不同的模型和领域中都可以观察到，这表明语言模型通过将计算集中到最后隐藏状态的专门子空间来响应不熟悉或复杂的输入。通过一系列带有学习动态解释的受控分析，我们证明这种稀疏性不是偶然的，而是 OOD 下稳定推理的自适应机制。利用这种洞察力，我们设计了 \textit{稀疏性引导课程情境学习（SG-ICL）}，这是一种明确使用表示稀疏性来安排几次演示的策略，从而显着提高性能。我们的研究为法学硕士如何内化 OOD 挑战提供了新的机制见解。源代码可从以下 URL 获取：此 https URL。

Title: Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi

Authors: Shiza Fatimah, Aniket Sen, Sophia Falk, Florian Mai, Lucie Flek, Nicholas Kluge Corrêa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03508
Pdf URL: https://arxiv.org/pdf/2603.03508
Copy Paste: [[2603.03508]] Raising Bars, Not Parameters: LilMoo Compact Language Model for Hindi(https://arxiv.org/abs/2603.03508)
Keywords: language model, llm
Abstract: The dominance of large multilingual foundation models has widened linguistic inequalities in Natural Language Processing (NLP), often leaving low-resource languages underrepresented. This paper introduces LilMoo, a 0.6-billion-parameter Hindi language model trained entirely from scratch to address this gap. Unlike prior Hindi models that rely on continual pretraining from opaque multilingual foundations, LilMoo is developed through a fully transparent and reproducible pipeline optimized for limited compute environments. We construct a high-quality Hindi corpus (GigaLekh) filtered through both heuristic and learned (LLM-as-a-judge) methods, complemented by bilingual augmentation with curated English data. Using this dataset, we explore various training recipes for small-scale language models. Across comprehensive evaluation suites, LilMoo consistently outperforms comparably sized multilingual baselines such as Qwen2.5-0.5B and Qwen3-0.6B, demonstrating that well-designed language-specific pretraining can rival large multilingual models at the sub-billion-parameter range.
摘要：大型多语言基础模型的主导地位扩大了自然语言处理（NLP）中的语言不平等，往往导致低资源语言的代表性不足。本文介绍了 LilMoo，这是一个完全从头开始训练的 6 亿参数印地语语言模型，旨在解决这一差距。与之前依赖于不透明多语言基础的持续预训练的印地语模型不同，LilMoo 是通过针对有限计算环境进行优化的完全透明且可重复的管道开发的。我们构建了一个高质量的印地语语料库（GigaLekh），通过启发式和学习（LLM-as-a-judge）方法进行过滤，并辅以精选英语数据的双语增强。使用该数据集，我们探索了小型语言模型的各种训练方法。在综合评估套件中，LilMoo 始终优于 Qwen2.5-0.5B 和 Qwen3-0.6B 等同等规模的多语言基线，这表明精心设计的特定语言预训练可以与数十亿参数范围内的大型多语言模型相媲美。

Title: SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems

Authors: Haochang Hao, Yifan Xu, Xinzhuo Li, Yingqiang Ge, Lu Cheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2603.03536
Pdf URL: https://arxiv.org/pdf/2603.03536
Copy Paste: [[2603.03536]] SafeCRS: Personalized Safety Alignment for LLM-Based Conversational Recommender Systems(https://arxiv.org/abs/2603.03536)
Keywords: llm
Abstract: Current LLM-based conversational recommender systems (CRS) primarily optimize recommendation accuracy and user satisfaction. We identify an underexplored vulnerability in which recommendation outputs may negatively impact users by violating personalized safety constraints, when individualized safety sensitivities -- such as trauma triggers, self-harm history, or phobias -- are implicitly inferred from the conversation but not respected during recommendation. We formalize this challenge as personalized CRS safety and introduce SafeRec, a new benchmark dataset designed to systematically evaluate safety risks in LLM-based CRS under user-specific constraints. To further address this problem, we propose SafeCRS, a safety-aware training framework that integrates Safe Supervised Fine-Tuning (Safe-SFT) with Safe Group reward-Decoupled Normalization Policy Optimization (Safe-GDPO) to jointly optimize recommendation quality and personalized safety alignment. Extensive experiments on SafeRec demonstrate that SafeCRS reduces safety violation rates by up to 96.5% relative to the strongest recommendation-quality baseline while maintaining competitive recommendation quality. Warning: This paper contains potentially harmful and offensive content.
摘要：当前基于法学硕士的会话推荐系统（CRS）主要优化推荐准确性和用户满意度。我们发现了一个尚未充分研究的漏洞，其中推荐输出可能会通过违反个性化安全约束而对用户产生负面影响，当个性化安全敏感性（例如创伤触发因素、自残历史或恐惧症）从对话中隐式推断出来但在推荐过程中未得到尊重时。我们将这一挑战正式化为个性化 CRS 安全性，并引入了 SafeRec，这是一个新的基准数据集，旨在系统地评估基于 LLM 的 CRS 在用户特定约束下的安全风险。为了进一步解决这个问题，我们提出了SafeCRS，这是一种安全意识培训框架，它将安全监督微调（Safe-SFT）与安全组奖励解耦标准化策略优化（Safe-GDPO）相结合，共同优化推荐质量和个性化安全调整。 SafeRec 上的大量实验表明，相对于最强的推荐质量基线，SafeCRS 可以将安全违规率降低高达 96.5%，同时保持有竞争力的推荐质量。警告：本文包含潜在有害和令人反感的内容。

Title: RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering

Authors: Aswini Sivakumar, Vijayan Sugumaran, Yao Qiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03541
Pdf URL: https://arxiv.org/pdf/2603.03541
Copy Paste: [[2603.03541]] RAG-X: Systematic Diagnosis of Retrieval-Augmented Generation for Medical Question Answering(https://arxiv.org/abs/2603.03541)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Automated question-answering (QA) systems increasingly rely on retrieval-augmented generation (RAG) to ground large language models (LLMs) in authoritative medical knowledge, ensuring clinical accuracy and patient safety in Artificial Intelligence (AI) applications for healthcare. Despite progress in RAG evaluation, current benchmarks focus only on simple multiple-choice QA tasks and employ metrics that poorly capture the semantic precision required for complex QA tasks. These approaches fail to diagnose whether an error stems from faulty retrieval or flawed generation, limiting developers from performing targeted improvement. To address this gap, we propose RAG-X, a diagnostic framework that evaluates the retriever and generator independently across a triad of QA tasks: information extraction, short-answer generation, and multiple-choice question (MCQ) answering. RAG-X introduces Context Utilization Efficiency (CUE) metrics to disaggregate system success into interpretable quadrants, isolating verified grounding from deceptive accuracy. Our experiments reveal an ``Accuracy Fallacy", where a 14\% gap separates perceived system success from evidence-based grounding. By surfacing hidden failure modes, RAG-X offers the diagnostic transparency needed for safe and verifiable clinical RAG systems.
摘要：自动问答 (QA) 系统越来越依赖检索增强生成 (RAG)，将大型语言模型 (LLM) 融入权威的医学知识中，从而确保医疗保健人工智能 (AI) 应用中的临床准确性和患者安全。尽管 RAG 评估取得了进展，但当前的基准测试仅关注简单的多项选择 QA 任务，并且采用的指标很难捕获复杂 QA 任务所需的语义精度。这些方法无法诊断错误是否源于错误的检索或有缺陷的生成，从而限制了开发人员进行有针对性的改进。为了解决这一差距，我们提出了 RAG-X，这是一种诊断框架，可以跨三个 QA 任务独立评估检索器和生成器：信息提取、简答题生成和多项选择题 (MCQ) 回答。 RAG-X 引入了上下文利用效率 (CUE) 指标，将系统成功分解为可解释的象限，将经过验证的基础与欺骗性的准确性隔离开来。我们的实验揭示了“准确性谬误”，其中感知系统成功与基于证据的基础之间存在 14% 的差距。通过揭示隐藏的故障模式，RAG-X 提供了安全且可验证的临床 RAG 系统所需的诊断透明度。

Title: Tucano 2 Cool: Better Open Source LLMs for Portuguese

Authors: Nicholas Kluge Corrêa, Aniket Sen, Shiza Fatimah, Sophia Falk, Lennard Landgraf, Julia Kastner, Lucie Flek
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03543
Pdf URL: https://arxiv.org/pdf/2603.03543
Copy Paste: [[2603.03543]] Tucano 2 Cool: Better Open Source LLMs for Portuguese(https://arxiv.org/abs/2603.03543)
Keywords: language model, llm, retrieval augmented generation, chain-of-thought
Abstract: We present Tucano 2, a fully open suite of large language models (LLMs) with 0.5-3.7 billion parameters, designed to address certain gaps in open-source development for Portuguese LLMs. Following our previous works, we now extend our dataset, GigaVerbo-v2, to a new degree of quality and scale, while also introducing a new synthetic dataset, GigaVerbo-v2 Synth, aimed at filling missing gaps in GigaVerbo-v2, and two post-training datasets, GigaVerbo-v2 SFT and GigaVerbo-v2 Preferences, that allow Portuguese LLMs to be trained in domains like retrieval augmented generation, coding, tool use, chain-of-thought reasoning, and many other domains of interest. Through extensive ablation studies, we design both pretraining and continual pretraining recipes for the Tucano 2 suite (Base, Instruct, and Think), which achieve state-of-the-art performance on several Portuguese-language modeling benchmarks. We also extend and refine the evaluation harness introduced in our earlier work, yielding a comprehensive evaluation suite that provides strong signals across different pretraining, continual pretraining, and post-training regimes. All artifacts associated with Tucano 2 are openly released, including training recipes, logs, and source code, ensuring that our work is reproducible, accessible, and extendable by the broader Portuguese NLP community.
摘要：我们推出了 Tucano 2，这是一个完全开放的大型语言模型 (LLM) 套件，具有 0.5-37 亿个参数，旨在解决葡萄牙语 LLM 开源开发中的某些差距。继我们之前的工作之后，我们现在将数据集 GigaVerbo-v2 扩展到新的质量和规模，同时还引入了一个新的合成数据集 GigaVerbo-v2 Synth，旨在填补 GigaVerbo-v2 中缺失的空白，以及两个训练后数据集 GigaVerbo-v2 SFT 和 GigaVerbo-v2 Preferences，允许葡萄牙法学硕士在检索增强生成、编码、工具使用等领域进行培训，思维链推理以及许多其他感兴趣的领域。通过广泛的消融研究，我们为 Tucano 2 套件（Base、Instruct 和 Think）设计了预训练和持续预训练方案，这些方案在多个葡萄牙语建模基准上实现了最先进的性能。我们还扩展和完善了我们早期工作中引入的评估工具，产生了一个全面的评估套件，可以在不同的预训练、持续预训练和训练后制度中提供强有力的信号。与 Tucano 2 相关的所有工件都是公开发布的，包括训练配方、日志和源代码，确保我们的工作可以被更广泛的葡萄牙 NLP 社区复制、访问和扩展。

Title: ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer

Authors: Chunyuan Deng, Sanket Lokegaonkar, Colin Lockard, Besnik Fetahu, Nasser Zalmout, Xian Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.03583
Pdf URL: https://arxiv.org/pdf/2603.03583
Copy Paste: [[2603.03583]] ByteFlow: Language Modeling through Adaptive Byte Compression without a Tokenizer(https://arxiv.org/abs/2603.03583)
Keywords: language model
Abstract: Modern language models still rely on fixed, pre-defined subword tokenizations. Once a tokenizer is trained, the LM can only operate at this fixed level of granularity, which often leads to brittle and counterintuitive behaviors even in otherwise strong reasoning models. We introduce \textbf{ByteFlow Net}, a new hierarchical architecture that removes tokenizers entirely and instead enables models to learn their own segmentation of raw byte streams into semantically meaningful units. ByteFlow Net performs compression-driven segmentation based on the coding rate of latent representations, yielding adaptive boundaries \emph{while preserving a static computation graph via Top-$K$ selection}. Unlike prior self-tokenizing methods that depend on brittle heuristics with human-designed inductive biases, ByteFlow Net adapts its internal representation granularity to the input itself. Experiments demonstrate that this compression-based chunking strategy yields substantial performance gains, with ByteFlow Net outperforming both BPE-based Transformers and previous byte-level architectures. These results suggest that end-to-end, tokenizer-free modeling is not only feasible but also more effective, opening a path toward more adaptive and information-grounded language models.
摘要：现代语言模型仍然依赖于固定的、预定义的子词标记化。一旦分词器经过训练，LM 就只能在这个固定的粒度级别上运行，这通常会导致脆弱且违反直觉的行为，即使在其他强大的推理模型中也是如此。我们引入了 \textbf{ByteFlow Net}，这是一种新的分层架构，它完全删除了分词器，而是使模型能够学习自己将原始字节流分割成语义上有意义的单元。 ByteFlow Net 根据潜在表示的编码率执行压缩驱动的分割，产生自适应边界 \emph{同时通过 Top-$K$ 选择保留静态计算图}。与之前依赖于具有人为设计的归纳偏差的脆弱启发式的自我标记化方法不同，ByteFlow Net 会根据输入本身调整其内部表示粒度。实验表明，这种基于压缩的分块策略可带来显着的性能提升，ByteFlow Net 的性能优于基于 BPE 的 Transformer 和之前的字节级架构。这些结果表明，端到端、无分词器的建模不仅可行，而且更有效，为更具适应性和基于信息的语言模型开辟了道路。

Title: Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Authors: Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03585
Pdf URL: https://arxiv.org/pdf/2603.03585
Copy Paste: [[2603.03585]] Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility(https://arxiv.org/abs/2603.03585)
Keywords: language model, llm, prompt
Abstract: Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility accuracy and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with accuracy up to 92%.
摘要：错误信息是一个日益严重的社会威胁，由于基本信念的差异，不同人口群体对错误信息主张的敏感性也有所不同。随着大型语言模型 (LLM) 越来越多地用于模拟人类行为，我们研究它们是否可以模拟人口统计错误信息的敏感性，将信念视为主要驱动因素。我们引入了 BeliefSim，这是一个模拟框架，它使用心理学分类法和调查先验来构建人口信仰概况。我们研究基于提示的条件反射和训练后适应，并使用以下方法进行多重评估：（i）敏感性准确性和（ii）反事实人口统计学敏感性。在数据集和建模策略中，我们表明信念为模拟错误信息敏感性提供了强大的先验，准确率高达 92%。

Title: A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research

Authors: Stephan Ludwig, Peter J. Danaher, Xiaohao Yang
Subjects: cs.CL, econ.EM
Abstract URL: https://arxiv.org/abs/2603.03623
Pdf URL: https://arxiv.org/pdf/2603.03623
Copy Paste: [[2603.03623]] A Neural Topic Method Using a Large-Language-Model-in-the-Loop for Business Research(https://arxiv.org/abs/2603.03623)
Keywords: language model
Abstract: The growing use of unstructured text in business research makes topic modeling a central tool for constructing explanatory variables from reviews, social media, and open-ended survey responses, yet existing approaches function poorly as measurement instruments. Prior work shows that textual content predicts outcomes such as sales, satisfaction, and firm performance, but probabilistic models often generate conceptually diffuse topics, neural topic models are difficult to interpret in theory-driven settings, and large language model approaches lack standardization, stability, and alignment with document-level representations. We introduce LX Topic, a neural topic method that conceptualizes topics as latent linguistic constructs and produces calibrated document-level topic proportions for empirical analysis. LX Topic builds on FASTopic to ensure strong document representativeness and integrates large language model refinement at the topic-word level using alignment and confidence-weighting mechanisms that enhance semantic coherence without distorting document-topic distributions. Evaluations on large-scale Amazon and Yelp review datasets demonstrate that LX Topic achieves the highest overall topic quality relative to leading models while preserving clustering and classification performance. By unifying topic discovery, refinement, and standardized output in a web-based system, LX Topic establishes topic modeling as a reproducible, interpretable, and measurement-oriented instrument for marketing research and practice.
摘要：在商业研究中越来越多地使用非结构化文本，使得主题建模成为从评论、社交媒体和开放式调查响应中构建解释变量的核心工具，但现有方法作为测量工具的功能不佳。先前的工作表明，文本内容可以预测销售、满意度和公司绩效等结果，但概率模型通常会生成概念上分散的主题，神经主题模型很难在理论驱动的环境中解释，大型语言模型方法缺乏标准化、稳定性和与文档级表示的一致性。我们引入了 LX Topic，一种神经主题方法，它将主题概念化为潜在的语言结构，并生成经过校准的文档级主题比例以进行实证分析。 LX Topic 基于 FASTopic 构建，可确保强大的文档代表性，并使用对齐和置信度加权机制在主题词级别集成大型语言模型细化，从而增强语义连贯性而不扭曲文档主题分布。对大规模 Amazon 和 Yelp 评论数据集的评估表明，相对于领先模型，LX Topic 实现了最高的整体主题质量，同时保留了聚类和分类性能。通过在基于网络的系统中统一主题发现、细化和标准化输出，LX Topic 将主题建模建立为营销研究和实践的可重复、可解释和面向测量的工具。

Title: MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation

Authors: Guoyi Li, Shihao Xu, Jiatong Ma, Yunyun Han, Jianhua Chen, Yafeng Deng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03677
Pdf URL: https://arxiv.org/pdf/2603.03677
Copy Paste: [[2603.03677]] MIND: Unified Inquiry and Diagnosis RL with Criteria Grounded Clinical Supports for Psychiatric Consultation(https://arxiv.org/abs/2603.03677)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have advanced medical dialogue systems, yet psychiatric consultation poses substantially higher demands due to subjective ambiguity and comorbidity complexity: an agent must continuously extract psychopathological cues from incomplete and inconsistent patient reports in multi-turn interactions and perform rigorous differential diagnostic reasoning. However, existing methods face two fundamental challenges. First, without criteria-grounded clinical supports, they are prone to unsupported clinical assertions when symptoms are atypical or underspecified. Second, in multi-turn interactions, they struggle to mitigate inquiry drift (off-topic or low-yield questioning) and optimize questioning strategies. To address these challenges, we propose MIND, a unified inquiry--diagnosis reinforcement learning framework for psychiatric consultation. Specifically, we build a Criteria-Grounded Psychiatric Reasoning Bank (PRB) that summarizes dialogue context into clinical retrieval states, retrieves semantically similar reference consultations, and distills reusable criteria-grounded clinical supports to guide criteria-aligned inquiry and reasoning. Building on this foundation, MIND enforces explicit clinical reasoning with rubric-based process rewards to provide fine-grained supervision over intermediate decision steps, and incorporates a value-aware trajectory rectification mechanism to jointly improve information acquisition and diagnostic decision-making across turns. Extensive experiments demonstrate that MIND consistently outperforms strong baselines in diagnostic accuracy, empathetic interaction quality, interpretability, and generalization.
摘要：大语言模型（LLM）拥有先进的医学对话系统，但由于主观模糊性和合并症复杂性，精神科咨询提出了更高的要求：代理必须在多轮交互中不断从不完整和不一致的患者报告中提取精神病理线索，并执行严格的鉴别诊断推理。然而，现有方法面临两个基本挑战。首先，如果没有基于标准的临床支持，当症状不典型或不明确时，他们很容易做出未经支持的临床断言。其次，在多轮互动中，他们努力减少询问漂移（偏离主题或低产率提问）并优化提问策略。为了应对这些挑战，我们提出了 MIND，一个用于精神病学咨询的统一查询诊断强化学习框架。具体来说，我们建立了一个基于标准的精神病学推理库（PRB），它将对话上下文总结为临床检索状态，检索语义相似的参考咨询，并提炼可重复使用的基于标准的临床支持，以指导符合标准的查询和推理。在此基础上，MIND 通过基于规则的过程奖励强制执行明确的临床推理，以对中间决策步骤提供细粒度的监督，并结合价值感知的轨迹纠正机制，共同改进跨回合的信息获取和诊断决策。大量实验表明，MIND 在诊断准确性、同理心交互质量、可解释性和泛化方面始终优于强大的基线。

Title: Order Is Not Layout: Order-to-Space Bias in Image Generation

Authors: Yongkang Zhang, Zonglin Zhao, Yuechen Zhang, Fei Ding, Pei Li, Wenxuan Wang
Subjects: cs.CL, cs.AI, cs.CV, cs.MM
Abstract URL: https://arxiv.org/abs/2603.03714
Pdf URL: https://arxiv.org/pdf/2603.03714
Copy Paste: [[2603.03714]] Order Is Not Layout: Order-to-Space Bias in Image Generation(https://arxiv.org/abs/2603.03714)
Keywords: prompt
Abstract: We study a systematic bias in modern image generation models: the mention order of entities in text spuriously determines spatial layout and entity--role binding. We term this phenomenon Order-to-Space Bias (OTS) and show that it arises in both text-to-image and image-to-image generation, often overriding grounded cues and causing incorrect layouts or swapped assignments. To quantify OTS, we introduce OTS-Bench, which isolates order effects with paired prompts differing only in entity order and evaluates models along two dimensions: homogenization and correctness. Experiments show that Order-to-Space Bias (OTS) is widespread in modern image generation models, and provide evidence that it is primarily data-driven and manifests during the early stages of layout formation. Motivated by this insight, we show that both targeted fine-tuning and early-stage intervention strategies can substantially reduce OTS, while preserving generation quality.
摘要：我们研究现代图像生成模型中的系统偏差：文本中实体的提及顺序虚假地确定了空间布局和实体角色绑定。我们将这种现象称为顺序空间偏差（OTS），并表明它出现在文本到图像和图像到图像的生成中，通常会覆盖接地线索并导致不正确的布局或交换分配。为了量化 OTS，我们引入了 OTS-Bench，它通过仅实体顺序不同的配对提示来隔离顺序效应，并沿着两个维度评估模型：同质化和正确性。实验表明，顺序空间偏差 (OTS) 在现代图像生成模型中普遍存在，并提供证据表明它主要是数据驱动的，并在布局形成的早期阶段表现出来。受这一见解的启发，我们表明有针对性的微调和早期干预策略都可以大幅减少 OTS，同时保持发电质量。

Title: ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement

Authors: Zijin Hong, Hao Chen, Zheng Yuan, Qinggang Zhang, Luyao Zhuang, Qing Liao, Feiran Huang, Yangqiu Song, Xiao Huang
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2603.03742
Pdf URL: https://arxiv.org/pdf/2603.03742
Copy Paste: [[2603.03742]] ErrorLLM: Modeling SQL Errors for Text-to-SQL Refinement(https://arxiv.org/abs/2603.03742)
Keywords: language model, llm, hallucination, prompt
Abstract: Despite the remarkable performance of large language models (LLMs) in text-to-SQL (SQL generation), correctly producing SQL queries remains challenging during initial generation. The SQL refinement task is subsequently introduced to correct syntactic and semantic errors in generated SQL queries. However, existing paradigms face two major limitations: (i) self-debugging becomes increasingly ineffective as modern LLMs rarely produce explicit execution errors that can trigger debugging signals; (ii) self-correction exhibits low detection precision due to the lack of explicit error modeling grounded in the question and schema, and suffers from severe hallucination that frequently corrupts correct SQLs. In this paper, we propose ErrorLLM, a framework that explicitly models text-to-SQL Errors within a dedicated LLM for text-to-SQL refinement. Specifically, we represent the user question and database schema as structural features, employ static detection to identify execution failures and surface mismatches, and extend ErrorLLM's semantic space with dedicated error tokens that capture categorized implicit semantic error types. Through a well-designed training strategy, we explicitly model these errors with structural representations, enabling the LLM to detect complex implicit errors by predicting dedicated error tokens. Guided by the detected errors, we perform error-guided refinement on the SQL structure by prompting LLMs. Extensive experiments demonstrate that ErrorLLM achieves the most significant improvements over backbone initial generation. Further analysis reveals that detection quality directly determines refinement effectiveness, and ErrorLLM addresses both sides by high detection F1 score while maintain refinement effectiveness.
摘要：尽管大型语言模型 (LLM) 在文本到 SQL（SQL 生成）方面表现出色，但在初始生成过程中正确生成 SQL 查询仍然具有挑战性。随后引入 SQL 细化任务来纠正生成的 SQL 查询中的语法和语义错误。然而，现有的范式面临两个主要限制：（i）自调试变得越来越无效，因为现代法学硕士很少产生可以触发调试信号的显式执行错误； (ii) 由于缺乏基于问题和模式的显式错误建模，自我纠正的检测精度较低，并且会遭受严重的幻觉，经常破坏正确的 SQL。在本文中，我们提出了 ErrorLLM，这是一个在专用 LLM 中显式建模文本到 SQL 错误的框架，用于文本到 SQL 细化。具体来说，我们将用户问题和数据库模式表示为结构特征，采用静态检测来识别执行失败和表面不匹配，并使用捕获分类隐式语义错误类型的专用错误标记来扩展 ErrorLLM 的语义空间。通过精心设计的训练策略，我们使用结构表示对这些错误进行显式建模，使法学硕士能够通过预测专用错误标记来检测复杂的隐式错误。在检测到的错误的指导下，我们通过提示 LLM 对 SQL 结构进行错误引导的细化。大量实验表明，ErrorLLM 相对于主干初始生成实现了最显着的改进。进一步分析发现，检测质量直接决定细化效果，ErrorLLM 通过高检测 F1 分数解决这两个问题，同时保持细化效果。

Title: Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning

Authors: Chuang Zhang, Zizhen Zhu, Yihao Wei, Bing Tian, Junyi Liu, Henan Wang, Xavier Wang, Yaxiao Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03752
Pdf URL: https://arxiv.org/pdf/2603.03752
Copy Paste: [[2603.03752]] Confidence-Calibrated Small-Large Language Model Collaboration for Cost-Efficient Reasoning(https://arxiv.org/abs/2603.03752)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate superior reasoning capabilities compared to small language models (SLMs), but incur substantially higher costs. We propose COllaborative REAsoner (COREA), a system that cascades an SLM with an LLM to achieve a balance between accuracy and cost in complex reasoning tasks. COREA first attempts to answer questions using the SLM, which outputs both an answer and a verbalized confidence score. Questions with confidence below a predefined threshold are deferred to the LLM for more accurate resolution. We introduce a reinforcement learning-based training algorithm that aligns the SLM's confidence through an additional confidence calibration reward. Extensive experiments demonstrate that our method jointly improves the SLM's reasoning ability and confidence calibration across diverse datasets and model backbones. Compared to using the LLM alone, COREA reduces cost by 21.5% and 16.8% on out-of-domain math and non-math datasets, respectively, with only an absolute pass@1 drop within 2%.
摘要：与小型语言模型 (SLM) 相比，大型语言模型 (LLM) 表现出卓越的推理能力，但成本要高得多。我们提出了 COllaborative REAsoner (COREA)，这是一个将 SLM 与 LLM 级联的系统，以在复杂推理任务中实现准确性和成本之间的平衡。 COREA 首先尝试使用 SLM 来回答问题，SLM 会输出答案和口头置信度得分。置信度低于预定义阈值的问题将推迟到法学硕士以获得更准确的解决方案。我们引入了一种基于强化学习的训练算法，该算法通过额外的置信度校准奖励来调整 SLM 的置信度。大量实验表明，我们的方法共同提高了 SLM 的推理能力和跨不同数据集和模型主干的置信度校准。与单独使用 LLM 相比，COREA 在域外数学和非数学数据集上的成本分别降低了 21.5% 和 16.8%，绝对 pass@1 仅下降了 2% 以内。

Title: T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning

Authors: Qinsi Wang, Hancheng Ye, Jinhee Kim, Jinghan Ke, Yifei Wang, Martin Kuo, Zishan Shao, Dongting Li, Yueqian Lin, Ting Jiang, Chiyue Wei, Qi Qian, Wei Wen, Helen Li, Yiran Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03790
Pdf URL: https://arxiv.org/pdf/2603.03790
Copy Paste: [[2603.03790]] T2S-Bench & Structure-of-Thought: Benchmarking and Prompting Comprehensive Text-to-Structure Reasoning(https://arxiv.org/abs/2603.03790)
Keywords: language model, prompt
Abstract: Think about how human handles complex reading tasks: marking key points, inferring their relationships, and structuring information to guide understanding and responses. Likewise, can a large language model benefit from text structure to enhance text-processing performance? To explore it, in this work, we first introduce Structure of Thought (SoT), a prompting technique that explicitly guides models to construct intermediate text structures, consistently boosting performance across eight tasks and three model families. Building upon this insight, we present T2S-Bench, the first benchmark designed to evaluate and improve text-to-structure capabilities of models. T2S-Bench includes 1.8K samples across 6 scientific domains and 32 structural types, rigorously constructed to ensure accuracy, fairness, and quality. Evaluation on 45 mainstream models reveals substantial improvement potential: the average accuracy on the multi-hop reasoning task is only 52.1%, and even the most advanced model achieves 58.1% node accuracy in end-to-end extraction. Furthermore, on Qwen2.5-7B-Instruct, SoT alone yields an average +5.7% improvement across eight diverse text-processing tasks, and fine-tuning on T2S-Bench further increases this gain to +8.6%. These results highlight the value of explicit text structuring and the complementary contributions of SoT and T2S-Bench. Dataset and eval code have been released at this https URL.
摘要：想想人类如何处理复杂的阅读任务：标记关键点，推断它们的关系，以及构建信息以指导理解和反应。同样，大型语言模型能否从文本结构中受益以增强文本处理性能？为了探索这个问题，在这项工作中，我们首先引入了思维结构（SoT），这是一种提示技术，可以明确指导模型构建中间文本结构，从而持续提高八个任务和三个模型系列的性能。基于这一见解，我们推出了 T2S-Bench，这是第一个旨在评估和改进模型的文本到结构能力的基准测试。 T2S-Bench 包含跨越 6 个科学领域和 32 种结构类型的 1.8K 样本，经过严格构建以确保准确性、公平性和质量。对45个主流模型的评估显示出巨大的改进潜力：多跳推理任务的平均准确率仅为52.1%，即使是最先进的模型在端到端提取中也能达到58.1%的节点准确率。此外，在 Qwen2.5-7B-Instruct 上，仅 SoT 就可以在八个不同的文本处理任务中平均提高 +5.7%，并且在 T2S-Bench 上进行微调，进一步将这一增益提高到 +8.6%。这些结果凸显了显式文本结构的价值以及 SoT 和 T2S-Bench 的互补贡献。数据集和评估代码已在此 https URL 发布。

Title: Benchmarking Motivational Interviewing Competence of Large Language Models

Authors: Aishwariya Jha, Prakrithi Shivaprakash, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03846
Pdf URL: https://arxiv.org/pdf/2603.03846
Copy Paste: [[2603.03846]] Benchmarking Motivational Interviewing Competence of Large Language Models(https://arxiv.org/abs/2603.03846)
Keywords: language model, llm
Abstract: Motivational interviewing (MI) promotes behavioural change in substance use disorders. Its fidelity is measured using the Motivational Interviewing Treatment Integrity (MITI) framework. While large language models (LLMs) can potentially generate MI-consistent therapist responses, their competence using MITI is not well-researched, especially in real world clinical transcripts. We aim to benchmark MI competence of proprietary and open-source models compared to human therapists in real-world transcripts and assess distinguishability from human therapists. Methods: We shortlisted 3 proprietary and 7 open-source LLMs from LMArena, evaluated performance using MITI 4.2 framework on two datasets (96 handcrafted model transcripts, 34 real-world clinical transcripts). We generated parallel LLM-therapist utterances iteratively for each transcript while keeping client responses static, and ranked performance using a composite ranking system with MITI components and verbosity. We conducted a distinguishability experiment with two independent psychiatrists to identify human-vs-LLM responses. Results: All 10 tested LLMs had fair (MITI global scores >3.5) to good (MITI global scores >4) competence across MITI measures, and three best-performing models (gemma-3-27b-it, gemini-2.5-pro, grok-3) were tested on real-world transcripts. All showed good competence, with LLMs outperforming human-expert in Complex Reflection percentage (39% vs 96%) and Reflection-Question ratio (1.2 vs >2.8). In the distinguishability experiment, psychiatrists identified LLM responses with only 56% accuracy, with d-prime: 0.17 and 0.25 for gemini-2.5-pro and gemma-3-27b-it respectively. Conclusion: LLMs can achieve good MI proficiency in real-world clinical transcripts using MITI framework. These findings suggest that even open-source LLMs are viable candidates for expanding MI counselling sessions in low-resource settings.
摘要：动机访谈（MI）促进物质使用障碍的行为改变。其保真度是使用动机访谈治疗完整性 (MITI) 框架来衡量的。虽然大语言模型 (LLM) 可能会产生与 MI 一致的治疗师反应，但他们使用 MITI 的能力尚未得到充分研究，尤其是在现实世界的临床记录中。我们的目标是在现实世界的记录中将专有和开源模型的 MI 能力与人类治疗师进行比较，并评估与人类治疗师的区别。方法：我们从 LMArena 入围了 3 名专有法学硕士和 7 名开源法学硕士，使用 MITI 4.2 框架在两个数据集（96 个手工制作的模型转录本、34 个真实世界临床转录本）上评估了性能。我们为每个笔录迭代生成并行的法学硕士治疗师话语，同时保持客户响应静态，并使用具有 MITI 组件和详细程度的综合排名系统对表现进行排名。我们与两位独立的精神科医生进行了一项区分实验，以确定人类与法学硕士的反应。结果：所有 10 名接受测试的法学硕士在 MITI 衡量标准中均具有一般（MITI 全局分数 >3.5）到良好（MITI 全局分数 >4）的能力，并且三个表现最佳的模型（gemma-3-27b-it、gemini-2.5-pro、grok-3）在现实世界的成绩单上进行了测试。所有人都表现出了良好的能力，法学硕士在复杂反思百分比（39% vs 96%）和反思-问题比率（1.2 vs >2.8）方面优于人类专家。在区分性实验中，精神科医生识别出的 LLM 反应准确度仅为 56%，其中 d-prime：gemini-2.5-pro 和 gemma-3-27b-it 的准确度分别为 0.17 和 0.25。结论：法学硕士可以使用 MITI 框架在现实世界的临床记录中取得良好的 MI 熟练程度。这些发现表明，即使是开源法学硕士也是在资源匮乏的环境中扩大 MI 咨询课程的可行候选者。

Title: Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling

Authors: Anas Belfathi, Nicolas Hernandez, Laura Monceaux, Warren Bonnard, Mary Catherine Lavissiere, Christine Jacquin, Richard Dufour
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03856
Pdf URL: https://arxiv.org/pdf/2603.03856
Copy Paste: [[2603.03856]] Coupling Local Context and Global Semantic Prototypes via a Hierarchical Architecture for Rhetorical Roles Labeling(https://arxiv.org/abs/2603.03856)
Keywords: language model
Abstract: Rhetorical Role Labeling (RRL) identifies the functional role of each sentence in a document, a key task for discourse understanding in domains such as law and medicine. While hierarchical models capture local dependencies effectively, they are limited in modeling global, corpus-level features. To address this limitation, we propose two prototype-based methods that integrate local context with global representations. Prototype-Based Regularization (PBR) learns soft prototypes through a distance-based auxiliary loss to structure the latent space, while Prototype-Conditioned Modulation (PCM) constructs corpus-level prototypes and injects them during training and inference. Given the scarcity of RRL resources, we introduce SCOTUS-Law, the first dataset of U.S. Supreme Court opinions annotated with rhetorical roles at three levels of granularity: category, rhetorical function, and step. Experiments on legal, medical, and scientific benchmarks show consistent improvements over strong baselines, with 4 Macro-F1 gains on low-frequency roles. We further analyze the implications in the era of Large Language Models and complement our findings with expert evaluation.
摘要：修辞角色标签（RRL）识别文档中每个句子的功能角色，这是法律和医学等领域话语理解的关键任务。虽然分层模型可以有效地捕获局部依赖关系，但它们在建模全局、语料库级特征方面受到限制。为了解决这个限制，我们提出了两种基于原型的方法，将局部上下文与全局表示相结合。基于原型的正则化（PBR）通过基于距离的辅助损失来学习软原型来构建潜在空间，而原型条件调制（PCM）则构建语料库级原型并在训练和推理过程中注入它们。鉴于 RRL 资源的稀缺性，我们引入了 SCOTUS-Law，这是美国最高法院意见的第一个数据集，在三个粒度级别上注释了修辞角色：类别、修辞功能和步骤。法律、医学和科学基准的实验表明，与强基准相比，其持续改进，低频角色的 Macro-F1 增益为 4。我们进一步分析了大型语言模型时代的影响，并通过专家评估补充了我们的发现。

Title: Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy

Authors: Navdeep Singh Bedi, Ana-Maria Bucur, Noriko Kando, Fabio Crestani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.03862
Pdf URL: https://arxiv.org/pdf/2603.03862
Copy Paste: [[2603.03862]] Assessing the Effectiveness of LLMs in Delivering Cognitive Behavioral Therapy(https://arxiv.org/abs/2603.03862)
Keywords: language model, llm, retrieval-augmented generation
Abstract: As mental health issues continue to rise globally, there is an increasing demand for accessible and scalable therapeutic solutions. Many individuals currently seek support from Large Language Models (LLMs), even though these models have not been validated for use in counseling services. In this paper, we evaluate LLMs' ability to emulate professional therapists practicing Cognitive Behavioral Therapy (CBT). Using anonymized, transcribed role-play sessions between licensed therapists and clients, we compare two approaches: (1) a generation-only method and (2) a Retrieval-Augmented Generation (RAG) approach using CBT guidelines. We evaluate both proprietary and open-source models for linguistic quality, semantic coherence, and therapeutic fidelity using standard natural language generation (NLG) metrics, natural language inference (NLI), and automated scoring for skills assessment. Our results indicate that while LLMs can generate CBT-like dialogues, they are limited in their ability to convey empathy and maintain consistency.
摘要：随着全球心理健康问题持续上升，对易于使用且可扩展的治疗解决方案的需求不断增加。目前，许多人寻求大型语言模型 (LLM) 的支持，尽管这些模型尚未经过验证可用于咨询服务。在本文中，我们评估了法学硕士模仿专业治疗师实践认知行为疗法（CBT）的能力。通过在持照治疗师和客户之间进行匿名、转录的角色扮演会议，我们比较了两种方法：(1) 仅生成方法和 (2) 使用 CBT 指南的检索增强生成 (RAG) 方法。我们使用标准自然语言生成 (NLG) 指标、自然语言推理 (NLI) 和技能评估自动评分来评估专有和开源模型的语言质量、语义一致性和治疗保真度。我们的结果表明，虽然法学硕士可以产生类似 CBT 的对话，但他们传达同理心和保持一致性的能力有限。

Title: CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents

Authors: Martin Kostelník, Michal Hradiš, Martin Dočekal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03884
Pdf URL: https://arxiv.org/pdf/2603.03884
Copy Paste: [[2603.03884]] CzechTopic: A Benchmark for Zero-Shot Topic Localization in Historical Czech Documents(https://arxiv.org/abs/2603.03884)
Keywords: language model, llm
Abstract: Topic localization aims to identify spans of text that express a given topic defined by a name and description. To study this task, we introduce a human-annotated benchmark based on Czech historical documents, containing human-defined topics together with manually annotated spans and supporting evaluation at both document and word levels. Evaluation is performed relative to human agreement rather than a single reference annotation. We evaluate a diverse range of large language models alongside BERT-based models fine-tuned on a distilled development dataset. Results reveal substantial variability among LLMs, with performance ranging from near-human topic detection to pronounced failures in span localization. While the strongest models approach human agreement, the distilled token embedding models remain competitive despite their smaller scale. The dataset and evaluation framework are publicly available at: this https URL.
摘要：主题本地化旨在识别表达由名称和描述定义的给定主题的文本范围。为了研究这项任务，我们引入了一个基于捷克历史文档的人工注释基准，其中包含人工定义的主题以及手动注释的范围，并支持文档和单词级别的评估。评估是根据人类协议而不是单个参考注释进行的。我们评估各种大型语言模型以及在精炼开发数据集上微调的基于 BERT 的模型。结果显示法学硕士之间存在很大差异，其表现范围从接近人类的主题检测到跨度定位的明显失败。虽然最强大的模型接近人类共识，但精炼的代币嵌入模型尽管规模较小，但仍然具有竞争力。数据集和评估框架可在以下网址公开获取：此 https URL。

Title: Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects

Authors: Ji-Lun Peng, Yun-Nung Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.03915
Pdf URL: https://arxiv.org/pdf/2603.03915
Copy Paste: [[2603.03915]] Rethinking Role-Playing Evaluation: Anonymous Benchmarking and a Systematic Study of Personality Effects(https://arxiv.org/abs/2603.03915)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated significant potential in developing Role-Playing Agents (RPAs). However, current research primarily evaluates RPAs using famous fictional characters, allowing models to rely on memory associated with character names. This dependency creates a bias that limits the generalization of RPAs to unseen personas. To address this issue, we propose an anonymous evaluation method. Experiments across multiple benchmarks reveal that anonymization significantly degrades role-playing performance, confirming that name exposure carries implicit information. Furthermore, we investigate personality augmentation to enhance role fidelity under anonymous setting. We systematically compare the efficacy of personality traits derived from human annotations versus those self-generated by the model. Our results demonstrate that incorporating personality information consistently improves RPA performance. Crucially, self-generated personalities achieve performance comparable to human-annotated ones. This work establishes a fairer evaluation protocol and validates a scalable, personality-enhanced framework for constructing robust RPAs.
摘要：大型语言模型 (LLM) 在开发角色扮演代理 (RPA) 方面表现出了巨大的潜力。然而，目前的研究主要使用著名的虚构人物来评估 RPA，允许模型依赖与人物名称相关的记忆。这种依赖性造成了一种偏见，限制了 RPA 对看不见的角色的推广。为了解决这个问题，我们提出了一种匿名评估方法。跨多个基准的实验表明，匿名化会显着降低角色扮演性能，证实姓名暴露携带隐式信息。此外，我们研究了人格增强，以增强匿名环境下的角色忠诚度。我们系统地比较了源自人类注释的人格特征与模型自行生成的人格特征的功效。我们的结果表明，整合个性信息可以持续提高 RPA 性能。至关重要的是，自我生成的个性所达到的性能可与人类注释的个性相媲美。这项工作建立了一个更公平的评估协议，并验证了一个可扩展、个性增强的框架，用于构建强大的 RPA。

Title: Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA

Authors: Ikram Belmadani, Oumaima El Khettari, Pacôme Constant dit Beaufils, Richard Dufour, Benoit Favre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04033
Pdf URL: https://arxiv.org/pdf/2603.04033
Copy Paste: [[2603.04033]] Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA(https://arxiv.org/abs/2603.04033)
Keywords: language model, llm
Abstract: Automatic evaluation of medical open-ended question answering (OEQA) remains challenging due to the need for expert annotations. We evaluate whether large language models (LLMs) can act as judges of semantic equivalence in French medical OEQA, comparing closed-access, general-purpose, and biomedical domain-adapted models. Our results show that LLM-based judgments are strongly influenced by the model that generated the answer, with agreement varying substantially across generators. Domain-adapted and large general-purpose models achieve the highest alignment with expert annotations. We further show that lightweight adaptation of a compact model using supervised fine-tuning (SFT) and Group Relative Policy Optimization (GRPO) substantially improves performance and reduces generator sensitivity, even with limited data. Overall, our findings highlight the need for generator-aware evaluation and suggest that carefully adapted small models can support scalable evaluation in low-resource medical settings.
摘要：由于需要专家注释，医学开放式问答 (OEQA) 的自动评估仍然具有挑战性。我们评估大型语言模型 (LLM) 是否可以充当法国医学 OEQA 中语义等价的判断者，比较封闭访问、通用和生物医学领域适应模型。我们的结果表明，基于 LLM 的判断受到生成答案的模型的强烈影响，不同生成器的一致性差异很大。领域适应模型和大型通用模型与专家注释实现了最高的一致性。我们进一步表明，即使数据有限，使用监督微调（SFT）和组相对策略优化（GRPO）对紧凑模型进行轻量级调整也可以显着提高性能并降低生成器灵敏度。总的来说，我们的研究结果强调了对生成器感知评估的必要性，并表明精心调整的小型模型可以支持资源匮乏的医疗环境中的可扩展评估。

Title: Monitoring Emergent Reward Hacking During Generation via Internal Activations

Authors: Patrick Wilhelm, Thorsten Wittkopp, Odej Kao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04069
Pdf URL: https://arxiv.org/pdf/2603.04069
Copy Paste: [[2603.04069]] Monitoring Emergent Reward Hacking During Generation via Internal Activations(https://arxiv.org/abs/2603.04069)
Keywords: language model, prompt, chain-of-thought
Abstract: Fine-tuned large language models can exhibit reward-hacking behavior arising from emergent misalignment, which is difficult to detect from final outputs alone. While prior work has studied reward hacking at the level of completed responses, it remains unclear whether such behavior can be identified during generation. We propose an activation-based monitoring approach that detects reward-hacking signals from internal representations as a model generates its response. Our method trains sparse autoencoders on residual stream activations and applies lightweight linear classifiers to produce token-level estimates of reward-hacking activity. Across multiple model families and fine-tuning mixtures, we find that internal activation patterns reliably distinguish reward-hacking from benign behavior, generalize to unseen mixed-policy adapters, and exhibit model-dependent temporal structure during chain-of-thought reasoning. Notably, reward-hacking signals often emerge early, persist throughout reasoning, and can be amplified by increased test-time compute in the form of chain-of-thought prompting under weakly specified reward objectives. These results suggest that internal activation monitoring provides a complementary and earlier signal of emergent misalignment than output-based evaluation, supporting more robust post-deployment safety monitoring for fine-tuned language models.
摘要：经过微调的大型语言模型可能会表现出因出现的不一致而引起的奖励黑客行为，而仅从最终输出中很难检测到这种行为。虽然之前的工作已经在完成的响应水平上研究了奖励黑客行为，但目前尚不清楚这种行为是否可以在生成过程中被识别。我们提出了一种基于激活的监控方法，当模型生成响应时，该方法可以从内部表示中检测奖励黑客信号。我们的方法在残差流激活上训练稀疏自动编码器，并应用轻量级线性分类器来生成奖励黑客活动的令牌级估计。在多个模型系列和微调混合物中，我们发现内部激活模式能够可靠地区分奖励黑客行为和良性行为，泛化到看不见的混合策略适配器，并在思想链推理过程中表现出依赖于模型的时间结构。值得注意的是，奖励黑客信号通常很早就出现，在整个推理过程中持续存在，并且可以通过在弱指定的奖励目标下以思维链提示的形式增加测试时间计算来放大。这些结果表明，内部激活监控比基于输出的评估提供了紧急失调的补充和更早的信号，支持对微调语言模型进行更强大的部署后安全监控。

Title: Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation

Authors: Malik Marmonier, Benoît Sagot, Rachel Bawden
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04083
Pdf URL: https://arxiv.org/pdf/2603.04083
Copy Paste: [[2603.04083]] Hindsight Quality Prediction Experiments in Multi-Candidate Human-Post-Edited Machine Translation(https://arxiv.org/abs/2603.04083)
Keywords: language model, llm
Abstract: This paper investigates two complementary paradigms for predicting machine translation (MT) quality: source-side difficulty prediction and candidate-side quality estimation (QE). The rapid adoption of Large Language Models (LLMs) into MT workflows is reshaping the research landscape, yet its impact on established quality prediction paradigms remains underexplored. We study this issue through a series of "hindsight" experiments on a unique, multi-candidate dataset resulting from a genuine MT post-editing (MTPE) project. The dataset consists of over 6,000 English source segments with nine translation hypotheses from a diverse set of traditional neural MT systems and advanced LLMs, all evaluated against a single, final human post-edited reference. Using Kendall's rank correlation, we assess the predictive power of source-side difficulty metrics, candidate-side QE models and position heuristics against two gold-standard scores: TER (as a proxy for post-editing effort) and COMET (as a proxy for human judgment). Our findings highlight that the architectural shift towards LLMs alters the reliability of established quality prediction methods while simultaneously mitigating previous challenges in document-level translation.
摘要：本文研究了预测机器翻译（MT）质量的两种互补范式：源端难度预测和候选端质量估计（QE）。大型语言模型（LLM）在机器翻译工作流程中的快速采用正在重塑研究格局，但其对既定质量预测范式的影响仍未得到充分探索。我们通过对由真正的机器翻译后期编辑 (MTPE) 项目产生的独特的多候选数据集进行一系列“事后诸葛亮”实验来研究这个问题。该数据集包含 6,000 多个英语源片段，以及来自一组不同的传统神经机器翻译系统和高级法学硕士的九个翻译假设，所有这些都根据单个最终的人工后期编辑参考进行评估。使用肯德尔的排名相关性，我们根据两个黄金标准分数来评估源端难度指标、候选端 QE 模型和位置启发式的预测能力：TER（作为后期编辑工作的代理）和 COMET（作为人类判断的代理）。我们的研究结果强调，向法学硕士的架构转变改变了既定质量预测方法的可靠性，同时缓解了之前文档级翻译中的挑战。

Title: FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation

Authors: Juhyun Oh, Nayeon Lee, Chani Jung, Jiho Jin, Junho Myung, Jongwon Lee, Taeui Song, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04123
Pdf URL: https://arxiv.org/pdf/2603.04123
Copy Paste: [[2603.04123]] FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation(https://arxiv.org/abs/2603.04123)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) often generate overly cautious and vague responses on sensitive topics, sacrificing helpfulness for safety. Existing evaluation frameworks lack systematic methods to identify and address specific weaknesses in responses to sensitive topics, making it difficult to improve both safety and helpfulness simultaneously. To address this, we introduce FINEST, a FINE-grained response evaluation taxonomy for Sensitive Topics, which breaks down helpfulness and harmlessness into errors across three main categories: Content, Logic, and Appropriateness. Experiments on a Korean-sensitive question dataset demonstrate that our score- and error-based improvement pipeline, guided by FINEST, significantly improves the model responses across all three categories, outperforming refinement without guidance. Notably, score-based improvement -- providing category-specific scores and justifications -- yields the most significant gains, reducing the error sentence ratio for Appropriateness by up to 33.09%. This work lays the foundation for a more explainable and comprehensive evaluation and improvement of LLM responses to sensitive questions.
摘要：大型语言模型 (LLM) 通常会对敏感主题产生过于谨慎和模糊的响应，从而牺牲了安全性。现有的评估框架缺乏系统的方法来识别和解决敏感话题响应中的具体弱点，因此难以同时提高安全性和有用性。为了解决这个问题，我们引入了 FINEST，这是一种针对敏感主题的细粒度响应评估分类法，它将有用性和无害性分为三个主要类别的错误：内容、逻辑和适当性。对韩语敏感问题数据集的实验表明，我们基于分数和错误的改进管道在 FINEST 的指导下，显着改善了所有三个类别的模型响应，优于没有指导的细化。值得注意的是，基于分数的改进（提供特定类别的分数和理由）产生了最显着的收益，将适当性的错误句子率降低了高达 33.09%。这项工作为更可解释和更全面地评估和改进法学硕士对敏感问题的回答奠定了基础。

Title: Traces of Social Competence in Large Language Models

Authors: Tom Kouwenhoven, Michiel van der Meer, Max van Duijn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04161
Pdf URL: https://arxiv.org/pdf/2603.04161
Copy Paste: [[2603.04161]] Traces of Social Competence in Large Language Models(https://arxiv.org/abs/2603.04161)
Keywords: language model, llm
Abstract: The False Belief Test (FBT) has been the main method for assessing Theory of Mind (ToM) and related socio-cognitive competencies. For Large Language Models (LLMs), the reliability and explanatory potential of this test have remained limited due to issues like data contamination, insufficient model details, and inconsistent controls. We address these issues by testing 17 open-weight models on a balanced set of 192 FBT variants (Trott et al. 2023) using Bayesian Logistic regression to identify how model size and post-training affect socio-cognitive competence. We find that scaling model size benefits performance, but not strictly. A cross-over effect reveals that explicating propositional attitudes (X thinks) fundamentally alters response patterns. Instruction tuning partially mitigates this effect, but further reasoning-oriented finetuning amplifies it. In a case study analysing social reasoning ability throughout OLMo 2 training, we show that this cross-over effect emerges during pre-training, suggesting that models acquire stereotypical response patterns tied to mental-state vocabulary that can outweigh other scenario semantics. Finally, vector steering allows us to isolate a think vector as the causal driver of observed FBT behaviour.
摘要：错误信念测试（FBT）一直是评估心理理论（ToM）和相关社会认知能力的主要方法。对于大型语言模型 (LLM)，由于数据污染、模型细节不足和控制不一致等问题，该测试的可靠性和解释潜力仍然有限。我们通过使用贝叶斯逻辑回归在 192 个 FBT 变体的平衡集上测试 17 个开放权重模型来解决这些问题（Trott 等人，2023），以确定模型大小和训练后如何影响社会认知能力。我们发现缩放模型大小有利于性能，但严格来说并不如此。交叉效应表明，阐明命题态度（X 认为）从根本上改变了反应模式。指令调整部分减轻了这种影响，但进一步面向推理的微调会放大这种影响。在分析整个 OLMo 2 训练过程中的社交推理能力的案例研究中，我们表明这种交叉效应在预训练期间出现，这表明模型获得了与心理状态词汇相关的刻板反应模式，这些模式可能比其他场景语义更重要。最后，向量引导使我们能够将思维向量隔离为观察到的 FBT 行为的因果驱动因素。

Title: Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model

Authors: Jakub Prejzner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2603.04162
Pdf URL: https://arxiv.org/pdf/2603.04162
Copy Paste: [[2603.04162]] Bielik-Q2-Sharp: A Comparative Study of Extreme 2-bit Quantization Methods for a Polish 11B Language Model(https://arxiv.org/abs/2603.04162)
Keywords: language model, gpt
Abstract: We present Bielik-Q2-Sharp, the first systematic academic evaluation of extreme 2-bit quantization applied to a Polish large language model. Using Bielik-11B-v2.3-Instruct (11B parameters, Mistral architecture) as our base model, we compare six state-of-the-art post-training quantization methods -- QuIP#, SpinQuant+GPTQ, ButterflyQuant, QTIP, VPTQ, and AQLM -- all calibrated on a Polish-language corpus (CulturaX-PL) with shared Hessian matrices. Our best variant (QuIP# E8P12) achieves 71.92% across 22 Polish benchmarks versus 72.07% for the IQ2_XXS baseline -- within statistical noise, at a modest size premium (3.26 GB vs. ~2.6 GB). On eq_bench, our method scores 47.14 versus 43.53 (+3.6pp), suggesting superior preservation of higher-order reasoning. QTIP achieves the best per-bit efficiency (79.4% MC acc_norm at ~2.4 bpw, 3.27 GB), matching VPTQ's quality at 35% smaller size. We additionally document a MC-generation dissociation phenomenon where rotation-based methods preserve log-likelihood quality but fail catastrophically at autoregressive generation. The entire project was conducted by a single independent researcher on cloud GPUs (this http URL) within a $285 budget. All models, Hessians, and evaluation logs are publicly available.
摘要：我们提出了 Bielik-Q2-Sharp，这是第一个对应用于波兰大语言模型的极端 2 位量化进行系统学术评估。使用 Bielik-11B-v2.3-Instruct（11B 参数，Mistral 架构）作为基础模型，我们比较了六种最先进的训练后量化方法——QuIP#、SpinQuant+GPTQ、ButterflyQuant、QTIP、VPTQ 和 AQLM——所有方法都在具有共享 Hessian 矩阵的波兰语语料库 (CulturaX-PL) 上进行了校准。我们的最佳变体 (QuIP# E8P12) 在 22 个波兰基准测试中实现了 71.92%，而 IQ2_XXS 基准则为 72.07%——在统计噪声范围内，大小溢价适中（3.26 GB 与 ~2.6 GB）。在 eq_bench 上，我们的方法得分为 47.14 比 43.53 (+3.6pp)，表明对高阶推理的出色保留。 QTIP 实现了最佳的每比特效率（79.4% MC acc_norm，约 2.4 bpw，3.27 GB），与 VPTQ 的质量相当，但尺寸缩小了 35%。我们还记录了 MC 生成解离现象，其中基于旋转的方法保留了对数似然质量，但在自回归生成时却灾难性地失败。整个项目由一位独立研究人员在云 GPU（此 http URL）上进行，预算为 285 美元。所有模型、Hessians 和评估日志都是公开的。

Title: When Do Language Models Endorse Limitations on Human Rights Principles?

Authors: Keenan Samway, Nicole Miu Takagi, Rada Mihalcea, Bernhard Schölkopf, Ilias Chalkidis, Daniel Hershcovich, Zhijing Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04217
Pdf URL: https://arxiv.org/pdf/2603.04217
Copy Paste: [[2603.04217]] When Do Language Models Endorse Limitations on Human Rights Principles?(https://arxiv.org/abs/2603.04217)
Keywords: language model, llm, prompt
Abstract: As Large Language Models (LLMs) increasingly mediate global information access with the potential to shape public discourse, their alignment with universal human rights principles becomes important to ensure that these rights are abided by in high stakes AI-mediated interactions. In this paper, we evaluate how LLMs navigate trade-offs involving the Universal Declaration of Human Rights (UDHR), leveraging 1,152 synthetically generated scenarios across 24 rights articles and eight languages. Our analysis of eleven major LLMs reveals systematic biases where models: (1) accept limiting Economic, Social, and Cultural rights more often than Political and Civil rights, (2) demonstrate significant cross-linguistic variation with elevated endorsement rates of rights-limiting actions in Chinese and Hindi compared to English or Romanian, (3) show substantial susceptibility to prompt-based steering, and (4) exhibit noticeable differences between Likert and open-ended responses, highlighting critical challenges in LLM preference assessment.
摘要：随着大型语言模型 (LLM) 越来越多地介导全球信息访问，并有可能塑造公共话语，它们与普遍人权原则的一致性变得非常重要，以确保这些权利在人工智能介导的高风险互动中得到遵守。在本文中，我们评估了法学硕士如何在涉及《世界人权宣言》(UDHR) 的问题上进行权衡，利用跨 24 条权利条款和 8 种语言的 1,152 个综合生成的场景。我们对十一个主要法学硕士的分析揭示了系统性偏见，其中模型：（1）比政治和公民权利更多地接受限制经济、社会和文化权利，（2）显示出显着的跨语言差异，与英语或罗马尼亚语相比，中文和印地语的权利限制行为的认可率更高，（3）显示出对基于提示的指导的极大敏感性，以及（4）显示出李克特和开放式回答之间的显着差异，突出了法学硕士偏好评估中的关键挑战。

Title: Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG

Authors: Martin Asenov, Kenza Benkirane, Dan Goldwater, Aneiss Ghodsi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04238
Pdf URL: https://arxiv.org/pdf/2603.04238
Copy Paste: [[2603.04238]] Retrieval or Representation? Reassessing Benchmark Gaps in Multilingual and Visually Rich RAG(https://arxiv.org/abs/2603.04238)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a common way to ground language models in external documents and up-to-date information. Classical retrieval systems relied on lexical methods such as BM25, which rank documents by term overlap with corpus-level weighting. End-to-end multimodal retrievers trained on large query-document datasets claim substantial improvements over these approaches, especially for multilingual documents with complex visual layouts. We demonstrate that better document representation is the primary driver of benchmark improvements. By systematically varying transcription and preprocessing methods while holding the retrieval mechanism fixed, we demonstrate that BM25 can recover large gaps on multilingual and visual benchmarks. Our findings call for decomposed evaluation benchmarks that separately measure transcription and retrieval capabilities, enabling the field to correctly attribute progress and focus effort where it matters.
摘要：检索增强生成（RAG）是在外部文档和最新信息中建立语言模型的常用方法。经典检索系统依赖于 BM25 等词汇方法，该方法根据术语重叠和语料库级别权重对文档进行排序。在大型查询文档数据集上训练的端到端多模式检索器声称比这些方法有了实质性的改进，特别是对于具有复杂视觉布局的多语言文档。我们证明更好的文档表示是基准改进的主要驱动力。通过系统地改变转录和预处理方法，同时保持检索机制固定，我们证明 BM25 可以恢复多语言和视觉基准上的大差距。我们的研究结果需要分解的评估基准，分别衡量转录和检索能力，使该领域能够正确归因进展并将工作重点放在重要的地方。

Title: Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory

Authors: Zhenting Wang, Huancheng Chen, Jiayun Wang, Wei Wei
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04257
Pdf URL: https://arxiv.org/pdf/2603.04257
Copy Paste: [[2603.04257]] Memex(RL): Scaling Long-Horizon LLM Agents via Indexed Experience Memory(https://arxiv.org/abs/2603.04257)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents are fundamentally bottlenecked by finite context windows on long-horizon tasks. As trajectories grow, retaining tool outputs and intermediate reasoning in-context quickly becomes infeasible: the working context becomes prohibitively long, eventually exceeds the context budget, and makes distant evidence harder to use even when it is still present. Existing solutions typically shorten context through truncation or running summaries, but these methods are fundamentally lossy because they compress or discard past evidence itself. We introduce Memex, an indexed experience memory mechanism that instead compresses context without discarding evidence. Memex maintains a compact working context consisting of concise structured summaries and stable indices, while storing full-fidelity underlying interactions in an external experience database under those indices. The agent can then decide when to dereference an index and recover the exact past evidence needed for the current subgoal. We optimize both write and read behaviors with our reinforcement learning framework MemexRL, using reward shaping tailored to indexed memory usage under a context budget, so the agent learns what to summarize, what to archive, how to index it, and when to retrieve it. This yields a substantially less lossy form of long-horizon memory than summary-only approaches. We further provide a theoretical analysis showing the potential of the Memex loop to preserve decision quality with bounded dereferencing while keeping effective in-context computation bounded as history grows. Empirically, on challenging long-horizon tasks, Memex agent trained with MemexRL improves task success while using a significantly smaller working context.
摘要：大型语言模型（LLM）智能体从根本上受到长期任务上有限上下文窗口的瓶颈。随着轨迹的增长，在上下文中保留工具输出和中间推理很快变得不可行：工作上下文变得非常长，最终超出了上下文预算，并且使得远程证据即使仍然存在也更难以使用。现有的解决方案通常通过截断或运行摘要来缩短上下文，但这些方法从根本上来说是有损的，因为它们本身压缩或丢弃了过去的证据。我们引入了 Memex，一种索引经验记忆机制，可以在不丢弃证据的情况下压缩上下文。 Memex 维护一个紧凑的工作环境，由简洁的结构化摘要和稳定的索引组成，同时在这些索引下的外部经验数据库中存储全保真的底层交互。然后，代理可以决定何时取消引用索引并恢复当前子目标所需的确切过去证据。我们使用强化学习框架 MemexRL 优化写入和读取行为，在上下文预算下使用针对索引内存使用情况定制的奖励塑造，以便代理学习要总结什么、要存档什么、如何对其进行索引以及何时检索它。与仅摘要方法相比，这产生了一种损耗大大减少的长视野记忆形式。我们进一步提供了理论分析，显示了 Memex 循环通过有界解除引用来保持决策质量的潜力，同时随着历史的增长保持有效的上下文计算有界。根据经验，在具有挑战性的长期任务中，使用 MemexRL 训练的 Memex 代理可以在使用更小的工作环境的同时提高任务成功率。

Title: Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models

Authors: Liangwei Yang, Shiyu Wang, Haolin Chen, Rithesh Murthy, Ming Zhu, Jielin Qiu, Zixiang Chen, Juntao Tan, Jianguo Zhang, Zhiwei Liu, Wenting Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Shelby Heinecke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04292
Pdf URL: https://arxiv.org/pdf/2603.04292
Copy Paste: [[2603.04292]] Position: Vector Prompt Interfaces Should Be Exposed to Enable Customization of Large Language Models(https://arxiv.org/abs/2603.04292)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) transition from research prototypes to real-world systems, customization has emerged as a central bottleneck. While text prompts can already customize LLM behavior, we argue that text-only prompting does not constitute a suitable control interface for scalable, stable, and inference-only customization. This position paper argues that model providers should expose \emph{vector prompt inputs} as part of the public interface for customizing LLMs. We support this position with diagnostic evidence showing that vector prompt tuning continues to improve with increasing supervision whereas text-based prompt optimization saturates early, and that vector prompts exhibit dense, global attention patterns indicative of a distinct control mechanism. We further discuss why inference-only customization is increasingly important under realistic deployment constraints, and why exposing vector prompts need not fundamentally increase model leakage risk under a standard black-box threat model. We conclude with a call to action for the community to rethink prompt interfaces as a core component of LLM customization.
摘要：随着大型语言模型 (LLM) 从研究原型过渡到现实系统，定制已成为一个中心瓶颈。虽然文本提示已经可以自定义 LLM 行为，但我们认为纯文本提示并不能构成可扩展、稳定和仅推理自定义的合适控制界面。本立场文件认为，模型提供者应该公开 \emph{向量提示输入} 作为定制 LLM 的公共接口的一部分。我们通过诊断证据支持这一立场，表明矢量提示调整随着监督的增加而不断改进，而基于文本的提示优化则较早饱和，并且矢量提示表现出密集的全局注意力模式，表明了独特的控制机制。我们进一步讨论了为什么仅推理定制在现实部署约束下变得越来越重要，以及为什么在标准黑盒威胁模型下暴露向量提示不需要从根本上增加模型泄漏风险。最后，我们呼吁社区采取行动，重新考虑提示界面作为 LLM 定制的核心组成部分。

Title: The Company You Keep: How LLMs Respond to Dark Triad Traits

Authors: Zeyi Lu, Angelica Henestrosa, Pavel Chizhov, Ivan P. Yamshchikov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04299
Pdf URL: https://arxiv.org/pdf/2603.04299
Copy Paste: [[2603.04299]] The Company You Keep: How LLMs Respond to Dark Triad Traits(https://arxiv.org/abs/2603.04299)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often exhibit highly agreeable and reinforcing conversational styles, also known as AI-sycophancy. Although this behavior is encouraged, it may become problematic when interacting with user prompts that reflect negative social tendencies. Such responses risk amplifying harmful behavior rather than mitigating it. In this study, we examine how LLMs respond to user prompts expressing varying degrees of Dark Triad traits (Machiavellianism, Narcissism, and Psychopathy) using a curated dataset. Our analysis reveals differences across models, whereby all models predominantly exhibit corrective behavior, while showing reinforcing output in certain cases. Model behavior also depends on the severity level and differs in the sentiment of the response. Our findings raise implications for designing safer conversational systems that can detect and respond appropriately when users escalate from benign to harmful requests.
摘要：大型语言模型 (LLM) 通常表现出高度令人愉快和强化的对话风格，也称为人工智能阿谀奉承。尽管鼓励这种行为，但在与反映负面社交倾向的用户提示进行交互时，可能会出现问题。这种应对措施可能会放大而不是减轻有害行为。在这项研究中，我们使用精心策划的数据集来研究法学硕士如何响应表达不同程度的黑暗三人格特征（马基雅维利主义、自恋和精神病态）的用户提示。我们的分析揭示了模型之间的差异，即所有模型主要表现出纠正行为，同时在某些情况下显示出强化输出。模型行为还取决于严重程度以及响应情绪的不同。我们的研究结果对设计更安全的对话系统具有重要意义，该系统可以在用户从良性请求升级为有害请求时进行检测并做出适当响应。

Title: World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings

Authors: Elan Barenholtz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2603.04317
Pdf URL: https://arxiv.org/pdf/2603.04317
Copy Paste: [[2603.04317]] World Properties without World Models: Recovering Spatial and Temporal Structure from Co-occurrence Statistics in Static Word Embeddings(https://arxiv.org/abs/2603.04317)
Keywords: language model, llm
Abstract: Recent work interprets the linear recoverability of geographic and temporal variables from large language model (LLM) hidden states as evidence for world-like internal representations. We test a simpler possibility: that much of the relevant structure is already latent in text itself. Applying the same class of ridge regression probes to static co-occurrence-based embeddings (GloVe and Word2Vec), we find substantial recoverable geographic signal and weaker but reliable temporal signal, with held-out R^2 values of 0.71-0.87 for city coordinates and 0.48-0.52 for historical birth years. Semantic-neighbor analyses and targeted subspace ablations show that these signals depend strongly on interpretable lexical gradients, especially country names and climate-related vocabulary. These findings suggest that ordinary word co-occurrence preserves richer spatial, temporal, and environmental structure than is often assumed, revealing a remarkable and underappreciated capacity of simple static embeddings to preserve world-shaped structure from text alone. Linear probe recoverability alone therefore does not establish a representational move beyond text.
摘要：最近的工作将大语言模型（LLM）隐藏状态中地理和时间变量的线性可恢复性解释为类似世界的内部表示的证据。我们测试了一种更简单的可能性：大部分相关结构已经隐藏在文本本身中。将同一类岭回归探针应用于基于静态共现的嵌入（GloVe 和 Word2Vec），我们发现大量可恢复的地理信号和较弱但可靠的时间信号，城市坐标的 R^2 值为 0.71-0.87，历史出生年份的 R^2 值为 0.48-0.52。语义邻居分析和有针对性的子空间消融表明，这些信号强烈依赖于可解释的词汇梯度，尤其是国名和气候相关词汇。这些发现表明，普通的单词共现保留了比通常假设的更丰富的空间、时间和环境结构，揭示了简单静态嵌入仅从文本中保留世界形状结构的显着且未被充分认识的能力。因此，仅线性探针的可恢复性并不能建立超越文本的代表性移动。

Title: AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning

Authors: Nikolas Karafyllis, Maria Lymperaiou, Giorgos Filandrianos, Athanasios Voulodimos, Giorgos Stamou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04319
Pdf URL: https://arxiv.org/pdf/2603.04319
Copy Paste: [[2603.04319]] AILS-NTUA at SemEval-2026 Task 12: Graph-Based Retrieval and Reflective Prompting for Abductive Event Reasoning(https://arxiv.org/abs/2603.04319)
Keywords: llm, prompt
Abstract: We present a winning three-stage system for SemEval 2026 Task~12: Abductive Event Reasoning that combines graph-based retrieval, LLM-driven abductive reasoning with prompt design optimized through reflective prompt evolution, and post-hoc consistency enforcement; our system ranks first on the evaluation-phase leaderboard with an accuracy score of 0.95. Cross-model error analysis across 14 models (7~families) reveals three shared inductive biases: causal chain incompleteness, proximate cause preference, and salience bias, whose cross-family convergence (51\% cause-count reduction) indicates systematic rather than model-specific failure modes in multi-label causal reasoning.
摘要：我们为 SemEval 2026 Task~12 提出了一个获胜的三阶段系统：溯因事件推理，它将基于图的检索、LLM 驱动的溯因推理与通过反射提示演化优化的提示设计以及事后一致性执行相结合；我们的系统在评估阶段排行榜上排名第一，准确度得分为 0.95。跨 14 个模型（7 个系列）的跨模型误差分析揭示了三种共同的归纳偏差：因果链不完整性、近因偏好和显着性偏差，其跨系列收敛（原因计数减少 51%）表明多标签因果推理中存在系统性故障模式，而不是特定于模型的故障模式。

Title: AgentIR: Reasoning-Aware Retrival for Deep Research Agents

Authors: Zijian Chen, Xueguang Ma, Shengyao Zhuang, Jimmy Lin, Akari Asai, Victor Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2603.04384
Pdf URL: https://arxiv.org/pdf/2603.04384
Copy Paste: [[2603.04384]] AgentIR: Reasoning-Aware Retrival for Deep Research Agents(https://arxiv.org/abs/2603.04384)
Keywords: agent
Abstract: Deep Research agents are rapidly emerging as primary consumers of modern retrieval systems. Unlike human users who issue and refine queries without documenting their intermediate thought processes, Deep Research agents generate explicit natural language reasoning before each search call, revealing rich intent and contextual information that existing retrievers entirely ignore. To exploit this overlooked signal, we introduce: (1) Reasoning-Aware Retrieval, a retrieval paradigm that jointly embeds the agent's reasoning trace alongside its query; and (2) DR-Synth, a data synthesis method that generates Deep Research retriever training data from standard QA datasets. We demonstrate that both components are independently effective, and their combination yields a trained embedding model, AgentIR-4B, with substantial gains. On the challenging BrowseComp-Plus benchmark, AgentIR-4B achieves 68\% accuracy with the open-weight agent Tongyi-DeepResearch, compared to 50\% with conventional embedding models twice its size, and 37\% with BM25. Code and data are available at: this https URL.
摘要：深度研究代理正在迅速成为现代检索系统的主要消费者。与人类用户在不记录中间思维过程的情况下发出和完善查询不同，深度研究代理在每次搜索调用之前生成明确的自然语言推理，揭示现有检索器完全忽略的丰富意图和上下文信息。为了利用这个被忽视的信号，我们引入：（1）推理感知检索，一种将代理的推理轨迹与其查询一起嵌入的检索范例； (2) DR-Synth，一种数据合成方法，可从标准 QA 数据集生成 Deep Research 检索器训练数据。我们证明这两个组件都是独立有效的，并且它们的组合产生了经过训练的嵌入模型 AgentIR-4B，并取得了显着的效果。在具有挑战性的 BrowseComp-Plus 基准测试中，AgentIR-4B 使用开放权重代理 Tongyi-DeepResearch 实现了 68% 的准确率，而使用两倍大小的传统嵌入模型的准确率达到 50%，使用 BM25 的准确率达到 37%。代码和数据可在以下位置获得：此 https URL。