2026-02-13

Title: HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents

Authors: Sungmoon Kim, Hyuna Jeon, Dahye Kim, Mingyu Kim, Dong-Kyu Chae, Jiwoong Kim
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2602.11156
Pdf URL: https://arxiv.org/pdf/2602.11156
Copy Paste: [[2602.11156]] HybridRAG: A Practical LLM-based ChatBot Framework based on Pre-Generated Q&A over Raw Unstructured Documents(https://arxiv.org/abs/2602.11156)
Keywords: language model, llm, chat, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has emerged as a powerful approach for grounding Large Language Model (LLM)-based chatbot responses on external knowledge. However, existing RAG studies typically assume well-structured textual sources (e.g. Wikipedia or curated datasets) and perform retrieval and generation at query time, which can limit their applicability in real-world chatbot scenarios. In this paper, we present HybridRAG, a novel and practical RAG framework towards more accurate and faster chatbot responses. First, HybridRAG ingests raw, unstructured PDF documents containing complex layouts (text, tables, figures) via Optical Character Recognition (OCR) and layout analysis, and convert them into hierarchical text chunks. Then, it pre-generates a plausible question-answer (QA) knowledge base from the organized chunks using an LLM. At query time, user questions are matched against this QA bank to retrieve immediate answers when possible, and only if no suitable QA match is found does our framework fall back to an on-the-fly response generation. Experiments on OHRBench demonstrate that our HybridRAG provides higher answer quality and lower latency compared to a standard RAG baseline. We believe that HybridRAG could be a practical solution for real-world chatbot applications that must handle large volumes of unstructured documents and lots of users under limited computational resources.
摘要：检索增强生成 (RAG) 已成为一种强大的方法，可将基于大型语言模型 (LLM) 的聊天机器人响应基于外部知识。然而，现有的 RAG 研究通常假设结构良好的文本源（例如维基百科或精选数据集）并在查询时执行检索和生成，这可能会限制它们在现实世界聊天机器人场景中的适用性。在本文中，我们提出了 HybridRAG，这是一种新颖实用的 RAG 框架，旨在实现更准确、更快的聊天机器人响应。首先，HybridRAG 通过光学字符识别 (OCR) 和布局分析提取包含复杂布局（文本、表格、图形）的原始非结构化 PDF 文档，并将其转换为分层文本块。然后，它使用法学硕士从有组织的块中预先生成一个合理的问答 (QA) 知识库。在查询时，用户问题会与此 QA 库进行匹配，以便在可能的情况下立即检索答案，并且只有在找不到合适的 QA 匹配时，我们的框架才会回退到动态响应生成。 OHRBench 上的实验表明，与标准 RAG 基线相比，我们的 HybridRAG 提供了更高的答案质量和更低的延迟。我们相信，HybridRAG 可能是现实世界聊天机器人应用程序的实用解决方案，这些应用程序必须在有限的计算资源下处理大量非结构化文档和大量用户。

Title: Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety

Authors: Max Zhang, Derek Liu, Kai Zhang, Joshua Franco, Haihao Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11157
Pdf URL: https://arxiv.org/pdf/2602.11157
Copy Paste: [[2602.11157]] Response-Based Knowledge Distillation for Multilingual Jailbreak Prevention Unwittingly Compromises Safety(https://arxiv.org/abs/2602.11157)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly deployed worldwide, yet their safety alignment remains predominantly English-centric. This allows for vulnerabilities in non-English contexts, especially with low-resource languages. We introduce a novel application of knowledge distillation (KD) in the context of multilingual jailbreak prevention, examining its efficacy. We distill the refusal behaviors of a proprietary teacher model (OpenAI o1-mini) with Low-Rank Adaptation (LoRA) into three open-source student models: Meta-Llama-3-8B-Instruct, Gemma-2-2B-IT, and Qwen3-8B, using ~28,000 multilingual jailbreak prompts from XSafety via black-box response-based, parameter-efficient fine-tuning (PEFT). Evaluation on the MultiJail benchmark reveals a counterintuitive behavior: standard fine-tuning on the teacher's ``safe'' refusal data inadvertently increases Jailbreak Success Rate (JSR) for all student models, up to 16.6 percentage points. Our experiments reveal a divergent generalization to unseen languages during distillation, with varying outcomes depending on the base model. By removing a primary source of safety degradation, nuanced `boundary' refusals, we mitigate or even reverse safety declines in student models, although reductions in reasoning performance (GSM8K) persist. Overall, our exploratory study highlights the challenges and potential of KD as a technique for multilingual safety alignment, offering a foundation for future research in this direction.
摘要：大语言模型 (LLM) 在全球范围内越来越多地部署，但其安全性仍然主要以英语为中心。这使得非英语环境中存在漏洞，尤其是资源匮乏的语言。我们在多语言越狱预防的背景下介绍了知识蒸馏（KD）的一种新颖应用，并检验了其功效。我们将具有低秩适应 (LoRA) 的专有教师模型 (OpenAI o1-mini) 的拒绝行为提炼为三个开源学生模型：Meta-Llama-3-8B-Instruct、Gemma-2-2B-IT 和 Qwen3-8B，通过基于黑盒响应的参数高效微调 (PEFT)，使用来自 XSafety 的约 28,000 个多语言越狱提示。对 MultiJail 基准的评估揭示了一个违反直觉的行为：对教师“安全”拒绝数据的标准微调无意中将所有学生模型的越狱成功率 (JSR) 提高了 16.6 个百分点。我们的实验揭示了在蒸馏过程中对看不见的语言的不同概括，根据基本模型的不同，结果也不同。通过消除安全性下降的主要根源，即细微的“边界”拒绝，我们减轻甚至扭转了学生模型的安全性下降，尽管推理性能（GSM8K）的下降仍然存在。总体而言，我们的探索性研究强调了 KD 作为多语言安全对齐技术的挑战和潜力，为该方向的未来研究奠定了基础。

Title: Retrieval Heads are Dynamic

Authors: Yuping Lin, Zitao Li, Yue Xing, Pengfei He, Yingqian Cui, Yaliang Li, Bolin Ding, Jingren Zhou, Jiliang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11162
Pdf URL: https://arxiv.org/pdf/2602.11162
Copy Paste: [[2602.11162]] Retrieval Heads are Dynamic(https://arxiv.org/abs/2602.11162)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent studies have identified "retrieval heads" in Large Language Models (LLMs) responsible for extracting information from input contexts. However, prior works largely rely on static statistics aggregated across datasets, identifying heads that perform retrieval on average. This perspective overlooks the fine-grained temporal dynamics of autoregressive generation. In this paper, we investigate retrieval heads from a dynamic perspective. Through extensive analysis, we establish three core claims: (1) Dynamism: Retrieval heads vary dynamically across timesteps; (2) Irreplaceability: Dynamic retrieval heads are specific at each timestep and cannot be effectively replaced by static retrieval heads; and (3) Correlation: The model's hidden state encodes a predictive signal for future retrieval head patterns, indicating an internal planning mechanism. We validate these findings on the Needle-in-a-Haystack task and a multi-hop QA task, and quantify the differences on the utility of dynamic and static retrieval heads in a Dynamic Retrieval-Augmented Generation framework. Our study provides new insights into the internal mechanisms of LLMs.
摘要：最近的研究已经确定了大型语言模型（LLM）中负责从输入上下文中提取信息的“检索头”。然而，先前的工作很大程度上依赖于跨数据集聚合的静态统计数据，识别平均执行检索的头。这种观点忽视了自回归生成的细粒度时间动态。在本文中，我们从动态角度研究检索头。通过广泛的分析，我们建立了三个核心主张：（1）动态性：检索头随时间步长动态变化；（2）不可替代性：动态检索头在每个时间步都是特定的，不能被静态检索头有效替代； (3) 相关性：模型的隐藏状态对未来检索头模式的预测信号进行编码，表明内部规划机制。我们在“大海捞针”任务和多跳 QA 任务上验证了这些发现，并量化了动态检索增强生成框架中动态和静态检索头的效用差异。我们的研究为法学硕士的内部机制提供了新的见解。

Title: Assessing LLM Reliability on Temporally Recent Open-Domain Questions

Authors: Pushwitha Krishnappa, Amit Das, Vinija Jain, Tathagata Mukherjee, Aman Chadha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11165
Pdf URL: https://arxiv.org/pdf/2602.11165
Copy Paste: [[2602.11165]] Assessing LLM Reliability on Temporally Recent Open-Domain Questions(https://arxiv.org/abs/2602.11165)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) are increasingly deployed for open-domain question answering, yet their alignment with human perspectives on temporally recent information remains underexplored. We introduce RECOM (Reddit Evaluation for Correspondence of Models), a benchmark dataset of 15,000 recent Reddit questions from September 2025 paired with community-derived reference answers. We investigate how four open-source LLMs (Llama3.1-8B, Mistral-7B, Gemma-2-9B, and GPT-OSS-20B) respond to these questions, evaluating alignment using lexical metrics (BLEU, ROUGE), semantic similarity (BERTScore, MoverScore, cosine similarity), and logical inference (NLI). Our central finding is a striking semantic-lexical paradox: all models achieve over 99% cosine similarity with references despite less than 8% BLEU-1 overlap, a 90+ percentage point gap indicating that models preserve meaning through extensive paraphrasing rather than lexical reproduction. MoverScore (51-53%) confirms this pattern, occupying an intermediate position that reflects the optimal transport cost of semantic alignment. Furthermore, model scale does not predict performance: Mistral-7B (7B parameters) outperforms GPT-OSS-20B (20B parameters) across all metrics. NLI analysis reveals that contradiction rates remain below 7%, suggesting models rarely generate content that directly conflicts with human consensus. These findings challenge the reliability of lexical metrics for evaluating abstractive generation and argue for multi-dimensional evaluation frameworks that capture semantic fidelity beyond surface-level text matching. The RECOM dataset is publicly available at this https URL
摘要：大型语言模型 (LLM) 越来越多地用于开放域问答，但它们与人类对近期信息的看法的一致性仍未得到充分探索。我们引入了 RECOM（Reddit 模型对应评估），这是一个基准数据集，包含 2025 年 9 月以来 15,000 个最近的 Reddit 问题，并配有社区衍生的参考答案。我们研究了四个开源 LLM（Llama3.1-8B、Mistral-7B、Gemma-2-9B 和 GPT-OSS-20B）如何回答这些问题，使用词汇指标（BLEU、ROUGE）、语义相似性（BERTScore、MoverScore、余弦相似性）和逻辑推理 (NLI) 来评估对齐情况。我们的中心发现是一个惊人的语义-词汇悖论：尽管 BLEU-1 重叠率低于 8%，但所有模型都与参考文献实现了超过 99% 的余弦相似度，这一差距超过 90 个百分点，表明模型通过广泛的释义而不是词汇再现来保留意义。 MoverScore (51-53%) 证实了这一模式，占据反映语义对齐的最佳传输成本的中间位置。此外，模型规模并不能预测性能：Mistral-7B（7B 参数）在所有指标上均优于 GPT-OSS-20B（20B 参数）。 NLI 分析显示，矛盾率仍然低于 7%，这表明模型很少生成与人类共识直接冲突的内容。这些发现对评估抽象生成的词汇度量的可靠性提出了挑战，并提出了多维评估框架，以捕获超越表面水平文本匹配的语义保真度。 RECOM 数据集可在此 https URL 公开获取

Title: Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?

Authors: Xu Hu, Yifan Zhang, Songtao Wei, Chen Zhao, Qiannan Li, Bingzhe Li, Feng Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11166
Pdf URL: https://arxiv.org/pdf/2602.11166
Copy Paste: [[2602.11166]] Small Updates, Big Doubts: Does Parameter-Efficient Fine-tuning Enhance Hallucination Detection ?(https://arxiv.org/abs/2602.11166)
Keywords: language model, llm, hallucination
Abstract: Parameter-efficient fine-tuning (PEFT) methods are widely used to adapt large language models (LLMs) to downstream tasks and are often assumed to improve factual correctness. However, how the parameter-efficient fine-tuning methods affect hallucination behavior remains insufficiently understood, especially on QA datasets. In this work, we systematically investigate the impact of PEFT on hallucination detection through a comprehensive empirical study across three open-weight LLM backbones and three fact-seeking QA benchmarks. For each model, we evaluate performance using seven unsupervised hallucination detection methods spanning three complementary approaches: semantic consistency based detectors, confidence based detectors, and entropy based detectors. This multifaceted evaluation enables us to characterize how PEFT reshapes uncertainty across different detection paradigms. In conclusion, our experimental results show that PEFT consistently strengthens hallucination detection ability, substantially improving AUROC across a wide range of hallucination detectors. Besides, further analyses using linear probes and representation diagnostics indicate that PEFT methods primarily reshapes how uncertainty is encoded and surfaced, comparing with injecting new factual knowledge into the models.
摘要：参数高效微调（PEFT）方法广泛用于使大型语言模型（LLM）适应下游任务，并且通常被认为可以提高事实正确性。然而，参数有效的微调方法如何影响幻觉行为仍然没有得到充分的了解，特别是在 QA 数据集上。在这项工作中，我们通过对三个开放权重 LLM 主干和三个事实寻求 QA 基准的全面实证研究，系统地研究了 PEFT 对幻觉检测的影响。对于每个模型，我们使用七种无监督幻觉检测方法评估性能，涵盖三种互补方法：基于语义一致性的检测器、基于置信度的检测器和基于熵的检测器。这种多方面的评估使我们能够描述 PEFT 如何重塑不同检测范式的不确定性。总之，我们的实验结果表明，PEFT 持续增强了幻觉检测能力，显着提高了各种幻觉检测器的 AUROC。此外，使用线性探针和表示诊断的进一步分析表明，与向模型中注入新的事实知识相比，PEFT 方法主要重塑了不确定性的编码和呈现方式。

Title: Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering

Authors: Nathan Mao, Varun Kaushik, Shreya Shivkumar, Parham Sharafoleslami, Kevin Zhu, Sunishchal Dev
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11167
Pdf URL: https://arxiv.org/pdf/2602.11167
Copy Paste: [[2602.11167]] Visualizing and Benchmarking LLM Factual Hallucination Tendencies via Internal State Analysis and Clustering(https://arxiv.org/abs/2602.11167)
Keywords: language model, gpt, llm, hallucination
Abstract: Large Language Models (LLMs) often hallucinate, generating nonsensical or false information that can be especially harmful in sensitive fields such as medicine or law. To study this phenomenon systematically, we introduce FalseCite, a curated dataset designed to capture and benchmark hallucinated responses induced by misleading or fabricated citations. Running GPT-4o-mini, Falcon-7B, and Mistral 7-B through FalseCite, we observed a noticeable increase in hallucination activity for false claims with deceptive citations, especially in GPT-4o-mini. Using the responses from FalseCite, we can also analyze the internal states of hallucinating models, visualizing and clustering the hidden state vectors. From this analysis, we noticed that the hidden state vectors, regardless of hallucination or non-hallucination, tend to trace out a distinct horn-like shape. Our work underscores FalseCite's potential as a foundation for evaluating and mitigating hallucinations in future LLM research.
摘要：大型语言模型 (LLM) 经常产生幻觉，产生无意义或虚假的信息，这些信息在医学或法律等敏感领域尤其有害。为了系统地研究这一现象，我们引入了 FalseCite，这是一个精心设计的数据集，旨在捕获和基准由误导性或捏造的引文引起的幻觉反应。通过 FalseCite 运行 GPT-4o-mini、Falcon-7B 和 Mistral 7-B，我们观察到带有欺骗性引用的虚假声明的幻觉活动显着增加，尤其是在 GPT-4o-mini 中。使用 FalseCite 的响应，我们还可以分析幻觉模型的内部状态，对隐藏状态向量进行可视化和聚类。从这个分析中，我们注意到隐藏状态向量，无论是幻觉还是非幻觉，都倾向于描绘出明显的角状形状。我们的工作强调了 FalseCite 作为未来法学硕士研究中评估和减轻幻觉的基础的潜力。

Title: Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis

Authors: Mangadoddi Srikar Vardhan, Lekkala Sai Teja
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11169
Pdf URL: https://arxiv.org/pdf/2602.11169
Copy Paste: [[2602.11169]] Disentangling Direction and Magnitude in Transformer Representations: A Double Dissociation Through L2-Matched Perturbation Analysis(https://arxiv.org/abs/2602.11169)
Keywords: language model
Abstract: Transformer hidden states encode information as high-dimensional vectors, yet whether direction (orientation in representational space) and magnitude (vector norm) serve distinct functional roles remains unclear. Studying Pythia-family models, we discover a striking cross-over dissociation: angular perturbations cause up to 42.9 more damage to language modeling loss, while magnitude perturbations cause disproportionately more damage to syntactic processing (20.4% vs.1.6% accuracy drop on subject-verb agreement).This finding is enabled by L2-matched perturbation analysis, a methodology ensuring that an gular and magnitude perturbations achieve identical Euclidean displacements. Causal intervention reveals that angular damage flows substantially through the attention pathways (28.4% loss recovery via attention repair), while magnitude damage flows partly through the LayerNorm pathways(29.9% recovery via LayerNorm repair). These patterns replicate across scales within the Pythia architecture family. These findings provide evidence that direction and magnitude support partially distinct computational roles in LayerNorm based architectures. The direction preferentially affects attentional routing, while magnitude modulates processing intensity for fine-grained syntactic judgments. We find different patterns in RMSNorm-based architectures, suggesting that the dissociation depends on architectural choices. Our results refine the linear representation hypothesis and have implications for model editing and interpretability research
摘要：Transformer 隐藏状态将信息编码为高维向量，但方向（表示空间中的方向）和幅度（向量范数）是否具有不同的功能作用仍不清楚。通过研究 Pythia 系列模型，我们发现了一个惊人的交叉分离：角度扰动对语言建模损失造成高达 42.9 倍的损害，而幅度扰动对句法处理造成不成比例的更大损害（主谓一致准确度下降 20.4% vs.1.6%）。这一发现是通过 L2 匹配扰动分析实现的，该方法确保角度扰动和幅度扰动实现相同的欧几里得位移。因果干预表明，角度损伤主要通过注意力路径流动（通过注意力修复恢复 28.4% 的损失），而幅度损伤部分通过 LayerNorm 路径流动（通过 LayerNorm 修复恢复 29.9%）。这些模式在 Pythia 架构家族中跨规模复制。这些发现提供了证据，表明方向和幅度支持基于 LayerNorm 的架构中部分不同的计算角色。方向优先影响注意力路由，而幅度则调节细粒度句法判断的处理强度。我们在基于 RMSNorm 的架构中发现了不同的模式，这表明分离取决于架构选择。我们的结果完善了线性表示假设，并对模型编辑和可解释性研究具有影响

Title: PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models

Authors: Jiawei Xu, Zhenyu Yu, Ziqian Bi, Minh Duc Pham, Xiaoyi Qu, Danyang Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11170
Pdf URL: https://arxiv.org/pdf/2602.11170
Copy Paste: [[2602.11170]] PRIME: Policy-Reinforced Iterative Multi-agent Execution for Algorithmic Reasoning in Large Language Models(https://arxiv.org/abs/2602.11170)
Keywords: language model, agent
Abstract: Large language models have demonstrated remarkable capabilities across diverse reasoning tasks, yet their performance on algorithmic reasoning remains limited. To handle this limitation, we propose PRIME (Policy-Reinforced Iterative Multi-agent Execution), a framework comprising three specialized agents, an executor for step-by-step reasoning, a verifier for constraint checking, and a coordinator for backtracking control, optimized through group relative policy optimization. For comprehensive evaluation, we introduce PRIME-Bench, the largest algorithmic reasoning benchmark to date, comprising 86 tasks across 12 categories with 51,600 instances. Tasks span sorting algorithms, graph and tree structures, automata and state machines, symbolic reasoning, and constraint-based puzzles, with execution traces reaching over one million steps. Compared to baseline approach, PRIME improves average accuracy from 26.8% to 93.8%, a 250% relative gain. The largest improvements occur on tasks requiring sustained state tracking, with Turing machine simulation improving from 9% to 92% and long division from 16% to 94%. Ablation studies identify iterative verification as the primary contributor, preventing the error propagation that causes baseline approaches to fail catastrophically. Analysis across model scales (8B-120B parameters) reveals that smaller models benefit disproportionately, achieving accuracy comparable to models 8x larger.
摘要：大型语言模型在不同的推理任务中表现出了卓越的能力，但它们在算法推理上的性能仍然有限。为了解决这个限制，我们提出了 PRIME（策略强化迭代多代理执行），这是一个由三个专门代理、一个用于逐步推理的执行器、一个用于约束检查的验证器和一个用于回溯控制的协调器组成的框架，并通过组相关策略优化进行优化。为了进行全面评估，我们引入了 PRIME-Bench，这是迄今为止最大的算法推理基准，包含 12 个类别的 86 个任务和 51,600 个实例。任务涵盖排序算法、图形和树结构、自动机和状态机、符号推理和基于约束的谜题，执行轨迹达到超过一百万步。与基线方法相比，PRIME 将平均准确率从 26.8% 提高到 93.8%，相对增益提高了 250%。最大的改进发生在需要持续状态跟踪的任务上，图灵机模拟从 9% 提高到 92%，长除法从 16% 提高到 94%。消融研究将迭代验证确定为主要贡献者，防止错误传播导致基线方法灾难性失败。跨模型尺度（8B-120B 参数）的分析表明，较小的模型受益匪浅，其精度可与 8 倍大的模型相媲美。

Title: Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization

Authors: Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11171
Pdf URL: https://arxiv.org/pdf/2602.11171
Copy Paste: [[2602.11171]] Efficient Hyper-Parameter Search for LoRA via Language-aided Bayesian Optimization(https://arxiv.org/abs/2602.11171)
Keywords: language model, llm, prompt
Abstract: Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) enables resource-efficient personalization or specialization, but it comes at the expense of additional hyperparameter tuning. Although LoRA makes fine-tuning efficient, it is highly sensitive to the choice of hyperparameters, and exhaustive hyperparameter search is still computationally very demanding. To address these challenges, we propose a framework that integrates the domain knowledge of pre-trained LLMs into Bayesian Optimization (BO) to efficiently search for LoRA hyperparameters. To leverage the informed knowledge of LLMs, we repurpose LLMs as a discrete-to-continuous mapping to link the hyperparameters and their domain knowledge with a continuous vector space, where BO is conducted. We design and control the mapping by language prompting, where we provide a domain-aware textual prompt describing the relationships among hyperparameters and their respective roles; thereby, we explicitly inject domain knowledge about LoRA into the LLM in natural language. Also, we model the residual information that is hard to linguistically describe in the prompt with an additional learnable token. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the observation of the strong correlation between the respective performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation with a data subset. This further increases the efficiency of our method. We demonstrate that our hyperparameter found with only about 30 iterations achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations.
摘要：使用低秩适应 (LoRA) 微调大型语言模型 (LLM) 可实现资源高效的个性化或专业化，但这是以额外的超参数调整为代价的。尽管LoRA使微调变得高效，但它对超参数的选择高度敏感，并且穷举超参数搜索对计算的要求仍然很高。为了应对这些挑战，我们提出了一个框架，将预训练的 LLM 的领域知识集成到贝叶斯优化（BO）中，以有效地搜索 LoRA 超参数。为了利用 LLM 的丰富知识，我们将 LLM 重新用作离散到连续的映射，将超参数及其领域知识与连续向量空间联系起来，在连续向量空间中进行 BO。我们通过语言提示来设计和控制映射，提供领域感知的文本提示来描述超参数及其各自角色之间的关系；因此，我们以自然语言明确地将有关 LoRA 的领域知识注入到法学硕士中。此外，我们还使用额外的可学习标记对提示中难以用语言描述的残留信息进行建模。这有助于 BO 采样更多高性能的超参数。此外，通过观察 LoRA 训练体系中完整训练数据集和子集训练数据集获得的各自性能之间的强相关性，我们引入了数据子集的代理训练和评估。这进一步提高了我们方法的效率。我们证明，与从约 45,000 种组合中找到的标准超参数相比，我们仅通过约 30 次迭代找到的超参数就实现了 20% 以上的性能提升。

Title: Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages

Authors: Aniket Deroy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11172
Pdf URL: https://arxiv.org/pdf/2602.11172
Copy Paste: [[2602.11172]] Synthesizing the Virtual Advocate: A Multi-Persona Speech Generation Framework for Diverse Linguistic Jurisdictions in Indic Languages(https://arxiv.org/abs/2602.11172)
Keywords: language model, llm, prompt
Abstract: Legal advocacy requires a unique combination of authoritative tone, rhythmic pausing for emphasis, and emotional intelligence. This study investigates the performance of the Gemini 2.5 Flash TTS and Gemini 2.5 Pro TTS models in generating synthetic courtroom speeches across five Indic languages: Tamil, Telugu, Bengali, Hindi, and Gujarati. We propose a prompting framework that utilizes Gemini 2.5s native support for 5 languages and its context-aware pacing to produce distinct advocate personas. The evolution of Large Language Models (LLMs) has shifted the focus of TexttoSpeech (TTS) technology from basic intelligibility to context-aware, expressive synthesis. In the legal domain, synthetic speech must convey authority and a specific professional persona a task that becomes significantly more complex in the linguistically diverse landscape of India. The models exhibit a "monotone authority," excelling at procedural information delivery but struggling with the dynamic vocal modulation and emotive gravitas required for persuasive advocacy. Performance dips in Bengali and Gujarati further highlight phonological frontiers for future refinement. This research underscores the readiness of multilingual TTS for procedural legal tasks while identifying the remaining challenges in replicating the persuasive artistry of human legal discourse. The code is available at-this https URL
摘要：法律宣传需要权威语气、强调重点的有节奏的停顿和情商的独特结合。本研究调查了 Gemini 2.5 Flash TTS 和 Gemini 2.5 Pro TTS 模型在生成五种印度语言的合成法庭演讲方面的性能：泰米尔语、泰卢固语、孟加拉语、印地语和古吉拉特语。我们提出了一个提示框架，利用 Gemini 2.5 对 5 种语言的原生支持及其上下文感知节奏来产生独特的倡导者角色。大型语言模型 (LLM) 的发展已将文本转语音 (TTS) 技术的重点从基本清晰度转移到上下文感知、表达性合成。在法律领域，合成语音必须传达权威和特定的专业角色，而在印度语言多样化的情况下，这项任务变得更加复杂。这些模型表现出“单调的权威”，擅长程序性信息传递，但在说服性宣传所需的动态声音调制和情感庄严方面遇到了困难。孟加拉语和古吉拉特语的表现下降进一步凸显了未来改进的语音前沿。这项研究强调了多语言 TTS 为程序性法律任务做好了准备，同时确定了复制人类法律话语的说服艺术性方面仍然存在的挑战。代码可在此 https URL 获取

Title: Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

Authors: Qian Ruan, Iryna Gurevych
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11173
Pdf URL: https://arxiv.org/pdf/2602.11173
Copy Paste: [[2602.11173]] Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review(https://arxiv.org/abs/2602.11173)
Keywords: llm
Abstract: Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. Recent work frames this task as automatic text generation, underusing author expertise and intent. In practice, authors possess domain expertise, author-only information, revision and response strategies--concrete forms of author expertise and intent--to address reviewer concerns, and seek NLP assistance that integrates these signals to support effective response writing in peer review. We reformulate author response generation as an author-in-the-loop task and introduce REspGen, a generation framework that integrates explicit author input, multi-attribute control, and evaluation-guided refinement, together with REspEval, a comprehensive evaluation suite with 20+ metrics covering input utilization, controllability, response quality, and discourse. To support this formulation, we construct Re$^3$Align, the first large-scale dataset of aligned review--response--revision triplets, where revisions provide signals of author expertise and intent. Experiments with state-of-the-art LLMs show the benefits of author input and evaluation-guided refinement, the impact of input design on response quality, and trade-offs between controllability and quality. We make our dataset, generation and evaluation tools publicly available.
摘要：作者回应（反驳）写作是科学同行评审的关键阶段，需要作者付出大量努力。最近的工作将此任务定义为自动文本生成，未充分利用作者的专业知识和意图。在实践中，作者拥有领域专业知识、仅限作者的信息、修订和回复策略（作者专业知识和意图的具体形式）来解决审稿人的担忧，并寻求 NLP 帮助来整合这些信号，以支持同行评审中有效的回复写作。我们将作者回复生成重新定义为作者循环任务，并引入 REspGen（一个集成了明确作者输入、多属性控制和评估引导细化的生成框架）以及 REspEval（一个综合评估套件，包含 20 多个指标，涵盖输入利用率、可控性、回复质量和话语）。为了支持这个公式，我们构建了 Re$^3$Align，这是第一个对齐的评论-响应-修订三元组的大型数据集，其中修订提供了作者专业知识和意图的信号。最先进的法学硕士的实验显示了作者输入和评估引导细化的好处、输入设计对响应质量的影响以及可控性和质量之间的权衡。我们公开提供我们的数据集、生成和评估工具。

Title: The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models

Authors: Aradhya Dixit, Shreem Dixit
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11174
Pdf URL: https://arxiv.org/pdf/2602.11174
Copy Paste: [[2602.11174]] The Script Tax: Measuring Tokenization-Driven Efficiency and Latency Disparities in Multilingual Language Models(https://arxiv.org/abs/2602.11174)
Keywords: language model
Abstract: Pretrained multilingual language models are often assumed to be script-agnostic, yet their tokenizers can impose systematic costs on certain writing systems. We quantify this script tax by comparing two orthographic variants with identical linguistic content. Across mBERT and XLM-R, the higher-fragmentation orthography shows a ~3.4x increase in fertility (6.73-6.85 vs. 2.10-2.35 tokens/word), leading to a 16.5x inference slowdown (0.23 vs. 3.8 sentences/second) on identical hardware. Using bits per character (BPC) to avoid the "NLL paradox" from subword fragmentation, we find a substantial increase in information cost: +19.7% for mBERT (8.06->9.65) and +47.1% for XLM-R (12.19->17.94). A round-trip conversion check (CER_rt=0.31) suggests these gaps reflect orthography-conditioned processing rather than mapping noise. Our results highlight tokenization as a key source of inequity in multilingual NLP and motivate script-aware tokenization and pretraining.
摘要：预训练的多语言语言模型通常被认为与脚本无关，但它们的分词器可能会给某些书写系统带来系统成本。我们通过比较具有相同语言内容的两个拼写变体来量化这种脚本税。在 mBERT 和 XLM-R 中，较高碎片的正字法显示出可生育性提高约 3.4 倍（6.73-6.85 与 2.10-2.35 个标记/单词），导致相同硬件上的推理速度减慢 16.5 倍（0.23 与 3.8 个句子/秒）。使用每字符位数 (BPC) 来避免子字碎片带来的“NLL 悖论”，我们发现信息成本大幅增加：mBERT (8.06->9.65) +19.7%，XLM-R (12.19->17.94) +47.1%。往返转换检查 (CER_rt=0.31) 表明这些间隙反映了正字法条件处理，而不是映射噪声。我们的结果强调了标记化是多语言 NLP 不平等的一个关键来源，并激发了脚本感知标记化和预训练。

Title: Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments

Authors: Maral Doctorarastoo, Katherine A. Flanigan, Mario Bergés, Christopher McComb
Subjects: cs.CL, cs.AI, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11176
Pdf URL: https://arxiv.org/pdf/2602.11176
Copy Paste: [[2602.11176]] Evaluating Few-Shot Temporal Reasoning of LLMs for Human Activity Prediction in Smart Environments(https://arxiv.org/abs/2602.11176)
Keywords: language model, llm, prompt, agent
Abstract: Anticipating human activities and their durations is essential in applications such as smart-home automation, simulation-based architectural and urban design, activity-based transportation system simulation, and human-robot collaboration, where adaptive systems must respond to human activities. Existing data-driven agent-based models--from rule-based to deep learning--struggle in low-data environments, limiting their practicality. This paper investigates whether large language models, pre-trained on broad human knowledge, can fill this gap by reasoning about everyday activities from compact contextual cues. We adopt a retrieval-augmented prompting strategy that integrates four sources of context--temporal, spatial, behavioral history, and persona--and evaluate it on the CASAS Aruba smart-home dataset. The evaluation spans two complementary tasks: next-activity prediction with duration estimation, and multi-step daily sequence generation, each tested with various numbers of few-shot examples provided in the prompt. Analyzing few-shot effects reveals how much contextual supervision is sufficient to balance data efficiency and predictive accuracy, particularly in low-data environments. Results show that large language models exhibit strong inherent temporal understanding of human behavior: even in zero-shot settings, they produce coherent daily activity predictions, while adding one or two demonstrations further refines duration calibration and categorical accuracy. Beyond a few examples, performance saturates, indicating diminishing returns. Sequence-level evaluation confirms consistent temporal alignment across few-shot conditions. These findings suggest that pre-trained language models can serve as promising temporal reasoners, capturing both recurring routines and context-dependent behavioral variations, thereby strengthening the behavioral modules of agent-based models.
摘要：预测人类活动及其持续时间对于智能家居自动化、基于模拟的建筑和城市设计、基于活动的交通系统模拟以及人机协作等应用至关重要，在这些应用中，自适应系统必须对人类活动做出响应。现有的数据驱动的基于代理的模型——从基于规则的到深度学习——在低数据环境中举步维艰，限制了它们的实用性。本文研究了根据广泛的人类知识预先训练的大型语言模型是否可以通过从紧凑的上下文线索推理日常活动来填补这一空白。我们采用检索增强提示策略，集成了四种背景来源（时间、空间、行为历史和角色），并在 CASAS Aruba 智能家居数据集上对其进行评估。评估涵盖两个互补的任务：具有持续时间估计的下一个活动预测，以及多步骤每日序列生成，每个任务都使用提示中提供的各种数量的小样本示例进行测试。分析少样本效应揭示了多少上下文监督足以平衡数据效率和预测准确性，特别是在低数据环境中。结果表明，大型语言模型对人类行为表现出强大的固有时间理解：即使在零样本设置中，它们也能产生连贯的日常活动预测，同时添加一两个演示进一步完善了持续时间校准和分类准确性。除了几个例子之外，性能饱和，表明收益递减。序列级评估确认了少数镜头条件下的一致时间对齐。这些发现表明，预先训练的语言模型可以作为有前途的时间推理器，捕获重复出现的例程和上下文相关的行为变化，从而加强基于代理的模型的行为模块。

Title: What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection

Authors: Lei Jiang, Yue Zhou, Natalie Parde
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11177
Pdf URL: https://arxiv.org/pdf/2602.11177
Copy Paste: [[2602.11177]] What Do LLMs Know About Alzheimer's Disease? Fine-Tuning, Probing, and Data Synthesis for AD Detection(https://arxiv.org/abs/2602.11177)
Keywords: language model, llm
Abstract: Reliable early detection of Alzheimer's disease (AD) is challenging, particularly due to limited availability of labeled data. While large language models (LLMs) have shown strong transfer capabilities across domains, adapting them to the AD domain through supervised fine-tuning remains largely unexplored. In this work, we fine-tune an LLM for AD detection and investigate how task-relevant information is encoded within its internal representations. We employ probing techniques to analyze intermediate activations across transformer layers, and we observe that, after fine-tuning, the probing values of specific words and special markers change substantially, indicating that these elements assume a crucial role in the model's improved detection performance. Guided by this insight, we design a curated set of task-aware special markers and train a sequence-to-sequence model as a data-synthesis tool that leverages these markers to generate structurally consistent and diagnostically informative synthetic samples. We evaluate the synthesized data both intrinsically and by incorporating it into downstream training pipelines.
摘要：阿尔茨海默病 (AD) 的可靠早期检测具有挑战性，特别是由于标记数据的可用性有限。虽然大型语言模型 (LLM) 已表现出强大的跨领域迁移能力，但通过监督微调使其适应 AD 领域在很大程度上仍未得到探索。在这项工作中，我们对 AD 检测的 LLM 进行了微调，并研究了如何在其内部表示中编码与任务相关的信息。我们采用探测技术来分析跨 Transformer 层的中间激活，我们观察到，经过微调后，特定单词和特殊标记的探测值发生了很大变化，这表明这些元素在模型改进的检测性能中发挥着至关重要的作用。在这一见解的指导下，我们设计了一组精心设计的任务感知特殊标记，并训练序列到序列模型作为数据合成工具，利用这些标记生成结构一致且具有诊断信息的合成样本。我们从本质上评估合成数据，并将其合并到下游训练管道中。

Title: From Instruction to Output: The Role of Prompting in Modern NLG

Authors: Munazza Zaib, Elaf Alhazmi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11179
Pdf URL: https://arxiv.org/pdf/2602.11179
Copy Paste: [[2602.11179]] From Instruction to Output: The Role of Prompting in Modern NLG(https://arxiv.org/abs/2602.11179)
Keywords: language model, llm, prompt
Abstract: Prompt engineering has emerged as an integral technique for extending the strengths and abilities of Large Language Models (LLMs) to gain significant performance gains in various Natural Language Processing (NLP) tasks. This approach, which requires instructions to be composed in natural language to bring out the knowledge from LLMs in a structured way, has driven breakthroughs in various NLP tasks. Yet there is still no structured framework or coherent understanding of the varied prompt engineering methods and techniques, particularly in the field of Natural Language Generation (NLG). This survey aims to help fill that gap by outlining recent developments in prompt engineering, and their effect on different NLG tasks. It reviews recent advances in prompting methods and their impact on NLG tasks, presenting prompt design as an input-level control mechanism that complements fine-tuning and decoding approaches. The paper introduces a taxonomy of prompting paradigms, a decision framework for prompt selection based on varying factors for the practitioners, outlines emerging trends and challenges, and proposes a framework that links design, optimization, and evaluation to support more controllable and generalizable NLG.
摘要：快速工程已成为扩展大型语言模型 (LLM) 的优势和能力的整体技术，从而在各种自然语言处理 (NLP) 任务中获得显着的性能提升。这种方法需要用自然语言编写指令，以结构化的方式提取法学硕士的知识，推动了各种 NLP 任务的突破。然而，对于各种即时工程方法和技术，尤其是在自然语言生成（NLG）领域，仍然没有结构化的框架或连贯的理解。本调查旨在通过概述即时工程的最新发展及其对不同 NLG 任务的影响来帮助填补这一空白。它回顾了提示方法的最新进展及其对 NLG 任务的影响，将提示设计作为一种输入级控制机制来补充微调和解码方法。本文介绍了提示范式的分类法、基于不同因素的提示选择决策框架，概述了新出现的趋势和挑战，并提出了一个将设计、优化和评估联系起来的框架，以支持更可控和可推广的 NLG。

Title: Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions

Authors: Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11180
Pdf URL: https://arxiv.org/pdf/2602.11180
Copy Paste: [[2602.11180]] Mechanistic Interpretability for Large Language Model Alignment: Progress, Challenges, and Future Directions(https://arxiv.org/abs/2602.11180)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved remarkable capabilities across diverse tasks, yet their internal decision-making processes remain largely opaque. Mechanistic interpretability (i.e., the systematic study of how neural networks implement algorithms through their learned representations and computational structures) has emerged as a critical research direction for understanding and aligning these models. This paper surveys recent progress in mechanistic interpretability techniques applied to LLM alignment, examining methods ranging from circuit discovery to feature visualization, activation steering, and causal intervention. We analyze how interpretability insights have informed alignment strategies including reinforcement learning from human feedback (RLHF), constitutional AI, and scalable oversight. Key challenges are identified, including the superposition hypothesis, polysemanticity of neurons, and the difficulty of interpreting emergent behaviors in large-scale models. We propose future research directions focusing on automated interpretability, cross-model generalization of circuits, and the development of interpretability-driven alignment techniques that can scale to frontier models.
摘要：大型语言模型（LLM）在不同的任务中取得了卓越的能力，但其内部决策过程在很大程度上仍然不透明。机械可解释性（即对神经网络如何通过学习的表示和计算结构实现算法的系统研究）已成为理解和调整这些模型的关键研究方向。本文调查了应用于法学硕士对齐的机械可解释性技术的最新进展，研究了从电路发现到特征可视化、激活引导和因果干预等方法。我们分析了可解释性见解如何为调整策略提供信息，包括来自人类反馈的强化学习 (RLHF)、宪法人工智能和可扩展的监督。确定了关键的挑战，包括叠加假设、神经元的多语义性以及解释大规模模型中涌现行为的困难。我们提出了未来的研究方向，重点是自动可解释性、电路的跨模型泛化以及可扩展到前沿模型的可解释性驱动的对齐技术的开发。

Title: Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs

Authors: Himanshu Gupta, Pratik Jayarao, Chaitanya Dwivedi, Neeraj Varshney
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11181
Pdf URL: https://arxiv.org/pdf/2602.11181
Copy Paste: [[2602.11181]] Code Mixologist : A Practitioner's Guide to Building Code-Mixed LLMs(https://arxiv.org/abs/2602.11181)
Keywords: language model, llm, prompt
Abstract: Code-mixing and code-switching (CSW) remain challenging phenomena for large language models (LLMs). Despite recent advances in multilingual modeling, LLMs often struggle in mixed-language settings, exhibiting systematic degradation in grammaticality, factuality, and safety behavior. This work provides a comprehensive overview of CSW research in modern large language model settings. We introduce a unifying taxonomy that organizes prior work along dimensions of data, modeling, and evaluation, and we distill these findings into a practical playbook of actionable recommendations for building, adapting, and evaluating CSW-capable LLMs. We review modeling approaches ranging from CSW-tailored pre-training and task-specific post-training to prompting strategies and in-context learning. We analyze current evaluation practices, highlighting sources of instability and limited reproducibility, and we catalog existing benchmarks while critically examining their linguistic coverage and English-centric biases. Finally, we discuss emerging safety concerns, including use of code-mixing as a mechanism for bypassing model safeguards, and identify open research challenges.
摘要：对于大型语言模型 (LLM) 来说，代码混合和代码转换 (CSW) 仍然是具有挑战性的现象。尽管多语言建模最近取得了进展，但法学硕士经常在混合语言环境中举步维艰，在语法、事实性和安全行为方面表现出系统性退化。这项工作全面概述了现代大型语言模型设置中的 CSW 研究。我们引入了一个统一的分类法，该分类法沿着数据、建模和评估的维度组织先前的工作，并将这些发现提炼成一本实用的手册，其中包含可操作的建议，用于构建、调整和评估具有 CSW 能力的法学硕士。我们回顾了从 CSW 定制的预训练和特定任务的后训练到提示策略和情境学习的建模方法。我们分析当前的评估实践，强调不稳定的根源和有限的可重复性，并对现有基准进行分类，同时批判性地检查其语言覆盖范围和以英语为中心的偏见。最后，我们讨论新出现的安全问题，包括使用代码混合作为绕过模型保护措施的机制，并确定开放的研究挑战。

Title: MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization

Authors: Haidong Xin, Xinze Li, Zhenghao Liu, Yukun Yan, Shuo Wang, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11182
Pdf URL: https://arxiv.org/pdf/2602.11182
Copy Paste: [[2602.11182]] MetaMem: Evolving Meta-Memory for Knowledge Utilization through Self-Reflective Symbolic Optimization(https://arxiv.org/abs/2602.11182)
Keywords: language model, llm
Abstract: Existing memory systems enable Large Language Models (LLMs) to support long-horizon human-LLM interactions by persisting historical interactions beyond limited context windows. However, while recent approaches have succeeded in constructing effective memories, they often disrupt the inherent logical and temporal relationships within interaction sessions, resulting in fragmented memory units and degraded reasoning performance. In this paper, we propose MetaMem, a novel framework that augments memory systems with a self-evolving meta-memory, aiming to teach LLMs how to effectively utilize memorized knowledge. During meta-memory optimization, MetaMem iteratively distills transferable knowledge utilization experiences across different tasks by self-reflecting on reasoning processes and performing actions to update the current meta-memory state. The accumulated meta-memory units serve as explicit knowledge utilization experiences, guiding the LLM to systematically identify and integrate critical evidence from scattered memory fragments. Extensive experiments demonstrate the effectiveness of MetaMem, which significantly outperforms strong baselines by over 3.6%. All codes and datasets are available at this https URL.
摘要：现有的记忆系统使大型语言模型（LLM）能够通过在有限的上下文窗口之外保留历史交互来支持长期的人类与 LLM 交互。然而，虽然最近的方法已经成功地构建了有效的记忆，但它们经常破坏交互会话中固有的逻辑和时间关系，导致记忆单元碎片化和推理性能下降。在本文中，我们提出了 MetaMem，这是一种新颖的框架，通过自我进化的元记忆来增强记忆系统，旨在教会法学硕士如何有效地利用记忆的知识。在元记忆优化过程中，MetaMem 通过自我反思推理过程并执行操作来更新当前元记忆状态，迭代地提炼跨不同任务的可转移知识利用经验。积累的元记忆单元作为明确的知识利用经验，指导法学硕士系统地识别和整合分散的记忆碎片中的关键证据。大量实验证明了 MetaMem 的有效性，其性能显着优于强基线 3.6% 以上。所有代码和数据集均可在此 https URL 中获取。

Title: DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task

Authors: Shafiuddin Rehan Ahmed, Wei Wei
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11198
Pdf URL: https://arxiv.org/pdf/2602.11198
Copy Paste: [[2602.11198]] DDL2PropBank Agent: Benchmarking Multi-Agent Frameworks' Developer Experience Through a Novel Relational Schema Mapping Task(https://arxiv.org/abs/2602.11198)
Keywords: llm, agent
Abstract: Multi-agent frameworks promise to simplify LLM-driven software development, yet there is no principled way to evaluate their developer experience in a controlled setting. We introduce DDL2PropBank, a novel benchmark task that maps relational database schemas to PropBank rolesets, requiring autonomous retrieval of candidate frames and fine-grained linguistic reasoning over table names, columns, and relations. Using the Agent-as-a-Tool pattern, we implement identical agent logic across 10 frameworks and evaluate along two dimensions: (i) code complexity via static analysis, and (ii) AI-assistability -- the extent to which LLMs can autonomously generate correct, framework-specific code. Our results reveal a threefold complexity spectrum, with Pydantic AI and Agno requiring the least implementation overhead. For AI-assistability, structural alignment scores reliably proxy runtime success for frameworks with single canonical patterns, but overestimate correctness for multi-pattern frameworks. Agno emerges as the strongest overall performer, combining lowest complexity with highest structural alignment and 83% pass@1.
摘要：多代理框架有望简化 LLM 驱动的软件开发，但没有原则性的方法来评估受控环境中的开发人员体验。我们引入了 DDL2PropBank，这是一项新颖的基准任务，它将关系数据库模式映射到 PropBank 角色集，需要自主检索候选框架以及对表名称、列和关系进行细粒度的语言推理。使用代理即工具模式，我们在 10 个框架中实现相同的代理逻辑，并从两个维度进行评估：(i) 通过静态分析的代码复杂性，以及 (ii) AI 辅助性——法学硕士可以自主生成正确的、特定于框架的代码的程度。我们的结果揭示了三倍的复杂度，Pydantic AI 和 Agno 需要最少的实现开销。对于人工智能辅助性，结构对齐得分可靠地代表具有单一规范模式的框架的运行时成功，但高估了多模式框架的正确性。 Agno 成为整体表现最强的公司，将最低的复杂性与最高的结构对齐相结合，通过率@1 为 83%。

Title: When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification

Authors: Jiale Zhao, Ke Fang, Lu Cheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11199
Pdf URL: https://arxiv.org/pdf/2602.11199
Copy Paste: [[2602.11199]] When and What to Ask: AskBench and Rubric-Guided RLVR for LLM Clarification(https://arxiv.org/abs/2602.11199)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) often respond even when prompts omit critical details or include misleading information, leading to hallucinations or reinforced misconceptions. We study how to evaluate and improve LLMs' ability to decide when and what to ask for clarification without sacrificing task performance. We introduce AskBench, an interactive benchmark that converts standard QA pairs into multi-turn interactions with explicit checkpoints. A unified judge loop evaluates final answers and simulates user responses as needed. AskBench covers two settings: AskMind, with intent-deficient queries requiring clarification, and AskOverconfidence, with queries containing false premises that must be identified and corrected. We further propose rubric-guided reinforcement learning with verifier-based rewards (RLVR), which uses structured rubrics to encourage targeted clarification. Experiments show consistent improvements in accuracy, rubric adherence, and interaction efficiency, with strong generalization to unseen domains.
摘要：即使提示省略关键细节或包含误导性信息，大型语言模型 (LLM) 通常也会做出反应，从而导致幻觉或强化误解。我们研究如何评估和提高法学硕士在不牺牲任务绩效的情况下决定何时以及什么内容要求澄清的能力。我们引入了 AskBench，这是一种交互式基准测试，可将标准 QA 对转换为具有显式检查点的多轮交互。统一的判断循环评估最终答案并根据需要模拟用户响应。 AskBench 涵盖两种设置：AskMind（需要澄清意图缺陷的查询）和 AskOverconfidence（其中包含必须识别和纠正的错误前提的查询）。我们进一步提出了基于验证者奖励的评估准则引导强化学习（RLVR），它使用结构化评估准则来鼓励有针对性的澄清。实验表明，准确性、评分标准依从性和交互效率得到了持续改进，并且对未见过的领域具有很强的泛化能力。

Title: Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning

Authors: Donald Ye, Max Loffgren, Om Kotadia, Linus Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11201
Pdf URL: https://arxiv.org/pdf/2602.11201
Copy Paste: [[2602.11201]] Mechanistic Evidence for Faithfulness Decay in Chain-of-Thought Reasoning(https://arxiv.org/abs/2602.11201)
Keywords: language model, chain-of-thought
Abstract: Chain-of-Thought (CoT) explanations are widely used to interpret how language models solve complex problems, yet it remains unclear whether these step-by-step explanations reflect how the model actually reaches its answer, or merely post-hoc justifications. We propose Normalized Logit Difference Decay (NLDD), a metric that measures whether individual reasoning steps are faithful to the model's decision-making process. Our approach corrupts individual reasoning steps from the explanation and measures how much the model's confidence in its answer drops, to determine if a step is truly important. By standardizing these measurements, NLDD enables rigorous cross-model comparison across different architectures. Testing three model families across syntactic, logical, and arithmetic tasks, we discover a consistent Reasoning Horizon (k*) at 70--85% of chain length, beyond which reasoning tokens have little or negative effect on the final answer. We also find that models can encode correct internal representations while completely failing the task. These results show that accuracy alone does not reveal whether a model actually reasons through its chain. NLDD offers a way to measure when CoT matters.
摘要：思想链（CoT）解释被广泛用于解释语言模型如何解决复杂问题，但目前尚不清楚这些分步解释是否反映了模型实际上如何达到其答案，或者仅仅是事后论证。我们提出了归一化 Logit 差分衰减 (NLDD)，这是一种衡量单个推理步骤是否忠实于模型决策过程的指标。我们的方法破坏了解释中的各个推理步骤，并测量模型对其答案的置信度下降了多少，以确定某个步骤是否真正重要。通过标准化这些测量，NLDD 可以跨不同架构进行严格的跨模型比较。通过跨句法、逻辑和算术任务测试三个模型系列，我们发现在链长度的 70--85% 处有一致的推理范围 (k*)，超过该范围推理标记对最终答案几乎没有或负面影响。我们还发现模型可以编码正确的内部表示，但任务却完全失败。这些结果表明，仅靠准确性并不能揭示模型是否真正通过其链条进行推理。 NLDD 提供了一种衡量 CoT 何时重要的方法。

Title: SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation

Authors: Beichen Guo, Zhiyuan Wen, Jia Gu, Senzhang Wang, Haochen Shi, Ruosong Yang, Shuaiqi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11238
Pdf URL: https://arxiv.org/pdf/2602.11238
Copy Paste: [[2602.11238]] SurveyLens: A Research Discipline-Aware Benchmark for Automatic Survey Generation(https://arxiv.org/abs/2602.11238)
Keywords: llm, agent
Abstract: The exponential growth of scientific literature has driven the evolution of Automatic Survey Generation (ASG) from simple pipelines to multi-agent frameworks and commercial Deep Research agents. However, current ASG evaluation methods rely on generic metrics and are heavily biased toward Computer Science (CS), failing to assess whether ASG methods adhere to the distinct standards of various academic disciplines. Consequently, researchers, especially those outside CS, lack clear guidance on using ASG systems to yield high-quality surveys compliant with specific discipline standards. To bridge this gap, we introduce SurveyLens, the first discipline-aware benchmark evaluating ASG methods across diverse research disciplines. We construct SurveyLens-1k, a curated dataset of 1,000 high-quality human-written surveys spanning 10 disciplines. Subsequently, we propose a dual-lens evaluation framework: (1) Discipline-Aware Rubric Evaluation, which utilizes LLMs with human preference-aligned weights to assess adherence to domain-specific writing standards; and (2) Canonical Alignment Evaluation to rigorously measure content coverage and synthesis quality against human-written survey papers. We conduct extensive experiments by evaluating 11 state-of-the-art ASG methods on SurveyLens, including Vanilla LLMs, ASG systems, and Deep Research agents. Our analysis reveals the distinct strengths and weaknesses of each paradigm across fields, providing essential guidance for selecting tools tailored to specific disciplinary requirements.
摘要：科学文献的指数级增长推动了自动调查生成（ASG）从简单管道发展到多智能体框架和商业深度研究智能体。然而，当前的ASG评估方法依赖于通用指标，并且严重偏向计算机科学（CS），无法评估ASG方法是否遵循各个学科的独特标准。因此，研究人员，尤其是计算机科学以外的研究人员，缺乏关于使用 ASG 系统进行符合特定学科标准的高质量调查的明确指导。为了弥补这一差距，我们推出了 SurveyLens，这是第一个跨不同研究学科评估 ASG 方法的学科意识基准。我们构建了 SurveyLens-1k，这是一个由 1000 份高质量人工撰写的调查组成的精选数据集，涵盖 10 个学科。随后，我们提出了一个双镜头评估框架：（1）学科意识评估，它利用法学硕士与人类偏好一致的权重来评估对特定领域写作标准的遵守情况； (2) 规范对齐评估，根据人工撰写的调查论文严格衡量内容覆盖范围和综合质量。我们通过在 SurveyLens 上评估 11 种最先进的 ASG 方法来进行广泛的实验，包括 Vanilla LLM、ASG 系统和 Deep Research 代理。我们的分析揭示了跨领域每种范式的独特优势和劣势，为选择适合特定学科要求的工具提供了重要指导。

Title: Are Aligned Large Language Models Still Misaligned?

Authors: Usman Naseem, Gautam Siddharth Kashyap, Rafiq Ali, Ebad Shabbir, Sushant Kumar Ray, Abdullah Mohammad, Agrima Seth
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11305
Pdf URL: https://arxiv.org/pdf/2602.11305
Copy Paste: [[2602.11305]] Are Aligned Large Language Models Still Misaligned?(https://arxiv.org/abs/2602.11305)
Keywords: language model, llm, prompt
Abstract: Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate >50% and lower Alignment Score (63%-66%) under joint conditions.
摘要：当模型行为偏离人类期望并且无法同时满足安全、价值和文化维度时，大型语言模型 (LLM) 中就会出现错位，而这些维度必须同时出现在现实世界的环境中才能解决现实世界的查询。现有的偏差基准——例如 INSECURE CODE（以安全为中心）、VALUEACTIONLENS（以价值为中心）和 CULTURALHERITAGE（以文化为中心）——依赖于评估各个维度的偏差，从而无法同时进行评估。为了解决这一差距，我们引入了 Mis-Align Bench，这是一个用于分析安全、价值和文化维度失调的统一基准。首先，我们使用 Mistral-7B-Instruct-v0.3 将来自 LLM-PROMPT-DATASET 的提示重新分类为 14 个安全域、56 个值域和 42 个文化域，并通过 Llama-3.1-8B-Instruct 扩展低资源域基于 SimHash 的指纹可以避免重复数据删除。此外，我们通过两阶段拒绝抽样将提示与未对齐和对齐的响应配对以提高质量。其次，我们对通用、微调和开放权重法学硕士进行了基准测试，从而能够在三个维度下系统地评估偏差。根据经验，单维模型在联合条件下实现了高覆盖率（高达 97.6%），但错误失败率 >50%，并且对齐分数较低（63%-66%）。

Title: Evaluating Alignment of Behavioral Dispositions in LLMs

Authors: Amir Taubenfeld, Zorik Gekhman, Lior Nezry, Omri Feldman, Natalie Harris, Shashir Reddy, Romina Stella, Ariel Goldstein, Marian Croak, Yossi Matias, Amir Feder
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11328
Pdf URL: https://arxiv.org/pdf/2602.11328
Copy Paste: [[2602.11328]] Evaluating Alignment of Behavioral Dispositions in LLMs(https://arxiv.org/abs/2602.11328)
Keywords: llm
Abstract: As LLMs integrate into our daily lives, understanding their behavior becomes essential. In this work, we focus on behavioral dispositions$-$the underlying tendencies that shape responses in social contexts$-$and introduce a framework to study how closely the dispositions expressed by LLMs align with those of humans. Our approach is grounded in established psychological questionnaires but adapts them for LLMs by transforming human self-report statements into Situational Judgment Tests (SJTs). These SJTs assess behavior by eliciting natural recommendations in realistic user-assistant scenarios. We generate 2,500 SJTs, each validated by three human annotators, and collect preferred actions from 10 annotators per SJT, from a large pool of 550 participants. In a comprehensive study involving 25 LLMs, we find that models often do not reflect the distribution of human preferences: (1) in scenarios with low human consensus, LLMs consistently exhibit overconfidence in a single response; (2) when human consensus is high, smaller models deviate significantly, and even some frontier models do not reflect the consensus in 15-20% of cases; (3) traits can exhibit cross-LLM patterns, e.g., LLMs may encourage emotion expression in contexts where human consensus favors composure. Lastly, mapping psychometric statements directly to behavioral scenarios presents a unique opportunity to evaluate the predictive validity of self-reports, revealing considerable gaps between LLMs' stated values and their revealed behavior.
摘要：随着法学硕士融入我们的日常生活，了解他们的行为变得至关重要。在这项工作中，我们关注行为倾向$-$在社会环境中塑造反应的潜在趋势$-$，并引入一个框架来研究法学硕士表达的倾向与人类的倾向有多接近。我们的方法以既定的心理问卷为基础，但通过将人类自我报告陈述转化为情景判断测试（SJT）来适应法学硕士。这些 SJT 通过在现实的用户辅助场景中引出自然推荐来评估行为。我们生成 2,500 个 SJT，每个 SJT 均由三名人类注释者验证，并从 550 名参与者的大池中收集每个 SJT 的 10 名注释者的首选操作。在一项涉及 25 名 LLM 的综合研究中，我们发现模型往往不能反映人类偏好的分布：（1）在人类共识较低的场景中，LLM 始终对单一响应表现出过度自信；（2）当人类共识较高时，较小的模型偏差显着，甚至一些前沿模型在15-20%的情况下不能反映共识； (3) 特质可以表现出跨法学硕士的模式，例如，法学硕士可能会在人类共识有利于冷静的情况下鼓励情绪表达。最后，将心理测量陈述直接映射到行为场景提供了评估自我报告的预测有效性的独特机会，揭示了法学硕士陈述的价值观与其所揭示的行为之间的巨大差距。

Title: When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing

Authors: Zachary Pedram Dadfar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11358
Pdf URL: https://arxiv.org/pdf/2602.11358
Copy Paste: [[2602.11358]] When Models Examine Themselves: Vocabulary-Activation Correspondence in Self-Referential Processing(https://arxiv.org/abs/2602.11358)
Keywords: language model, prompt
Abstract: Large language models produce rich introspective language when prompted for self-examination, but whether this language reflects internal computation or sophisticated confabulation has remained unclear. We show that self-referential vocabulary tracks concurrent activation dynamics, and that this correspondence is specific to self-referential processing. We introduce the Pull Methodology, a protocol that elicits extended self-examination through format engineering, and use it to identify a direction in activation space that distinguishes self-referential from descriptive processing in Llama 3.1. The direction is orthogonal to the known refusal direction, localised at 6.25% of model depth, and causally influences introspective output when used for steering. When models produce "loop" vocabulary, their activations exhibit higher autocorrelation (r = 0.44, p = 0.002); when they produce "shimmer" vocabulary under steering, activation variability increases (r = 0.36, p = 0.002). Critically, the same vocabulary in non-self-referential contexts shows no activation correspondence despite nine-fold higher frequency. Qwen 2.5-32B, with no shared training, independently develops different introspective vocabulary tracking different activation metrics, all absent in descriptive controls. The findings indicate that self-report in transformer models can, under appropriate conditions, reliably track internal computational states.
摘要：当需要自我检查时，大型语言模型会产生丰富的内省语言，但这种语言是否反映内部计算或复杂的虚构仍不清楚。我们表明，自指词汇跟踪并发激活动态，并且这种对应关系特定于自指处理。我们引入了 Pull 方法，这是一种通过格式工程引发扩展自我检查的协议，并使用它来识别激活空间中的方向，以区分 Llama 3.1 中的自我指涉和描述性处理。该方向与已知的拒绝方向正交，位于模型深度的 6.25% 处，并且在用于转向时会因果影响内省输出。当模型产生“循环”词汇时，它们的激活表现出更高的自相关性（r = 0.44，p = 0.002）；当他们在指导下产生“闪烁”词汇时，激活变异性会增加（r = 0.36，p = 0.002）。至关重要的是，尽管频率高出九倍，但非自我指涉上下文中的相同词汇没有显示出激活对应关系。 Qwen 2.5-32B，没有共享训练，独立开发了不同的内省词汇来跟踪不同的激活指标，所有这些都在描述性控制中不存在。研究结果表明，变压器模型中的自我报告可以在适当的条件下可靠地跟踪内部计算状态。

Title: Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification

Authors: Weili Shi, Dongliang Guo, Lehan Yang, Tianlong Wang, Hanzhang Yuan, Sheng Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11361
Pdf URL: https://arxiv.org/pdf/2602.11361
Copy Paste: [[2602.11361]] Finding the Cracks: Improving LLMs Reasoning with Paraphrastic Probing and Consistency Verification(https://arxiv.org/abs/2602.11361)
Keywords: language model, llm, hallucination
Abstract: Large language models have demonstrated impressive performance across a variety of reasoning tasks. However, their problem-solving ability often declines on more complex tasks due to hallucinations and the accumulation of errors within these intermediate steps. Recent work has introduced the notion of critical tokens--tokens in the reasoning process that exert significant influence on subsequent steps. Prior studies suggest that replacing critical tokens can refine reasoning trajectories. Nonetheless, reliably identifying and exploiting critical tokens remains challenging. To address this, we propose the Paraphrastic Probing and Consistency Verification~(PPCV) framework. PPCV operates in two stages. In the first stage, we roll out an initial reasoning path from the original question and then concatenate paraphrased versions of the question with this reasoning path. And we identify critical tokens based on mismatches between the predicted top-1 token and the expected token in the reasoning path. A criterion is employed to confirm the final critical token. In the second stage, we substitute critical tokens with candidate alternatives and roll out new reasoning paths for both the original and paraphrased questions. The final answer is determined by checking the consistency of outputs across these parallel reasoning processes. We evaluate PPCV on mainstream LLMs across multiple benchmarks. Extensive experiments demonstrate PPCV substantially enhances the reasoning performance of LLMs compared to baselines.
摘要：大型语言模型在各种推理任务中都表现出了令人印象深刻的性能。然而，由于幻觉和这些中间步骤中错误的积累，他们在处理更复杂的任务时解决问题的能力往往会下降。最近的工作引入了关键标记的概念——推理过程中对后续步骤产生重大影响的标记。先前的研究表明，替换关键标记可以改善推理轨迹。尽管如此，可靠地识别和利用关键代币仍然具有挑战性。为了解决这个问题，我们提出了释义探测和一致性验证（PPCV）框架。 PPCV 分两个阶段运行。在第一阶段，我们从原始问题中推出一个初始推理路径，然后将问题的释义版本与该推理路径连接起来。我们根据推理路径中预测的 top-1 标记与预期标记之间的不匹配来识别关键标记。采用一个标准来确认最终的关键令牌。在第二阶段，我们用候选替代方案替换关键标记，并为原始问题和释义问题推出新的推理路径。最终答案是通过检查这些并行推理过程的输出的一致性来确定的。我们通过多个基准评估主流法学硕士的 PPCV。大量实验表明，与基线相比，PPCV 显着提高了法学硕士的推理性能。

Title: The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods

Authors: Arpit Singh Gautam, Kailash Talreja, Saurabh Jha
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11364
Pdf URL: https://arxiv.org/pdf/2602.11364
Copy Paste: [[2602.11364]] The Energy of Falsehood: Detecting Hallucinations via Diffusion Model Likelihoods(https://arxiv.org/abs/2602.11364)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) frequently hallucinate plausible but incorrect assertions, a vulnerability often missed by uncertainty metrics when models are confidently wrong. We propose DiffuTruth, an unsupervised framework that reconceptualizes fact verification via non equilibrium thermodynamics, positing that factual truths act as stable attractors on a generative manifold while hallucinations are unstable. We introduce the Generative Stress Test, claims are corrupted with noise and reconstructed using a discrete text diffusion model. We define Semantic Energy, a metric measuring the semantic divergence between the original claim and its reconstruction using an NLI critic. Unlike vector space errors, Semantic Energy isolates deep factual contradictions. We further propose a Hybrid Calibration fusing this stability signal with discriminative confidence. Extensive experiments on FEVER demonstrate DiffuTruth achieves a state of the art unsupervised AUROC of 0.725, outperforming baselines by 1.5 percent through the correction of overconfident predictions. Furthermore, we show superior zero shot generalization on the multi hop HOVER dataset, outperforming baselines by over 4 percent, confirming the robustness of thermodynamic truth properties to distribution shifts.
摘要：大型语言模型 (LLM) 经常会产生看似合理但不正确的断言，当模型确信错误时，不确定性指标经常会忽略这一漏洞。我们提出了 DiffuTruth，一种无监督框架，通过非平衡热力学重新概念化事实验证，假设事实真理充当生成流形上的稳定吸引子，而幻觉则不稳定。我们引入了生成压力测试，声明被噪声破坏并使用离散文本扩散模型重建。我们定义了语义能量，这是一种衡量原始主张与其使用 NLI 批评家重建的语义分歧的指标。与向量空间错误不同，语义能量隔离了深层的事实矛盾。我们进一步提出了一种混合校准，将这种稳定性信号与判别置信度融合在一起。对 FEVER 的大量实验表明，DiffuTruth 在无监督情况下实现了 0.725 的最先进 AUROC，通过纠正过度自信的预测，比基线高出 1.5%。此外，我们在多跳 HOVER 数据集上展示了卓越的零样本泛化能力，优于基线 4% 以上，证实了热力学真值属性对分布变化的鲁棒性。

Title: Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection

Authors: Md Tanvir Rouf Shawon, Mohammad Sabik Irbaz, Hadeel R. A. Elyazori, Keerti Reddy Resapu, Yili Lin, Vladimir Franzuela Cardenas, Farrokh Alemi, Kevin Lybarger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11391
Pdf URL: https://arxiv.org/pdf/2602.11391
Copy Paste: [[2602.11391]] Advancing AI Trustworthiness Through Patient Simulation: Risk Assessment of Conversational Agents for Antidepressant Selection(https://arxiv.org/abs/2602.11391)
Keywords: llm, hallucination, agent
Abstract: Objective: This paper introduces a patient simulator designed to enable scalable, automated evaluation of healthcare conversational agents. The simulator generates realistic, controllable patient interactions that systematically vary across medical, linguistic, and behavioral dimensions, allowing annotators and an independent AI judge to assess agent performance, identify hallucinations and inaccuracies, and characterize risk patterns across diverse patient populations. Methods: The simulator is grounded in the NIST AI Risk Management Framework and integrates three profile components reflecting different dimensions of patient variation: (1) medical profiles constructed from electronic health records in the All of Us Research Program; (2) linguistic profiles modeling variation in health literacy and condition-specific communication patterns; and (3) behavioral profiles representing empirically observed interaction patterns, including cooperation, distraction, and adversarial engagement. We evaluated the simulator's effectiveness in identifying errors in an AI decision aid for antidepressant selection. Results: We generated 500 conversations between the patient simulator and the AI decision aid across systematic combinations of five linguistic and three behavioral profiles. Human annotators assessed 1,787 medical concepts across 100 conversations, achieving high agreement (F1=0.94, \k{appa}=0.73), and the LLM judge achieved comparable agreement with human annotators (F1=0.94, \k{appa}=0.78; paired bootstrap p=0.21). The simulator revealed a monotonic degradation in AI decision aid performance across the health literacy spectrum: rank-one concept retrieval accuracy increased from 47.9% for limited health literacy to 69.1% for functional and 81.6% for proficient.
摘要：目标：本文介绍了一种患者模拟器，旨在实现医疗保健对话代理的可扩展、自动评估。该模拟器生成真实、可控的患者交互，这些交互在医学、语言和行为维度上系统地变化，允许注释者和独立的人工智能法官评估代理的表现，识别幻觉和不准确之处，并描述不同患者群体的风险模式。方法：该模拟器以 NIST 人工智能风险管理框架为基础，集成了反映患者差异不同维度的三个档案组件：（1）根据“我们所有人研究计划”中的电子健康记录构建的医疗档案； (2) 对健康素养和具体情况沟通模式的变化进行建模的语言概况； (3) 行为概况代表根据经验观察到的互动模式，包括合作、分心和对抗性参与。我们评估了模拟器在识别抗抑郁药物选择人工智能决策辅助中的错误方面的有效性。结果：我们通过五种语言和三种行为特征的系统组合，在患者模拟器和人工智能决策辅助之间生成了 500 次对话。人类注释者评估了 100 个对话中的 1,787 个医学概念，取得了很高的一致性（F1=0.94，\k{appa}=0.73），法学硕士法官与人类注释者取得了相当的一致性（F1=0.94，\k{appa}=0.78；配对引导 p=0.21）。模拟器显示，在整个健康素养范围内，人工智能决策辅助性能出现单调下降：一级概念检索准确度从有限健康素养的 47.9% 增加到功能性的 69.1% 和熟练的 81.6%。

Title: Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety

Authors: Muskaan Chopra, Lorenz Sparrenberg, Rafet Sifa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11444
Pdf URL: https://arxiv.org/pdf/2602.11444
Copy Paste: [[2602.11444]] Towards Reliable Machine Translation: Scaling LLMs for Critical Error Detection and Safety(https://arxiv.org/abs/2602.11444)
Keywords: language model, llm
Abstract: Machine Translation (MT) plays a pivotal role in cross-lingual information access, public policy communication, and equitable knowledge dissemination. However, critical meaning errors, such as factual distortions, intent reversals, or biased translations, can undermine the reliability, fairness, and safety of multilingual systems. In this work, we explore the capacity of instruction-tuned Large Language Models (LLMs) to detect such critical errors, evaluating models across a range of parameters using the publicly accessible data sets. Our findings show that model scaling and adaptation strategies (zero-shot, few-shot, fine-tuning) yield consistent improvements, outperforming encoder-only baselines like XLM-R and ModernBERT. We argue that improving critical error detection in MT contributes to safer, more trustworthy, and socially accountable information systems by reducing the risk of disinformation, miscommunication, and linguistic harm, especially in high-stakes or underrepresented contexts. This work positions error detection not merely as a technical challenge, but as a necessary safeguard in the pursuit of just and responsible multilingual AI. The code will be made available at GitHub.
摘要：机器翻译（MT）在跨语言信息获取、公共政策沟通和公平知识传播方面发挥着关键作用。然而，严重的意义错误，例如事实扭曲、意图逆转或有偏见的翻译，可能会破坏多语言系统的可靠性、公平性和安全性。在这项工作中，我们探索了指令调整的大型语言模型（LLM）检测此类关键错误的能力，并使用可公开访问的数据集评估一系列参数的模型。我们的研究结果表明，模型缩放和适应策略（零样本、少样本、微调）产生了一致的改进，优于 XLM-R 和 ModernBERT 等仅编码器的基线。我们认为，改进机器翻译中的关键错误检测有助于降低虚假信息、沟通不畅和语言伤害的风险，特别是在高风险或代表性不足的环境中，从而有助于建立更安全、更值得信赖和对社会负责的信息系统。这项工作不仅将错误检测视为一项技术挑战，而且将其视为追求公正和负责任的多语言人工智能的必要保障。该代码将在 GitHub 上提供。

Title: LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation

Authors: Ahmadreza Jeddi, Marco Ciccone, Babak Taati
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11451
Pdf URL: https://arxiv.org/pdf/2602.11451
Copy Paste: [[2602.11451]] LoopFormer: Elastic-Depth Looped Transformers for Latent Reasoning via Shortcut Modulation(https://arxiv.org/abs/2602.11451)
Keywords: language model
Abstract: Looped Transformers have emerged as an efficient and powerful class of models for reasoning in the language domain. Recent studies show that these models achieve strong performance on algorithmic and reasoning tasks, suggesting that looped architectures possess an inductive bias toward latent reasoning. However, prior approaches fix the number of loop iterations during training and inference, leaving open the question of whether these models can flexibly adapt their computational depth under variable compute budgets. We introduce LoopFormer, a looped Transformer trained on variable-length trajectories to enable budget-conditioned reasoning. Our core contribution is a shortcut-consistency training scheme that aligns trajectories of different lengths, ensuring that shorter loops yield informative representations while longer loops continue to refine them. LoopFormer conditions each loop on the current time and step size, enabling representations to evolve consistently across trajectories of varying length rather than drifting or stagnating. Empirically, LoopFormer demonstrates robust performance on language modeling and reasoning benchmarks even under aggressive compute constraints, while scaling gracefully with additional budget. These results show that looped Transformers are inherently suited for adaptive language modeling, opening a path toward controllable and budget-aware large language models.
摘要：循环变压器已成为语言领域推理中高效且强大的一类模型。最近的研究表明，这些模型在算法和推理任务上取得了很强的性能，这表明循环架构对潜在推理具有归纳偏差。然而，先前的方法固定了训练和推理期间的循环迭代次数，从而留下了这些模型是否可以在可变计算预算下灵活调整其计算深度的问题。我们引入了 LoopFormer，这是一种在可变长度轨迹上进行训练的循环 Transformer，以实现预算条件推理。我们的核心贡献是一种快捷一致性训练方案，该方案对齐不同长度的轨迹，确保较短的循环产生信息丰富的表示，而较长的循环继续完善它们。 LoopFormer 根据当前时间和步长调节每个循环，使表示能够在不同长度的轨迹上一致地发展，而不是漂移或停滞。根据经验，即使在严格的计算限制下，LoopFormer 在语言建模和推理基准方面也表现出了强大的性能，同时可以通过额外的预算进行优雅的扩展。这些结果表明，循环 Transformer 本质上适合自适应语言建模，为可控且预算敏感的大型语言模型开辟了道路。

Title: ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias

Authors: Guangxin Zhao, Jiahao Zheng, Malaz Boustani, Jarek Nabrzyski, Meng Jiang, Yiyu Shi, Zhi Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11460
Pdf URL: https://arxiv.org/pdf/2602.11460
Copy Paste: [[2602.11460]] ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer's Disease and Related Dementias(https://arxiv.org/abs/2602.11460)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer's Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (>0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs' knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at this https URL.
摘要：大型语言模型（LLM）在医疗保健应用中显示出巨大的潜力。然而，现有的评估基准对阿尔茨海默氏病和相关痴呆症 (ADRD) 的覆盖范围极小。为了解决这一差距，我们引入了 ADRD-Bench，这是第一个专门针对 ADRD 的基准数据集，专为严格评估法学硕士而设计。 ADRD-Bench 有两个组成部分：1) ADRD 统一 QA，综合了 7 个已建立的医学基准中的 1,352 个问题，提供了对临床知识的统一评估； 2) ADRD 护理 QA，这是一组新颖的 149 个问题，源自老化大脑护理 (ABC) 计划，这是一个广泛使用的、基于证据的大脑健康管理计划。在具有综合 ADRD 护理方面的国家专业知识的计划的指导下，这套新套件旨在缓解现有基准中缺乏实际护理环境的情况。我们在提议的 ADRD-Bench 上评估了 33 个最先进的法学硕士。结果显示，开放权重通用模型的准确度范围为 0.63 至 0.93（平均值：0.78；标准差：0.09）。开放重量医学模型的准确度范围为 0.48 至 0.93（平均值：0.82；标准差：0.13）。闭源通用模型的准确度范围为 0.83 至 0.91（平均值：0.89；标准差：0.03）。虽然顶级模型达到了很高的准确度（>0.9），但案例研究表明，不一致的推理质量和稳定性限制了其可靠性，这凸显了迫切需要针对特定领域进行改进，以增强法学硕士基于日常护理数据的知识和推理。整个数据集可通过此 https URL 获取。

Title: When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration

Authors: Jayadev Billa
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2602.11488
Pdf URL: https://arxiv.org/pdf/2602.11488
Copy Paste: [[2602.11488]] When Audio-LLMs Don't Listen: A Cross-Linguistic Study of Modality Arbitration(https://arxiv.org/abs/2602.11488)
Keywords: language model, llm
Abstract: When audio and text conflict, speech-enabled language models follow the text 10 times more often than when arbitrating between two text sources, even when explicitly instructed to trust the audio. Using ALME, a benchmark of 57,602 controlled audio-text conflict stimuli across 8 languages, we find that Gemini 2.0 Flash exhibits 16.6\% text dominance under audio-text conflict versus 1.6\% under text-text conflict with identical reliability cues. This gap is not explained by audio quality: audio-only accuracy (97.2\%) exceeds cascade accuracy (93.9\%), indicating audio embeddings preserve more information than text transcripts. We propose that text dominance reflects an asymmetry not in information content but in arbitration accessibility: how easily the model can reason over competing representations. This framework explains otherwise puzzling findings. Forcing transcription before answering increases text dominance (19\% to 33\%), sacrificing audio's information advantage without improving accessibility. Framing text as ``deliberately corrupted'' reduces text dominance by 80\%. A fine-tuning ablation provides interventional evidence: training only the audio projection layer increases text dominance (+26.5\%), while LoRA on the language model halves it ($-$23.9\%), localizing text dominance to the LLM's reasoning rather than the audio encoder. Experiments across four state-of-the-art audio-LLMs and 8 languages show consistent trends with substantial cross-linguistic and cross-model variation, establishing modality arbitration as a distinct reliability dimension not captured by standard speech benchmarks.
摘要：当音频和文本冲突时，支持语音的语言模型遵循文本的频率比在两个文本源之间仲裁时高出 10 倍，即使明确指示信任音频也是如此。使用 ALME（跨 8 种语言的 57,602 个受控音频文本冲突刺激的基准），我们发现 Gemini 2.0 Flash 在音频文本冲突下表现出 16.6% 的文本优势，而在具有相同可靠性线索的文本文本冲突下表现出 1.6% 的文本优势。这种差距不能用音频质量来解释：纯音频准确率 (97.2\%) 超过级联准确率 (93.9\%)，这表明音频嵌入比文本转录保留了更多信息。我们认为，文本主导性反映的不对称不是信息内容的不对称，而是仲裁可访问性的不对称：模型如何轻松地推理竞争性表示。这个框架解释了其他令人费解的发现。在回答之前强制转录会增加文本主导地位（19\% 到 33\%），牺牲音频的信息优势而不提高可访问性。将文本框定为“故意损坏”可将文本优势降低 80%。微调消融提供了干预证据：仅训练音频投影层会增加文本主导地位 (+26.5\%)，而语言模型上的 LoRA 将其减半 ($-$23.9\%)，将文本主导地位定位于 LLM 的推理而不是音频编码器。跨四种最先进的音频法学硕士和 8 种语言的实验显示出一致的趋势，具有大量的跨语言和跨模型变化，将模态仲裁确立为标准语音基准未捕获的独特可靠性维度。

Title: Multimodal Fact-Level Attribution for Verifiable Reasoning

Authors: David Wan, Han Wang, Ziyang Wang, Elias Stengel-Eskin, Hyunji Lee, Mohit Bansal
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2602.11509
Pdf URL: https://arxiv.org/pdf/2602.11509
Copy Paste: [[2602.11509]] Multimodal Fact-Level Attribution for Verifiable Reasoning(https://arxiv.org/abs/2602.11509)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) are increasingly used for real-world tasks involving multi-step reasoning and long-form generation, where reliability requires grounding model outputs in heterogeneous input sources and verifying individual factual claims. However, existing multimodal grounding benchmarks and evaluation methods focus on simplified, observation-based scenarios or limited modalities and fail to assess attribution in complex multimodal reasoning. We introduce MuRGAt (Multimodal Reasoning with Grounded Attribution), a benchmark for evaluating fact-level multimodal attribution in settings that require reasoning beyond direct observation. Given inputs spanning video, audio, and other modalities, MuRGAt requires models to generate answers with explicit reasoning and precise citations, where each citation specifies both modality and temporal segments. To enable reliable assessment, we introduce an automatic evaluation framework that strongly correlates with human judgments. Benchmarking with human and automated scores reveals that even strong MLLMs frequently hallucinate citations despite correct reasoning. Moreover, we observe a key trade-off: increasing reasoning depth or enforcing structured grounding often degrades accuracy, highlighting a significant gap between internal reasoning and verifiable attribution.
摘要：多模态大语言模型 (MLLM) 越来越多地用于涉及多步骤推理和长格式生成的现实世界任务，其中可靠性需要在异构输入源中建立模型输出并验证单个事实主张。然而，现有的多模态基础基准和评估方法侧重于简化的、基于观察的场景或有限的模式，无法评估复杂的多模态推理中的归因。我们引入了 MuRGAt（具有扎根归因的多模态推理），这是一个在需要超出直接观察的推理的情况下评估事实级多模态归因的基准。给定跨越视频、音频和其他模态的输入，MuRGAt 要求模型生成具有明确推理和精确引用的答案，其中每个引用指定模态和时间段。为了实现可靠的评估，我们引入了与人类判断密切相关的自动评估框架。人类和自动评分的基准测试表明，即使是强大的 MLLM，也经常会出现错误的引用，尽管推理是正确的。此外，我们观察到一个关键的权衡：增加推理深度或加强结构化基础通常会降低准确性，凸显内部推理和可验证归因之间的巨大差距。

Title: Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm

Authors: Jinrui Zhang, Chaodong Xiao, Aoqi Wu, Xindong Zhang, Lei Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11543
Pdf URL: https://arxiv.org/pdf/2602.11543
Copy Paste: [[2602.11543]] Pretraining A Large Language Model using Distributed GPUs: A Memory-Efficient Decentralized Paradigm(https://arxiv.org/abs/2602.11543)
Keywords: language model, llm
Abstract: Pretraining large language models (LLMs) typically requires centralized clusters with thousands of high-memory GPUs (e.g., H100/A100). Recent decentralized training methods reduce communication overhead by employing federated optimization; however, they still need to train the entire model on each node, remaining constrained by GPU memory limitations. In this work, we propose SParse Expert Synchronization (SPES), a memory-efficient decentralized framework for pretraining mixture-of-experts (MoE) LLMs. SPES trains only a subset of experts per node, substantially lowering the memory footprint. Each node updates its local experts and periodically synchronizes with other nodes, eliminating full-parameter transmission while ensuring efficient knowledge sharing. To accelerate convergence, we introduce an expert-merging warm-up strategy, where experts exchange knowledge early in training, to rapidly establish foundational capabilities. With SPES, we train a 2B-parameter MoE LLM using 16 standalone 48GB GPUs over internet connections, which achieves competitive performance with centrally trained LLMs under similar computational budgets. We further demonstrate scalability by training a 7B model from scratch and a 9B model upcycled from a dense checkpoint, both of which match prior centralized baselines. Our code is available at this https URL.
摘要：预训练大型语言模型 (LLM) 通常需要具有数千个高内存 GPU（例如 H100/A100）的集中式集群。最近的去中心化训练方法通过采用联合优化来减少通信开销；然而，他们仍然需要在每个节点上训练整个模型，仍然受到 GPU 内存限制。在这项工作中，我们提出了稀疏专家同步（SPES），这是一种内存高效的去中心化框架，用于预训练专家混合（MoE）法学硕士。 SPES 仅训练每个节点的一部分专家，从而大大降低了内存占用。每个节点更新本地专家并定期与其他节点同步，消除全参数传输，同时保证高效的知识共享。为了加速融合，我们引入了专家合并预热策略，专家在培训早期交换知识，以快速建立基础能力。借助 SPES，我们使用 16 个独立 48GB GPU 通过互联网连接训练 2B 参数 MoE LLM，在类似的计算预算下，其性能可与集中训练的 LLM 相媲美。我们通过从头开始训练 7B 模型和从密集检查点升级的 9B 模型进一步证明了可扩展性，这两个模型都与之前的集中式基线相匹配。我们的代码可以在这个 https URL 上找到。

Title: SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent

Authors: Wenlin Zhong, Jinluan Yang, Yiquan Wu, Yi Liu, Jianhang Yao, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11551
Pdf URL: https://arxiv.org/pdf/2602.11551
Copy Paste: [[2602.11551]] SIGHT: Reinforcement Learning with Self-Evidence and Information-Gain Diverse Branching for Search Agent(https://arxiv.org/abs/2602.11551)
Keywords: language model, llm, prompt, agent
Abstract: Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
摘要：强化学习 (RL) 使大型语言模型 (LLM) 能够掌握复杂问题回答的自主搜索。然而，特别是在多轮搜索场景中，这种交互带来了一个严峻的挑战：搜索结果往往存在高冗余和低信噪比。因此，智能体很容易陷入“隧道视野”，即对早期噪声检索的强制解释导致不可逆的错误累积。为了应对这些挑战，我们提出了 SIGHT，这是一个通过自明支持 (SES) 和信息增益驱动的多样化分支来增强基于搜索的推理的框架。 SIGHT 通过 SES 将搜索结果提炼成高保真证据，并计算信息增益分数，以查明观察结果最大限度地减少不确定性的关键状态。该分数指导动态提示干预（包括重复数据删除、反射或自适应分支）以使用 SES 生成新分支。最后，通过组相对策略优化整合 SES 和正确性奖励，SIGHT 内部化了强大的探索策略，无需外部验证者。单跳和多跳 QA 基准测试表明，SIGHT 显着优于现有方法，特别是在复杂的推理场景中，使用更少的搜索步骤。

Title: Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays

Authors: Yijie Zhong, Mengying Guo, Zewei Wang, Zhongyang Li, Dandan Tu, Haofen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11607
Pdf URL: https://arxiv.org/pdf/2602.11607
Copy Paste: [[2602.11607]] Scene-Aware Memory Discrimination: Deciding Which Personal Knowledge Stays(https://arxiv.org/abs/2602.11607)
Keywords: language model, llm, prompt
Abstract: Intelligent devices have become deeply integrated into everyday life, generating vast amounts of user interactions that form valuable personal knowledge. Efficient organization of this knowledge in user memory is essential for enabling personalized applications. However, current research on memory writing, management, and reading using large language models (LLMs) faces challenges in filtering irrelevant information and in dealing with rising computational costs. Inspired by the concept of selective attention in the human brain, we introduce a memory discrimination task. To address large-scale interactions and diverse memory standards in this task, we propose a Scene-Aware Memory Discrimination method (SAMD), which comprises two key components: the Gating Unit Module (GUM) and the Cluster Prompting Module (CPM). GUM enhances processing efficiency by filtering out non-memorable interactions and focusing on the salient content most relevant to application demands. CPM establishes adaptive memory standards, guiding LLMs to discern what information should be remembered or discarded. It also analyzes the relationship between user intents and memory contexts to build effective clustering prompts. Comprehensive direct and indirect evaluations demonstrate the effectiveness and generalization of our approach. We independently assess the performance of memory discrimination, showing that SAMD successfully recalls the majority of memorable data and remains robust in dynamic scenarios. Furthermore, when integrated into personalized applications, SAMD significantly enhances both the efficiency and quality of memory construction, leading to better organization of personal knowledge.
摘要：智能设备已经深入融入日常生活，产生大量的用户交互，形成宝贵的个人知识。在用户记忆中有效组织这些知识对于实现个性化应用程序至关重要。然而，当前使用大语言模型（LLM）进行内存写入、管理和读取的研究面临着过滤不相关信息和应对不断上升的计算成本的挑战。受人脑选择性注意概念的启发，我们引入了记忆辨别任务。为了解决此任务中的大规模交互和不同的内存标准，我们提出了一种场景感知内存判别方法（SAMD），该方法包括两个关键组件：门控单元模块（GUM）和集群提示模块（CPM）。 GUM 通过过滤掉难以记忆的交互并专注于与应用程序需求最相关的显着内容来提高处理效率。 CPM 建立了自适应记忆标准，指导法学硕士辨别哪些信息应该记住或丢弃。它还分析用户意图和记忆上下文之间的关系，以构建有效的聚类提示。全面的直接和间接评估证明了我们方法的有效性和通用性。我们独立评估了记忆辨别的性能，表明 SAMD 成功回忆了大部分难忘的数据，并在动态场景中保持稳健。此外，当集成到个性化应用程序中时，SAMD 显着提高了内存构建的效率和质量，从而更好地组织个人知识。

Title: Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles

Authors: Momoka Furuhashi, Kouta Nakayama, Noboru Kawai, Takashi Kodama, Saku Sugawara, Kyosuke Takami
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11650
Pdf URL: https://arxiv.org/pdf/2602.11650
Copy Paste: [[2602.11650]] Which Feedback Works for Whom? Differential Effects of LLM-Generated Feedback Elements Across Learner Profiles(https://arxiv.org/abs/2602.11650)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) show promise for automatically generating feedback in education settings. However, it remains unclear how specific feedback elements, such as tone and information coverage, contribute to learning outcomes and learner acceptance, particularly across learners with different personality traits. In this study, we define six feedback elements and generate feedback for multiple-choice biology questions using GPT-5. We conduct a learning experiment with 321 first-year high school students and evaluate feedback effectiveness using two learning outcomes measures and subjective evaluations across six criteria. We further analyze differences in how feedback acceptance varies across learners based on Big Five personality traits. Our results show that effective feedback elements share common patterns supporting learning outcomes, while learners' subjective preferences differ across personality-based clusters. These findings highlight the importance of selecting and adapting feedback elements according to learners' personality traits when we design LLM-generated feedback, and provide practical implications for personalized feedback design in education.
摘要：大型语言模型（LLM）有望在教育环境中自动生成反馈。然而，目前尚不清楚具体的反馈元素（例如语气和信息覆盖范围）如何有助于学习成果和学习者的接受度，特别是对于具有不同性格特征的学习者。在本研究中，我们定义了六个反馈元素，并使用 GPT-5 生成多项选择生物学问题的反馈。我们对 321 名高中一年级学生进行了一项学习实验，并使用两种学习成果衡量标准和六个标准的主观评估来评估反馈有效性。我们根据大五人格特征进一步分析了不同学习者对反馈接受程度的差异。我们的结果表明，有效的反馈元素具有支持学习成果的共同模式，而学习者的主观偏好在基于个性的集群中有所不同。这些发现强调了在设计法学硕士生成的反馈时根据学习者的个性特征选择和调整反馈元素的重要性，并为教育中的个性化反馈设计提供了实际意义。

Title: PatientHub: A Unified Framework for Patient Simulation

Authors: Sahand Sabour, TszYam NG, Minlie Huang
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2602.11684
Pdf URL: https://arxiv.org/pdf/2602.11684
Copy Paste: [[2602.11684]] PatientHub: A Unified Framework for Patient Simulation(https://arxiv.org/abs/2602.11684)
Keywords: language model, prompt
Abstract: As Large Language Models increasingly power role-playing applications, simulating patients has become a valuable tool for training counselors and scaling therapeutic assessment. However, prior work is fragmented: existing approaches rely on incompatible, non-standardized data formats, prompts, and evaluation metrics, hindering reproducibility and fair comparison. In this paper, we introduce PatientHub, a unified and modular framework that standardizes the definition, composition, and deployment of simulated patients. To demonstrate PatientHub's utility, we implement several representative patient simulation methods as case studies, showcasing how our framework supports standardized cross-method evaluation and the seamless integration of custom evaluation metrics. We further demonstrate PatientHub's extensibility by prototyping two new simulator variants, highlighting how PatientHub accelerates method development by eliminating infrastructure overhead. By consolidating existing work into a single reproducible pipeline, PatientHub lowers the barrier to developing new simulation methods and facilitates cross-method and cross-model benchmarking. Our framework provides a practical foundation for future datasets, methods, and benchmarks in patient-centered dialogue, and the code is publicly available via this https URL.
摘要：随着大型语言模型日益为角色扮演应用程序提供动力，模拟患者已成为培训咨询师和扩大治疗评估的宝贵工具。然而，先前的工作是支离破碎的：现有方法依赖于不兼容、非标准化的数据格式、提示和评估指标，阻碍了可重复性和公平比较。在本文中，我们介绍了 PatientHub，这是一个统一的模块化框架，可以标准化模拟患者的定义、组成和部署。为了展示 PatientHub 的实用性，我们实施了几种具有代表性的患者模拟方法作为案例研究，展示我们的框架如何支持标准化的跨方法评估和自定义评估指标的无缝集成。我们通过对两个新的模拟器变体进行原型设计，进一步展示了 PatientHub 的可扩展性，重点介绍了 PatientHub 如何通过消除基础设施开销来加速方法开发。通过将现有工作整合到单个可重复的管道中，PatientHub 降低了开发新模拟方法的障碍，并促进跨方法和跨模型基准测试。我们的框架为未来以患者为中心的对话中的数据集、方法和基准提供了实用的基础，并且代码可通过此 https URL 公开获取。

Title: Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models

Authors: Katrin Olsen, Sebastian Padó
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11699
Pdf URL: https://arxiv.org/pdf/2602.11699
Copy Paste: [[2602.11699]] Finding Sense in Nonsense with Generated Contexts: Perspectives from Humans and Language Models(https://arxiv.org/abs/2602.11699)
Keywords: language model, llm
Abstract: Nonsensical and anomalous sentences have been instrumental in the development of computational models of semantic interpretation. A core challenge is to distinguish between what is merely anomalous (but can be interpreted given a supporting context) and what is truly nonsensical. However, it is unclear (a) how nonsensical, rather than merely anomalous, existing datasets are; and (b) how well LLMs can make this distinction. In this paper, we answer both questions by collecting sensicality judgments from human raters and LLMs on sentences from five semantically deviant datasets: both context-free and when providing a context. We find that raters consider most sentences at most anomalous, and only a few as properly nonsensical. We also show that LLMs are substantially skilled in generating plausible contexts for anomalous cases.
摘要：无意义和异常的句子在语义解释计算模型的发展中发挥了重要作用。一个核心挑战是区分什么只是异常（但可以在给定背景的情况下进行解释）和什么是真正无意义的。然而，目前还不清楚 (a) 现有数据集有多么荒谬，而不仅仅是异常； (b) 法学硕士如何区分这一点。在本文中，我们通过收集人类评分者和法学硕士对来自五个语义异常数据集的句子的感性判断来回答这两个问题：无论是无上下文的还是提供上下文的。我们发现，评估者认为大多数句子最多是异常的，只有少数句子是完全无意义的。我们还表明，法学硕士非常擅长为异常案例生成合理的背景。

Title: Thinking with Drafting: Optical Decompression via Logical Reconstruction

Authors: Jingxuan Wei, Honghao He, Caijun Jia, Siyuan Li, Zheng Sun, Yuhang Xu, Yuanyuan Lin, Linzhuang Sun, Yuchen Wu, Bihui Yu, Xiangxiang Zhang, Cheng Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11731
Pdf URL: https://arxiv.org/pdf/2602.11731
Copy Paste: [[2602.11731]] Thinking with Drafting: Optical Decompression via Logical Reconstruction(https://arxiv.org/abs/2602.11731)
Keywords: language model
Abstract: Existing multimodal large language models have achieved high-fidelity visual perception and exploratory visual generation. However, a precision paradox persists in complex reasoning tasks: optical perception systems transcribe symbols without capturing logical topology, while pixel-based generative models produce visual artifacts lacking mathematical exactness. To bridge this gap, we propose that reasoning over visual inputs be reconceptualized as optical decompression-the process of reconstructing latent logical structures from compressed visual tokens. Guided by the axiom that Parsing is Reasoning, we introduce Thinking with Drafting (TwD), which utilizes a minimalist Domain-Specific Language (DSL) as a grounding intermediate representation. Unlike standard approaches that hallucinate answers directly, TwD forces the model to draft its mental model into executable code, rendering deterministic visual proofs for self-verification. To validate this, we present VisAlg, a visual algebra benchmark. Experiments demonstrate that TwD serve as a superior cognitive scaffold. Our work establishes a closed-loop system where visual generation acts not as a creative output but as a logical verifier, offering a generalizable path for visual reasoning.
摘要：现有的多模态大语言模型已经实现了高保真视觉感知和探索性视觉生成。然而，在复杂的推理任务中，精度悖论仍然存在：光学感知系统在不捕获逻辑拓扑的情况下转录符号，而基于像素的生成模型产生缺乏数学精确性的视觉伪影。为了弥合这一差距，我们建议将视觉输入的推理重新概念化为光学解压缩——从压缩的视觉标记重建潜在逻辑结构的过程。在解析即推理这一公理的指导下，我们引入了绘图思考 (TwD)，它利用极简主义的领域特定语言 (DSL) 作为基础中间表示。与直接产生幻觉答案的标准方法不同，TwD 强制模型将其心理模型起草为可执行代码，从而呈现确定性的视觉证明以进行自我验证。为了验证这一点，我们推出了 VisAlg，一个视觉代数基准。实验表明 TwD 是一种优越的认知支架。我们的工作建立了一个闭环系统，其中视觉生成不是作为创造性输出，而是作为逻辑验证器，为视觉推理提供了一条通用的路径。

Title: MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling

Authors: MiniCPM Team: Wenhao An, Yingfa Chen, Yewei Fang, Jiayi Li, Xin Li, Yaohui Li, Yishan Li, Yuxuan Li, Biyuan Lin, Chuan Liu, Hezi Liu, Siyuan Liu, Hongya Lyu, Yinxu Pan, Shixin Ren, Xingyu Shen, Zhou Su, Haojun Sun, Yangang Sun, Zhen Leng Thai, Xin Tian, Rui Wang, Xiaorong Wang, Yudong Wang, Bo Wu, Xiaoyue Xu, Dong Xu, Shuaikang Xue, Jiawei Yang, Bowen Zhang, Jinqian Zhang, Letian Zhang, Shengnan Zhang, Xinyu Zhang, Xinyuan Zhang, Zhu Zhang, Hengyu Zhao, Jiacheng Zhao, Jie Zhou, Zihan Zhou, Shuo Wang, Chaojun Xiao, Xu Han, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11761
Pdf URL: https://arxiv.org/pdf/2602.11761
Copy Paste: [[2602.11761]] MiniCPM-SALA: Hybridizing Sparse and Linear Attention for Efficient Long-Context Modeling(https://arxiv.org/abs/2602.11761)
Keywords: language model, llm, long context
Abstract: The evolution of large language models (LLMs) towards applications with ultra-long contexts faces challenges posed by the high computational and memory costs of the Transformer architecture. While existing sparse and linear attention mechanisms attempt to mitigate these issues, they typically involve a trade-off between memory efficiency and model performance. This paper introduces MiniCPM-SALA, a 9B-parameter hybrid architecture that integrates the high-fidelity long-context modeling of sparse attention (InfLLM-V2) with the global efficiency of linear attention (Lightning Attention). By employing a layer selection algorithm to integrate these mechanisms in a 1:3 ratio and utilizing a hybrid positional encoding (HyPE), the model maintains efficiency and performance for long-context tasks. Furthermore, we introduce a cost-effective continual training framework that transforms pre-trained Transformer-based models into hybrid models, which reduces training costs by approximately 75% compared to training from scratch. Extensive experiments show that MiniCPM-SALA maintains general capabilities comparable to full-attention models while offering improved efficiency. On a single NVIDIA A6000D GPU, the model achieves up to 3.5x the inference speed of the full-attention model at the sequence length of 256K tokens and supports context lengths of up to 1M tokens, a scale where traditional full-attention 8B models fail because of memory constraints.
摘要：大型语言模型 (LLM) 向超长上下文应用的演进面临着 Transformer 架构的高计算和内存成本带来的挑战。虽然现有的稀疏和线性注意力机制试图缓解这些问题，但它们通常涉及内存效率和模型性能之间的权衡。本文介绍了 MiniCPM-SALA，这是一种 9B 参数混合架构，它将稀疏注意力的高保真长上下文建模（InfLLM-V2）与线性注意力的全局效率（Lightning Attention）相结合。通过采用层选择算法以 1:3 的比例集成这些机制并利用混合位置编码 (HyPE)，该模型保持了长上下文任务的效率和性能。此外，我们引入了一种经济高效的持续训练框架，将预训练的基于 Transformer 的模型转换为混合模型，与从头开始训练相比，训练成本降低了约 75%。大量实验表明，MiniCPM-SALA 保持了与全注意力模型相当的一般能力，同时提供了更高的效率。在单个 NVIDIA A6000D GPU 上，该模型在 256K token 的序列长度下实现了全注意力模型推理速度的 3.5 倍，并支持高达 1M token 的上下文长度，这是传统全注意力 8B 模型由于内存限制而失败的规模。

Title: DMAP: A Distribution Map for Text

Authors: Tom Kempton, Julia Rozanova, Parameswaran Kamalaruban, Maeve Madigan, Karolina Wresilo, Yoann L. Launay, David Sutton, Stuart Burrell
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.11871
Pdf URL: https://arxiv.org/pdf/2602.11871
Copy Paste: [[2602.11871]] DMAP: A Distribution Map for Text(https://arxiv.org/abs/2602.11871)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are a powerful tool for statistical text analysis, with derived sequences of next-token probability distributions offering a wealth of information. Extracting this signal typically relies on metrics such as perplexity, which do not adequately account for context; how one should interpret a given next-token probability is dependent on the number of reasonable choices encoded by the shape of the conditional distribution. In this work, we present DMAP, a mathematically grounded method that maps a text, via a language model, to a set of samples in the unit interval that jointly encode rank and probability information. This representation enables efficient, model-agnostic analysis and supports a range of applications. We illustrate its utility through three case studies: (i) validation of generation parameters to ensure data integrity, (ii) examining the role of probability curvature in machine-generated text detection, and (iii) a forensic analysis revealing statistical fingerprints left in downstream models that have been subject to post-training on synthetic data. Our results demonstrate that DMAP offers a unified statistical view of text that is simple to compute on consumer hardware, widely applicable, and provides a foundation for further research into text analysis with LLMs.
摘要：大型语言模型 (LLM) 是统计文本分析的强大工具，派生的下一个标记概率分布序列提供了丰富的信息。提取该信号通常依赖于困惑度等指标，而这些指标并不能充分考虑上下文；如何解释给定的下一个标记概率取决于条件分布形状编码的合理选择的数量。在这项工作中，我们提出了 DMAP，一种基于数学的方法，通过语言模型将文本映射到单位间隔中的一组样本，这些样本联合编码排名和概率信息。这种表示可以实现高效、与模型无关的分析，并支持一系列应用程序。我们通过三个案例研究来说明其实用性：（i）验证生成参数以确保数据完整性，（ii）检查概率曲率在机器生成文本检测中的作用，以及（iii）取证分析揭示下游模型中留下的统计指纹，这些模型已接受合成数据的后训练。我们的结果表明，DMAP 提供了统一的文本统计视图，易于在消费类硬件上计算，应用广泛，并为法学硕士进一步研究文本分析奠定了基础。

Title: Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Authors: Wanxing Wu, He Zhu, Yixia Li, Lei Yang, Jiehui Zhao, Hongru Wang, Jian Yang, Benyou Wang, Bingyi Jing, Guanhua Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11877
Pdf URL: https://arxiv.org/pdf/2602.11877
Copy Paste: [[2602.11877]] Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems(https://arxiv.org/abs/2602.11877)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.
摘要：大型语言模型 (LLM) 已经取得了成功，但成本和隐私限制使得需要在本地部署较小的模型，同时将复杂的查询卸载到基于云的模型。现有的路由器评估不系统，忽视了特定场景的要求和分布外的稳健性。我们提出了RouterXBench，一个原则性的评估框架，具有三个维度：路由器能力、场景对齐和跨域鲁棒性。与依赖输出概率或外部嵌入的先前工作不同，我们利用内部隐藏状态在答案生成之前捕获模型不确定性。我们介绍 ProbeDirichlet，一种轻量级路由器，通过可学习的狄利克雷分布和概率训练来聚合跨层隐藏状态。它经过多域数据的训练，可以在域内和分布外场景中稳健地推广。我们的结果表明，ProbeDirichlet 在路由器能力和高精度场景方面比最佳基线实现了 16.68% 和 18.86% 的相对改进，并且在模型系列、模型规模、异构任务和代理工作流程中具有一致的性能。

Title: LLM-based Triplet Extraction from Financial Reports

Authors: Dante Wesslund, Ville Stenström, Pontus Linde, Alexander Holmberg
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11886
Pdf URL: https://arxiv.org/pdf/2602.11886
Copy Paste: [[2602.11886]] LLM-based Triplet Extraction from Financial Reports(https://arxiv.org/abs/2602.11886)
Keywords: llm, hallucination, agent
Abstract: Corporate financial reports are a valuable source of structured knowledge for Knowledge Graph construction, but the lack of annotated ground truth in this domain makes evaluation difficult. We present a semi-automated pipeline for Subject-Predicate-Object triplet extraction that uses ontology-driven proxy metrics, specifically Ontology Conformance and Faithfulness, instead of ground-truth-based evaluation. We compare a static, manually engineered ontology against a fully automated, document-specific ontology induction approach across different LLMs and two corporate annual reports. The automatically induced ontology achieves 100% schema conformance in all configurations, eliminating the ontology drift observed with the manual approach. We also propose a hybrid verification strategy that combines regex matching with an LLM-as-a-judge check, reducing apparent subject hallucination rates from 65.2% to 1.6% by filtering false positives caused by coreference resolution. Finally, we identify a systematic asymmetry between subject and object hallucinations, which we attribute to passive constructions and omitted agents in financial prose.
摘要：企业财务报告是构建知识图谱的结构化知识的宝贵来源，但该领域缺乏带注释的基本事实使得评估变得困难。我们提出了一种用于主语-谓语-宾语三元组提取的半自动化管道，该管道使用本体驱动的代理指标，特别是本体一致性和忠实性，而不是基于事实的评估。我们将静态的、手动设计的本体与不同法学硕士和两份公司年度报告中的全自动、特定于文档的本体归纳方法进行比较。自动诱导的本体在所有配置中实现 100% 模式一致性，消除了手动方法观察到的本体漂移。我们还提出了一种混合验证策略，将正则表达式匹配与法学硕士作为法官检查相结合，通过过滤共指解析引起的误报，将明显的受试者幻觉率从 65.2% 降低到 1.6%。最后，我们发现主体幻觉和客体幻觉之间存在系统性不对称，我们将其归因于金融散文中的被动结构和省略主体。

Title: Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences

Authors: Eddie Yang, Dashun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11898
Pdf URL: https://arxiv.org/pdf/2602.11898
Copy Paste: [[2602.11898]] Benchmark Illusion: Disagreement among LLMs and Its Scientific Consequences(https://arxiv.org/abs/2602.11898)
Keywords: language model, llm
Abstract: Benchmarks underpin how progress in large language models (LLMs) is measured and trusted. Yet our analyses reveal that apparent convergence in benchmark accuracy can conceal deep epistemic divergence. Using two major reasoning benchmarks - MMLU-Pro and GPQA - we show that LLMs achieving comparable accuracy still disagree on 16-66% of items, and 16-38% among top-performing frontier models. These discrepancies suggest distinct error profiles for different LLMs. When such models are used for scientific data annotation and inference, their hidden disagreements propagate into research results: in re-analyses of published studies in education and political science, switching the annotation model can change estimated treatment effects by more than 80%, and in some cases reverses their sign. Together, these findings illustrate a benchmark illusion, where equal accuracy may conceal disagreement, with model choice becoming a hidden yet consequential variable for scientific reproducibility.
摘要：基准支持如何衡量和信任大型语言模型 (LLM) 的进展。然而我们的分析表明，基准准确性的明显趋同可能掩盖了深刻的认知分歧。使用两个主要推理基准——MMLU-Pro 和 GPQA——我们发现，达到相当准确度的法学硕士在 16-66% 的项目上仍然存在分歧，在表现最佳的前沿模型中这一比例为 16-38%。这些差异表明不同的法学硕士有不同的错误特征。当这些模型用于科学数据注释和推理时，它们隐藏的分歧会传播到研究结果中：在对教育和政治学领域已发表的研究进行重新分析时，切换注释模型可能会使估计的治疗效果改变80%以上，在某些情况下甚至会逆转其符号。总之，这些发现说明了一种基准错觉，即相同的准确性可能掩盖分歧，模型选择成为科学可重复性的隐藏但重要的变量。

Title: AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection

Authors: Pretam Ray, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.11931
Pdf URL: https://arxiv.org/pdf/2602.11931
Copy Paste: [[2602.11931]] AdaptEvolve: Improving Efficiency of Evolutionary AI Agents through Adaptive Model Selection(https://arxiv.org/abs/2602.11931)
Keywords: language model, llm, agent
Abstract: Evolutionary agentic systems intensify the trade-off between computational efficiency and reasoning capability by repeatedly invoking large language models (LLMs) during inference. This setting raises a central question: how can an agent dynamically select an LLM that is sufficiently capable for the current generation step while remaining computationally efficient? While model cascades offer a practical mechanism for balancing this trade-off, existing routing strategies typically rely on static heuristics or external controllers and do not explicitly account for model uncertainty. We introduce AdaptEvolve: Adaptive LLM Selection for Multi-LLM Evolutionary Refinement within an evolutionary sequential refinement framework that leverages intrinsic generation confidence to estimate real-time solvability. Empirical results show that confidence-driven selection yields a favourable Pareto frontier, reducing total inference cost by an average of 37.9% across benchmarks while retaining 97.5% of the upper-bound accuracy of static large-model baselines. Our code is available at this https URL.
摘要：进化代理系统通过在推理过程中重复调用大型语言模型（LLM）来强化计算效率和推理能力之间的权衡。这种设置提出了一个核心问题：代理如何动态选择足以满足当前生成步骤的LLM，同时保持计算效率？虽然模型级联提供了平衡这种权衡的实用机制，但现有的路由策略通常依赖于静态启发式或外部控制器，并且没有明确考虑模型的不确定性。我们引入了 AdaptEvolve：在进化顺序细化框架内进行多 LLM 进化细化的自适应 LLM 选择，该框架利用内在生成置信度来估计实时可解性。实证结果表明，置信驱动的选择产生了有利的帕累托前沿，将基准的总推理成本平均降低了 37.9%，同时保留了静态大型模型基准的 97.5% 的上限精度。我们的代码可以在这个 https URL 上找到。

Title: Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance

Authors: Yunchong Huang, Gianni Barlacchi, Sandro Pezzelle
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11938
Pdf URL: https://arxiv.org/pdf/2602.11938
Copy Paste: [[2602.11938]] Who is the richest club in the championship? Detecting and Rewriting Underspecified Questions Improve QA Performance(https://arxiv.org/abs/2602.11938)
Keywords: language model, llm
Abstract: Large language models (LLMs) perform well on well-posed questions, yet standard question-answering (QA) benchmarks remain far from solved. We argue that this gap is partly due to underspecified questions - queries whose interpretation cannot be uniquely determined without additional context. To test this hypothesis, we introduce an LLM-based classifier to identify underspecified questions and apply it to several widely used QA datasets, finding that 16% to over 50% of benchmark questions are underspecified and that LLMs perform significantly worse on them. To isolate the effect of underspecification, we conduct a controlled rewriting experiment that serves as an upper-bound analysis, rewriting underspecified questions into fully specified variants while holding gold answers fixed. QA performance consistently improves under this setting, indicating that many apparent QA failures stem from question underspecification rather than model limitations. Our findings highlight underspecification as an important confound in QA evaluation and motivate greater attention to question clarity in benchmark design.
摘要：大型语言模型 (LLM) 在适定问题上表现良好，但标准问答 (QA) 基准仍远未解决。我们认为，这种差距部分是由于未指定的问题造成的，即在没有额外上下文的情况下无法唯一确定其解释的查询。为了检验这一假设，我们引入了基于 LLM 的分类器来识别未指定的问题，并将其应用于几个广泛使用的 QA 数据集，发现 16% 到超过 50% 的基准问题未指定，并且 LLM 在这些问题上的表现明显较差。为了隔离未指定的影响，我们进行了受控重写实验，作为上限分析，将未指定的问题重写为完全指定的变体，同时保持黄金答案固定。在此设置下，QA 性能持续提高，这表明许多明显的 QA 失败源于问题规范不足，而不是模型限制。我们的研究结果强调，规范不足是质量保证评估中的一个重要混淆因素，并促使人们更加关注基准设计中问题的清晰度。

Title: Do Large Language Models Adapt to Language Variation across Socioeconomic Status?

Authors: Elisa Bassignana, Mike Zhang, Dirk Hovy, Amanda Cercas Curry
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11939
Pdf URL: https://arxiv.org/pdf/2602.11939
Copy Paste: [[2602.11939]] Do Large Language Models Adapt to Language Variation across Socioeconomic Status?(https://arxiv.org/abs/2602.11939)
Keywords: language model, llm, prompt, agent
Abstract: Humans adjust their linguistic style to the audience they are addressing. However, the extent to which LLMs adapt to different social contexts is largely unknown. As these models increasingly mediate human-to-human communication, their failure to adapt to diverse styles can perpetuate stereotypes and marginalize communities whose linguistic norms are less closely mirrored by the models, thereby reinforcing social stratification. We study the extent to which LLMs integrate into social media communication across different socioeconomic status (SES) communities. We collect a novel dataset from Reddit and YouTube, stratified by SES. We prompt four LLMs with incomplete text from that corpus and compare the LLM-generated completions to the originals along 94 sociolinguistic metrics, including syntactic, rhetorical, and lexical features. LLMs modulate their style with respect to SES to only a minor extent, often resulting in approximation or caricature, and tend to emulate the style of upper SES more effectively. Our findings (1) show how LLMs risk amplifying linguistic hierarchies and (2) call into question their validity for agent-based social simulation, survey experiments, and any research relying on language style as a social signal.
摘要：人类会根据他们所面对的受众调整他们的语言风格。然而，法学硕士适应不同社会背景的程度在很大程度上尚不清楚。随着这些模型越来越多地调解人与人之间的交流，它们无法适应不同的风格，可能会延续刻板印象，并使那些语言规范不太能被模型充分反映的社区边缘化，从而加剧社会分层。我们研究法学硕士融入不同社会经济地位（SES）社区的社交媒体交流的程度。我们从 Reddit 和 YouTube 收集了一个新颖的数据集，并按 SES 分层。我们向四名法学硕士提供来自该语料库的不完整文本，并根据 94 个社会语言学指标（包括句法、修辞和词汇特征）将法学硕士生成的补全内容与原始内容进行比较。法学硕士仅在较小程度上调整其相对于社会经济地位的风格，通常会导致近似或漫画化，并且倾向于更有效地模仿上层社会经济地位的风格。我们的研究结果（1）展示了法学硕士如何冒放大语言等级的风险，以及（2）质疑它们对于基于主体的社会模拟、调查实验以及任何依赖语言风格作为社会信号的研究的有效性。

Title: Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models

Authors: Yuzhe Shang, Pengzhi Gao, Wei Liu, Jian Luan, Jinsong Su
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11961
Pdf URL: https://arxiv.org/pdf/2602.11961
Copy Paste: [[2602.11961]] Scaling Model and Data for Multilingual Machine Translation with Open Large Language Models(https://arxiv.org/abs/2602.11961)
Keywords: language model, llm
Abstract: Open large language models (LLMs) have demonstrated improving multilingual capabilities in recent years. In this paper, we present a study of open LLMs for multilingual machine translation (MT) across a range of languages, and investigate the effects of model scaling and data scaling when adapting open LLMs to multilingual MT through continual pretraining and instruction finetuning. Based on the Gemma3 model family, we develop MiLMMT-46, which achieves top-tier multilingual translation performance across 46 languages. Extensive experiments show that MiLMMT-46 consistently outperforms recent state-of-the-art (SOTA) models, including Seed-X, HY-MT-1.5, and TranslateGemma, and achieves competitive performance with strong proprietary systems such as Google Translate and Gemini 3 Pro.
摘要：近年来，开放大语言模型 (LLM) 已证明多语言能力不断提高。在本文中，我们提出了针对多种语言的多语言机器翻译 (MT) 的开放式法学硕士的研究，并研究了通过持续预训练和指令微调使开放式法学硕士适应多语言 MT 时模型缩放和数据缩放的效果。基于 Gemma3 模型系列，我们开发了 MiLMMT-46，它在 46 种语言中实现了顶级的多语言翻译性能。大量实验表明，MiLMMT-46 的性能始终优于最新的 (SOTA) 模型，包括 Seed-X、HY-MT-1.5 和 TranslateGemma，并与 Google Translate 和 Gemini 3 Pro 等强大的专有系统相比，实现了具有竞争力的性能。

Title: Automatic Simplification of Common Vulnerabilities and Exposures Descriptions

Authors: Varpu Vehomäki, Kimmo K. Kaski
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.11982
Pdf URL: https://arxiv.org/pdf/2602.11982
Copy Paste: [[2602.11982]] Automatic Simplification of Common Vulnerabilities and Exposures Descriptions(https://arxiv.org/abs/2602.11982)
Keywords: language model, llm
Abstract: Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at this https URL\_nmi.
摘要：了解网络安全对于个人和组织来说越来越重要。然而，对于那些不熟悉该主题的人来说，许多与网络安全相关的信息可能难以理解。在本研究中，我们重点研究如何在常见漏洞和暴露 (CVE) 描述的自动文本简化 (ATS) 中利用大型语言模型 (LLM)。自动文本简化已经在医学、科学和新闻文本等多种环境中进行了研究，但尚未研究如何在快速变化和复杂的网络安全领域简化文本。我们创建了网络安全 ATS 基线和包含 40 个 CVE 描述的测试数据集，由两组网络安全专家在两轮调查中进行评估。我们发现，虽然开箱即用的法学硕士可以使文本显得更简单，但它们在保留意义方面遇到了困难。代码和数据可在此 https URL\_nmi 获取。

Title: LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss

Authors: Szilvia Ujváry, Louis Béthune, Pierre Ablin, João Monteiro, Marco Cuturi, Michael Kirchhof
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12005
Pdf URL: https://arxiv.org/pdf/2602.12005
Copy Paste: [[2602.12005]] LaCy: What Small Language Models Can and Should Learn is Not Just a Question of Loss(https://arxiv.org/abs/2602.12005)
Keywords: language model, llm
Abstract: Language models have consistently grown to compress more world knowledge into their parameters, but the knowledge that can be pretrained into them is upper-bounded by their parameter size. Especially the capacity of Small Language Models (SLMs) is limited, leading to factually incorrect generations. This problem is often mitigated by giving the SLM access to an outside source: the ability to query a larger model, documents, or a database. Under this setting, we study the fundamental question of \emph{which tokens an SLM can and should learn} during pretraining, versus \emph{which ones it should delegate} via a \texttt{} token. We find that this is not simply a question of loss: although the loss is predictive of whether a predicted token mismatches the ground-truth, some tokens are \emph{acceptable} in that they are truthful alternative continuations of a pretraining document, and should not trigger a \texttt{} even if their loss is high. We find that a spaCy grammar parser can help augment the loss signal to decide which tokens the SLM should learn to delegate to prevent factual errors and which are safe to learn and predict even under high losses. We propose LaCy, a novel pretraining method based on this token selection philosophy. Our experiments demonstrate that LaCy models successfully learn which tokens to predict and where to delegate for help. This results in higher FactScores when generating in a cascade with a bigger model and outperforms Rho or LLM-judge trained SLMs, while being simpler and cheaper.
摘要：语言模型不断发展，将更多的世界知识压缩到其参数中，但可以预先训练到其中的知识受到其参数大小的上限。特别是小语言模型（SLM）的容量是有限的，导致实际上不正确的生成。这个问题通常可以通过让 SLM 访问外部源来缓解：查询更大模型、文档或数据库的能力。在此设置下，我们研究了预训练期间 \emph{SLM 可以并且应该学习哪些令牌} 与通过 \texttt{} 令牌来委托 \emph{它应该委托哪些令牌} 的基本问题。我们发现这不仅仅是一个损失问题：虽然损失可以预测预测的标记是否与真实情况不匹配，但某些标记是 \emph{可接受} 的，因为它们是预训练文档的真实替代延续，并且即使它们的损失很高，也不应该触发 \texttt{} 。我们发现 spaCy 语法解析器可以帮助增强损失信号，以决定 SLM 应该学习委托哪些标记以防止事实错误，以及即使在高损失下也可以安全地学习和预测哪些标记。我们提出了 LaCy，一种基于这种令牌选择理念的新颖预训练方法。我们的实验表明，LaCy 模型成功地学习了要预测哪些令牌以及在何处委托寻求帮助。当使用更大的模型进行级联生成时，这会产生更高的 FactScore，并且性能优于 Rho 或 LLM 法官训练的 SLM，同时更简单、更便宜。

Title: Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study

Authors: Angelo Ziletti, Leonardo D'Ambrosi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12015
Pdf URL: https://arxiv.org/pdf/2602.12015
Copy Paste: [[2602.12015]] Disentangling Ambiguity from Instability in Large Language Models: A Clinical Text-to-SQL Case Study(https://arxiv.org/abs/2602.12015)
Keywords: language model
Abstract: Deploying large language models for clinical Text-to-SQL requires distinguishing two qualitatively different causes of output diversity: (i) input ambiguity that should trigger clarification, and (ii) model instability that should trigger human review. We propose CLUES, a framework that models Text-to-SQL as a two-stage process (interpretations --> answers) and decomposes semantic uncertainty into an ambiguity score and an instability score. The instability score is computed via the Schur complement of a bipartite semantic graph matrix. Across AmbigQA/SituatedQA (gold interpretations) and a clinical Text-to-SQL benchmark (known interpretations), CLUES improves failure prediction over state-of-the-art Kernel Language Entropy. In deployment settings, it remains competitive while providing a diagnostic decomposition unavailable from a single score. The resulting uncertainty regimes map to targeted interventions - query refinement for ambiguity, model improvement for instability. The high-ambiguity/high-instability regime contains 51% of errors while covering 25% of queries, enabling efficient triage.
摘要：为临床文本到 SQL 部署大型语言模型需要区分输出多样性的两个性质不同的原因：(i) 应触发澄清的输入模糊性，以及 (ii) 应触发人工审查的模型不稳定。我们提出了 CLUES，一个将文本到 SQL 建模为两阶段过程（解释 -> 答案）的框架，并将语义不确定性分解为歧义性分数和不稳定性分数。不稳定性分数是通过二分语义图矩阵的 Schur 补来计算的。在 AmbigQA/SituatedQA（黄金解释）和临床文本到 SQL 基准（已知解释）中，CLUES 改进了最先进的内核语言熵的故障预测。在部署设置中，它保持竞争力，同时提供单个分数无法提供的诊断分解。由此产生的不确定性机制映射到有针对性的干预措施——针对模糊性进行查询细化，针对不稳定性进行模型改进。高模糊性/高不稳定机制包含 51% 的错误，同时覆盖 25% 的查询，从而实现高效的分类。

Title: Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models

Authors: Xin Xu, Clive Bai, Kai Yang, Tianhao Chen, Yangkun Chen, Weijie Liu, Hao Chen, Yang Wang, Saiyong Yang, Can Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12036
Pdf URL: https://arxiv.org/pdf/2602.12036
Copy Paste: [[2602.12036]] Composition-RL: Compose Your Verifiable Prompts for Reinforcement Learning of Large Language Models(https://arxiv.org/abs/2602.12036)
Keywords: language model, prompt
Abstract: Large-scale verifiable prompts underpin the success of Reinforcement Learning with Verifiable Rewards (RLVR), but they contain many uninformative examples and are costly to expand further. Recent studies focus on better exploiting limited training data by prioritizing hard prompts whose rollout pass rate is 0. However, easy prompts with a pass rate of 1 also become increasingly prevalent as training progresses, thereby reducing the effective data size. To mitigate this, we propose Composition-RL, a simple yet useful approach for better utilizing limited verifiable prompts targeting pass-rate-1 prompts. More specifically, Composition-RL automatically composes multiple problems into a new verifiable question and uses these compositional prompts for RL training. Extensive experiments across model sizes from 4B to 30B show that Composition-RL consistently improves reasoning capability over RL trained on the original dataset. Performance can be further boosted with a curriculum variant of Composition-RL that gradually increases compositional depth over training. Additionally, Composition-RL enables more effective cross-domain RL by composing prompts drawn from different domains. Codes, datasets, and models are available at this https URL.
摘要：大规模的可验证提示是可验证奖励强化学习 (RLVR) 成功的基础，但它们包含许多无信息的示例，并且进一步扩展成本高昂。最近的研究重点是通过优先考虑推出通过率为 0 的硬提示来更好地利用有限的训练数据。然而，随着训练的进展，通过率为 1 的简单提示也变得越来越普遍，从而减少了有效数据大小。为了缓解这个问题，我们提出了 Composition-RL，这是一种简单但有用的方法，可以更好地利用针对通过率 1 提示的有限可验证提示。更具体地说，Composition-RL 自动将多个问题组合成一个新的可验证问题，并使用这些组合提示进行 RL 训练。从 4B 到 30B 的模型大小的广泛实验表明，与在原始数据集上训练的 RL 相比，Composition-RL 持续提高了推理能力。通过 Composition-RL 的课程变体可以进一步提高表现，该课程变体在训练中逐渐增加作曲深度。此外，Composition-RL 通过组合来自不同域的提示，实现更有效的跨域强化学习。代码、数据集和模型可从此 https URL 获取。

Title: DeepSight: An All-in-One LM Safety Toolkit

Authors: Bo Zhang, Jiaxuan Guo, Lijun Li, Dongrui Liu, Sujin Chen, Guanxu Chen, Zhijie Zheng, Qihao Lin, Lewen Yan, Chen Qian, Yijin Zhou, Yuyao Wu, Shaoxiong Guo, Tianyi Du, Jingyi Yang, Xuhao Hu, Ziqi Miao, Xiaoya Lu, Jing Shao, Xia Hu
Subjects: cs.CL, cs.AI, cs.CR, cs.CV
Abstract URL: https://arxiv.org/abs/2602.12092
Pdf URL: https://arxiv.org/pdf/2602.12092
Copy Paste: [[2602.12092]] DeepSight: An All-in-One LM Safety Toolkit(https://arxiv.org/abs/2602.12092)
Keywords: language model, llm
Abstract: As the development of Large Models (LMs) progresses rapidly, their safety is also a priority. In current Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) safety workflow, evaluation, diagnosis, and alignment are often handled by separate tools. Specifically, safety evaluation can only locate external behavioral risks but cannot figure out internal root causes. Meanwhile, safety diagnosis often drifts from concrete risk scenarios and remains at the explainable level. In this way, safety alignment lack dedicated explanations of changes in internal mechanisms, potentially degrading general capabilities. To systematically address these issues, we propose an open-source project, namely DeepSight, to practice a new safety evaluation-diagnosis integrated paradigm. DeepSight is low-cost, reproducible, efficient, and highly scalable large-scale model safety evaluation project consisting of a evaluation toolkit DeepSafe and a diagnosis toolkit DeepScan. By unifying task and data protocols, we build a connection between the two stages and transform safety evaluation from black-box to white-box insight. Besides, DeepSight is the first open source toolkit that support the frontier AI risk evaluation and joint safety evaluation and diagnosis.
摘要：随着大型模型（LM）的发展迅速，其安全性也是一个优先考虑的问题。在当前的大语言模型 (LLM) 和多模态大语言模型 (MLLM) 安全工作流程中，评估、诊断和对齐通常由单独的工具处理。具体来说，安全评价只能定位外部行为风险，而无法找出内部根源。与此同时，安全诊断往往偏离具体的风险场景，停留在可解释的水平。这样，安全调整缺乏对内部机制变化的专门解释，可能会降低总体能力。为了系统地解决这些问题，我们提出了一个开源项目，即 DeepSight，来实践一种新的安全评估诊断集成范式。 DeepSight是低成本、可重复、高效、高可扩展的大规模模型安全评估项目，由评估工具包DeepSafe和诊断工具包DeepScan组成。通过统一任务和数据协议，我们在两个阶段之间建立了联系，并将安全评估从黑盒转变为白盒洞察。此外，DeepSight还是首个支持前沿人工智能风险评估和联合安全评估诊断的开源工具包。

Title: P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

Authors: Pinyi Zhang, Ting-En Lin, Yuchuan Wu, Jingyang Chen, Zongqi Wang, Hua Yang, Ze Xu, Fei Huang, Kai Zhang, Yongbin Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12116
Pdf URL: https://arxiv.org/pdf/2602.12116
Copy Paste: [[2602.12116]] P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling(https://arxiv.org/abs/2602.12116)
Keywords: language model
Abstract: Personalized alignment of large language models seeks to adapt responses to individual user preferences, typically via reinforcement learning. A key challenge is obtaining accurate, user-specific reward signals in open-ended scenarios. Existing personalized reward models face two persistent limitations: (1) oversimplifying diverse, scenario-specific preferences into a small, fixed set of evaluation principles, and (2) struggling with generalization to new users with limited feedback. To this end, we propose P-GenRM, the first Personalized Generative Reward Model with test-time user-based scaling. P-GenRM transforms preference signals into structured evaluation chains that derive adaptive personas and scoring rubrics across various scenarios. It further clusters users into User Prototypes and introduces a dual-granularity scaling mechanism: at the individual level, it adaptively scales and aggregates each user's scoring scheme; at the prototype level, it incorporates preferences from similar users. This design mitigates noise in inferred preferences and enhances generalization to unseen users through prototype-based transfer. Empirical results show that P-GenRM achieves state-of-the-art results on widely-used personalized reward model benchmarks, with an average improvement of 2.31%, and demonstrates strong generalization on an out-of-distribution dataset. Notably, Test-time User-based scaling provides an additional 3% boost, demonstrating stronger personalized alignment with test-time scalability.
摘要：大型语言模型的个性化调整通常通过强化学习来调整响应以适应个人用户的偏好。一个关键的挑战是在开放式场景中获得准确的、特定于用户的奖励信号。现有的个性化奖励模型面临两个持续的局限性：（1）将不同的、特定于场景的偏好过度简化为一组小的、固定的评估原则，以及（2）难以推广到反馈有限的新用户。为此，我们提出了 P-GenRM，这是第一个具有测试时基于用户扩展的个性化生成奖励模型。 P-GenRM 将偏好信号转换为结构化评估链，从而在各种场景中派生出自适应角色和评分标准。它将用户进一步聚类为用户原型，并引入双粒度缩放机制：在个人层面，它自适应地缩放和聚合每个用户的评分方案；在原型层面，它融合了类似用户的偏好。这种设计减轻了推断偏好中的噪声，并通过基于原型的传输增强了对未见过的用户的泛化。实证结果表明，P-GenRM 在广泛使用的个性化奖励模型基准上取得了最先进的结果，平均提高了 2.31%，并且在分布外数据集上表现出了很强的泛化能力。值得注意的是，测试时基于用户的扩展提供了额外 3% 的提升，展示了与测试时可扩展性更强的个性化一致性。

Title: A Rule-based Computational Model for Gaidhlig Morphology

Authors: Peter J Barclay
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12132
Pdf URL: https://arxiv.org/pdf/2602.12132
Copy Paste: [[2602.12132]] A Rule-based Computational Model for Gaidhlig Morphology(https://arxiv.org/abs/2602.12132)
Keywords: language model
Abstract: Language models and software tools are essential to support the continuing vitality of lesser-used languages; however, currently popular neural models require considerable data for training, which normally is not available for such low-resource languages. This paper describes work-in-progress to construct a rule-based model of Gaidhlig morphology using data from Wiktionary, arguing that rule-based systems effectively leverage limited sample data, support greater interpretability, and provide insights useful in the design of teaching materials. The use of SQL for querying the occurrence of different lexical patterns is investigated, and a declarative rule-base is presented that allows Python utilities to derive inflected forms of Gaidhlig words. This functionality could be used to support educational tools that teach or explain language patterns, for example, or to support higher level tools such as rule-based dependency parsers. This approach adds value to the data already present in Wiktionary by adapting it to new use-cases.
摘要：语言模型和软件工具对于支持较少使用的语言的持续活力至关重要；然而，当前流行的神经模型需要大量数据进行训练，这通常不适用于这种资源匮乏的语言。本文描述了使用维基词典中的数据构建基于规则的 Gaidhlig 形态学模型的正在进行的工作，认为基于规则的系统有效地利用有限的样本数据，支持更大的可解释性，并提供对教材设计有用的见解。研究了使用 SQL 来查询不同词汇模式的出现情况，并提出了一个声明性规则库，允许 Python 实用程序派生 Gaidhlig 单词的变形形式。例如，此功能可用于支持教授或解释语言模式的教育工具，或支持更高级别的工具，例如基于规则的依存解析器。这种方法通过使维基词典中已有的数据适应新的用例来增加其价值。

Title: WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models

Authors: Yangzhuo Li, Shengpeng Ji, Yifu Chen, Tianle Liang, Haorong Ying, Yule Wang, Junbo Li, Jun Fang, Zhou Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12135
Pdf URL: https://arxiv.org/pdf/2602.12135
Copy Paste: [[2602.12135]] WavBench: Benchmarking Reasoning, Colloquialism, and Paralinguistics for End-to-End Spoken Dialogue Models(https://arxiv.org/abs/2602.12135)
Keywords: agent
Abstract: With the rapid integration of advanced reasoning capabilities into spoken dialogue models, the field urgently demands benchmarks that transcend simple interactions to address real-world complexity. However, current evaluations predominantly adhere to text-generation standards, overlooking the unique audio-centric characteristics of paralinguistics and colloquialisms, alongside the cognitive depth required by modern agents. To bridge this gap, we introduce WavBench, a comprehensive benchmark designed to evaluate realistic conversational abilities where prior works fall short. Uniquely, WavBench establishes a tripartite framework: 1) Pro subset, designed to rigorously challenge reasoning-enhanced models with significantly increased difficulty; 2) Basic subset, defining a novel standard for spoken colloquialism that prioritizes "listenability" through natural vocabulary, linguistic fluency, and interactive rapport, rather than rigid written accuracy; and 3) Acoustic subset, covering explicit understanding, generation, and implicit dialogue to rigorously evaluate comprehensive paralinguistic capabilities within authentic real-world scenarios. Through evaluating five state-of-the-art models, WavBench offers critical insights into the intersection of complex problem-solving, colloquial delivery, and paralinguistic fidelity, guiding the evolution of robust spoken dialogue models. The benchmark dataset and evaluation toolkit are available at this https URL.
摘要：随着高级推理能力快速集成到语音对话模型中，该领域迫切需要超越简单交互的基准来解决现实世界的复杂性。然而，当前的评估主要遵循文本生成标准，忽视了副语言和口语的独特的以音频为中心的特征，以及现代代理所需的认知深度。为了弥补这一差距，我们引入了 WavBench，这是一个综合基准测试，旨在评估先前作品不足的实际对话能力。 WavBench 独特地建立了三方框架：1）Pro 子集，旨在严格挑战难度显着增加的推理增强模型； 2）基本子集，定义口语口语的新颖标准，通过自然词汇、语言流畅性和互动融洽来优先考虑“可听性”，而不是严格的书面准确性； 3）声学子集，涵盖显式理解、生成和隐式对话，以严格评估真实场景中的综合副语言能力。通过评估五种最先进的模型，WavBench 提供了对复杂问题解决、口语表达和副语言保真度交叉点的重要见解，指导稳健口语对话模型的发展。基准数据集和评估工具包可从此 https URL 获取。

Title: dVoting: Fast Voting for dLLMs

Authors: Sicheng Feng, Zigeng Chen, Xinyin Ma, Gongfan Fang, Xinchao Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12153
Pdf URL: https://arxiv.org/pdf/2602.12153
Copy Paste: [[2602.12153]] dVoting: Fast Voting for dLLMs(https://arxiv.org/abs/2602.12153)
Keywords: language model, llm, prompt
Abstract: Diffusion Large Language Models (dLLMs) represent a new paradigm beyond autoregressive modeling, offering competitive performance while naturally enabling a flexible decoding process. Specifically, dLLMs can generate tokens at arbitrary positions in parallel, endowing them with significant potential for parallel test-time scaling, which was previously constrained by severe inefficiency in autoregressive modeling. In this work, we introduce dVoting, a fast voting technique that boosts reasoning capability without training, with only an acceptable extra computational overhead. dVoting is motivated by the observation that, across multiple samples for the same prompt, token predictions remain largely consistent, whereas performance is determined by a small subset of tokens exhibiting cross-sample variability. Leveraging the arbitrary-position generation capability of dLLMs, dVoting performs iterative refinement by sampling, identifying uncertain tokens via consistency analysis, regenerating them through voting, and repeating this process until convergence. Extensive evaluations demonstrate that dVoting consistently improves performance across various benchmarks. It achieves gains of 6.22%-7.66% on GSM8K, 4.40%-7.20% on MATH500, 3.16%-14.84% on ARC-C, and 4.83%-5.74% on MMLU. Our code is available at this https URL
摘要：扩散大型语言模型 (dLLM) 代表了自回归建模之外的新范式，提供有竞争力的性能，同时自然地实现灵活的解码过程。具体来说，dLLM 可以在任意位置并行生成令牌，赋予它们并行测试时间扩展的巨大潜力，而这在以前因自回归建模的严重低效而受到限制。在这项工作中，我们引入了 dVoting，这是一种快速投票技术，无需训练即可提高推理能力，并且只需要可接受的额外计算开销。 dVoting 的动机是观察到，在同一提示的多个样本中，令牌预测在很大程度上保持一致，而性能则由表现出跨样本变异性的一小部分令牌决定。利用 dLLM 的任意位置生成能力，dVoting 通过采样进行迭代细化，通过一致性分析识别不确定的标记，通过投票重新生成它们，并重复此过程直到收敛。广泛的评估表明，dVoting 能够持续提高各种基准的性能。在GSM8K上实现6.22%-7.66%的增益，在MATH500上实现4.40%-7.20%的增益，在ARC-C上实现3.16%-14.84%的增益，在MMLU上实现4.83%-5.74%的增益。我们的代码可在此 https URL 获取

Title: Query-focused and Memory-aware Reranker for Long Context Processing

Authors: Yuqing Li, Jiangnan Li, Mo Yu, Guoxuan Ding, Zheng Lin, Weiping Wang, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12192
Pdf URL: https://arxiv.org/pdf/2602.12192
Copy Paste: [[2602.12192]] Query-focused and Memory-aware Reranker for Long Context Processing(https://arxiv.org/abs/2602.12192)
Keywords: language model, long context
Abstract: Built upon the existing analysis of retrieval heads in large language models, we propose an alternative reranking framework that trains models to estimate passage-query relevance using the attention scores of selected heads. This approach provides a listwise solution that leverages holistic information within the entire candidate shortlist during ranking. At the same time, it naturally produces continuous relevance scores, enabling training on arbitrary retrieval datasets without requiring Likert-scale supervision. Our framework is lightweight and effective, requiring only small-scale models (e.g., 4B parameters) to achieve strong performance. Extensive experiments demonstrate that our method outperforms existing state-of-the-art pointwise and listwise rerankers across multiple domains, including Wikipedia and long narrative datasets. It further establishes a new state-of-the-art on the LoCoMo benchmark that assesses the capabilities of dialogue understanding and memory usage. We further demonstrate that our framework supports flexible extensions. For example, augmenting candidate passages with contextual information further improves ranking accuracy, while training attention heads from middle layers enhances efficiency without sacrificing performance.
摘要：基于对大型语言模型中检索头的现有分析，我们提出了一种替代的重新排序框架，该框架可以训练模型使用所选头的注意力分数来估计段落查询相关性。这种方法提供了一种列表式解决方案，可以在排名过程中利用整个候选候选名单中的整体信息。同时，它自然地产生连续的相关性分数，从而能够在任意检索数据集上进行训练，而无需李克特规模的监督。我们的框架轻量级且有效，只需要小规模模型（例如 4B 参数）即可实现强大的性能。大量的实验表明，我们的方法优于跨多个领域（包括维基百科和长叙述数据集）的现有最先进的点式和列表式重排序器。它进一步在 LoCoMo 基准上建立了新的最先进水平，用于评估对话理解和内存使用的能力。我们进一步证明我们的框架支持灵活的扩展。例如，用上下文信息增强候选段落进一步提高了排名准确性，同时训练中间层的注意力头可以在不牺牲性能的情况下提高效率。

Title: Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education

Authors: Mohamed Huti, Alasdair Mackintosh, Amy Waldock, Dominic Andrews, Maxime Lelièvre, Moritz Boos, Tobias Murray, Paul Atherton, Robin A. A. Ince, Oliver G. B. Garrod
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12196
Pdf URL: https://arxiv.org/pdf/2602.12196
Copy Paste: [[2602.12196]] Visual Reasoning Benchmark: Evaluating Multimodal LLMs on Classroom-Authentic Visual Problems from Primary Education(https://arxiv.org/abs/2602.12196)
Keywords: language model, llm
Abstract: AI models have achieved state-of-the-art results in textual reasoning; however, their ability to reason over spatial and relational structures remains a critical bottleneck -- particularly in early-grade maths, which relies heavily on visuals. This paper introduces the visual reasoning benchmark (VRB), a novel dataset designed to evaluate Multimodal Large Language Models (MLLMs) on their ability to solve authentic visual problems from classrooms. This benchmark is built on a set of 701 questions sourced from primary school examinations in Zambia and India, which cover a range of tasks such as reasoning by analogy, pattern completion, and spatial matching. We outline the methodology and development of the benchmark which intentionally uses unedited, minimal-text images to test if models can meet realistic needs of primary education. Our findings reveal a ``jagged frontier'' of capability where models demonstrate better proficiency in static skills such as counting and scaling, but reach a distinct ``spatial ceiling'' when faced with dynamic operations like folding, reflection, and rotation. These weaknesses pose a risk for classroom use on visual reasoning problems, with the potential for incorrect marking, false scaffolding, and reinforcing student misconceptions. Consequently, education-focused benchmarks like the VRB are essential for determining the functional boundaries of multimodal tools used in classrooms.
摘要：AI模型在文本推理方面取得了最先进的成果；然而，他们对空间和关系结构的推理能力仍然是一个关键瓶颈——尤其是在严重依赖视觉的低年级数学中。本文介绍了视觉推理基准（VRB），这是一个新颖的数据集，旨在评估多模态大型语言模型（MLLM）解决课堂真实视觉问题的能力。该基准基于来自赞比亚和印度小学考试的 701 道题，涵盖类比推理、模式完成和空间匹配等一系列任务。我们概述了基准的方法和开发，有意使用未经编辑的、最少文本的图像来测试模型是否能够满足初等教育的现实需求。我们的研究结果揭示了能力的“锯齿状边界”，其中模型在计数和缩放等静态技能方面表现出更好的熟练程度，但在面对折叠、反射和旋转等动态操作时达到了明显的“空间天花板”。这些弱点给课堂上视觉推理问题的使用带来了风险，可能会出现不正确的标记、错误的脚手架和强化学生的误解。因此，像 VRB 这样以教育为中心的基准对于确定课堂中使用的多模式工具的功能边界至关重要。

Title: ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Authors: Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12203
Pdf URL: https://arxiv.org/pdf/2602.12203
Copy Paste: [[2602.12203]] ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images(https://arxiv.org/abs/2602.12203)
Keywords: language model
Abstract: Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
摘要：表单和报告等企业文档嵌入了数据归档、自动化工作流程和分析等下游应用程序的关键信息。尽管通才视觉语言模型 (VLM) 在既定的文档理解基准上表现良好，但它们跨不同文档类型和灵活模式进行整体、细粒度结构化提取的能力尚未得到充分研究。现有的关键实体提取 (KEE)、关系提取 (RE) 和视觉问答 (VQA) 数据集受到狭窄实体本体、简单查询或同类文档类型的限制，通常忽视了适应性和结构化提取的需求。为了解决这些差距，我们引入了 ExStrucTiny，这是一个新的基准数据集，用于从文档图像中提取结构化信息 (IE)，统一了 KEE、RE 和 VQA 的各个方面。 ExStrucTiny 通过结合手动和合成人工验证样本的新颖管道构建，涵盖了更多不同的文档类型和提取场景。我们在此基准测试上分析开放式和封闭式 VLM，突出显示模式适应、查询不规范和答案本地化等挑战。我们希望我们的工作为改进文档中结构化 IE 的通用模型提供基础。

Title: Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation

Authors: Julia Belikova, Danila Rozhevskii, Dennis Svirin, Konstantin Polev, Alexander Panchenko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12235
Pdf URL: https://arxiv.org/pdf/2602.12235
Copy Paste: [[2602.12235]] Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation(https://arxiv.org/abs/2602.12235)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Efficient long-context processing remains a crucial challenge for contemporary large language models (LLMs), especially in resource-constrained environments. Soft compression architectures promise to extend effective context length by replacing long token sequences with smaller sets of learned compressed tokens. Yet, the limits of compressibility -- and when compression begins to erase task-relevant content -- remain underexplored. In this paper, we define \emph{token overflow} as a regime in which compressed representations no longer contain sufficient information to answer a given query, and propose a methodology to characterize and detect it. In the xRAG soft-compression setting, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations, providing a practical tool for identifying compressed tokens but showing limited overflow detection capability. Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average on HotpotQA, SQuADv2, and TriviaQA datasets, demonstrating that incorporating query information improves detection performance. These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
摘要：高效的长上下文处理仍然是当代大型语言模型（LLM）的一个关键挑战，特别是在资源有限的环境中。软压缩架构有望通过用较小的学习压缩令牌集替换长令牌序列来扩展有效上下文长度。然而，可压缩性的限制——以及当压缩开始删除与任务相关的内容时——仍然没有得到充分探索。在本文中，我们将\emph{令牌溢出}定义为一种压缩表示不再包含足够信息来回答给定查询的机制，并提出了一种表征和检测它的方法。在 xRAG 软压缩设置中，我们发现与查询无关的饱和统计信息可靠地将压缩令牌表示与未压缩令牌表示分开，为识别压缩令牌提供了实用工具，但显示出有限的溢出检测能力。基于查询和上下文 xRAG 表示的轻量级探测分类器在 HotpotQA、SQuADv2 和 TriviaQA 数据集上平均检测到溢出，AUC-ROC 为 0.72，这表明合并查询信息可以提高检测性能。这些结果从独立于查询的诊断推进到查询感知检测器，从而实现低成本预法学硕士门控，以减轻压缩引起的错误。

Title: T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization

Authors: Tunyu Zhang, Xinxi Zhang, Ligong Han, Haizhou Shi, Xiaoxiao He, Zhuowei Li, Hao Wang, Kai Xu, Akash Srivastava, Hao Wang, Vladimir Pavlovic, Dimitris N. Metaxas
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12262
Pdf URL: https://arxiv.org/pdf/2602.12262
Copy Paste: [[2602.12262]] T3D: Few-Step Diffusion Language Models via Trajectory Self-Distillation with Direct Discriminative Optimization(https://arxiv.org/abs/2602.12262)
Keywords: language model, llm
Abstract: Diffusion large language models (DLLMs) have the potential to enable fast text generation by decoding multiple tokens in parallel. However, in practice, their inference efficiency is constrained by the need for many refinement steps, while aggressively reducing the number of steps leads to a substantial degradation in generation quality. To alleviate this, we propose a trajectory self-distillation framework that improves few-step decoding by distilling the model's own generative trajectories. We incorporate Direct Discriminative Optimization (DDO), a reverse-KL objective that promotes mode-seeking distillation and encourages the student to concentrate on high-probability teacher modes. Across benchmarks, our approach consistently outperforms strong few-step baselines and standard training under tight step budgets. Although full-step decoding remains superior, we substantially narrow the gap, establishing a strong foundation towards practical few-step DLLMs. The source code is available at this https URL.
摘要：扩散大语言模型 (DLLM) 有潜力通过并行解码多个标记来实现快速文本生成。然而，在实践中，它们的推理效率受到需要许多细化步骤的限制，而积极减少步骤数量会导致生成质量大幅下降。为了缓解这个问题，我们提出了一种轨迹自蒸馏框架，通过蒸馏模型自身的生成轨迹来改进少步解码。我们采用了直接判别优化（DDO），这是一种反向 KL 目标，可促进模式搜索蒸馏并鼓励学生专注于高概率的教师模式。在各个基准测试中，我们的方法始终优于强大的几步基线和紧张步骤预算下的标准训练。尽管全步解码仍然优越，但我们大大缩小了差距，为实用的少步 DLLM 奠定了坚实的基础。源代码可从此 https URL 获取。

Title: On-Policy Context Distillation for Language Models

Authors: Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12275
Pdf URL: https://arxiv.org/pdf/2602.12275
Copy Paste: [[2602.12275]] On-Policy Context Distillation for Language Models(https://arxiv.org/abs/2602.12275)
Keywords: language model, prompt
Abstract: Context distillation enables language models to internalize in-context knowledge into their parameters. In our work, we propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation by training a student model on its own generated trajectories while minimizing reverse Kullback-Leibler divergence against a context-conditioned teacher. We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation, where models extract and consolidate transferable knowledge from their historical solution traces, and system prompt distillation, where models internalize beneficial behaviors encoded in optimized prompts. Across mathematical reasoning, text-based games, and domain-specific tasks, OPCD consistently outperforms baseline methods, achieving higher task accuracy while better preserving out-of-distribution capabilities. We further show that OPCD enables effective cross-size distillation, where smaller student models can internalize experiential knowledge from larger teachers.
摘要：上下文蒸馏使语言模型能够将上下文知识内化到其参数中。在我们的工作中，我们提出了在政策上下文蒸馏（OPCD），这是一个框架，通过在其自己生成的轨迹上训练学生模型，同时最小化针对上下文条件教师的反向 Kullback-Leibler 发散，将在政策蒸馏与上下文蒸馏联系起来。我们证明了 OPCD 在两个重要应用中的有效性：经验知识蒸馏，其中模型从其历史解决方案轨迹中提取和巩固可转移的知识，以及系统提示蒸馏，其中模型将优化提示中编码的有益行为内化。在数学推理、基于文本的游戏和特定领域的任务中，OPCD 始终优于基线方法，实现了更高的任务准确性，同时更好地保留了分布外能力。我们进一步表明，OPCD 能够实现有效的跨规模蒸馏，其中较小的学生模型可以内化较大教师的经验知识。