2026-02-20

Title: References Improve LLM Alignment in Non-Verifiable Domains

Authors: Kejian Shi, Yixin Liu, Peifeng Wang, Alexander R. Fabbri, Shafiq Joty, Arman Cohan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.16802
Pdf URL: https://arxiv.org/pdf/2602.16802
Copy Paste: [[2602.16802]] References Improve LLM Alignment in Non-Verifiable Domains(https://arxiv.org/abs/2602.16802)
Keywords: llm
Abstract: While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment. In this work, we investigate whether reference-guided LLM-evaluators can bridge this gap by serving as soft "verifiers". First, we design evaluation protocols that enhance LLM-based evaluators for LLM alignment using reference outputs. Through comprehensive experiments, we show that a reference-guided approach substantially improves the accuracy of less capable LLM-judges using references from frontier models; stronger LLM-judges can also be enhanced by high-quality (i.e., human-written) references. Building on these improved judges, we demonstrate the utility of high-quality references in alignment tuning, where LLMs guided with references are used as judges to self-improve. We show that reference-guided self-improvement yields clear gains over both direct SFT on reference outputs and self-improvement with reference-free judges, achieving performance comparable to training with ArmoRM, a strong finetuned reward model. Specifically, our method achieves 73.1% and 58.7% on AlpacaEval and Arena-Hard with Llama-3-8B-Instruct, and 70.0% and 74.1% with Qwen2.5-7B, corresponding to average absolute gains of +20.2 / +17.1 points over SFT distillation and +5.3 / +3.6 points over reference-free self-improvement on AlpacaEval / Arena-Hard. These results highlight the potential of using reference-guided LLM-evaluators to enable effective LLM post-training in non-verifiable domains.
摘要：While Reinforcement Learning with Verifiable Rewards (RLVR) has shown strong effectiveness in reasoning tasks, it cannot be directly applied to non-verifiable domains lacking ground-truth verifiers, such as LLM alignment.在这项工作中，我们研究了参考指导的法学硕士评估者是否可以通过充当软“验证者”来弥补这一差距。首先，我们设计评估协议，使用参考输出增强基于 LLM 的评估器，以实现 LLM 对齐。通过全面的实验，我们表明，参考文献引导的方法使用前沿模型的参考文献，大大提高了能力较差的法学硕士法官的准确性；更强大的法学硕士法官也可以通过高质量（即人工撰写）的参考文献来增强。在这些改进的法官的基础上，我们展示了高质量参考文献在对齐调整中的实用性，其中以参考文献为指导的法学硕士被用作自我改进的法官。我们表明，与参考输出上的直接 SFT 和使用无参考法官的自我改进相比，参考引导的自我改进产生了明显的收益，达到了与使用 ArmoRM（一种强大的微调奖励模型）进行训练相当的性能。具体来说，我们的方法在使用 Llama-3-8B-Instruct 的 AlpacaEval 和 Arena-Hard 上实现了 73.1% 和 58.7%，在 Qwen2.5-7B 上实现了 70.0% 和 74.1%，对应于在 AlpacaEval / 上比 SFT 蒸馏平均绝对增益 +20.2 / +17.1 点，比无参考自我改进 +5.3 / +3.6 点竞技场-困难。这些结果凸显了使用参考指导的法学硕士评估者在不可验证领域实现有效的法学硕士后培训的潜力。

Title: Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark

Authors: Charalampos Mastrokostas, Nikolaos Giarelis, Nikos Karacapilidis
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.16811
Pdf URL: https://arxiv.org/pdf/2602.16811
Copy Paste: [[2602.16811]] Evaluating Monolingual and Multilingual Large Language Models for Greek Question Answering: The DemosQA Benchmark(https://arxiv.org/abs/2602.16811)
Keywords: language model, llm, prompt
Abstract: Recent advancements in Natural Language Processing and Deep Learning have enabled the development of Large Language Models (LLMs), which have significantly advanced the state-of-the-art across a wide range of tasks, including Question Answering (QA). Despite these advancements, research on LLMs has primarily targeted high-resourced languages (e.g., English), and only recently has attention shifted toward multilingual models. However, these models demonstrate a training data bias towards a small number of popular languages or rely on transfer learning from high- to under-resourced languages; this may lead to a misrepresentation of social, cultural, and historical aspects. To address this challenge, monolingual LLMs have been developed for under-resourced languages; however, their effectiveness remains less studied when compared to multilingual counterparts on language-specific tasks. In this study, we address this research gap in Greek QA by contributing: (i) DemosQA, a novel dataset, which is constructed using social media user questions and community-reviewed answers to better capture the Greek social and cultural zeitgeist; (ii) a memory-efficient LLM evaluation framework adaptable to diverse QA datasets and languages; and (iii) an extensive evaluation of 11 monolingual and multilingual LLMs on 6 human-curated Greek QA datasets using 3 different prompting strategies. We release our code and data to facilitate reproducibility.
摘要：自然语言处理和深度学习的最新进展促进了大型语言模型 (LLM) 的发展，极大地提升了包括问答 (QA) 在内的各种任务的最新技术水平。尽管取得了这些进步，法学硕士的研究主要针对资源丰富的语言（例如英语），直到最近才将注意力转向多语言模型。然而，这些模型表现出训练数据偏向少数流行语言，或者依赖于从资源丰富的语言到资源匮乏的语言的迁移学习；这可能会导致对社会、文化和历史方面的误传。为了应对这一挑战，针对资源贫乏的语言开发了单语法学硕士；然而，与针对特定语言任务的多语言对应物相比，它们的有效性仍然较少研究。在这项研究中，我们通过贡献以下内容来解决希腊 QA 的研究空白：(i) DemosQA，一个新颖的数据集，它是使用社交媒体用户问题和社区审查的答案构建的，以更好地捕捉希腊社会和文化时代精神； (ii) 一个内存效率高的LLM评估框架，适用于不同的QA数据集和语言； (iii) 使用 3 种不同的提示策略，在 6 个人工策划的希腊语 QA 数据集上对 11 名单语和多语言法学硕士进行了广泛评估。我们发布代码和数据以促进可重复性。

Title: One-step Language Modeling via Continuous Denoising

Authors: Chanhyuk Lee, Jaehoon Yoo, Manan Agarwal, Sheel Shah, Jerry Huang, Aditi Raghunathan, Seunghoon Hong, Nicholas M. Boffi, Jinwoo Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.16813
Pdf URL: https://arxiv.org/pdf/2602.16813
Copy Paste: [[2602.16813]] One-step Language Modeling via Continuous Denoising(https://arxiv.org/abs/2602.16813)
Keywords: language model
Abstract: Language models based on discrete diffusion have attracted widespread interest for their potential to provide faster generation than autoregressive models. In practice, however, they exhibit a sharp degradation of sample quality in the few-step regime, failing to realize this promise. Here we show that language models leveraging flow-based continuous denoising can outperform discrete diffusion in both quality and speed. By revisiting the fundamentals of flows over discrete modalities, we build a flow-based language model (FLM) that performs Euclidean denoising over one-hot token encodings. We show that the model can be trained by predicting the clean data via a cross entropy objective, where we introduce a simple time reparameterization that greatly improves training stability and generation quality. By distilling FLM into its associated flow map, we obtain a distilled flow map language model (FMLM) capable of few-step generation. On the LM1B and OWT language datasets, FLM attains generation quality matching state-of-the-art discrete diffusion models. With FMLM, our approach outperforms recent few-step language models across the board, with one-step generation exceeding their 8-step quality. Our work calls into question the widely held hypothesis that discrete diffusion processes are necessary for generative modeling over discrete modalities, and paves the way toward accelerated flow-based language modeling at scale. Code is available at this https URL.
摘要：基于离散扩散的语言模型因其比自回归模型提供更快生成速度的潜力而引起了广泛的兴趣。然而，在实践中，他们在几步方案中表现出样品质量的急剧下降，未能实现这一承诺。在这里，我们展示了利用基于流的连续去噪的语言模型在质量和速度上都优于离散扩散。通过重新审视离散模态上的流的基础知识，我们构建了一种基于流的语言模型（FLM），该模型对单热令牌编码执行欧几里德去噪。我们表明，可以通过交叉熵目标预测干净数据来训练模型，其中我们引入了一个简单的时间重新参数化，可以极大地提高训练稳定性和生成质量。通过将 FLM 提炼到其相关的流程图中，我们获得了能够进行少步生成的精炼流程图语言模型（FMLM）。在 LM1B 和 OWT 语言数据集上，FLM 获得了与最先进的离散扩散模型相匹配的生成质量。借助 FMLM，我们的方法全面优于最近的几步语言模型，其中一步生成超过了其 8 步质量。我们的工作对广泛持有的假设提出了质疑，即离散扩散过程对于离散模态的生成建模是必要的，并为大规模加速基于流的语言建模铺平了道路。代码可从此 https URL 获取。

Title: Claim Automation using Large Language Model

Authors: Zhengda Mo, Zhiyu Quan, Eli O'Donohue, Kaiwen Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.16836
Pdf URL: https://arxiv.org/pdf/2602.16836
Copy Paste: [[2602.16836]] Claim Automation using Large Language Model(https://arxiv.org/abs/2602.16836)
Keywords: language model, llm, prompt
Abstract: While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives. We fine-tune pretrained LLMs using Low-Rank Adaptation (LoRA), scoping the model to an initial decision module within the claim processing pipeline to speed up claim adjusters' decisions. We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy. Our results show that domain-specific fine-tuning substantially outperforms commercial general-purpose and prompt-based LLMs, with approximately 80% of the evaluated cases achieving near-identical matches to ground-truth corrective actions. Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.
摘要：While Large Language Models (LLMs) have achieved strong performance on general-purpose language tasks, their deployment in regulated and data-sensitive domains, including insurance, remains limited. Leveraging millions of historical warranty claims, we propose a locally deployed governance-aware language modeling component that generates structured corrective-action recommendations from unstructured claim narratives.我们使用低秩适应 (LoRA) 对预训练的 LLM 进行微调，将模型范围限定为索赔处理流程中的初始决策模块，以加快索赔理算员的决策速度。 We assess this module using a multi-dimensional evaluation framework that combines automated semantic similarity metrics with human evaluation, enabling a rigorous examination of both practical utility and predictive accuracy.我们的结果表明，特定领域的微调大大优于商业通用和基于提示的法学硕士，大约 80% 的评估案例实现了与真实纠正措施几乎相同的匹配。 Overall, this study provides both theoretical and empirical evidence to prove that domain-adaptive fine-tuning can align model output distributions more closely with real-world operational data, demonstrating its promise as a reliable and governable building block for insurance applications.

Title: BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization

Authors: Ahmed Rafid, Rumman Adib, Fariya Ahmed, Ajwad Abrar, Mohammed Saidul Islam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.16843
Pdf URL: https://arxiv.org/pdf/2602.16843
Copy Paste: [[2602.16843]] BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization(https://arxiv.org/abs/2602.16843)
Keywords: language model
Abstract: Evaluating factual consistency is essential for reliable text summarization, particularly in high-stakes domains such as healthcare and news. However, most existing evaluation metrics overlook Bangla, a widely spoken yet under-resourced language, and often depend on reference summaries. We introduce BanglaSummEval, a reference-free, question-answering-based framework for evaluating factual consistency in Bangla summarization. The proposed method assesses both factual accuracy and content coverage through automatically generated questions and answers derived from the source document and the summary. A single multilingual instruction-tuned language model handles question generation, question answering, candidate answer extraction, and question importance weighting. This unified design reduces system complexity and computational cost. To capture semantic consistency beyond surface-level overlap, we use BERTScore-Recall for answer comparison. We validate BanglaSummEval on 300 human-written summaries from educational and medical domains, demonstrating strong correlation with expert human judgments (Pearson's $r = 0.694$, Spearman's $\rho = 0.763$). By providing interpretable, step-wise diagnostics alongside reliable evaluation scores, BanglaSummEval offers a practical and transparent solution for factual consistency evaluation in low-resource language settings.
摘要：评估事实一致性对于可靠的文本摘要至关重要，特别是在医疗保健和新闻等高风险领域。然而，大多数现有的评估指标忽视了孟加拉语这种广泛使用但资源不足的语言，并且通常依赖于参考摘要。我们引入 BanglaSummEval，这是一个无参考、基于问答的框架，用于评估孟加拉语摘要中的事实一致性。所提出的方法通过从源文档和摘要中自动生成的问题和答案来评估事实准确性和内容覆盖率。单个多语言指令调整的语言模型可处理问题生成、问题回答、候选答案提取和问题重要性加权。这种统一的设计降低了系统复杂性和计算成本。为了捕获超越表面重叠的语义一致性，我们使用 BERTScore-Recall 进行答案比较。我们在来自教育和医学领域的 300 份人工撰写的摘要上验证了 BanglaSummEval，证明其与专家的判断具有很强的相关性（Pearson 的 $r = 0.694$，Spearman 的 $\rho = 0.763$）。通过提供可解释的逐步诊断以及可靠的评估分数，BanglaSummEval 为资源匮乏的语言环境中的事实一致性评估提供了实用且透明的解决方案。

Title: Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect

Authors: Minh Duc Bui, Manuel Mager, Peter Herbert Kann, Katharina von der Wense
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.16852
Pdf URL: https://arxiv.org/pdf/2602.16852
Copy Paste: [[2602.16852]] Meenz bleibt Meenz, but Large Language Models Do Not Speak Its Dialect(https://arxiv.org/abs/2602.16852)
Keywords: language model, llm
Abstract: Meenzerisch, the dialect spoken in the German city of Mainz, is also the traditional language of the Mainz carnival, a yearly celebration well known throughout Germany. However, Meenzerisch is on the verge of dying out-a fate it shares with many other German dialects. Natural language processing (NLP) has the potential to help with the preservation and revival efforts of languages and dialects. However, so far no NLP research has looked at Meenzerisch. This work presents the first research in the field of NLP that is explicitly focused on the dialect of Mainz. We introduce a digital dictionary-an NLP-ready dataset derived from an existing resource (Schramm, 1966)-to support researchers in modeling and benchmarking the language. It contains 2,351 words in the dialect paired with their meanings described in Standard German. We then use this dataset to answer the following research questions: (1) Can state-of-the-art large language models (LLMs) generate definitions for dialect words? (2) Can LLMs generate words in Meenzerisch, given their definitions? Our experiments show that LLMs can do neither: the best model for definitions reaches only 6.27% accuracy and the best word generation model's accuracy is 1.51%. We then conduct two additional experiments in order to see if accuracy is improved by few-shot learning and by extracting rules from the training set, which are then passed to the LLM. While those approaches are able to improve the results, accuracy remains below 10%. This highlights that additional resources and an intensification of research efforts focused on German dialects are desperately needed.
摘要：Meenzerisch 是德国美因茨市使用的方言，也是美因茨狂欢节的传统语言，这是全德国众所周知的一年一度的庆祝活动。然而，Meenzerisch 正处于消亡的边缘——它与许多其他德国方言有着共同的命运。自然语言处理（NLP）有潜力帮助语言和方言的保护和复兴。然而，到目前为止，还没有 NLP 研究关注 Meenzerisch。这项工作提出了 NLP 领域第一项明确关注美因茨方言的研究。我们引入了数字词典——源自现有资源的 NLP 就绪数据集（Schramm，1966）——以支持研究人员对语言进行建模和基准测试。它包含 2,351 个方言单词以及标准德语中描述的含义。然后，我们使用该数据集来回答以下研究问题：（1）最先进的大型语言模型（LLM）能否生成方言单词的定义？ (2) 法学硕士能否根据定义生成 Meenzerisch 中的单词？我们的实验表明，法学硕士两者都做不到：最佳定义模型的准确度仅为 6.27%，最佳单词生成模型的准确度为 1.51%。然后，我们进行了两个额外的实验，以查看通过几次学习和从训练集中提取规则（然后将其传递给法学硕士）是否可以提高准确性。虽然这些方法能够改善结果，但准确率仍低于 10%。这凸显出迫切需要额外的资源和加强针对德国方言的研究工作。

Title: ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders

Authors: Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, Craig Boutilier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.16938
Pdf URL: https://arxiv.org/pdf/2602.16938
Copy Paste: [[2602.16938]] ConvApparel: A Benchmark Dataset and Validation Framework for User Simulators in Conversational Recommenders(https://arxiv.org/abs/2602.16938)
Keywords: llm, prompt, agent
Abstract: The promise of LLM-based user simulators to improve conversational AI is hindered by a critical "realism gap," leading to systems that are optimized for simulated interactions, but may fail to perform well in the real world. We introduce ConvApparel, a new dataset of human-AI conversations designed to address this gap. Its unique dual-agent data collection protocol -- using both "good" and "bad" recommenders -- enables counterfactual validation by capturing a wide spectrum of user experiences, enriched with first-person annotations of user satisfaction. We propose a comprehensive validation framework that combines statistical alignment, a human-likeness score, and counterfactual validation to test for generalization. Our experiments reveal a significant realism gap across all simulators. However, the framework also shows that data-driven simulators outperform a prompted baseline, particularly in counterfactual validation where they adapt more realistically to unseen behaviors, suggesting they embody more robust, if imperfect, user models.
摘要：基于法学硕士的用户模拟器改善对话式人工智能的承诺受到关键的“现实差距”的阻碍，导致系统针对模拟交互进行了优化，但可能无法在现实世界中表现良好。我们推出了 ConvApparel，这是一个新的人类与人工智能对话数据集，旨在解决这一差距。其独特的双代理数据收集协议（同时使用“好”和“坏”推荐器）通过捕获广泛的用户体验来实现反事实验证，并通过用户满意度的第一人称注释进行丰富。我们提出了一个全面的验证框架，结合了统计对齐、人类相似性评分和反事实验证来测试泛化能力。我们的实验揭示了所有模拟器之间存在显着的现实差距。然而，该框架还表明，数据驱动的模拟器优于提示基线，特别是在反事实验证中，它们更现实地适应看不见的行为，这表明它们体现了更强大（即使不完美）的用户模型。

Title: Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History

Authors: Serin Kim, Sangam Lee, Dongha Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.17003
Pdf URL: https://arxiv.org/pdf/2602.17003
Copy Paste: [[2602.17003]] Persona2Web: Benchmarking Personalized Web Agents for Contextual Reasoning with User History(https://arxiv.org/abs/2602.17003)
Keywords: language model, agent
Abstract: Large language models have advanced web agents, yet current agents lack personalization capabilities. Since users rarely specify every detail of their intent, practical web agents must be able to interpret ambiguous queries by inferring user preferences and contexts. To address this challenge, we present Persona2Web, the first benchmark for evaluating personalized web agents on the real open web, built upon the clarify-to-personalize principle, which requires agents to resolve ambiguity based on user history rather than relying on explicit instructions. Persona2Web consists of: (1) user histories that reveal preferences implicitly over long time spans, (2) ambiguous queries that require agents to infer implicit user preferences, and (3) a reasoning-aware evaluation framework that enables fine-grained assessment of personalization. We conduct extensive experiments across various agent architectures, backbone models, history access schemes, and queries with varying ambiguity levels, revealing key challenges in personalized web agent behavior. For reproducibility, our codes and datasets are publicly available at this https URL.
摘要：大型语言模型具有先进的网络代理，但当前的代理缺乏个性化功能。由于用户很少指定其意图的每个细节，因此实用的网络代理必须能够通过推断用户偏好和上下文来解释不明确的查询。为了应对这一挑战，我们提出了 Persona2Web，这是在真实开放网络上评估个性化网络代理的第一个基准，它建立在澄清个性化原则的基础上，该原则要求代理根据用户历史记录而不是依赖明确的指令来解决歧义。 Persona2Web 包括：(1) 在长时间跨度内隐式揭示偏好的用户历史记录，(2) 需要代理推断隐式用户偏好的模糊查询，以及 (3) 能够对个性化进行细粒度评估的推理感知评估框架。我们对各种代理架构、主干模型、历史访问方案和具有不同模糊级别的查询进行了广泛的实验，揭示了个性化网络代理行为中的关键挑战。为了重现性，我们的代码和数据集可通过此 https URL 公开获取。

Title: ReIn: Conversational Error Recovery with Reasoning Inception

Authors: Takyoung Kim, Jinseok Nam, Chandrayee Basu, Xing Fan, Chengyuan Ma, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.17022
Pdf URL: https://arxiv.org/pdf/2602.17022
Copy Paste: [[2602.17022]] ReIn: Conversational Error Recovery with Reasoning Inception(https://arxiv.org/abs/2602.17022)
Keywords: language model, llm, prompt, agent
Abstract: Conversational agents powered by large language models (LLMs) with tool integration achieve strong performance on fixed task-oriented dialogue datasets but remain vulnerable to unanticipated, user-induced errors. Rather than focusing on error prevention, this work focuses on error recovery, which necessitates the accurate diagnosis of erroneous dialogue contexts and execution of proper recovery plans. Under realistic constraints precluding model fine-tuning or prompt modification due to significant cost and time requirements, we explore whether agents can recover from contextually flawed interactions and how their behavior can be adapted without altering model parameters and prompts. To this end, we propose Reasoning Inception (ReIn), a test-time intervention method that plants an initial reasoning into the agent's decision-making process. Specifically, an external inception module identifies predefined errors within the dialogue context and generates recovery plans, which are subsequently integrated into the agent's internal reasoning process to guide corrective actions, without modifying its parameters or system prompts. We evaluate ReIn by systematically simulating conversational failure scenarios that directly hinder successful completion of user goals: user's ambiguous and unsupported requests. Across diverse combinations of agent models and inception modules, ReIn substantially improves task success and generalizes to unseen error types. Moreover, it consistently outperforms explicit prompt-modification approaches, underscoring its utility as an efficient, on-the-fly method. In-depth analysis of its operational mechanism, particularly in relation to instruction hierarchy, indicates that jointly defining recovery tools with ReIn can serve as a safe and effective strategy for improving the resilience of conversational agents without modifying the backbone models or system prompts.
摘要：由大型语言模型 (LLM) 和工具集成提供支持的对话代理在固定的面向任务的对话数据集上实现了强大的性能，但仍然容易受到意外的、用户引起的错误的影响。这项工作的重点不是错误预防，而是错误恢复，这需要准确诊断错误的对话上下文并执行正确的恢复计划。在由于巨大的成本和时间要求而无法进行模型微调或提示修改的现实约束下，我们探索智能体是否可以从上下文有缺陷的交互中恢复，以及如何在不改变模型参数和提示的情况下调整它们的行为。为此，我们提出推理初始（ReIn），这是一种测试时干预方法，将初始推理植入代理的决策过程中。具体来说，外部初始模块识别对话上下文中的预定义错误并生成恢复计划，这些计划随后被集成到代理的内部推理过程中以指导纠正措施，而无需修改其参数或系统提示。我们通过系统地模拟直接阻碍用户目标成功完成的对话失败场景来评估 ReIn：用户模糊且不受支持的请求。通过代理模型和初始模块的不同组合，ReIn 显着提高了任务成功率并泛化到未见的错误类型。此外，它始终优于显式提示修改方法，强调了它作为一种高效、即时方法的实用性。对其运行机制（特别是指令层次结构）的深入分析表明，与 ReIn 联合定义恢复工具可以作为一种安全有效的策略，在不修改骨干模型或系统提示的情况下提高会话代理的恢复能力。

Title: Large Language Models Persuade Without Planning Theory of Mind

Authors: Jared Moore, Rasmus Overmark, Ned Cooper, Beba Cibralic, Nick Haber, Cameron R. Jones
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17045
Pdf URL: https://arxiv.org/pdf/2602.17045
Copy Paste: [[2602.17045]] Large Language Models Persuade Without Planning Theory of Mind(https://arxiv.org/abs/2602.17045)
Keywords: language model, llm, agent
Abstract: A growing body of work attempts to evaluate the theory of mind (ToM) abilities of humans and large language models (LLMs) using static, non-interactive question-and-answer benchmarks. However, theoretical work in the field suggests that first-personal interaction is a crucial part of ToM and that such predictive, spectatorial tasks may fail to evaluate it. We address this gap with a novel ToM task that requires an agent to persuade a target to choose one of three policy proposals by strategically revealing information. Success depends on a persuader's sensitivity to a given target's knowledge states (what the target knows about the policies) and motivational states (how much the target values different outcomes). We varied whether these states were Revealed to persuaders or Hidden, in which case persuaders had to inquire about or infer them. In Experiment 1, participants persuaded a bot programmed to make only rational inferences. LLMs excelled in the Revealed condition but performed below chance in the Hidden condition, suggesting difficulty with the multi-step planning required to elicit and use mental state information. Humans performed moderately well in both conditions, indicating an ability to engage such planning. In Experiment 2, where a human target role-played the bot, and in Experiment 3, where we measured whether human targets' real beliefs changed, LLMs outperformed human persuaders across all conditions. These results suggest that effective persuasion can occur without explicit ToM reasoning (e.g., through rhetorical strategies) and that LLMs excel at this form of persuasion. Overall, our results caution against attributing human-like ToM to LLMs while highlighting LLMs' potential to influence people's beliefs and behavior.
摘要：越来越多的工作尝试使用静态、非交互式问答基准来评估人类和大型语言模型 (LLM) 的思维理论 (ToM) 能力。然而，该领域的理论工作表明，第一人称交互是 ToM 的重要组成部分，这种预测性、旁观性任务可能无法对其进行评估。我们通过一项新颖的 ToM 任务来解决这一差距，该任务要求代理通过战略性地揭示信息来说服目标选择三个政策建议之一。成功取决于说服者对给定目标的知识状态（目标对政策的了解）和动机状态（目标对不同结果的重视程度）的敏感性。我们改变了这些状态是向说服者揭示还是隐藏，在这种情况下说服者必须询问或推断它们。在实验 1 中，参与者说服了一个被编程为仅做出理性推论的机器人。法学硕士在“揭示”条件下表现出色，但在“隐藏”条件下表现较差，这表明在引出和使用心理状态信息所需的多步骤计划方面存在困难。人类在这两种情况下都表现得不错，表明有能力进行此类计划。在实验 2（由人类目标扮演机器人）和实验 3（我们测量人类目标的真实信念是否改变）中，法学硕士在所有条件下都优于人类说服者。这些结果表明，无需明确的理论推理（例如，通过修辞策略）即可进行有效的说服，并且法学硕士擅长这种形式的说服。总体而言，我们的结果警告不要将类似人类的 ToM 归因于法学硕士，同时强调法学硕士影响人们信仰和行为的潜力。

Title: BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios

Authors: Yunseung Lee, Subin Kim, Youngjun Kwak, Jaegul Choo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17072
Pdf URL: https://arxiv.org/pdf/2602.17072
Copy Paste: [[2602.17072]] BankMathBench: A Benchmark for Numerical Reasoning in Banking Scenarios(https://arxiv.org/abs/2602.17072)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs)-based chatbots are increasingly being adopted in the financial domain, particularly in digital banking, to handle customer inquiries about products such as deposits, savings, and loans. However, these models still exhibit low accuracy in core banking computations-including total payout estimation, comparison of products with varying interest rates, and interest calculation under early repayment conditions. Such tasks require multi-step numerical reasoning and contextual understanding of banking products, yet existing LLMs often make systematic errors-misinterpreting product types, applying conditions incorrectly, or failing basic calculations involving exponents and geometric progressions. However, such errors have rarely been captured by existing benchmarks. Mathematical datasets focus on fundamental math problems, whereas financial benchmarks primarily target financial documents, leaving everyday banking scenarios underexplored. To address this limitation, we propose BankMathBench, a domain-specific dataset that reflects realistic banking tasks. BankMathBench is organized in three levels of difficulty-basic, intermediate, and advanced-corresponding to single-product reasoning, multi-product comparison, and multi-condition scenarios, respectively. When trained on BankMathBench, open-source LLMs exhibited notable improvements in both formula generation and numerical reasoning accuracy, demonstrating the dataset's effectiveness in enhancing domain-specific reasoning. With tool-augmented fine-tuning, the models achieved average accuracy increases of 57.6%p (basic), 75.1%p (intermediate), and 62.9%p (advanced), representing significant gains over zero-shot baselines. These findings highlight BankMathBench as a reliable benchmark for evaluating and advancing LLMs' numerical reasoning in real-world banking scenarios.
摘要：基于大语言模型 (LLM) 的聊天机器人越来越多地在金融领域得到采用，特别是在数字银行领域，以处理客户有关存款、储蓄和贷款等产品的查询。然而，这些模型在核心银行计算方面仍然表现出较低的准确性，包括总支付估算、不同利率产品的比较以及提前还款条件下的利息计算。此类任务需要对银行产品进行多步骤数值推理和上下文理解，但现有的法学硕士经常会犯系统性错误——误解产品类型、错误应用条件或无法进行涉及指数和几何级数的基本计算。 However, such errors have rarely been captured by existing benchmarks.数学数据集侧重于基本数学问题，而金融基准主要针对金融文档，因此日常银行业务场景尚未得到充分探索。为了解决这一限制，我们提出了 BankMathBench，这是一个反映实际银行任务的特定领域数据集。 BankMathBench分为基础、中级、高级三个难度级别，分别对应单产品推理、多产品比较、多条件场景。在 BankMathBench 上进行训练时，开源法学硕士在公式生成和数值推理准确性方面表现出显着改进，证明了该数据集在增强特定领域推理方面的有效性。通过工具增强微调，模型的平均准确度提高了 57.6%p（基础）、75.1%p（中级）和 62.9%p（高级），与零样本基线相比有了显着提高。这些发现凸显了 BankMathBench 作为评估和推进法学硕士在现实银行场景中的数字推理能力的可靠基准。

Title: The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI

Authors: Dusan Bosnjakovic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17127
Pdf URL: https://arxiv.org/pdf/2602.17127
Copy Paste: [[2602.17127]] The Emergence of Lab-Driven Alignment Signatures: A Psychometric Framework for Auditing Latent Bias and Compounding Risk in Generative AI(https://arxiv.org/abs/2602.17127)
Keywords: language model, llm, chat, agent
Abstract: As Large Language Models (LLMs) transition from standalone chat interfaces to foundational reasoning layers in multi-agent systems and recursive evaluation loops (LLM-as-a-judge), the detection of durable, provider-level behavioral signatures becomes a critical requirement for safety and governance. Traditional benchmarks measure transient task accuracy but fail to capture stable, latent response policies -- the ``prevailing mindsets'' embedded during training and alignment that outlive individual model versions. This paper introduces a novel auditing framework that utilizes psychometric measurement theory -- specifically latent trait estimation under ordinal uncertainty -- to quantify these tendencies without relying on ground-truth labels. Utilizing forced-choice ordinal vignettes masked by semantically orthogonal decoys and governed by cryptographic permutation-invariance, the research audits nine leading models across dimensions including Optimization Bias, Sycophancy, and Status-Quo Legitimization. Using Mixed Linear Models (MixedLM) and Intraclass Correlation Coefficient (ICC) analysis, the research identifies that while item-level framing drives high variance, a persistent ``lab signal'' accounts for significant behavioral clustering. These findings demonstrate that in ``locked-in'' provider ecosystems, latent biases are not merely static errors but compounding variables that risk creating recursive ideological echo chambers in multi-layered AI architectures.
摘要：

Title: Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study

Authors: Kensuke Okada, Yui Furukawa, Kyosuke Bunji
Subjects: cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2602.17262
Pdf URL: https://arxiv.org/pdf/2602.17262
Copy Paste: [[2602.17262]] Quantifying and Mitigating Socially Desirable Responding in LLMs: A Desirability-Matched Graded Forced-Choice Psychometric Study(https://arxiv.org/abs/2602.17262)
Keywords: language model, llm
Abstract: Human self-report questionnaires are increasingly used in NLP to benchmark and audit large language models (LLMs), from persona consistency to safety and bias assessments. Yet these instruments presume honest responding; in evaluative contexts, LLMs can instead gravitate toward socially preferred answers-a form of socially desirable responding (SDR)-biasing questionnaire-derived scores and downstream conclusions. We propose a psychometric framework to quantify and mitigate SDR in questionnaire-based evaluation of LLMs. To quantify SDR, the same inventory is administered under HONEST versus FAKE-GOOD instructions, and SDR is computed as a direction-corrected standardized effect size from item response theory (IRT)-estimated latent scores. This enables comparisons across constructs and response formats, as well as against human instructed-faking benchmarks. For mitigation, we construct a graded forced-choice (GFC) Big Five inventory by selecting 30 cross-domain pairs from an item pool via constrained optimization to match desirability. Across nine instruction-tuned LLMs evaluated on synthetic personas with known target profiles, Likert-style questionnaires show consistently large SDR, whereas desirability-matched GFC substantially attenuates SDR while largely preserving the recovery of the intended persona profiles. These results highlight a model-dependent SDR-recovery trade-off and motivate SDR-aware reporting practices for questionnaire-based benchmarking and auditing of LLMs.
摘要：人类自我报告问卷越来越多地用于 NLP 中，以对大型语言模型 (LLM) 进行基准测试和审核，从角色一致性到安全性和偏见评估。然而，这些工具假定诚实的回应；在评估环境中，法学硕士可以转而倾向于社会偏好的答案——一种社会理想反应（SDR）的形式——偏向问卷得出的分数和下游结论。我们提出了一个心理测量框架来量化和减轻基于问卷调查的法学硕士评估中的 SDR。为了量化 SDR，在“诚实”与“假好”指令下管理相同的库存，并且根据项目反应理论 (IRT) 估计的潜在分数将 SDR 计算为方向校正的标准化效应大小。这使得能够跨结构和响应格式进行比较，以及与人类指导的伪造基准进行比较。为了缓解影响，我们通过约束优化从项目池中选择 30 个跨域对来构建分级强制选择 (GFC) 大五库存，以匹配需求。在对具有已知目标概况的合成人物角色进行评估的九个指令调整的法学硕士中，李克特式问卷显示始终较大的 SDR，而合意性匹配的 GFC 大大削弱了 SDR，同时在很大程度上保留了预期人物概况的恢复。这些结果强调了依赖于模型的 SDR 恢复权衡，并激发了基于问卷的法学硕士基准测试和审计的 SDR 意识报告实践。

Title: Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective

Authors: Yukun Chen, Xinyu Zhang, Jialong Tang, Yu Wan, Baosong Yang, Yiming Li, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.17283
Pdf URL: https://arxiv.org/pdf/2602.17283
Copy Paste: [[2602.17283]] Towards Cross-lingual Values Assessment: A Consensus-Pluralism Perspective(https://arxiv.org/abs/2602.17283)
Keywords: language model, llm
Abstract: While large language models (LLMs) have become pivotal to content safety, current evaluation paradigms primarily focus on detecting explicit harms (e.g., violence or hate speech), neglecting the subtler value dimensions conveyed in digital content. To bridge this gap, we introduce X-Value, a novel Cross-lingual Values Assessment Benchmark designed to evaluate LLMs' ability to assess deep-level values of content from a global perspective. X-Value consists of more than 5,000 QA pairs across 18 languages, systematically organized into 7 core domains grounded in Schwartz's Theory of Basic Human Values and categorized into easy and hard levels for discriminative evaluation. We further propose a unique two-stage annotation framework that first identifies whether an issue falls under global consensus (e.g., human rights) or pluralism (e.g., religion), and subsequently conducts a multi-party evaluation of the latent values embedded within the content. Systematic evaluations on X-Value reveal that current SOTA LLMs exhibit deficiencies in cross-lingual values assessment ($Acc < 77\%$), with significant performance disparities across different languages ($\Delta Acc > 20\%$). This work highlights the urgent need to improve the nuanced, values-aware content assessment capability of LLMs. Our X-Value is available at: this https URL.
摘要：虽然大型语言模型（LLM）已成为内容安全的关键，但当前的评估范式主要侧重于检测明确的伤害（例如暴力或仇恨言论），而忽略了数字内容中传达的更微妙的价值维度。为了弥补这一差距，我们引入了 X-Value，这是一种新颖的跨语言价值观评估基准，旨在评估法学硕士从全球角度评估内容深层次价值的能力。 X-Value 由 18 种语言的 5,000 多个 QA 对组成，系统地组织成基于 Schwartz 人类基本价值观理论的 7 个核心领域，并分为简单和困难级别以进行区分评估。我们进一步提出了一个独特的两阶段注释框架，首先确定一个问题是否属于全球共识（例如人权）或多元化（例如宗教），然后对内容中嵌入的潜在价值进行多方评估。对X-Value的系统评估表明，目前的SOTA LLM在跨语言价值观评估方面存在缺陷（$Acc < 77\%$），不同语言之间的绩效差异显着（$\Delta Acc > 20\%$）。这项工作凸显了提高法学硕士细致入微、具有价值观意识的内容评估能力的迫切需要。我们的 X 值可在以下位置获取：此 https URL。

Title: Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Authors: Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.17316
Pdf URL: https://arxiv.org/pdf/2602.17316
Copy Paste: [[2602.17316]] Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation(https://arxiv.org/abs/2602.17316)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
摘要：大型语言模型（LLM）的快速发展已经建立了标准化的评估基准作为模型比较的主要工具。 Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts.本文研究了受控的、真值条件等价的词汇和句法扰动如何影响 23 名当代法学硕士在三个基准（MMLU、SQuAD 和 AMEGA）上的绝对表现和相对排名。我们采用两种语言学原理的管道来生成保留意义的变体：一种对词汇变化执行同义词替换，另一种使用依存分析来确定适用的句法转换。结果表明，词汇扰动几乎在所有模型和任务中都会导致显着的、统计上显着的性能下降，而句法扰动则具有更多的异质性影响，偶尔会改善结果。 Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence.总体而言，研究结果表明法学硕士更多地依赖于表面词汇模式而不是抽象语言能力，这强调了将稳健性测试作为法学硕士评估的标准组成部分的必要性。

Title: RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering

Authors: Yiming Zhang, Siyue Zhang, Junbo Zhao, Chen Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17366
Pdf URL: https://arxiv.org/pdf/2602.17366
Copy Paste: [[2602.17366]] RPDR: A Round-trip Prediction-Based Data Augmentation Framework for Long-Tail Question Answering(https://arxiv.org/abs/2602.17366)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms. However, dense retrieval models often face the same difficulties when generalizing to rare or niche knowledge. In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.
摘要：Long-tail question answering presents significant challenges for large language models (LLMs) due to their limited ability to acquire and accurately recall less common knowledge. Retrieval-augmented generation (RAG) systems have shown great promise in mitigating this limitation by integrating external retrieval mechanisms.然而，密集检索模型在推广稀有或小众知识时通常面临同样的困难。 In this study, we introduce RPDR, a novel data augmentation framework that selects high-quality easy-to-learn training data, to enhance dense retrievers. Our approach is built around three core components: synthetic data generation, data selection with Round-Trip prediction to identify easy-to-learn instances, and retriever training with these instances. We evaluate RPDR on two long-tail retrieval benchmarks, PopQA and EntityQuestion, demonstrating substantial improvements over existing retrievers like BM25 and Contriver, especially on extremely long-tail categories. We identify the strengths and limitations of RPDR through detailed human analysis and propose a dynamic routing mechanism to dynamically route queries to specialized retrieval modules to further improve retrieval performance.

Title: The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour

Authors: Leonidas Zotos, Hedderik van Rijn, Malvina Nissim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17377
Pdf URL: https://arxiv.org/pdf/2602.17377
Copy Paste: [[2602.17377]] The Role of the Availability Heuristic in Multiple-Choice Answering Behaviour(https://arxiv.org/abs/2602.17377)
Keywords: llm
Abstract: When students are unsure of the correct answer to a multiple-choice question (MCQ), guessing is common practice. The availability heuristic, proposed by A. Tversky and D. Kahneman in 1973, suggests that the ease with which relevant instances come to mind, typically operationalised by the mere frequency of exposure, can offer a mental shortcut for problems in which the test-taker does not know the exact answer. Is simply choosing the option that comes most readily to mind a good strategy for answering MCQs? We propose a computational method of assessing the cognitive availability of MCQ options operationalised by concepts' prevalence in large corpora. The key finding, across three large question sets, is that correct answers, independently of the question stem, are significantly more available than incorrect MCQ options. Specifically, using Wikipedia as the retrieval corpus, we find that always selecting the most available option leads to scores 13.5% to 32.9% above the random-guess baseline. We further find that LLM-generated MCQ options show similar patterns of availability compared to expert-created options, despite the LLMs' frequentist nature and their training on large collections of textual data. Our findings suggest that availability should be considered in current and future work when computationally modelling student behaviour.
摘要：当学生不确定多项选择题 (MCQ) 的正确答案时，猜测是常见的做法。 A. Tversky 和 D. Kahneman 在 1973 年提出的可用性启发法表明，相关实例浮现在脑海中的容易程度（通常仅通过暴露的频率来操作）可以为考生不知道确切答案的问题提供一条心理捷径。简单地选择最容易想到的选项是回答 MCQ 的好策略吗？我们提出了一种计算方法，用于评估 MCQ 选项的认知可用性，该方法通过概念在大型语料库中的流行程度来操作。跨越三个大型问题集的关键发现是，与问题主干无关的正确答案比错误的 MCQ 选项更容易获得。具体来说，使用维基百科作为检索语料库，我们发现始终选择最可用的选项会导致分数比随机猜测基线高出 13.5% 到 32.9%。我们进一步发现，与专家创建的选项相比，法学硕士生成的 MCQ 选项显示出相似的可用性模式，尽管法学硕士具有常客性，并且接受了大量文本数据的训练。我们的研究结果表明，在对学生行为进行计算建模时，应在当前和未来的工作中考虑可用性。

Title: Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics

Authors: Sanjeev Kumar, Preethi Jyothi, Pushpak Bhattacharyya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17425
Pdf URL: https://arxiv.org/pdf/2602.17425
Copy Paste: [[2602.17425]] Evaluating Extremely Low-Resource Machine Translation: A Comparative Study of ChrF++ and BLEU Metrics(https://arxiv.org/abs/2602.17425)
Keywords: language model, llm, hallucination
Abstract: Evaluating machine translation (MT) quality in extremely low-resource language (ELRL) scenarios poses unique challenges, as widely used metrics such as BLEU, effective in high-resource settings, often misrepresent quality in data-scarce contexts. This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings. We examine how each metric responds to translation artifacts, including hallucinations, repetition, source-text copying, and diacritic (\textit{matra}) variations across three ELRLs: Magahi, Bhojpuri, and Chhattisgarhi, with a focus on outputs from large language models (LLMs) and neural MT (NMT) systems. While recent work often relies solely on ChrF++, our findings show that BLEU, despite its lower absolute scores, provides complementary lexical-precision insights that improve interpretability.
摘要：在极低资源语言 (ELRL) 场景中评估机器翻译 (MT) 质量带来了独特的挑战，因为广泛使用的指标（例如 BLEU）在高资源环境中有效，但在数据稀缺环境中常常会歪曲质量。 This work presents a comparative analysis of BLEU, an n-gram-based metric, and ChrF++, a character-based metric, for MT evaluation in ELRL settings.我们研究了每个指标如何响应翻译伪影，包括三个 ELRL 中的幻觉、重复、源文本复制和变音符号 (\textit{matra}) 变化：Magahi、Bhojpuri 和 Chhattisgarhi，重点关注大型语言模型 (LLM) 和神经 MT (NMT) 系统的输出。虽然最近的工作通常仅依赖于 ChrF++，但我们的研究结果表明，尽管 BLEU 的绝对分数较低，但它提供了补充的词汇精度见解，从而提高了可解释性。

Title: Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study

Authors: Dylan Bouchard, Mohit Singh Chauhan, Viren Bajaj, David Skarbrevik
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.17431
Pdf URL: https://arxiv.org/pdf/2602.17431
Copy Paste: [[2602.17431]] Fine-Grained Uncertainty Quantification for Long-Form Language Model Outputs: A Comparative Study(https://arxiv.org/abs/2602.17431)
Keywords: language model, llm, hallucination
Abstract: Uncertainty quantification has emerged as an effective approach to closed-book hallucination detection for LLMs, but existing methods are largely designed for short-form outputs and do not generalize well to long-form generation. We introduce a taxonomy for fine-grained uncertainty quantification in long-form LLM outputs that distinguishes methods by design choices at three stages: response decomposition, unit-level scoring, and response-level aggregation. We formalize several families of consistency-based black-box scorers, providing generalizations and extensions of existing methods. In our experiments across multiple LLMs and datasets, we find 1) claim-response entailment consistently performs better or on par with more complex claim-level scorers, 2) claim-level scoring generally yields better results than sentence-level scoring, and 3) uncertainty-aware decoding is highly effective for improving the factuality of long-form outputs. Our framework clarifies relationships between prior methods, enables apples-to-apples comparisons, and provides practical guidance for selecting components for fine-grained UQ.
摘要：不确定性量化已成为法学硕士闭卷幻觉检测的有效方法，但现有方法主要是针对短格式输出而设计的，不能很好地推广到长格式生成。我们在长篇 LLM 输出中引入了细粒度不确定性量化的分类法，该分类法通过三个阶段的设计选择来区分方法：响应分解、单元级评分和响应级聚合。我们形式化了几个基于一致性的黑盒评分器系列，提供了现有方法的概括和扩展。在我们针对多个法学硕士和数据集的实验中，我们发现 1) 主张-响应蕴涵始终表现得更好，或与更复杂的主张级评分器持平，2) 主张级评分通常比句子级评分产生更好的结果，3) 不确定性感知解码对于提高长格式输出的真实性非常有效。我们的框架阐明了先前方法之间的关系，实现了同类比较，并为选择细粒度 UQ 组件提供了实用指导。

Title: AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue

Authors: Adib Sakhawat, Fardeen Sadab, Rakin Shahriar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17443
Pdf URL: https://arxiv.org/pdf/2602.17443
Copy Paste: [[2602.17443]] AIDG: Evaluating Asymmetry Between Information Extraction and Containment in Multi-Turn Dialogue(https://arxiv.org/abs/2602.17443)
Keywords: language model, llm
Abstract: Evaluating the strategic reasoning capabilities of Large Language Models (LLMs) requires moving beyond static benchmarks to dynamic, multi-turn interactions. We introduce AIDG (Adversarial Information Deduction Game), a game-theoretic framework that probes the asymmetry between information extraction (active deduction) and information containment (state maintenance) in dialogue. We propose two complementary tasks: AIDG-I, measuring pragmatic strategy in social deduction, and AIDG-II, measuring constraint satisfaction in a structured "20 Questions" setting. Across 439 games with six frontier LLMs, we observe a clear capability asymmetry: models perform substantially better at containment than deduction, with a 350 ELO advantage on defense;(Cohen's d = 5.47). We identify two bottlenecks driving this gap: (1) Information Dynamics, where confirmation strategies are 7.75x more effective than blind deduction (p < 0.00001), and (2) Constraint Adherence, where instruction-following degrades under conversational load, accounting for 41.3% of deductive failures. These findings suggest that while LLMs excel at local defensive coherence, they struggle with the global state tracking required for strategic inquiry.
摘要：评估大型语言模型 (LLM) 的战略推理能力需要超越静态基准，转向动态、多轮交互。我们引入了 AIDG（对抗性信息推演博弈），这是一种博弈论框架，探讨对话中信息提取（主动推演）和信息遏制（状态维护）之间的不对称性。我们提出了两项补充任务：AIDG-I，衡量社会演绎中的务实策略，AIDG-II，衡量结构化“20 个问题”环境中的约束满意度。在 6 个前沿 LLM 的 439 场比赛中，我们观察到明显的能力不对称：模型在遏制方面的表现明显优于演绎，在防御方面具有 350 ELO 优势；（Cohen 的 d = 5.47）。我们发现了造成这一差距的两个瓶颈：(1) 信息动态，其中确认策略比盲目演绎有效 7.75 倍 (p < 0.00001)，以及 (2) 约束遵守，其中指令遵循在对话负载下会降低，占演绎失败的 41.3%。这些发现表明，虽然法学硕士在本地防御一致性方面表现出色，但他们在战略调查所需的全球状态跟踪方面遇到了困难。

Title: ABCD: All Biases Come Disguised

Authors: Mateusz Nowak, Xavier Cadet, Peter Chin
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.17445
Pdf URL: https://arxiv.org/pdf/2602.17445
Copy Paste: [[2602.17445]] ABCD: All Biases Come Disguised(https://arxiv.org/abs/2602.17445)
Keywords: llm, prompt
Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.
摘要：多项选择题（MCQ）基准已成为衡量法学硕士推理和回答基于知识的问题的能力的标准评估实践。通过综合的 NonsenseQA 基准，我们观察到不同的 LLM 表现出不同程度的标签位置少样本提示偏差，其中模型要么使用答案位置、答案前面的标签、少样本提示中存在的正确答案的分布，要么使用所有这些的组合来回答每个 MCQ 问题。我们提出了一个简单的减少偏差的评估协议，用统一的、无序的标签替换每个问题的标签，并提示法学硕士使用所提供的整个答案。通过简单的句子相似性模型，我们证明了不同答案排列之间的鲁棒性得到了提高，标准偏差也降低了，LLM 的性能下降幅度最小，从而在减少评估工件的情况下暴露了 LLM 的能力，而无需提示示例或选项标签的任何帮助。在多个基准测试和模型中，该协议极大地提高了回答排列的鲁棒性，将平均精度方差降低了 3 倍，而平均模型性能的下降幅度很小。通过对各种嵌入模型和相似函数的消融研究，我们表明该方法比标准方法更稳健。

Title: Entropy-Based Data Selection for Language Models

Authors: Hongming Li, Yang Liu, Chao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17465
Pdf URL: https://arxiv.org/pdf/2602.17465
Copy Paste: [[2602.17465]] Entropy-Based Data Selection for Language Models(https://arxiv.org/abs/2602.17465)
Keywords: language model, llm
Abstract: Modern language models (LMs) increasingly require two critical resources: computational resources and data resources. Data selection techniques can effectively reduce the amount of training data required for fine-tuning LMs. However, their effectiveness is closely related to computational resources, which always require a high compute budget. Owing to the resource limitations in practical fine-tuning scenario, we systematically reveal the relationship between data selection and uncertainty estimation of selected data. Although large language models (LLMs) exhibit exceptional capabilities in language understanding and generation, which provide new ways to alleviate data scarcity, evaluating data usability remains a challenging task. This makes efficient data selection indispensable. To mitigate these issues, we propose Entropy-Based Unsupervised Data Selection (EUDS) framework. Empirical experiments on sentiment analysis (SA), topic classification (Topic-CLS), and question answering (Q&A) tasks validate its effectiveness. EUDS establishes a computationally efficient data-filtering mechanism. Theoretical analysis and experimental results confirm the effectiveness of our approach. EUDS significantly reduces computational costs and improves training time efficiency with less data requirement. This provides an innovative solution for the efficient fine-tuning of LMs in the compute-constrained scenarios.
摘要：现代语言模型 (LM) 越来越需要两种关键资源：计算资源和数据资源。数据选择技术可以有效减少微调 LM 所需的训练数据量。然而，它们的有效性与计算资源密切相关，而计算资源总是需要较高的计算预算。由于实际微调场景中的资源限制，我们系统地揭示了数据选择和所选数据的不确定性估计之间的关系。尽管大型语言模型 (LLM) 在语言理解和生成方面表现出卓越的能力，为缓解数据稀缺性提供了新方法，但评估数据可用性仍然是一项具有挑战性的任务。这使得高效的数据选择变得不可或缺。为了缓解这些问题，我们提出了基于熵的无监督数据选择（EUDS）框架。情感分析（SA）、主题分类（Topic-CLS）和问答（Q&A）任务的实证实验验证了其有效性。 EUDS 建立了一种计算高效的数据过滤机制。理论分析和实验结果证实了我们方法的有效性。 EUDS 显着降低了计算成本，并以更少的数据需求提高了训练时间效率。这为在计算受限的场景中有效微调 LM 提供了一种创新的解决方案。

Title: PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions

Authors: Greta Damo, Stéphane Petiot, Elena Cabrio, Serena Villata
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17467
Pdf URL: https://arxiv.org/pdf/2602.17467
Copy Paste: [[2602.17467]] PEACE 2.0: Grounded Explanations and Counter-Speech for Combating Hate Expressions(https://arxiv.org/abs/2602.17467)
Keywords: retrieval-augmented generation
Abstract: The increasing volume of hate speech on online platforms poses significant societal challenges. While the Natural Language Processing community has developed effective methods to automatically detect the presence of hate speech, responses to it, called counter-speech, are still an open challenge. We present PEACE 2.0, a novel tool that, besides analysing and explaining why a message is considered hateful or not, also generates a response to it. More specifically, PEACE 2.0 has three main new functionalities: leveraging a Retrieval-Augmented Generation (RAG) pipeline i) to ground HS explanations into evidence and facts, ii) to automatically generate evidence-grounded counter-speech, and iii) exploring the characteristics of counter-speech replies. By integrating these capabilities, PEACE 2.0 enables in-depth analysis and response generation for both explicit and implicit hateful messages.
摘要：网络平台上的仇恨言论数量不断增加，带来了重大的社会挑战。尽管自然语言处理社区已经开发出有效的方法来自动检测仇恨言论的存在，但对其的回应（称为反言论）仍然是一个公开的挑战。我们推出了 PEACE 2.0，这是一种新颖的工具，除了分析和解释为什么一条消息被认为是可恶的之外，还可以生成对其的响应。更具体地说，PEACE 2.0 具有三个主要新功能：利用检索增强生成 (RAG) 管道 i) 将 HS 解释转化为证据和事实，ii) 自动生成基于证据的反言论，以及 iii) 探索反言论回复的特征。通过集成这些功能，PEACE 2.0 能够对显性和隐性的仇恨消息进行深入分析和响应生成。

Title: Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian

Authors: Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17475
Pdf URL: https://arxiv.org/pdf/2602.17475
Copy Paste: [[2602.17475]] Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian(https://arxiv.org/abs/2602.17475)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
摘要：大型语言模型 (LLM) 在各种医学自然语言处理 (NLP) 任务中始终表现出色，但其大量的计算要求往往限制在现实世界的医疗保健环境中的部署。 In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.

Title: Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics

Authors: Baris Karacan, Barbara Di Eugenio, Patrick Thornton
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17513
Pdf URL: https://arxiv.org/pdf/2602.17513
Copy Paste: [[2602.17513]] Bridging the Domain Divide: Supervised vs. Zero-Shot Clinical Section Segmentation from MIMIC-III to Obstetrics(https://arxiv.org/abs/2602.17513)
Keywords: language model, hallucination
Abstract: Clinical free-text notes contain vital patient information. They are structured into labelled sections; recognizing these sections has been shown to support clinical decision-making and downstream NLP tasks. In this paper, we advance clinical section segmentation through three key contributions. First, we curate a new de-identified, section-labeled obstetrics notes dataset, to supplement the medical domains covered in public corpora such as MIMIC-III, on which most existing segmentation approaches are trained. Second, we systematically evaluate transformer-based supervised models for section segmentation on a curated subset of MIMIC-III (in-domain), and on the new obstetrics dataset (out-of-domain). Third, we conduct the first head-to-head comparison of supervised models for medical section segmentation with zero-shot large language models. Our results show that while supervised models perform strongly in-domain, their performance drops substantially out-of-domain. In contrast, zero-shot models demonstrate robust out-of-domain adaptability once hallucinated section headers are corrected. These findings underscore the importance of developing domain-specific clinical resources and highlight zero-shot segmentation as a promising direction for applying healthcare NLP beyond well-studied corpora, as long as hallucinations are appropriately managed.
摘要：临床自由文本注释包含重要的患者信息。它们被分为带标签的部分；事实证明，识别这些部分可以支持临床决策和下游 NLP 任务。在本文中，我们通过三个关键贡献推进临床部分分割。首先，我们策划了一个新的去识别化、部分标记的产科笔记数据集，以补充 MIMIC-III 等公共语料库中涵盖的医学领域，大多数现有的分割方法都是在该语料库上进行训练的。其次，我们系统地评估了基于 Transformer 的监督模型，用于在 MIMIC-III 的精选子集（域内）和新的产科数据集（域外）上进行切片分割。第三，我们对医学切片分割的监督模型与零样本大型语言模型进行了首次头对头比较。我们的结果表明，虽然监督模型在域内表现强劲，但它们在域外的性能却大幅下降。相比之下，一旦纠正了幻觉的节标题，零样本模型就会表现出强大的域外适应性。这些发现强调了开发特定领域临床资源的重要性，并强调了零样本分割是在经过充分研究的语料库之外应用医疗保健 NLP 的一个有前途的方向，只要幻觉得到适当的管理。

Title: Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems

Authors: Zhangqi Duan, Arnav Kankaria, Dhruv Kartik, Andrew Lan
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2602.17542
Pdf URL: https://arxiv.org/pdf/2602.17542
Copy Paste: [[2602.17542]] Using LLMs for Knowledge Component-level Correctness Labeling in Open-ended Coding Problems(https://arxiv.org/abs/2602.17542)
Keywords: language model, llm
Abstract: Fine-grained skill representations, commonly referred to as knowledge components (KCs), are fundamental to many approaches in student modeling and learning analytics. However, KC-level correctness labels are rarely available in real-world datasets, especially for open-ended programming tasks where solutions typically involve multiple KCs simultaneously. Simply propagating problem-level correctness to all associated KCs obscures partial mastery and often leads to poorly fitted learning curves. To address this challenge, we propose an automated framework that leverages large language models (LLMs) to label KC-level correctness directly from student-written code. Our method assesses whether each KC is correctly applied and further introduces a temporal context-aware Code-KC mapping mechanism to better align KCs with individual student code. We evaluate the resulting KC-level correctness labels in terms of learning curve fit and predictive performance using the power law of practice and the Additive Factors Model. Experimental results show that our framework leads to learning curves that are more consistent with cognitive theory and improves predictive performance, compared to baselines. Human evaluation further demonstrates substantial agreement between LLM and expert annotations.
摘要：细粒度的技能表示，通常称为知识组件 (KC)，是学生建模和学习分析的许多方法的基础。然而，KC 级正确性标签在现实数据集中很少可用，特别是对于开放式编程任务，其中解决方案通常同时涉及多个 KC。简单地将问题级正确性传播到所有相关的 KC 会掩盖部分掌握，并且常常导致学习曲线拟合不佳。为了应对这一挑战，我们提出了一个自动化框架，利用大型语言模型 (LLM) 直接从学生编写的代码中标记 KC 级正确性。我们的方法评估每个 KC 是否被正确应用，并进一步引入时间上下文感知的代码 KC 映射机制，以更好地将 KC 与单个学生代码对齐。我们使用实践幂律和加性因素模型在学习曲线拟合和预测性能方面评估最终的 KC 级正确性标签。实验结果表明，与基线相比，我们的框架导致的学习曲线更符合认知理论，并提高了预测性能。人工评估进一步证明了法学硕士和专家注释之间的实质性一致性。

Title: Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning

Authors: Jyotin Goel, Souvik Maji, Pratik Mazumder
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.17546
Pdf URL: https://arxiv.org/pdf/2602.17546
Copy Paste: [[2602.17546]] Learning to Stay Safe: Adaptive Regularization Against Safety Degradation during Fine-Tuning(https://arxiv.org/abs/2602.17546)
Keywords: language model
Abstract: Instruction-following language models are trained to be helpful and safe, yet their safety behavior can deteriorate under benign fine-tuning and worsen under adversarial updates. Existing defenses often offer limited protection or force a trade-off between safety and utility. We introduce a training framework that adapts regularization in response to safety risk, enabling models to remain aligned throughout fine-tuning. To estimate safety risk at training time, we explore two distinct approaches: a judge-based Safety Critic that assigns high-level harm scores to training batches, and an activation-based risk predictor built with a lightweight classifier trained on intermediate model activations to estimate harmful intent. Each approach provides a risk signal that is used to constrain updates deemed higher risk to remain close to a safe reference policy, while lower-risk updates proceed with standard training. We empirically verify that harmful intent signals are predictable from pre-generation activations and that judge scores provide effective high-recall safety guidance. Across multiple model families and attack scenarios, adaptive regularization with either risk estimation approach consistently lowers attack success rate compared to standard fine-tuning, preserves downstream performance, and adds no inference-time cost. This work demonstrates a principled mechanism for maintaining safety without sacrificing utility.
摘要：遵循指令的语言模型被训练为有用且安全的，但它们的安全行为可能会在良性微调下恶化，并在对抗性更新下恶化。现有的防御措施通常提供有限的保护或强制在安全性和实用性之间进行权衡。我们引入了一个训练框架，该框架可以根据安全风险调整正则化，使模型在整个微调过程中保持一致。为了估计训练时的安全风险，我们探索了两种不同的方法：基于判断的安全评论家，为训练批次分配高水平伤害分数；以及基于激活的风险预测器，该预测器使用经过中间模型激活训练的轻量级分类器构建，以估计有害意图。每种方法都提供一个风险信号，用于限制被视为较高风险的更新以保持接近安全参考策略，而较低风险的更新则继续进行标准培训。我们凭经验验证有害意图信号可以从预生成激活中预测，并且判断分数提供有效的高召回安全指导。在多个模型系列和攻击场景中，与标准微调相比，使用任一风险估计方法的自适应正则化始终会降低攻击成功率，保留下游性能，并且不会增加推理时间成本。这项工作展示了一种在不牺牲实用性的情况下保持安全性的原则机制。

Title: Modeling Distinct Human Interaction in Web Agents

Authors: Faria Huq, Zora Zhiruo Wang, Zhanqiu Guo, Venu Arvind Arangarajan, Tianyue Ou, Frank Xu, Shuyan Zhou, Graham Neubig, Jeffrey P. Bigham
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2602.17588
Pdf URL: https://arxiv.org/pdf/2602.17588
Copy Paste: [[2602.17588]] Modeling Distinct Human Interaction in Web Agents(https://arxiv.org/abs/2602.17588)
Keywords: language model, agent
Abstract: Despite rapid progress in autonomous web agents, human involvement remains essential for shaping preferences and correcting agent behavior as tasks unfold. However, current agentic systems lack a principled understanding of when and why humans intervene, often proceeding autonomously past critical decision points or requesting unnecessary confirmation. In this work, we introduce the task of modeling human intervention to support collaborative web task execution. We collect CowCorpus, a dataset of 400 real-user web navigation trajectories containing over 4,200 interleaved human and agent actions. We identify four distinct patterns of user interaction with agents -- hands-off supervision, hands-on oversight, collaborative task-solving, and full user takeover. Leveraging these insights, we train language models (LMs) to anticipate when users are likely to intervene based on their interaction styles, yielding a 61.4-63.4% improvement in intervention prediction accuracy over base LMs. Finally, we deploy these intervention-aware models in live web navigation agents and evaluate them in a user study, finding a 26.5% increase in user-rated agent usefulness. Together, our results show structured modeling of human intervention leads to more adaptive, collaborative agents.
摘要：尽管自主网络代理取得了快速进展，但随着任务的展开，人类的参与对于塑造偏好和纠正代理行为仍然至关重要。然而，当前的代理系统缺乏对人类何时以及为何进行干预的原则性理解，通常会自主地越过关键决策点或请求不必要的确认。在这项工作中，我们介绍了对人工干预进行建模以支持协作 Web 任务执行的任务。我们收集了 CowCorpus，这是一个包含 400 个真实用户 Web 导航轨迹的数据集，其中包含超过 4,200 个交错的人类和代理动作。我们确定了用户与代理交互的四种不同模式——不干涉监督、亲自监督、协作解决任务和完全用户接管。利用这些见解，我们训练语言模型 (LM) 来根据用户的交互风格预测用户何时可能进行干预，与基础 LM 相比，干预预测准确性提高了 61.4-63.4%。最后，我们将这些干预感知模型部署在实时 Web 导航代理中，并在用户研究中对其进行评估，发现用户评价的代理有用性提高了 26.5%。总之，我们的结果表明，人类干预的结构化模型可以产生更具适应性、协作性的代理。

Title: The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?

Authors: Jayadev Billa
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2602.17598
Pdf URL: https://arxiv.org/pdf/2602.17598
Copy Paste: [[2602.17598]] The Cascade Equivalence Hypothesis: When Do Speech LLMs Behave Like ASR$\rightarrow$LLM Pipelines?(https://arxiv.org/abs/2602.17598)
Keywords: llm
Abstract: Current speech LLMs largely perform implicit ASR: on tasks solvable from a transcript, they are behaviorally and mechanistically equivalent to simple Whisper$\to$LLM cascades. We show this through matched-backbone testing across four speech LLMs and six tasks, controlling for the LLM backbone for the first time. Ultravox is statistically indistinguishable from its matched cascade ($\kappa{=}0.93$); logit lens reveals literal text emerging in hidden states; LEACE concept erasure confirms text representations are causally necessary in both architectures tested, collapsing accuracy to near-zero. Qwen2-Audio genuinely diverges, revealing cascade equivalence is architecture-dependent, not universal. For most deployed use cases, current speech LLMs are expensive cascades, and under noise, they are worse ones, with clean-condition advantages reversing by up to 7.6% at 0 dB.
摘要：当前的语音 LLM 很大程度上执行隐式 ASR：在可通过转录本解决的任务上，它们在行为和机制上等同于简单的 Whisper$\to$LLM 级联。我们通过四个语音法学硕士和六个任务的匹配主干测试来展示这一点，并首次控制了法学硕士主干。从统计上看，Ultravox 与其匹配的级联没有什么区别 ($\kappa{=}0.93$)；逻辑透镜揭示了隐藏状态下出现的文字文本； LEACE 概念擦除证实了文本表示在两种测试的架构中都是因果必要的，将准确性降低到接近于零。 Qwen2-Audio 确实有所不同，揭示了级联等效性是依赖于架构的，而不是通用的。对于大多数已部署的用例，当前的语音 LLM 是昂贵的级联，而在噪声条件下，它们的级联性能更差，在 0 dB 时，干净条件下的优势最多可逆转 7.6%。

Title: Unmasking the Factual-Conceptual Gap in Persian Language Models

Authors: Alireza Sakhaeirad, Ali Ma'manpoosh, Arshia Hemmat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17623
Pdf URL: https://arxiv.org/pdf/2602.17623
Copy Paste: [[2602.17623]] Unmasking the Factual-Conceptual Gap in Persian Language Models(https://arxiv.org/abs/2602.17623)
Keywords: language model, llm
Abstract: While emerging Persian NLP benchmarks have expanded into pragmatics and politeness, they rarely distinguish between memorized cultural facts and the ability to reason about implicit social norms. We introduce DivanBench, a diagnostic benchmark focused on superstitions and customs, arbitrary, context-dependent rules that resist simple logical deduction. Through 315 questions across three task types (factual retrieval, paired scenario verification, and situational reasoning), we evaluate seven Persian LLMs and reveal three critical failures: most models exhibit severe acquiescence bias, correctly identifying appropriate behaviors but failing to reject clear violations; continuous Persian pretraining amplifies this bias rather than improving reasoning, often degrading the model's ability to discern contradictions; and all models show a 21\% performance gap between retrieving factual knowledge and applying it in scenarios. These findings demonstrate that cultural competence requires more than scaling monolingual data, as current models learn to mimic cultural patterns without internalizing the underlying schemas.
摘要：虽然新兴的波斯语 NLP 基准已扩展到实用性和礼貌性，但它们很少区分记忆的文化事实和推理隐含社会规范的能力。我们推出 DivanBench，这是一个诊断基准，专注于迷信和习俗、任意的、依赖于上下文的规则，抵制简单的逻辑演绎。通过跨三种任务类型（事实检索、配对场景验证和情境推理）的 315 个问题，我们评估了 7 个波斯法学硕士，并揭示了三个关键失败：大多数模型表现出严重的默许偏差，正确识别适当的行为，但未能拒绝明显的违规行为；连续的波斯语预训练放大了这种偏差，而不是改善推理，往往会降低模型辨别矛盾的能力；所有模型在检索事实知识和将其应用于场景之间表现出 21\% 的性能差距。这些发现表明，文化能力需要的不仅仅是扩展单语数据，因为当前的模型学会模仿文化模式而不内化底层模式。

Title: Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

Authors: Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17653
Pdf URL: https://arxiv.org/pdf/2602.17653
Copy Paste: [[2602.17653]] Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking(https://arxiv.org/abs/2602.17653)
Keywords: language model, gpt
Abstract: Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.
摘要：最近的研究表明，在合成语料库上训练的语言模型（LM）可以表现出类似于人类语言中跨语言规律的类型偏好，特别是对于词序等句法现象。在本文中，我们将这种范式扩展到差分参数标记（DAM），这是一种语义许可系统，其中形态标记取决于语义显着性。使用受控综合学习方法，我们在 18 个实现不同 DAM 系统的语料库上训练 GPT-2 模型，并使用最小对评估它们的泛化能力。我们的结果揭示了 DAM 的两个类型学维度之间的分离。模型可靠地表现出类似人类对自然标记方向的偏好，有利于公开标记针对语义非典型论点的系统。相比之下，模型无法再现人类语言中强烈的对象偏好，其中 DAM 中的明显标记通常针对对象而不是主题。这些发现表明不同的类型倾向可能源于不同的潜在来源。

Title: What Language is This? Ask Your Tokenizer

Authors: Clara Meister, Ahmetcan Yavuz, Pietro Lesci, Tiago Pimentel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.17655
Pdf URL: https://arxiv.org/pdf/2602.17655
Copy Paste: [[2602.17655]] What Language is This? Ask Your Tokenizer(https://arxiv.org/abs/2602.17655)
Keywords: language model
Abstract: Language Identification (LID) is an important component of many multilingual natural language processing pipelines, where it facilitates corpus curation, training data analysis, and cross-lingual evaluation of large language models. Despite near-perfect performance on high-resource languages, existing systems remain brittle in low-resource and closely related language settings. We introduce UniLID, a simple and efficient LID method based on the UnigramLM tokenization algorithm, leveraging its probabilistic framing, parameter estimation technique and inference strategy. In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines. Empirical evaluations against widely used baselines, including fastText, GlotLID, and CLD3, show that UniLID achieves competitive performance on standard benchmarks, substantially improves sample efficiency in low-resource settings - surpassing 70% accuracy with as few as five labeled samples per language - and delivers large gains on fine-grained dialect identification.
摘要：语言识别 (LID) 是许多多语言自然语言处理管道的重要组成部分，它有助于语料库管理、训练数据分析和大型语言模型的跨语言评估。尽管在高资源语言上具有近乎完美的性能，但现有系统在低资源和密切相关的语言设置中仍然很脆弱。我们引入 UniLID，这是一种基于 UnigramLM 标记化算法的简单高效的 LID 方法，利用其概率框架、参数估计技术和推理策略。 In short, we learn language-conditional unigram distributions over a shared tokenizer vocabulary but treat segmentation as a language-specific phenomenon. Our formulation is data- and compute-efficient, supports incremental addition of new languages without retraining existing models, and can naturally be integrated into existing language model tokenization pipelines.针对广泛使用的基线（包括 fastText、GlotLID 和 CLD3）的实证评估表明，UniLID 在标准基准上实现了具有竞争力的性能，显着提高了低资源环境中的样本效率 - 每种语言的标记样本数量少于 5 个，准确率超过 70% - 并在细粒度方言识别方面带来了巨大收益。

Title: Sink-Aware Pruning for Diffusion Language Models

Authors: Aidar Myrzakhan, Tianyi Li, Bowei Guo, Shengkun Tang, Zhiqiang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.17664
Pdf URL: https://arxiv.org/pdf/2602.17664
Copy Paste: [[2602.17664]] Sink-Aware Pruning for Diffusion Language Models(https://arxiv.org/abs/2602.17664)
Keywords: language model, llm
Abstract: Diffusion Language Models (DLMs) incur high inference cost due to iterative denoising, motivating efficient pruning. Existing pruning heuristics largely inherited from autoregressive (AR) LLMs, typically preserve attention sink tokens because AR sinks serve as stable global anchors. We show that this assumption does not hold for DLMs: the attention-sink position exhibits substantially higher variance over the full generation trajectory (measured by how the dominant sink locations shift across timesteps), indicating that sinks are often transient and less structurally essential than in AR models. Based on this observation, we propose ${\bf \texttt{Sink-Aware Pruning}}$, which automatically identifies and prunes unstable sinks in DLMs (prior studies usually keep sinks for AR LLMs). Without retraining, our method achieves a better quality-efficiency trade-off and outperforms strong prior pruning baselines under matched compute. Our code is available at this https URL.
摘要：扩散语言模型（DLM）由于迭代去噪而产生较高的推理成本，从而激发了高效的剪枝。现有的剪枝启发法很大程度上继承自自回归（AR）LLM，通常保留注意力池标记，因为 AR 池充当稳定的全局锚。我们证明这个假设对于 DLM 并不成立：注意力池位置在整个生成轨迹上表现出更高的方差（通过主要池位置如何跨时间步移动来衡量），这表明池通常是瞬态的，并且在结构上不如 AR 模型那么重要。基于这一观察，我们提出${\bf \texttt{Sink-Aware Pruning}}$，它自动识别和修剪DLM中不稳定的sink（之前的研究通常为AR LLM保留sink）。在没有重新训练的情况下，我们的方法实现了更好的质量效率权衡，并且在匹配计算下优于强大的先验剪枝基线。我们的代码可以在这个 https URL 上找到。