2026-02-17

Title: LLM-Powered Automatic Translation and Urgency in Crisis Scenarios

Authors: Belu Ticona, Antonis Anastasopoulos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13452
Pdf URL: https://arxiv.org/pdf/2602.13452
Copy Paste: [[2602.13452]] LLM-Powered Automatic Translation and Urgency in Crisis Scenarios(https://arxiv.org/abs/2602.13452)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly proposed for crisis preparedness and response, particularly for multilingual communication. However, their suitability for high-stakes crisis contexts remains insufficiently evaluated. This work examines the performance of state-of-the-art LLMs and machine translation systems in crisis-domain translation, with a focus on preserving urgency, which is a critical property for effective crisis communication and triaging. Using multilingual crisis data and a newly introduced urgency-annotated dataset covering over 32 languages, we show that both dedicated translation models and LLMs exhibit substantial performance degradation and instability. Crucially, even linguistically adequate translations can distort perceived urgency, and LLM-based urgency classifications vary widely depending on the language of the prompt and input. These findings highlight significant risks in deploying general-purpose language technologies for crisis communication and underscore the need for crisis-aware evaluation frameworks.
摘要：大语言模型（LLM）越来越多地被提出用于危机准备和响应，特别是多语言交流。然而，它们对高风险危机环境的适用性仍未得到充分评估。这项工作研究了最先进的法学硕士和机器翻译系统在危机领域翻译中的表现，重点是保持紧迫性，这是有效危机沟通和分类的关键属性。使用多语言危机数据和新引入的涵盖超过 32 种语言的紧急注释数据集，我们表明专用翻译模型和法学硕士都表现出严重的性能下降和不稳定。至关重要的是，即使语言上足够的翻译也会扭曲感知的紧迫性，并且基于 LLM 的紧迫性分类根据提示和输入的语言而有很大差异。这些发现凸显了部署通用语言技术进行危机沟通的重大风险，并强调了危机意识评估框架的必要性。

Title: Language Model Memory and Memory Models for Language

Authors: Benjamin L. Badger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13466
Pdf URL: https://arxiv.org/pdf/2602.13466
Copy Paste: [[2602.13466]] Language Model Memory and Memory Models for Language(https://arxiv.org/abs/2602.13466)
Keywords: language model
Abstract: The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.
摘要：机器学习模型将输入信息存储在隐藏层向量嵌入中的能力（类似于“记忆”的概念）已得到广泛应用，但尚未得到很好的表征。我们发现，无论训练期间的数据和计算规模如何，语言模型嵌入通常包含相对较少的输入信息。相比之下，经过输入再生训练的自动编码器的嵌入能够形成近乎完美的记忆。用内存嵌入替代令牌序列可以显着提高计算效率，从而推动可并行编码器-解码器内存模型架构的引入。经过因果训练后，这些模型包含无法任意信息访问的信息匮乏的嵌入，但通过结合因果和信息保留目标函数，它们学习形成和解码信息丰富的记忆。通过冻结高保真编码器，然后采用课程训练方法，可以进一步简化训练，其中解码器首先学习处理记忆，然后学习另外预测下一个标记。我们引入这样的观点：仅下一个令牌预测训练不太适合准确的记忆形成，因为目标本身是不可逆的，这促使在不暴露整个输入的模型中使用组合目标函数。

Title: From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier

Authors: Ozancan Ozdemir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13504
Pdf URL: https://arxiv.org/pdf/2602.13504
Copy Paste: [[2602.13504]] From Perceptions To Evidence: Detecting AI-Generated Content In Turkish News Media With A Fine-Tuned Bert Classifier(https://arxiv.org/abs/2602.13504)
Keywords: language model, llm
Abstract: The rapid integration of large language models into newsroom workflows has raised urgent questions about the prevalence of AI-generated content in online media. While computational studies have begun to quantify this phenomenon in English-language outlets, no empirical investigation exists for Turkish news media, where existing research remains limited to qualitative interviews with journalists or fake news detection. This study addresses that gap by fine-tuning a Turkish-specific BERT model (dbmdz/bert-base-turkish-cased) on a labeled dataset of 3,600 articles from three major Turkish outlets with distinct editorial orientations for binary classification of AI-rewritten content. The model achieves 0.9708 F1 score on the held-out test set with symmetric precision and recall across both classes. Subsequent deployment on over 3,500 unseen articles spanning between 2023 and 2026 reveals consistent cross-source and temporally stable classification patterns, with mean prediction confidence exceeding 0.96 and an estimated 2.5 percentage of examined news content rewritten or revised by LLMs on average. To the best of our knowledge, this is the first study to move beyond self-reported journalist perceptions toward empirical, data-driven measurement of AI usage in Turkish news media.
摘要：大型语言模型快速集成到新闻编辑室工作流程中，引发了关于在线媒体中人工智能生成内容的流行的紧迫问题。虽然计算研究已经开始量化英语媒体中的这种现象，但土耳其新闻媒体还没有实证调查，现有的研究仍然仅限于对记者的定性采访或假新闻检测。这项研究通过对来自三个主要土耳其媒体的 3,600 篇文章的标记数据集微调土耳其语特定的 BERT 模型 (dbmdz/bert-base-turkish-cased) 来解决这一差距，这些文章具有不同的编辑方向，用于对 AI 重写内容进行二进制分类。该模型在保留测试集上获得了 0.9708 F1 分数，并且在两个类别中具有对称的精度和召回率。随后对 2023 年至 2026 年间超过 3,500 篇未见文章的部署揭示了一致的跨源和时间稳定的分类模式，平均预测置信度超过 0.96，平均估计有 2.5% 的审查新闻内容被法学硕士重写或修改。据我们所知，这是第一项超越记者自我报告的看法，转向以实证、数据驱动的方式衡量土耳其新闻媒体中人工智能使用情况的研究。

Title: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Authors: Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13517
Pdf URL: https://arxiv.org/pdf/2602.13517
Copy Paste: [[2602.13517]] Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens(https://arxiv.org/abs/2602.13517)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.
摘要：大型语言模型 (LLM) 通过长思想链 (CoT) 扩展测试时计算，展示了令人印象深刻的推理能力。然而，最近的研究结果表明，原始令牌计数并不能可靠地代表推理质量：增加的生成长度并不总是与准确性相关，反而可能表示“过度思考”，从而导致性能下降。在这项工作中，我们通过识别深度思考标记（内部预测在收敛之前在更深的模型层中进行重大修改的标记）来量化推理时间的工作量。在四个具有挑战性的数学和科学基准（AIME 24/25、HMMT 25 和 GPQA-diamond）和一组不同的以推理为中心的模型（GPT-OSS、DeepSeek-R1 和 Qwen3）中，我们表明深度思考比率（生成序列中深度思考标记的比例）与准确性表现出强大且一致的正相关性，大大优于基于长度和基于置信度的基线。利用这种洞察力，我们引入了 Think@n，这是一种测试时间扩展策略，可优先考虑具有高深度思考比率的样本。我们证明 Think@n 能够匹配或超过标准的自我一致性性能，同时通过启用基于短前缀的无前景生成的早期拒绝来显着降低推理成本。

Title: On Calibration of Large Language Models: From Response To Capability

Authors: Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.13540
Pdf URL: https://arxiv.org/pdf/2602.13540
Copy Paste: [[2602.13540]] On Calibration of Large Language Models: From Response To Capability(https://arxiv.org/abs/2602.13540)
Keywords: language model, llm
Abstract: Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.
摘要：大型语言模型 (LLM) 被广泛部署为通用问题解决器，这使得准确的置信度估计对于可靠使用至关重要。之前关于 LLM 校准的工作主要集中在响应水平置信度上，它估计单个生成输出的正确性。然而，这种表述与许多实际设置不一致，其中的核心问题是模型整体解决查询的可能性有多大。我们表明，这种不匹配是由现代 LLM 解码的随机性造成的，在这种情况下，单响应正确性无法反映底层模型的能力。为了解决这个问题，我们引入了能力校准，其目标是模型在查询上的预期准确性。我们正式区分能力校准和响应校准，并表明两者在理论和经验上都有所不同。我们建立了实证评估设置并研究了一系列置信度估计方法。我们的结果表明，能力校准的置信度改善了 pass@$k$ 预测和推理预算分配，为多样化应用奠定了基础。

Title: Small Reward Models via Backward Inference

Authors: Yike Wang, Faeze Brahman, Shangbin Feng, Teng Xiao, Hannaneh Hajishirzi, Yulia Tsvetkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13551
Pdf URL: https://arxiv.org/pdf/2602.13551
Copy Paste: [[2602.13551]] Small Reward Models via Backward Inference(https://arxiv.org/abs/2602.13551)
Keywords: language model, llm, prompt
Abstract: Reward models (RMs) play a central role throughout the language model (LM) pipeline, particularly in non-verifiable domains. However, the dominant LLM-as-a-Judge paradigm relies on the strong reasoning capabilities of large models, while alternative approaches require reference responses or explicit rubrics, limiting flexibility and broader accessibility. In this work, we propose FLIP (FLipped Inference for Prompt reconstruction), a reference-free and rubric-free reward modeling approach that reformulates reward modeling through backward inference: inferring the instruction that would most plausibly produce a given response. The similarity between the inferred and the original instructions is then used as the reward signal. Evaluations across four domains using 13 small language models show that FLIP outperforms LLM-as-a-Judge baselines by an average of 79.6%. Moreover, FLIP substantially improves downstream performance in extrinsic evaluations under test-time scaling via parallel sampling and GRPO training. We further find that FLIP is particularly effective for longer outputs and robust to common forms of reward hacking. By explicitly exploiting the validation-generation gap, FLIP enables reliable reward modeling in downscaled regimes where judgment methods fail. Code available at this https URL.
摘要：奖励模型 (RM) 在整个语言模型 (LM) 管道中发挥着核心作用，特别是在不可验证的领域。然而，占主导地位的法学硕士法官范式依赖于大型模型的强大推理能力，而替代方法需要参考答案或明确的规则，限制了灵活性和更广泛的可访问性。在这项工作中，我们提出了 FLIP（FLipped Inference for Prompt重构），这是一种无参考和无规则的奖励建模方法，它通过向后推理重新制定奖励建模：推断最有可能产生给定响应的指令。然后将推断的指令与原始指令之间的相似性用作奖励信号。使用 13 个小型语言模型对四个领域进行的评估表明，FLIP 的表现比 LLM-as-a-Judge 基线平均高出 79.6%。此外，FLIP 通过并行采样和 GRPO 训练，在测试时间缩放下显着提高了外部评估的下游性能。我们进一步发现，FLIP 对于较长的输出特别有效，并且对于常见的奖励黑客形式具有鲁棒性。通过明确利用验证代差距，FLIP 在判断方法失败的缩小范围内实现可靠的奖励建模。代码可在此 https URL 获取。

Title: DistillLens: Symmetric Knowledge Distillation Through Logit Lens

Authors: Manish Dhakal, Uthman Jinadu, Anjila Budathoki, Rajshekhar Sunderraman, Yi Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13567
Pdf URL: https://arxiv.org/pdf/2602.13567
Copy Paste: [[2602.13567]] DistillLens: Symmetric Knowledge Distillation Through Logit Lens(https://arxiv.org/abs/2602.13567)
Keywords: language model, gpt, llm
Abstract: Standard Knowledge Distillation (KD) compresses Large Language Models (LLMs) by optimizing final outputs, yet it typically treats the teacher's intermediate layer's thought process as a black box. While feature-based distillation attempts to bridge this gap, existing methods (e.g., MSE and asymmetric KL divergence) ignore the rich uncertainty profiles required for the final output. In this paper, we introduce DistillLens, a framework that symmetrically aligns the evolving thought processes of student and teacher models. By projecting intermediate hidden states into the vocabulary space via the Logit Lens, we enforce structural alignment using a symmetric divergence objective. Our analysis proves that this constraint imposes a dual-sided penalty, preventing both overconfidence and underconfidence while preserving the high-entropy information conduits essential for final deduction. Extensive experiments on GPT-2 and Llama architectures demonstrate that DistillLens consistently outperforms standard KD and feature-transfer baselines on diverse instruction-following benchmarks. The code is available at this https URL.
摘要：标准知识蒸馏（KD）通过优化最终输出来压缩大型语言模型（LLM），但它通常将教师的中间层的思维过程视为黑匣子。虽然基于特征的蒸馏试图弥补这一差距，但现有方法（例如 MSE 和非对称 KL 散度）忽略了最终输出所需的丰富的不确定性概况。在本文中，我们介绍了 DistillLens，这是一个对称地调整学生和教师模型不断发展的思维过程的框架。通过 Logit Lens 将中间隐藏状态投影到词汇空间中，我们使用对称发散目标强制执行结构对齐。我们的分析证明，这种约束会带来双重惩罚，既防止过度自信，又防止信心不足，同时保留最终推论所必需的高熵信息管道。 GPT-2 和 Llama 架构上的大量实验表明，DistillLens 在各种指令跟踪基准测试中始终优于标准 KD 和特征传输基准。该代码可从此 https URL 获取。

Title: LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems

Authors: Zhipeng Song, Xiangyu Kong, Xinrui Bao, Yizhi Zhou, Jiulong Jiao, Sitong Liu, Yuhang Zhou, Heng Qi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13571
Pdf URL: https://arxiv.org/pdf/2602.13571
Copy Paste: [[2602.13571]] LLM-Confidence Reranker: A Training-Free Approach for Enhancing Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2602.13571)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) have revolutionized natural language processing, yet hallucinations in knowledge-intensive tasks remain a critical challenge. Retrieval-augmented generation (RAG) addresses this by integrating external knowledge, but its efficacy depends on accurate document retrieval and ranking. Although existing rerankers demonstrate effectiveness, they frequently necessitate specialized training, impose substantial computational expenses, and fail to fully exploit the semantic capabilities of LLMs, particularly their inherent confidence signals. We propose the LLM-Confidence Reranker (LCR), a training-free, plug-and-play algorithm that enhances reranking in RAG systems by leveraging black-box LLM confidence derived from Maximum Semantic Cluster Proportion (MSCP). LCR employs a two-stage process: confidence assessment via multinomial sampling and clustering, followed by binning and multi-level sorting based on query and document confidence thresholds. This approach prioritizes relevant documents while preserving original rankings for high-confidence queries, ensuring robustness. Evaluated on BEIR and TREC benchmarks with BM25 and Contriever retrievers, LCR--using only 7--9B-parameter pre-trained LLMs--consistently improves NDCG@5 by up to 20.6% across pre-trained LLM and fine-tuned Transformer rerankers, without degradation. Ablation studies validate the hypothesis that LLM confidence positively correlates with document relevance, elucidating LCR's mechanism. LCR offers computational efficiency, parallelism for scalability, and broad compatibility, mitigating hallucinations in applications like medical diagnosis.
摘要：大型语言模型（LLM）彻底改变了自然语言处理，但知识密集型任务中的幻觉仍然是一个严峻的挑战。检索增强生成（RAG）通过整合外部知识来解决这个问题，但其功效取决于准确的文档检索和排名。尽管现有的重新排序器证明了有效性，但它们经常需要专门的培训，增加大量的计算费用，并且无法充分利用法学硕士的语义功能，特别是其固有的置信信号。我们提出了 LLM-Confidence Reranker (LCR)，这是一种免训练、即插即用的算法，通过利用从最大语义聚类比例 (MSCP) 导出的黑盒 LLM 置信度来增强 RAG 系统中的重新排名。 LCR 采用两阶段过程：通过多项采样和聚类进行置信度评估，然后根据查询和文档置信度阈值进行分箱和多级排序。这种方法优先考虑相关文档，同时保留高置信度查询的原始排名，确保稳健性。使用 BM25 和 Contriever 检索器对 BEIR 和 TREC 基准进行评估，LCR（仅使用 7--9B 参数预训练的 LLM）在预训练的 LLM 和微调的 Transformer 重新排序器中始终将 NDCG@5 提高了 20.6%，且没有降级。消融研究验证了 LLM 置信度与文档相关性正相关的假设，阐明了 LCR 的机制。 LCR 提供计算效率、可扩展性的并行性和广泛的兼容性，减轻医疗诊断等应用中的幻觉。

Title: Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment

Authors: Jing Zhao, Ting Zhen, Junwei bao, Hongfei Jiang, Yang song
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13575
Pdf URL: https://arxiv.org/pdf/2602.13575
Copy Paste: [[2602.13575]] Elo-Evolve: A Co-evolutionary Framework for Language Model Alignment(https://arxiv.org/abs/2602.13575)
Keywords: language model, llm, agent
Abstract: Current alignment methods for Large Language Models (LLMs) rely on compressing vast amounts of human preference data into static, absolute reward functions, leading to data scarcity, noise sensitivity, and training instability. We introduce Elo-Evolve, a co-evolutionary framework that redefines alignment as dynamic multi-agent competition within an adaptive opponent pool. Our approach makes two key innovations: (1) eliminating Bradley-Terry model dependencies by learning directly from binary win/loss outcomes in pairwise competitions, and (2) implementing Elo-orchestrated opponent selection that provides automatic curriculum learning through temperature-controlled sampling. We ground our approach in PAC learning theory, demonstrating that pairwise comparison achieves superior sample complexity and empirically validate a 4.5x noise reduction compared to absolute scoring approaches. Experimentally, we train a Qwen2.5-7B model using our framework with opponents including Qwen2.5-14B, Qwen2.5-32B, and Qwen3-8B models. Results demonstrate a clear performance hierarchy: point-based methods < static pairwise training < Elo-Evolve across Alpaca Eval 2.0 and MT-Bench, validating the progressive benefits of pairwise comparison and dynamic opponent selection for LLM alignment.
摘要：当前大型语言模型 (LLM) 的对齐方法依赖于将大量人类偏好数据压缩为静态、绝对奖励函数，从而导致数据稀缺、噪声敏感性和训练不稳定。我们引入了 Elo-Evolve，这是一个共同进化框架，它将对齐重新定义为自适应对手池中的动态多智能体竞争。我们的方法有两个关键创新：（1）通过直接从配对比赛中的二元胜负结果中学习来消除 Bradley-Terry 模型依赖性，以及（2）实施 Elo 精心策划的对手选择，通过温度控制采样提供自动课程学习。我们的方法以 PAC 学习理论为基础，证明成对比较可实现卓越的样本复杂性，并凭经验验证与绝对评分方法相比，噪声降低了 4.5 倍。实验上，我们使用我们的框架与包括 Qwen2.5-14B、Qwen2.5-32B 和 Qwen3-8B 模型在内的对手一起训练 Qwen2.5-7B 模型。结果展示了清晰的性能层次结构：基于点的方法 < 静态配对训练 < Alpaca Eval 2.0 和 MT-Bench 中的 Elo-Evolve，验证了 LLM 对齐的配对比较和动态对手选择的渐进优势。

Title: On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis

Authors: Maciej Uberna, Michał Wawer, Jarosław A. Chudziak, Marcin Koszowy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13713
Pdf URL: https://arxiv.org/pdf/2602.13713
Copy Paste: [[2602.13713]] On Theoretically-Driven LLM Agents for Multi-Dimensional Discourse Analysis(https://arxiv.org/abs/2602.13713)
Keywords: llm, retrieval-augmented generation, agent
Abstract: Identifying the strategic uses of reformulation in discourse remains a key challenge for computational argumentation. While LLMs can detect surface-level similarity, they often fail to capture the pragmatic functions of rephrasing, such as its role within rhetorical discourse. This paper presents a comparative multi-agent framework designed to quantify the benefits of incorporating explicit theoretical knowledge for this task. We utilise an dataset of annotated political debates to establish a new standard encompassing four distinct rephrase functions: Deintensification, Intensification, Specification, Generalisation, and Other, which covers all remaining types (D-I-S-G-O). We then evaluate two parallel LLM-based agent systems: one enhanced by argumentation theory via Retrieval-Augmented Generation (RAG), and an identical zero-shot baseline. The results reveal a clear performance gap: the RAG-enhanced agents substantially outperform the baseline across the board, with particularly strong advantages in detecting Intensification and Generalisation context, yielding an overall Macro F1-score improvement of nearly 30\%. Our findings provide evidence that theoretical grounding is not only beneficial but essential for advancing beyond mere paraphrase detection towards function-aware analysis of argumentative discourse. This comparative multi-agent architecture represents a step towards scalable, theoretically informed computational tools capable of identifying rhetorical strategies in contemporary discourse.
摘要：确定话语中重新表述的战略用途仍然是计算论证的一个关键挑战。虽然法学硕士可以检测表面水平的相似性，但它们往往无法捕捉改写的语用功能，例如它在修辞话语中的作用。本文提出了一个比较多智能体框架，旨在量化为此任务纳入显式理论知识的好处。我们利用带注释的政治辩论数据集来建立一个新标准，其中包含四种不同的改写功能：去强化、强化、规范、概括和其他，涵盖所有剩余类型 (D-I-S-G-O)。然后，我们评估两个并行的基于 LLM 的代理系统：一个通过检索增强生成（RAG）通过论证理论增强，另一个是相同的零样本基线。结果揭示了明显的性能差距：RAG 增强型智能体全面优于基线，在检测强化和泛化上下文方面具有特别强大的优势，总体宏观 F1 分数提高了近 30%。我们的研究结果证明，理论基础不仅有益，而且对于超越单纯的释义检测，迈向议论文的功能感知分析至关重要。这种比较性的多智能体架构代表了朝着可扩展的、理论上知情的计算工具迈出的一步，该工具能够识别当代话语中的修辞策略。

Title: RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction

Authors: Yongkang Jin, Jianwen Luo, Jingjing Wang, Jianmin Yao, Yu Hong
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2602.13748
Pdf URL: https://arxiv.org/pdf/2602.13748
Copy Paste: [[2602.13748]] RMPL: Relation-aware Multi-task Progressive Learning with Stage-wise Training for Multimedia Event Extraction(https://arxiv.org/abs/2602.13748)
Keywords: language model, prompt
Abstract: Multimedia Event Extraction (MEE) aims to identify events and their arguments from documents that contain both text and images. It requires grounding event semantics across different modalities. Progress in MEE is limited by the lack of annotated training data. M2E2 is the only established benchmark, but it provides annotations only for evaluation. This makes direct supervised training impractical. Existing methods mainly rely on cross-modal alignment or inference-time prompting with Vision--Language Models (VLMs). These approaches do not explicitly learn structured event representations and often produce weak argument grounding in multimodal settings. To address these limitations, we propose RMPL, a Relation-aware Multi-task Progressive Learning framework for MEE under low-resource conditions. RMPL incorporates heterogeneous supervision from unimodal event extraction and multimedia relation extraction with stage-wise training. The model is first trained with a unified schema to learn shared event-centric representations across modalities. It is then fine-tuned for event mention identification and argument role extraction using mixed textual and visual data. Experiments on the M2E2 benchmark with multiple VLMs show consistent improvements across different modality settings.
摘要：多媒体事件提取 (MEE) 旨在从包含文本和图像的文档中识别事件及其论点。它需要跨不同模式的基础事件语义。由于缺乏带注释的训练数据，MEE 的进展受到限制。 M2E2 是唯一建立的基准，但它仅提供用于评估的注释。这使得直接监督训练变得不切实际。现有方法主要依赖于跨模式对齐或视觉语言模型（VLM）的推理时间提示。这些方法没有明确学习结构化事件表示，并且经常在多模态环境中产生薄弱的论证基础。为了解决这些限制，我们提出了 RMPL，一种用于资源匮乏条件下的 MEE 的关系感知多任务渐进学习框架。 RMPL 将单峰事件提取和多媒体关系提取的异构监督与阶段式训练相结合。该模型首先使用统一的模式进行训练，以学习跨模态的共享的以事件为中心的表示。然后使用混合文本和视觉数据对事件提及识别和论点角色提取进行微调。使用多个 VLM 在 M2E2 基准上进行的实验显示，不同模态设置之间存在一致的改进。

Title: OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum

Authors: Yangyang Zhang, Zilong Wang, Jianbo Xu, Yongqi Chen, Chu Han, Zhihao Zhang, Shuai Liu, Hui Li, Huiping Zhang, Ziqi Liu, Jiaxin Chen, Jun Zhu, Zheng Feng, Hao Wen, Xingzhu Ju, Yanping Zhong, Yunqiu Zhang, Jie Duan, Jun Li, Dongsheng Li, Weijie Wang, Haiyan Zhu, Wei Jiang, Xiaohua Wu, Shuo Wang, Haiming Li, Qinhao Guo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13793
Pdf URL: https://arxiv.org/pdf/2602.13793
Copy Paste: [[2602.13793]] OMGs: A multi-agent system supporting MDT decision-making across the ovarian tumour care continuum(https://arxiv.org/abs/2602.13793)
Keywords: agent
Abstract: Ovarian tumour management has increasingly relied on multidisciplinary tumour board (MDT) deliberation to address treatment complexity and disease heterogeneity. However, most patients worldwide lack access to timely expert consensus, particularly in resource-constrained centres where MDT resources are scarce or unavailable. Here we present OMGs (Ovarian tumour Multidisciplinary intelligent aGent System), a multi-agent AI framework where domain-specific agents deliberate collaboratively to integrate multidisciplinary evidence and generate MDT-style recommendations with transparent rationales. To systematically evaluate MDT recommendation quality, we developed SPEAR (Safety, Personalization, Evidence, Actionability, Robustness) and validated OMGs across diverse clinical scenarios spanning the care continuum. In multicentre re-evaluation, OMGs achieved performance comparable to expert MDT consensus ($4.45 \pm 0.30$ versus $4.53 \pm 0.23$), with higher Evidence scores (4.57 versus 3.92). In prospective multicentre evaluation (59 patients), OMGs demonstrated high concordance with routine MDT decisions. Critically, in paired human-AI studies, OMGs most substantially enhanced clinicians' recommendations in Evidence and Robustness, the dimensions most compromised when multidisciplinary expertise is unavailable. These findings suggest that multi-agent deliberative systems can achieve performance comparable to expert MDT consensus, with potential to expand access to specialized oncology expertise in resource-limited settings.
摘要：卵巢肿瘤的治疗越来越依赖多学科肿瘤委员会（MDT）的审议来解决治疗的复杂性和疾病的异质性。然而，世界各地的大多数患者无法及时获得专家共识，特别是在 MDT 资源稀缺或不可用的资源有限的中心。在这里，我们提出 OMG（卵巢肿瘤多学科智能 aGent 系统），这是一个多智能体人工智能框架，其中特定领域的智能体协作审议以整合多学科证据并生成具有透明原理的 MDT 式建议。为了系统地评估 MDT 推荐质量，我们开发了 SPEAR（安全性、个性化、证据、可操作性、稳健性），并在跨越护理连续体的不同临床场景中验证了 OMG。在多中心重新评估中，OMG 的表现与专家 MDT 共识相当（$4.45 \pm 0.30$ vs $4.53 \pm 0.23$），且证据分数更高（4.57 vs 3.92）。在前瞻性多中心评估（59 名患者）中，OMG 表现出与常规 MDT 决策高度一致。至关重要的是，在人类与人工智能配对研究中，OMG 最大幅度地增强了临床医生在证据和稳健性方面的建议，而当多学科专业知识不可用时，这两个维度受到的影响最大。这些研究结果表明，多主体审议系统可以实现与专家 MDT 共识相当的性能，并有可能在资源有限的环境中扩大获得专门肿瘤学专业知识的机会。

Title: Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind

Authors: Minyuan Ruan, Ziyue Wang, Kaiming Liu, Yunghwei Lai, Peng Li, Yang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13832
Pdf URL: https://arxiv.org/pdf/2602.13832
Copy Paste: [[2602.13832]] Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind(https://arxiv.org/abs/2602.13832)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks to assist human users. However, they still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed, leading to a divergence between subjective user believes and true environment states. Resolving this epistemic divergence requires Theory of Mind (ToM), yet existing ToM evaluations for LLMs primarily focus on isolated belief inference, overlooking its functional utility in real-world interaction. To this end, we formalize ToM for LLMs as a mechanism for epistemic divergence detection and resolution, and propose a benchmark, \benchname, to assess how models reconcile user beliefs and profiles in practice. Results across 11 leading models reveal a significant limitation to identify underlying cognitive gaps that impede task success. To bridge this gap, we further curate a trajectory-based ToM dataset linking belief tracking with task-related state inference. The model trained on this data via reinforcement learning shows consistent improvement in reasoning about user mental states, leading to enhanced downstream performance. Our work highlights the practical value of ToM as an essential interaction-level mechanism rather than as a standalone reasoning skill.
摘要：大型语言模型（LLM）发展迅速，并广泛应用于通用和专业任务以协助人类用户。然而，当意图和指令传达不准确时，他们仍然难以理解和响应真实的用户需求，导致用户主观信念与真实环境状态之间的分歧。解决这种认知分歧需要心智理论（ToM），但现有的法学硕士 ToM 评估主要侧重于孤立的信念推理，忽视了其在现实世界交互中的功能效用。为此，我们将 LLM 的 ToM 形式化为认知分歧检测和解决的机制，并提出一个基准 \benchname，以评估模型如何在实践中协调用户信念和配置文件。 11 个主要模型的结果揭示了识别阻碍任务成功的潜在认知差距的重大局限性。为了弥补这一差距，我们进一步策划了一个基于轨迹的 ToM 数据集，将信念跟踪与任务相关的状态推断联系起来。通过强化学习对这些数据进行训练的模型显示出在用户心理状态推理方面的持续改进，从而提高了下游性能。我们的工作强调了 ToM 作为一种重要的交互级别机制而不是作为一种独立的推理技能的实用价值。

Title: Speculative Decoding with a Speculative Vocabulary

Authors: Miles Williams, Young D. Kwon, Rui Li, Alexandros Kouris, Stylianos I. Venieris
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13836
Pdf URL: https://arxiv.org/pdf/2602.13836
Copy Paste: [[2602.13836]] Speculative Decoding with a Speculative Vocabulary(https://arxiv.org/abs/2602.13836)
Keywords: language model
Abstract: Speculative decoding has rapidly emerged as a leading approach for accelerating language model (LM) inference, as it offers substantial speedups while yielding identical outputs. This relies upon a small draft model, tasked with predicting the outputs of the target model. State-of-the-art speculative decoding methods use a draft model consisting of a single decoder layer and output embedding matrix, with the latter dominating drafting time for the latest LMs. Recent work has sought to address this output distribution bottleneck by reducing the vocabulary of the draft model. Although this can improve throughput, it compromises speculation effectiveness when the target token is out-of-vocabulary. In this paper, we argue for vocabulary speculation as an alternative to a reduced vocabulary. We propose SpecVocab, an efficient and effective method that selects a vocabulary subset per decoding step. Across a variety of tasks, we demonstrate that SpecVocab can achieve a higher acceptance length than state-of-the-art speculative decoding approach, EAGLE-3. Notably, this yields up to an 8.1% increase in average throughput over EAGLE-3.
摘要：推测性解码已迅速成为加速语言模型 (LM) 推理的主要方法，因为它在产生相同输出的同时提供了显着的加速。这依赖于一个小型草稿模型，其任务是预测目标模型的输出。最先进的推测解码方法使用由单个解码器层和输出嵌入矩阵组成的草案模型，后者主导了最新 LM 的起草时间。最近的工作试图通过减少草稿模型的词汇量来解决这个输出分配瓶颈。虽然这可以提高吞吐量，但当目标标记超出词汇表时，它会损害推测的有效性。在本文中，我们主张词汇推测作为减少词汇的替代方案。我们提出了 SpecVocab，这是一种高效且有效的方法，可以在每个解码步骤中选择一个词汇子集。在各种任务中，我们证明 SpecVocab 可以实现比最先进的推测解码方法 EAGLE-3 更高的接受长度。值得注意的是，与 EAGLE-3 相比，平均吞吐量提高了 8.1%。

Title: PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training

Authors: Yuhan Cheng, Hancheng Ye, Hai Helen Li, Jingwei Sun, Yiran Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13840
Pdf URL: https://arxiv.org/pdf/2602.13840
Copy Paste: [[2602.13840]] PrivAct: Internalizing Contextual Privacy Preservation via Multi-Agent Preference Training(https://arxiv.org/abs/2602.13840)
Keywords: language model, llm, agent
Abstract: Large language model (LLM) agents are increasingly deployed in personalized tasks involving sensitive, context-dependent information, where privacy violations may arise in agents' action due to the implicitness of contextual privacy. Existing approaches rely on external, inference-time interventions which are brittle, scenario-specific, and may expand the privacy attack surface. We propose PrivAct, a contextual privacy-aware multi-agent learning framework that internalizes contextual privacy preservation directly into models' generation behavior for privacy-compliant agentic actions. By embedding privacy preferences into each agent, PrivAct enhances system-wide contextual integrity while achieving a more favorable privacy-helpfulness tradeoff. Experiments across multiple LLM backbones and benchmarks demonstrate consistent improvements in contextual privacy preservation, reducing leakage rates by up to 12.32% while maintaining comparable helpfulness, as well as zero-shot generalization and robustness across diverse multi-agent topologies. Code is available at this https URL.
摘要：大型语言模型（LLM）代理越来越多地部署在涉及敏感、上下文相关信息的个性化任务中，由于上下文隐私的隐含性，代理的行为可能会出现隐私侵犯。现有的方法依赖于外部的推理时间干预，这些干预是脆弱的、特定于场景的，并且可能会扩大隐私攻击面。我们提出了 PrivAct，一种上下文隐私感知的多代理学习框架，它将上下文隐私保护直接内化到模型的生成行为中，以实现符合隐私的代理操作。通过将隐私偏好嵌入到每个代理中，PrivAct 增强了系统范围的上下文完整性，同时实现了更有利的隐私-有用性权衡。跨多个 LLM 主干和基准的实验表明，上下文隐私保护方面得到了一致的改进，将泄漏率降低了 12.32%，同时保持了相当的有用性，以及跨不同多代理拓扑的零样本泛化和鲁棒性。代码可从此 https URL 获取。

Title: Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe

Authors: Somnath Banerjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13860
Pdf URL: https://arxiv.org/pdf/2602.13860
Copy Paste: [[2602.13860]] Tutoring Large Language Models to be Domain-adaptive, Precise, and Safe(https://arxiv.org/abs/2602.13860)
Keywords: language model, llm
Abstract: The overarching research direction of this work is the development of a ''Responsible Intelligence'' framework designed to reconcile the immense generative power of Large Language Models (LLMs) with the stringent requirements of real-world deployment. As these models become a transformative force in artificial intelligence, there is an urgent need to move beyond general-purpose architectures toward systems that are contextually aware, inherently safer, and deeply respectful of global cultural nuances. This research navigates three interconnected threads: domain adaptation to ensure technical precision, ethical rigor to mitigate adversarial vulnerabilities, and cultural/multilingual alignment to promote global inclusivity. The methodological trajectory moves from classical supervised adaptation for task-specific demands to decoding-time alignment for safety, finally leveraging human feedback and preference modeling to achieve sociolinguistic acuity.
摘要：这项工作的总体研究方向是开发“负责任的智能”框架，旨在协调大型语言模型（LLM）的巨大生成能力与现实世界部署的严格要求。随着这些模型成为人工智能领域的变革力量，迫切需要超越通用架构，转向具有上下文感知、本质上更安全且高度尊重全球文化细微差别的系统。这项研究涵盖了三个相互关联的线索：领域适应以确保技术精度，道德严谨性以减轻对抗性漏洞，以及文化/多语言协调以促进全球包容性。方法论轨迹从针对特定任务需求的经典监督适应转向为了安全而进行的解码时间对齐，最后利用人类反馈和偏好建模来实现社会语言学敏锐度。

Title: Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages

Authors: Somnath Banerjee, Rima Hazra, Animesh Mukherjee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13867
Pdf URL: https://arxiv.org/pdf/2602.13867
Copy Paste: [[2602.13867]] Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages(https://arxiv.org/abs/2602.13867)
Keywords: language model, llm
Abstract: Large language models (LLMs) are being deployed across the Global South, where everyday use involves low-resource languages, code-mixing, and culturally specific norms. Yet safety pipelines, benchmarks, and alignment still largely target English and a handful of high-resource languages, implicitly assuming safety and factuality ''transfer'' across languages. Evidence increasingly shows they do not. We synthesize recent findings indicating that (i) safety guardrails weaken sharply on low-resource and code-mixed inputs, (ii) culturally harmful behavior can persist even when standard toxicity scores look acceptable, and (iii) English-only knowledge edits and safety patches often fail to carry over to low-resource languages. In response, we outline a practical agenda for researchers and students in the Global South: parameter-efficient safety steering, culturally grounded evaluation and preference data, and participatory workflows that empower local communities to define and mitigate harm. Our aim is to make multilingual safety a core requirement-not an add-on-for equitable AI in underrepresented regions.
摘要：大型语言模型 (LLM) 正在全球南方部署，这些地区的日常使用涉及资源匮乏的语言、代码混合和文化特定规范。然而，安全管道、基准和一致性仍然主要针对英语和少数高资源语言，隐含地假设安全性和事实性跨语言“转移”。越来越多的证据表明事实并非如此。我们综合了最近的研究结果，表明（i）安全护栏在低资源和代码混合输入上急剧减弱，（ii）即使标准毒性分数看起来可以接受，文化上有害的行为也可能持续存在，以及（iii）纯英语知识编辑和安全补丁通常无法延续到低资源语言。作为回应，我们为南半球的研究人员和学生概述了一个实用议程：参数高效的安全指导、基于文化的评估和偏好数据，以及使当地社区能够定义和减轻伤害的参与式工作流程。我们的目标是使多语言安全成为核心要求，而不是在代表性不足的地区实现公平人工智能的附加条件。

Title: ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics

Authors: Hend Al-Khalifa, Nadia Ghezaiel, Maria Bounnit, Hend Hamed Alhazmi, Noof Abdullah Alfear, Reem Fahad Alqifari, Ameera Masoud Almasoud, Sharefah Ahmed Al-Ghamdi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13870
Pdf URL: https://arxiv.org/pdf/2602.13870
Copy Paste: [[2602.13870]] ADAB: Arabic Dataset for Automated Politeness Benchmarking -- A Large-Scale Resource for Computational Sociopragmatics(https://arxiv.org/abs/2602.13870)
Keywords: language model
Abstract: The growing importance of culturally-aware natural language processing systems has led to an increasing demand for resources that capture sociopragmatic phenomena across diverse languages. Nevertheless, Arabic-language resources for politeness detection remain under-explored, despite the rich and complex politeness expressions embedded in Arabic communication. In this paper, we introduce ADAB (Arabic Politeness Dataset), a new annotated Arabic dataset collected from four online platforms, including social media, e-commerce, and customer service domains, covering Modern Standard Arabic and multiple dialects (Gulf, Egyptian, Levantine, and Maghrebi). The dataset was annotated based on Arabic linguistic traditions and pragmatic theory, resulting in three classes: polite, impolite, and neutral. It contains 10,000 samples with linguistic feature annotations across 16 politeness categories and achieves substantial inter-annotator agreement (kappa = 0.703). We benchmark 40 model configurations, including traditional machine learning, transformer-based models, and large language models. The dataset aims to support research on politeness-aware Arabic NLP.
摘要：具有文化意识的自然语言处理系统的重要性日益增加，导致对捕获不同语言的社会语用现象的资源的需求不断增加。然而，尽管阿拉伯语交流中嵌入了丰富而复杂的礼貌表达，但用于礼貌检测的阿拉伯语资源仍未得到充分探索。在本文中，我们介绍了 ADAB（阿拉伯语礼貌数据集），这是一个从社交媒体、电子商务和客户服务领域等四个在线平台收集的新带注释的阿拉伯语数据集，涵盖现代标准阿拉伯语和多种方言（海湾语、埃及语、黎凡特语和马格里比语）。该数据集根据阿拉伯语言传统和实用理论进行注释，分为三个类别：礼貌、不礼貌和中立。它包含 10,000 个具有跨 16 个礼貌类别的语言特征注释的样本，并实现了实质性的注释者间一致性（kappa = 0.703）。我们对 40 种模型配置进行了基准测试，包括传统机器学习、基于 Transformer 的模型和大型语言模型。该数据集旨在支持礼貌意识阿拉伯语 NLP 的研究。

Title: Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach

Authors: Amir Hossein Mohammadi, Ali Moeinian, Zahra Razavizade, Afsaneh Fatemi, Reza Ramezani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13890
Pdf URL: https://arxiv.org/pdf/2602.13890
Copy Paste: [[2602.13890]] Evaluating Prompt Engineering Techniques for RAG in Small Language Models: A Multi-Hop QA Approach(https://arxiv.org/abs/2602.13890)
Keywords: language model, prompt, retrieval augmented generation
Abstract: Retrieval Augmented Generation (RAG) is a powerful approach for enhancing the factual grounding of language models by integrating external knowledge. While widely studied for large language models, the optimization of RAG for Small Language Models (SLMs) remains a critical research gap, particularly in complex, multi-hop question-answering tasks that require sophisticated reasoning. In these systems, prompt template design is a crucial yet under-explored factor influencing performance. This paper presents a large-scale empirical study to investigate this factor, evaluating 24 different prompt templates on the HotpotQA dataset. The set includes a standard RAG prompt, nine well-formed techniques from the literature, and 14 novel hybrid variants, all tested on two prominent SLMs: Qwen2.5-3B Instruct and Gemma3-4B-It. Our findings, based on a test set of 18720 instances, reveal significant performance gains of up to 83% on Qwen2.5 and 84.5% on Gemma3-4B-It, yielding an improvement of up to 6% for both models compared to the Standard RAG prompt. This research also offers concrete analysis and actionable recommendations for designing effective and efficient prompts for SLM-based RAG systems, practically for deployment in resource-constrained environments.
摘要：检索增强生成（RAG）是一种通过整合外部知识来增强语言模型事实基础的强大方法。尽管对大型语言模型进行了广泛研究，但针对小语言模型 (SLM) 的 RAG 优化仍然是一个关键的研究空白，特别是在需要复杂推理的复杂、多跳问答任务中。在这些系统中，提示模板设计是影响性能的一个关键但尚未充分探索的因素。本文提出了一项大规模实证研究来调查这一因素，评估了 HotpotQA 数据集上的 24 种不同的提示模板。该套件包括一个标准的 RAG 提示、来自文献的 9 种结构良好的技术以及 14 种新颖的混合变体，所有这些都在两个著名的 SLM 上进行了测试：Qwen2.5-3B Instruct 和 Gemma3-4B-It。我们的研究结果基于 18720 个实例的测试集，结果表明，Qwen2.5 上的性能显着提升了 83%，Gemma3-4B-It 上的性能提升了 84.5%，与标准 RAG 提示相比，这两个模型的性能提升了高达 6%。这项研究还提供了具体的分析和可行的建议，用于为基于 SLM 的 RAG 系统设计有效且高效的提示，实际上可以在资源有限的环境中进行部署。

Title: HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam

Authors: Weiqi Zhai, Zhihai Wang, Jinghang Wang, Boyu Yang, Xiaogang Li, Xiang Xu, Bohan Wang, Peng Wang, Xingzhe Wu, Anfeng Li, Qiyuan Feng, Yuhao Zhou, Shoulin Han, Wenjie Luo, Yiyuan Li, Yaxuan Wang, Ruixian Luo, Guojie Lin, Peiyao Xiao, Chengliang Xu, Ben Wang, Zeyu Wang, Zichao Chen, Jianan Ye, Yijie Hu, Jialong Chen, Zongwen Shen, Yuliang Xu, An Yang, Bowen Yu, Dayiheng Liu, Junyang Lin, Hu Wei, Que Shen, Bing Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13964
Pdf URL: https://arxiv.org/pdf/2602.13964
Copy Paste: [[2602.13964]] HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam(https://arxiv.org/abs/2602.13964)
Keywords: language model
Abstract: Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models on challenging, multi-domain questions. However, community-led analyses have raised concerns that HLE contains a non-trivial number of noisy items, which can bias evaluation results and distort cross-model comparisons. To address this challenge, we introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy. Our construction follows a two-stage validation-and-repair workflow resulting in a certified benchmark. In Stage I, each item undergoes binary validation of the problem and final answer through domain-expert review and model-based cross-checks, yielding 641 verified items. In Stage II, flawed but fixable items are revised under strict constraints preserving the original evaluation intent, through dual independent expert repairs, model-assisted auditing, and final adjudication, resulting in 1,170 revised-and-certified items. The remaining 689 items are released as a documented uncertain set with explicit uncertainty sources and expertise tags for future refinement. We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points on HLE-Verified. The improvement is particularly pronounced on items where the original problem statement and/or reference answer is erroneous, with gains of 30--40 percentage points. Our analyses further reveal a strong association between model confidence and the presence of errors in the problem statement or reference answer, supporting the effectiveness of our revisions. Overall, HLE-Verified improves HLE-style evaluations by reducing annotation noise and enabling more faithful measurement of model capabilities. Data is available at: this https URL
摘要：Humanity's Last Exam (HLE) 已成为广泛使用的基准，用于评估具有挑战性的多领域问题的前沿大型语言模型。然而，社区主导的分析引起了人们的担忧，即 HLE 包含大量噪声项目，这可能会使评估结果产生偏差并扭曲跨模型比较。为了应对这一挑战，我们推出了 HLE-Verified，这是 HLE 的经过验证和修订的版本，具有透明的验证协议和细粒度的错误分类法。我们的施工遵循两阶段验证和修复工作流程，从而获得了经过认证的基准。在第一阶段，每个项目都通过领域专家审查和基于模型的交叉检查对问题和最终答案进行二进制验证，产生 641 个经过验证的项目。第二阶段，通过双独立专家修复、模型辅助审核、终审等方式，在严格约束下，在保留原评价意图的前提下，对存在缺陷但可修复的项目进行修订，共修订认证项目1170项。其余 689 个项目作为记录的不确定性集发布，具有明确的不确定性来源和专业知识标签，以供未来改进。我们在 HLE 和 HLE-Verified 上评估了七种最先进的语言模型，观察到 HLE-Verified 上的平均绝对准确度提高了 7--10 个百分点。对于原始问题陈述和/或参考答案错误的项目，改进尤其明显，提高了 30--40 个百分点。我们的分析进一步揭示了模型置信度与问题陈述或参考答案中存在的错误之间的密切关联，支持了我们修订的有效性。总体而言，HLE-Verified 通过减少注释噪声并实现更忠实地测量模型功能来改进 HLE 式评估。数据可在以下位置获取：此 https URL

Title: Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis

Authors: Tongze Zhang, Jun-En Ding, Melik Ozolcer, Fang-Ming Hung, Albert Chih-Chieh Yang, Feng Liu, Yi-Rou Ji, Sang Won Bae
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13979
Pdf URL: https://arxiv.org/pdf/2602.13979
Copy Paste: [[2602.13979]] Chain-of-Thought Reasoning with Large Language Models for Clinical Alzheimer's Disease Assessment and Diagnosis(https://arxiv.org/abs/2602.13979)
Keywords: language model, llm, chain-of-thought
Abstract: Alzheimer's disease (AD) has become a prevalent neurodegenerative disease worldwide. Traditional diagnosis still relies heavily on medical imaging and clinical assessment by physicians, which is often time-consuming and resource-intensive in terms of both human expertise and healthcare resources. In recent years, large language models (LLMs) have been increasingly applied to the medical field using electronic health records (EHRs), yet their application in Alzheimer's disease assessment remains limited, particularly given that AD involves complex multifactorial etiologies that are difficult to observe directly through imaging modalities. In this work, we propose leveraging LLMs to perform Chain-of-Thought (CoT) reasoning on patients' clinical EHRs. Unlike direct fine-tuning of LLMs on EHR data for AD classification, our approach utilizes LLM-generated CoT reasoning paths to provide the model with explicit diagnostic rationale for AD assessment, followed by structured CoT-based predictions. This pipeline not only enhances the model's ability to diagnose intrinsically complex factors but also improves the interpretability of the prediction process across different stages of AD progression. Experimental results demonstrate that the proposed CoT-based diagnostic framework significantly enhances stability and diagnostic performance across multiple CDR grading tasks, achieving up to a 15% improvement in F1 score compared to the zero-shot baseline method.
摘要：阿尔茨海默病（AD）已成为世界范围内流行的神经退行性疾病。传统诊断仍然严重依赖医生的医学影像和临床评估，这在人力专业知识和医疗资源方面往往耗时且耗费资源。近年来，大语言模型（LLM）越来越多地应用于电子健康记录（EHR）的医疗领域，但其在阿尔茨海默病评估中的应用仍然有限，特别是考虑到 AD 涉及复杂的多因素病因，难以通过成像方式直接观察。在这项工作中，我们建议利用法学硕士对患者的临床 EHR 进行思想链 (CoT) 推理。与针对 AD 分类的 EHR 数据直接微调法学硕士不同，我们的方法利用法学硕士生成的 CoT 推理路径为模型提供用于 AD 评估的明确诊断原理，然后进行基于 CoT 的结构化预测。该流程不仅增强了模型诊断本质上复杂因素的能力，而且还提高了 AD 进展不同阶段的预测过程的可解释性。实验结果表明，所提出的基于 CoT 的诊断框架显着增强了多个 CDR 分级任务的稳定性和诊断性能，与零样本基线方法相比，F1 分数提高了 15%。

Title: The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective

Authors: Ali Zahedzadeh, Behnam Bahrak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14002
Pdf URL: https://arxiv.org/pdf/2602.14002
Copy Paste: [[2602.14002]] The Sufficiency-Conciseness Trade-off in LLM Self-Explanation from an Information Bottleneck Perspective(https://arxiv.org/abs/2602.14002)
Keywords: language model, llm
Abstract: Large Language Models increasingly rely on self-explanations, such as chain of thought reasoning, to improve performance on multi step question answering. While these explanations enhance accuracy, they are often verbose and costly to generate, raising the question of how much explanation is truly necessary. In this paper, we examine the trade-off between sufficiency, defined as the ability of an explanation to justify the correct answer, and conciseness, defined as the reduction in explanation length. Building on the information bottleneck principle, we conceptualize explanations as compressed representations that retain only the information essential for producing correct this http URL operationalize this view, we introduce an evaluation pipeline that constrains explanation length and assesses sufficiency using multiple language models on the ARC Challenge dataset. To broaden the scope, we conduct experiments in both English, using the original dataset, and Persian, as a resource-limited language through translation. Our experiments show that more concise explanations often remain sufficient, preserving accuracy while substantially reducing explanation length, whereas excessive compression leads to performance degradation.
摘要：大型语言模型越来越依赖自我解释，例如思想链推理，以提高多步骤问答的性能。虽然这些解释提高了准确性，但它们通常很冗长且生成成本高昂，这就提出了一个问题：有多少解释是真正必要的。在本文中，我们研究了充分性（定义为解释证明正确答案的能力）与简洁性（定义为解释长度的减少）之间的权衡。基于信息瓶颈原则，我们将解释概念化为压缩表示，仅保留生成正确的 http URL 所必需的信息并操作该视图，我们引入了一个评估管道，该管道限制解释长度并在 ARC Challenge 数据集上使用多种语言模型评估充分性。为了扩大范围，我们使用原始数据集用英语进行实验，并通过翻译将波斯语作为资源有限的语言进行实验。我们的实验表明，更简洁的解释通常仍然足够，在保持准确性的同时大大减少解释长度，而过度压缩会导致性能下降。

Title: GRRM: Group Relative Reward Modeling for Machine Translation

Authors: Sen Yang, Shanbo Cheng, Lu Xu, Jianbing Zhang, Shujian Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14028
Pdf URL: https://arxiv.org/pdf/2602.14028
Copy Paste: [[2602.14028]] GRRM: Group Relative Reward Modeling for Machine Translation(https://arxiv.org/abs/2602.14028)
Keywords: llm
Abstract: While Group Relative Policy Optimization (GRPO) offers a powerful framework for LLM post-training, its effectiveness in open-ended domains like Machine Translation hinges on accurate intra-group ranking. We identify that standard Scalar Quality Metrics (SQM) fall short in this context; by evaluating candidates in isolation, they lack the comparative context necessary to distinguish fine-grained linguistic nuances. To address this, we introduce the Group Quality Metric (GQM) paradigm and instantiate it via the Group Relative Reward Model (GRRM). Unlike traditional independent scorers, GRRM processes the entire candidate group jointly, leveraging comparative analysis to rigorously resolve relative quality and adaptive granularity. Empirical evaluations confirm that GRRM achieves competitive ranking accuracy among all baselines. Building on this foundation, we integrate GRRM into the GRPO training loop to optimize the translation policy. Experimental results demonstrate that our framework not only improves general translation quality but also unlocks reasoning capabilities comparable to state-of-the-art reasoning models. We release codes, datasets, and model checkpoints at this https URL.
摘要：虽然群组相对策略优化 (GRPO) 为 LLM 后期培训提供了强大的框架，但其在机器翻译等开放式领域的有效性取决于准确的群组内排名。我们发现标准标量质量指标 (SQM) 在这方面存在不足；通过孤立地评估候选人，他们缺乏区分细粒度语言细微差别所必需的比较背景。为了解决这个问题，我们引入了群体质量度量（GQM）范式，并通过群体相对奖励模型（GRRM）将其实例化。与传统的独立评分者不同，GRRM 联合处理整个候选组，利用比较分析来严格解决相对质量和自适应粒度。实证评估证实 GRRM 在所有基线中实现了有竞争力的排名准确性。在此基础上，我们将 GRRM 集成到 GRPO 训练循环中，以优化翻译策略。实验结果表明，我们的框架不仅提高了总体翻译质量，而且还释放了与最先进的推理模型相当的推理能力。我们在此 https URL 发布代码、数据集和模型检查点。

Title: Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness

Authors: Pietro Bernardelle, Stefano Civelli, Kevin Roitero, Gianluca Demartini
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14044
Pdf URL: https://arxiv.org/pdf/2602.14044
Copy Paste: [[2602.14044]] Context Shapes LLMs Retrieval-Augmented Fact-Checking Effectiveness(https://arxiv.org/abs/2602.14044)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) show strong reasoning abilities across diverse tasks, yet their performance on extended contexts remains inconsistent. While prior research has emphasized mid-context degradation in question answering, this study examines the impact of context in LLM-based fact verification. Using three datasets (HOVER, FEVEROUS, and ClimateFEVER) and five open-source models accross different parameters sizes (7B, 32B and 70B parameters) and model families (Llama-3.1, Qwen2.5 and Qwen3), we evaluate both parametric factual knowledge and the impact of evidence placement across varying context lengths. We find that LLMs exhibit non-trivial parametric knowledge of factual claims and that their verification accuracy generally declines as context length increases. Similarly to what has been shown in previous works, in-context evidence placement plays a critical role with accuracy being consistently higher when relevant evidence appears near the beginning or end of the prompt and lower when placed mid-context. These results underscore the importance of prompt structure in retrieval-augmented fact-checking systems.
摘要：大型语言模型（LLM）在不同的任务中表现出强大的推理能力，但它们在扩展上下文中的表现仍然不一致。虽然之前的研究强调问答中的中情境退化，但本研究探讨了情境对基于法学硕士的事实验证的影响。使用三个数据集（HOVER、FEVEROUS 和 ClimateFEVER）和五个跨不同参数大小（7B、32B 和 70B 参数）和模型系列（Llama-3.1、Qwen2.5 和 Qwen3）的开源模型，我们评估了参数事实知识和不同上下文长度的证据放置的影响。我们发现法学硕士表现出对事实主张的重要参数知识，并且它们的验证准确性通常随着上下文长度的增加而下降。与之前的作品中所显示的类似，上下文中的证据放置起着至关重要的作用，当相关证据出现在提示的开头或结尾附近时，准确性始终较高，而当放置在上下文中间时，准确性会较低。这些结果强调了提示结构在检索增强事实检查系统中的重要性。

Title: LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation

Authors: Jizheng Chen, Weiming Zhang, Xinyi Dai, Weiwen Liu, Kounianhua Du, Yasheng Wang, Ruiming Tang, Yong Yu, Weinan Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14054
Pdf URL: https://arxiv.org/pdf/2602.14054
Copy Paste: [[2602.14054]] LogitsCoder: Towards Efficient Chain-of-Thought Path Search via Logits Preference Decoding for Code Generation(https://arxiv.org/abs/2602.14054)
Keywords: chain-of-thought
Abstract: Code generation remains a challenging task that requires precise and structured reasoning. Existing Test Time Scaling (TTS) methods, including structured tree search, have made progress in exploring reasoning paths but still face two major challenges: (1) underthinking, where reasoning chains tend to be shallow and fail to capture the full complexity of problems; and (2) overthinking, where overly verbose reasoning leads to inefficiency and increased computational costs. To address these issues, we propose LogitsCoder, a novel framework that enhances chain-of-thought reasoning through lightweight, logit-level control mechanisms for code generation. LogitsCoder iteratively generates and refines reasoning steps by first steering token selection toward statistically preferred patterns via Logits Preference Decoding, then selecting and aggregating diverse reasoning paths using Logits Rank Based Path Selection and Thoughts Aggregation. This results in coherent and effective reasoning chains that balance depth and efficiency. Extensive experiments demonstrate that LogitsCoder produces more efficient and higher-quality reasoning chains, leading to superior code generation performance compared to baseline methods.
摘要：代码生成仍然是一项具有挑战性的任务，需要精确和结构化的推理。现有的测试时间缩放（TTS）方法，包括结构化树搜索，在探索推理路径方面取得了进展，但仍然面临两大挑战：（1）思考不足，推理链往往很浅，无法捕捉问题的全部复杂性； (2)过度思考，过于冗长的推理会导致效率低下并增加计算成本。为了解决这些问题，我们提出了 LogitsCoder，这是一种新颖的框架，它通过轻量级的代码生成 logit 级控制机制来增强思想链推理。 LogitsCoder 首先通过 Logits 偏好解码将标记选择引导至统计首选模式，然后使用基于 Logits 排名的路径选择和想法聚合来选择和聚合不同的推理路径，从而迭代地生成和细化推理步骤。这会产生连贯有效的推理链，平衡深度和效率。大量实验表明，LogitsCoder 可以生成更高效、更高质量的推理链，与基线方法相比，具有更出色的代码生成性能。

Title: LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts

Authors: Yang Liu, Jiaye Yang, Weikang Li, Jiahui Liang, Yang Li, Lingyong Yan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14060
Pdf URL: https://arxiv.org/pdf/2602.14060
Copy Paste: [[2602.14060]] LM-Lexicon: Improving Definition Modeling via Harmonizing Semantic Experts(https://arxiv.org/abs/2602.14060)
Keywords: language model
Abstract: We introduce LM-Lexicon, an innovative definition modeling approach that incorporates data clustering, semantic expert learning, and model merging using a sparse mixture-of-experts architecture. By decomposing the definition modeling task into specialized semantic domains, where small language models are trained as domain experts, LM-Lexicon achieves substantial improvements (+7% BLEU score compared with the prior state-of-the-art model) over existing methods on five widely used benchmarks. Empirically, we demonstrate that 1) the clustering strategy enables fine-grained expert specialization with nearly 10% improvement in definition quality; 2) the semantic-aware domain-level routing mechanism achieves higher expert efficacy (+1%) than conventional token-level routing; and 3) further performance gains can be obtained through test-time compute and semantic expert scaling. Our work advances definition modeling while providing insights into the development of efficient language models for semantic-intensive applications.
摘要：我们引入了 LM-Lexicon，这是一种创新的定义建模方法，它结合了数据聚类、语义专家学习和使用稀疏专家混合架构的模型合并。通过将定义建模任务分解为专门的语义领域，其中小型语言模型被训练为领域专家，LM-Lexicon 在五个广泛使用的基准上比现有方法取得了显着的改进（与先前最先进的模型相比，BLEU 分数增加了 7%）。根据经验，我们证明：1）聚类策略可以实现细粒度的专家专业化，定义质量提高近 10%； 2）语义感知的域级路由机制比传统的令牌级路由实现了更高的专家效率（+1%）； 3）通过测试时计算和语义专家扩展可以获得进一步的性能提升。我们的工作推进了定义建模，同时为语义密集型应用程序的高效语言模型的开发提供了见解。

Title: From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset

Authors: Jandad Jahani, Mursal Dawodi, Jawid Ahmad Baktash
Subjects: cs.CL, cs.SD
Abstract URL: https://arxiv.org/abs/2602.14062
Pdf URL: https://arxiv.org/pdf/2602.14062
Copy Paste: [[2602.14062]] From Scarcity to Scale: A Release-Level Analysis of the Pashto Common Voice Dataset(https://arxiv.org/abs/2602.14062)
Keywords: prompt
Abstract: Large, openly licensed speech datasets are essential for building automatic speech recognition (ASR) systems, yet many widely spoken languages remain underrepresented in public resources. Pashto, spoken by more than 60 million people, has historically lacked large-scale openly licensed speech data suitable for modern ASR development. This paper presents a release-level analysis of the Pashto component of the Mozilla Common Voice corpus, focusing on version 24.0 (December 2025) and contextualizing trends across major releases. We document rapid growth from 1.49 recorded hours in mid-2023 to 2,768.7 total hours in 2025, including 975.89 validated hours available for supervised ASR training. Beyond scale, we analyze validation throughput, contributor participation inequality, demographic metadata completeness, and sentence-level concentration in the validated subset. We find that participation is extremely concentrated (Gini = 0.941), age representation is strongly skewed toward young adults, and 41.97\% of clips lack self-reported gender labels, limiting subgroup auditing based on metadata. At the textual level, prompt reuse is moderate: 35.88\% of unique sentences account for 50\% of validated clips, suggesting that structural concentration is driven primarily by uneven contributor activity rather than dominance of a small prompt set. These results provide a quantitative audit of a rapidly scaling low-resource speech corpus and highlight practical priorities for improving dataset maturity, including expanded validation capacity and broader demographic participation.
摘要：大型、公开许可的语音数据集对于构建自动语音识别 (ASR) 系统至关重要，但许多广泛使用的语言在公共资源中的代表性仍然不足。普什图语的使用人数超过 6000 万人，历史上一直缺乏适合现代 ASR 开发的大规模公开许可的语音数据。本文对 Mozilla 通用语音语料库的普什图语组件进行了版本级别分析，重点关注版本 24.0（2025 年 12 月）以及主要版本的背景趋势。我们记录了从 2023 年中期记录的 1.49 小时到 2025 年总小时数 2,768.7 小时的快速增长，其中包括 975.89 小时可用于监督 ASR 培训的经过验证的小时。除了规模之外，我们还分析了验证吞吐量、贡献者参与不平等、人口统计元数据完整性以及已验证子集中的句子级集中度。我们发现参与度极其集中（基尼系数 = 0.941），年龄代表性严重偏向年轻人，并且 41.97% 的剪辑缺乏自我报告的性别标签，限制了基于元数据的分组审核。在文本层面，提示重用程度适中：35.88% 的独特句子占已验证剪辑的 50%，这表明结构集中主要是由不均匀的贡献者活动驱动的，而不是小提示集的主导地位。这些结果提供了对快速扩展的低资源语音语料库的定量审核，并强调了提高数据集成熟度的实际优先事项，包括扩大验证能力和更广泛的人口参与。

Title: Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric

Authors: Ruipeng Jia, Yunyi Yang, Yuxin Wu, Yongbo Gai, Siyuan Tao, Mengyu Zhou, Jianhe Lin, Xiaoxi Jiang, Guanjun Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14069
Pdf URL: https://arxiv.org/pdf/2602.14069
Copy Paste: [[2602.14069]] Open Rubric System: Scaling Reinforcement Learning with Pairwise Adaptive Rubric(https://arxiv.org/abs/2602.14069)
Keywords: llm
Abstract: Scalar reward models compress multi-dimensional human preferences into a single opaque score, creating an information bottleneck that often leads to brittleness and reward hacking in open-ended alignment. We argue that robust alignment for non-verifiable tasks is fundamentally a principle generalization problem: reward should not be a learned function internalized into a judge, but an explicit reasoning process executed under inspectable principles. To operationalize this view, we present the Open Rubric System (OpenRS), a plug-and-play, rubrics-based LLM-as-a-Judge framework built around Pairwise Adaptive Meta-Rubrics (PAMR) and lightweight Pointwise Verifiable Rubrics (PVRs), which provide both hard-constraint guardrails and verifiable reward components when ground-truth or programmatic checks are available. OpenRS uses an explicit meta-rubric -- a constitution-like specification that governs how rubrics are instantiated, weighted, and enforced -- and instantiates adaptive rubrics on the fly by conditioning on the semantic differences between two candidate responses. It then performs criterion-wise pairwise comparisons and aggregates criterion-level preferences externally, avoiding pointwise weighted scalarization while improving discriminability in open-ended settings. To keep principles consistent yet editable across various domains, we introduce a two-level meta-rubric refinement pipeline (automated evolutionary refinement for general principles and a reproducible human-in-the-loop procedure for domain principles), complemented with pointwise verifiable rubrics that act as both guardrails against degenerate behaviors and a source of verifiable reward for objective sub-tasks. Finally, we instantiate OpenRS as reward supervision in pairwise RL training.
摘要：标量奖励模型将多维人类偏好压缩为单个不透明分数，从而产生信息瓶颈，通常导致开放式对齐中的脆弱性和奖励黑客攻击。我们认为，对不可验证任务的鲁棒对齐从根本上来说是一个原则泛化问题：奖励不应该是内化到法官中的学习函数，而应该是在可检查原则下执行的显式推理过程。为了实现这一观点，我们提出了开放Rubric系统（OpenRS），这是一个即插即用、基于Rubrics的LLM作为法官框架，围绕成对自适应元Rubrics（PAMR）和轻量级逐点可验证Rubrics（PVR）构建，当地面真相或程序化检查可用时，它提供硬约束护栏和可验证的奖励组件。 OpenRS 使用显式的元标准（一种类似于宪法的规范，用于管理如何实例化、加权和执行标准），并通过调节两个候选响应之间的语义差异来动态实例化自适应标准。然后，它执行标准的成对比较并在外部聚合标准级别的偏好，避免逐点加权标量化，同时提高开放式设置中的可辨别性。为了保持原则在不同领域的一致性和可编辑性，我们引入了一个两级元标准细化管道（一般原则的自动进化细化和领域原则的可重复的人机循环程序），并辅以逐点可验证的标准，这些标准既可以作为防止退化行为的护栏，也可以作为目标子任务的可验证奖励的来源。最后，我们将 OpenRS 实例化为成对 RL 训练中的奖励监督。

Title: Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework

Authors: Grzegorz Statkiewicz, Alicja Dobrzeniecka, Karolina Seweryn, Aleksandra Krasnodębska, Karolina Piosek, Katarzyna Bogusz, Sebastian Cygert, Wojciech Kusa
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14073
Pdf URL: https://arxiv.org/pdf/2602.14073
Copy Paste: [[2602.14073]] Annotation-Efficient Vision-Language Model Adaptation to the Polish Language Using the LLaVA Framework(https://arxiv.org/abs/2602.14073)
Keywords: language model
Abstract: Most vision-language models (VLMs) are trained on English-centric data, limiting their performance in other languages and cultural contexts. This restricts their usability for non-English-speaking users and hinders the development of multimodal systems that reflect diverse linguistic and cultural realities. In this work, we reproduce and adapt the LLaVA-Next methodology to create a set of Polish VLMs. We rely on a fully automated pipeline for translating and filtering existing multimodal datasets, and complement this with synthetic Polish data for OCR and culturally specific tasks. Despite relying almost entirely on automatic translation and minimal manual intervention to the training data, our approach yields strong results: we observe a +9.5% improvement over LLaVA-1.6-Vicuna-13B on a Polish-adapted MMBench, along with higher-quality captions in generative evaluations, as measured by human annotators in terms of linguistic correctness. These findings highlight that large-scale automated translation, combined with lightweight filtering, can effectively bootstrap high-quality multimodal models for low-resource languages. Some challenges remain, particularly in cultural coverage and evaluation. To facilitate further research, we make our models and evaluation dataset publicly available.
摘要：大多数视觉语言模型（VLM）都是根据以英语为中心的数据进行训练的，这限制了它们在其他语言和文化背景下的表现。这限制了它们对非英语用户的可用性，并阻碍了反映不同语言和文化现实的多模式系统的开发。在这项工作中，我们复制并调整了 LLaVA-Next 方法来创建一组波兰 VLM。我们依靠完全自动化的管道来翻译和过滤现有的多模式数据集，并用用于 OCR 和文化特定任务的合成波兰数据对其进行补充。尽管几乎完全依赖于自动翻译和对训练数据最少的手动干预，我们的方法还是产生了很好的结果：我们观察到在波兰语适应的 MMBench 上比 LLaVA-1.6-Vicuna-13B 提高了 9.5%，并且在生成评估中提供了更高质量的字幕（由人类注释者在语言正确性方面进行测量）。这些发现强调，大规模自动翻译与轻量级过滤相结合，可以有效地引导低资源语言的高质量多模态模型。仍然存在一些挑战，特别是在文化覆盖和评估方面。为了促进进一步的研究，我们公开我们的模型和评估数据集。

Title: Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Authors: Nitay Calderon, Eyal Ben-David, Zorik Gekhman, Eran Ofek, Gal Yona
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14080
Pdf URL: https://arxiv.org/pdf/2602.14080
Copy Paste: [[2602.14080]] Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality(https://arxiv.org/abs/2602.14080)
Keywords: gpt, llm, prompt
Abstract: Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.
摘要：法学硕士的标准事实性评估对所有错误都一视同仁，掩盖了失败是由于知识缺失（空架子）还是由于对编码事实的访问受限（丢失密钥）引起的。我们提出了一个行为框架，在事实而不是问题的层面上描述事实知识，通过是否编码以及它的可访问性来描述每个事实：无法回忆，可以直接回忆，或者只能通过推理时间计算（思考）来回忆。为了支持此类分析，我们引入了 WikiProfile，这是一个通过自动化管道构建的新基准，并带有基于网络搜索的提示 LLM。在来自 13 个法学硕士的 400 万条回复中，我们发现在我们的基准测试中，前沿模型中的编码几乎饱和，GPT-5 和 Gemini-3 编码了 95--98% 的事实。然而，回忆仍然是一个主要瓶颈：许多以前归因于知识缺失的错误实际上源于未能访问知识。这些失败是系统性的，并且不成比例地影响长尾事实和反向问题。最后，我们表明，思考可以提高召回率，并且可以恢复很大一部分失败，这表明未来的收益可能更少地依赖于扩展，而更多地依赖于改进模型如何利用它们已经编码的方法。

Title: CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry

Authors: Shangqing Zhao, Yupei Ren, Yuhao Zhou, Xiaopeng Bai, Man Lan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14081
Pdf URL: https://arxiv.org/pdf/2602.14081
Copy Paste: [[2602.14081]] CCiV: A Benchmark for Structure, Rhythm and Quality in LLM-Generated Chinese \textit{Ci} Poetry(https://arxiv.org/abs/2602.14081)
Keywords: language model, llm, prompt
Abstract: The generation of classical Chinese \textit{Ci} poetry, a form demanding a sophisticated blend of structural rigidity, rhythmic harmony, and artistic quality, poses a significant challenge for large language models (LLMs). To systematically evaluate and advance this capability, we introduce \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV}), a benchmark designed to assess LLM-generated \textit{Ci} poetry across these three dimensions: structure, rhythm, and quality. Our evaluation of 17 LLMs on 30 \textit{Cipai} reveals two critical phenomena: models frequently generate valid but unexpected historical variants of a poetic form, and adherence to tonal patterns is substantially harder than structural rules. We further show that form-aware prompting can improve structural and tonal control for stronger models, while potentially degrading weaker ones. Finally, we observe weak and inconsistent alignment between formal correctness and literary quality in our sample. CCiV highlights the need for variant-aware evaluation and more holistic constrained creative generation methods.
摘要：中国古典\textit{Ci}诗歌的产生，要求结构刚性、节奏和谐和艺术品质的复杂融合，对大语言模型（LLM）提出了重大挑战。为了系统地评估和提升这种能力，我们引入了 \textbf{C}hinese \textbf{Ci}pai \textbf{V}ariants (\textbf{CCiV})，这是一个旨在从结构、节奏和质量这三个维度评估 LLM 生成的 \textit{Ci} 诗歌的基准。我们对 30 个 \textit{Cipai} 上的 17 个法学硕士进行的评估揭示了两个关键现象：模型经常生成有效但意想不到的诗歌形式的历史变体，并且遵守音调模式比结构规则要困难得多。我们进一步表明，形式感知提示可以改善较强模型的结构和色调控制，同时可能降低较弱模型的性能。最后，我们观察到样本中形式正确性和文学质量之间的一致性较弱且不一致。 CCiV 强调了对变量感知评估和更全面的约束创意生成方法的需求。

Title: A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Authors: Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, Zia Ush Shamszaman
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2602.14158
Pdf URL: https://arxiv.org/pdf/2602.14158
Copy Paste: [[2602.14158]] A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing(https://arxiv.org/abs/2602.14158)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical Reasoning agent (fine-tuned LLaMA) produces structured explanations, an Evidence Retrieval agent queries PubMed to ground responses in recent literature, and a Refinement agent (DeepSeek R1) improves clarity and factual consistency; an optional human validation path is triggered for high-risk or high-uncertainty cases. Safety mechanisms include Monte Carlo dropout and perplexity-based uncertainty scoring, plus lexical and sentiment-based bias detection supported by LIME/SHAP-based analyses. In evaluation, the full system achieves 87% accuracy with relevance around 0.80, and evidence augmentation reduces uncertainty (perplexity 4.13) compared to base responses, with mean end-to-end latency of 36.5 seconds under the reported configuration. Overall, the results indicate that agent specialisation and verification layers can mitigate key single-model limitations and provide a practical, extensible design for evidence-based and bias-aware medical AI.
摘要：大型语言模型 (LLM) 在医疗保健问题解答方面显示出良好的前景，但临床应用因验证薄弱、证据不足和置信信号不可靠而受到限制。我们提出了一个多智能体医学 QA 框架，它将补充法学硕士与证据检索、不确定性估计和偏差检查相结合，以提高答案的可靠性。我们的方法有两个阶段。首先，我们根据 MedQuAD 衍生的医学 QA 数据（跨多个 NIH 领域的 20k 多个问答对）和基准生成质量对三个代表性的 LLM 系列（GPT、LLaMA 和 DeepSeek R1）进行微调。 DeepSeek R1 取得了最高的分数（ROUGE-1 0.536 +- 0.04；ROUGE-2 0.226 +-0.03；BLEU 0.098 -+ 0.018），并且在零样本评估中大大优于专业生物医学基线 BioGPT。其次，我们实现了一个模块化的多智能体管道，其中临床推理智能体（经过微调的 LLaMA）产生结构化解释，证据检索智能体查询 PubMed 以了解最近文献中的地面响应，而细化智能体（DeepSeek R1）则提高清晰度和事实一致性；对于高风险或高不确定性情况，会触发可选的人工验证路径。安全机制包括蒙特卡洛退出和基于困惑的不确定性评分，以及基于 LIME/SHAP 分析支持的词汇和基于情感的偏差检测。在评估中，整个系统的准确度达到 87%，相关性约为 0.80，与基础响应相比，证据增强降低了不确定性（困惑度 4.13），在报告的配置下平均端到端延迟为 36.5 秒。总体而言，结果表明代理专业化和验证层可以减轻关键的单一模型限制，并为基于证据和偏差感知的医疗人工智能提供实用的、可扩展的设计。

Title: Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering

Authors: Tao Xu
Subjects: cs.CL, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2602.14162
Pdf URL: https://arxiv.org/pdf/2602.14162
Copy Paste: [[2602.14162]] Index Light, Reason Deep: Deferred Visual Ingestion for Visual-Dense Document Question Answering(https://arxiv.org/abs/2602.14162)
Keywords: language model
Abstract: Existing multimodal document question answering methods universally adopt a supply-side ingestion strategy: running a Vision-Language Model (VLM) on every page during indexing to generate comprehensive descriptions, then answering questions through text retrieval. However, this "pre-ingestion" approach is costly (a 113-page engineering drawing package requires approximately 80,000 VLM tokens), end-to-end unreliable (VLM outputs may fail to be correctly retrieved due to format mismatches in the retrieval infrastructure), and irrecoverable once it fails. This paper proposes the Deferred Visual Ingestion (DVI) framework, adopting a demand-side ingestion strategy: the indexing phase performs only lightweight metadata extraction, deferring visual understanding to the moment users pose specific questions. DVI's core principle is "Index for locating, not understanding"--achieving page localization through structured metadata indexes and BM25 full-text search, then sending original images along with specific questions to a VLM for targeted analysis. Experiments on two real industrial engineering drawings (113 pages + 7 pages) demonstrate that DVI achieves comparable overall accuracy at zero ingestion VLM cost (46.7% vs. 48.9%), an effectiveness rate of 50% on visually necessary queries (vs. 0% for pre-ingestion), and 100% page localization (98% search space compression). DVI also supports interactive refinement and progressive caching, transforming the "QA accuracy" problem into a "page localization" problem--once the correct drawing page is found, obtaining the answer becomes a matter of interaction rounds.
摘要：现有的多模态文档问答方法普遍采用供给方摄取策略：在索引过程中在每个页面上运行视觉语言模型（VLM）以生成全面的描述，然后通过文本检索来回答问题。然而，这种“预摄取”方法成本高昂（一个 113 页的工程绘图包需要大约 80,000 个 VLM 令牌）、端到端不可靠（由于检索基础设施中的格式不匹配，VLM 输出可能无法正确检索），并且一旦失败就无法恢复。本文提出了延迟视觉摄取（DVI）框架，采用需求方摄取策略：索引阶段仅执行轻量级元数据提取，将视觉理解推迟到用户提出特定问题的那一刻。 DVI的核心原则是“索引是为了定位，而不是理解”——通过结构化元数据索引和BM25全文搜索实现页面定位，然后将原始图像连同具体问题发送到VLM进行针对性分析。对两个真实工业工程图（113 页 + 7 页）进行的实验表明，DVI 在零摄取 VLM 成本（46.7% 与 48.9%）的情况下实现了相当的整体精度，视觉上必要的查询的有效性为 50%（相对于预摄取的 0%），以及 100% 的页面本地化（98% 搜索空间压缩）。 DVI还支持交互细化和渐进式缓存，将“QA准确性”问题转化为“页面本地化”问题——一旦找到正确的绘图页面，获得答案就变成了交互轮次的问题。

Title: GPT-5 vs Other LLMs in Long Short-Context Performance

Authors: Nima Esmi (1 and 2), Maryam Nezhad-Moghaddam (3), Fatemeh Borhani (3), Asadollah Shahbahrami (2 and 3), Amin Daemdoost (3), Georgi Gaydadjiev (4) ((1) Bernoulli Institute, RUG, Groningen, Netherlands, (2) ISRC, Khazar University, Baku, Azerbaijan, (3) Department of Computer Engineering, University of Guilan, Rasht, Iran, (4) QCE Department, TU Delft, Delft, Netherlands)
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2602.14188
Pdf URL: https://arxiv.org/pdf/2602.14188
Copy Paste: [[2602.14188]] GPT-5 vs Other LLMs in Long Short-Context Performance(https://arxiv.org/abs/2602.14188)
Keywords: language model, gpt, llm, long context
Abstract: With the significant expansion of the context window in Large Language Models (LLMs), these models are theoretically capable of processing millions of tokens in a single pass. However, research indicates a significant gap between this theoretical capacity and the practical ability of models to robustly utilize information within long contexts, especially in tasks that require a comprehensive understanding of numerous details. This paper evaluates the performance of four state-of-the-art models (Grok-4, GPT-4, Gemini 2.5, and GPT-5) on long short-context tasks. For this purpose, three datasets were used: two supplementary datasets for retrieving culinary recipes and math problems, and a primary dataset of 20K social media posts for depression detection. The results show that as the input volume on the social media dataset exceeds 5K posts (70K tokens), the performance of all models degrades significantly, with accuracy dropping to around 50-53% for 20K posts. Notably, in the GPT-5 model, despite the sharp decline in accuracy, its precision remained high at approximately 95%, a feature that could be highly effective for sensitive applications like depression detection. This research also indicates that the "lost in the middle" problem has been largely resolved in newer models. This study emphasizes the gap between the theoretical capacity and the actual performance of models on complex, high-volume data tasks and highlights the importance of metrics beyond simple accuracy for practical applications.
摘要：随着大型语言模型（LLM）中上下文窗口的显着扩展，这些模型理论上能够在一次传递中处理数百万个令牌。然而，研究表明，这种理论能力与模型在长上下文中稳健地利用信息的实际能力之间存在显着差距，特别是在需要全面理解大量细节的任务中。本文评估了四种最先进模型（Grok-4、GPT-4、Gemini 2.5 和 GPT-5）在长短上下文任务上的性能。为此，使用了三个数据集：两个用于检索烹饪食谱和数学问题的补充数据集，以及一个包含 2 万条社交媒体帖子的主要数据集，用于抑郁症检测。结果表明，当社交媒体数据集上的输入量超过 5K 个帖子（70K 个令牌）时，所有模型的性能都会显着下降，对于 20K 个帖子，准确率下降至 50-53% 左右。值得注意的是，在 GPT-5 模型中，尽管精度急剧下降，但其精度仍然保持在约 95% 的高水平，这一功能对于抑郁症检测等敏感应用非常有效。这项研究还表明，“中间迷失”的问题在较新的型号中已经得到了很大程度的解决。这项研究强调了模型在复杂、大容量数据任务上的理论能力和实际性能之间的差距，并强调了指标在实际应用中超越简单准确性的重要性。

Title: Knowing When Not to Answer: Abstention-Aware Scientific Reasoning

Authors: Samir Abdaljalil, Erchin Serpedin, Hasan Kurban
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14189
Pdf URL: https://arxiv.org/pdf/2602.14189
Copy Paste: [[2602.14189]] Knowing When Not to Answer: Abstention-Aware Scientific Reasoning(https://arxiv.org/abs/2602.14189)
Keywords: language model, chat
Abstract: Large language models are increasingly used to answer and verify scientific claims, yet existing evaluations typically assume that a model must always produce a definitive answer. In scientific settings, however, unsupported or uncertain conclusions can be more harmful than abstaining. We study this problem through an abstention-aware verification framework that decomposes scientific claims into minimal conditions, audits each condition against available evidence using natural language inference (NLI), and selectively decides whether to support, refute, or abstain. We evaluate this framework across two complementary scientific benchmarks: SciFact and PubMedQA, covering both closed-book and open-domain evidence settings. Experiments are conducted with six diverse language models, including encoder-decoder, open-weight chat models, and proprietary APIs. Across all benchmarks and models, we observe that raw accuracy varies only modestly across architectures, while abstention plays a critical role in controlling error. In particular, confidence-based abstention substantially reduces risk at moderate coverage levels, even when absolute accuracy improvements are limited. Our results suggest that in scientific reasoning tasks, the primary challenge is not selecting a single best model, but rather determining when available evidence is sufficient to justify an answer. This work highlights abstention-aware evaluation as a practical and model-agnostic lens for assessing scientific reliability, and provides a unified experimental basis for future work on selective reasoning in scientific domains. Code is available at this https URL .
摘要：大型语言模型越来越多地用于回答和验证科学主张，但现有的评估通常假设模型必须始终产生明确的答案。然而，在科学环境中，未经支持或不确定的结论可能比放弃更有害。我们通过弃权意识验证框架来研究这个问题，该框架将科学主张分解为最小条件，使用自然语言推理（NLI）根据可用证据审核每个条件，并有选择地决定是否支持、反驳或弃权。我们通过两个互补的科学基准评估该框架：SciFact 和 PubMedQA，涵盖闭卷和开放领域的证据设置。使用六种不同的语言模型进行实验，包括编码器-解码器、开放权重聊天模型和专有 API。在所有基准测试和模型中，我们观察到原始准确性在不同架构之间仅略有变化，而弃权在控制错误方面发挥着关键作用。特别是，即使绝对准确性的提高有限，基于置信度的弃权也大大降低了中等覆盖水平的风险。我们的结果表明，在科学推理任务中，主要的挑战不是选择一个最佳模型，而是确定何时可用证据足以证明答案的合理性。这项工作强调了弃权意识评估作为评估科学可靠性的实用且与模型无关的透镜，并为科学领域选择性推理的未来工作提供了统一的实验基础。代码可在此 https URL 获取。

Title: AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Authors: Lingxiang Hu, Yiding Sun, Tianle Xia, Wenwei Li, Ming Xu, Liqun Liu, Peng Shu, Huan Yu, Jie Jiang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14257
Pdf URL: https://arxiv.org/pdf/2602.14257
Copy Paste: [[2602.14257]] AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents(https://arxiv.org/abs/2602.14257)
Keywords: language model, llm, agent
Abstract: While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at this https URL.
摘要：虽然大型语言模型（LLM）代理在复杂的推理任务中取得了显着的进步，但评估其在现实环境中的性能已成为一个关键问题。然而，当前的基准在很大程度上仅限于理想化的模拟，无法满足广告和营销分析等专业领域的实际需求。在这些领域，任务本来就比较复杂，往往需要与专业营销工具进行多轮交互。为了解决这一差距，我们提出了 AD-Bench，这是一个根据广告和营销平台的实际业务需求设计的基准。 AD-Bench根据真实用户营销分析请求构建，领域专家提供可验证的参考答案和相应的参考工具调用轨迹。该基准将请求分为三个难度级别（L1-L3），以评估代理在多轮、多工具协作下的能力。实验表明，在AD-Bench上，Gemini-3-Pro实现了Pass@1 = 68.0%和Pass@3 = 83.0%，但在L3上性能大幅下降至Pass@1 = 49.4%和Pass@3 = 62.1%，轨迹覆盖率为70.1%，这表明即使是最先进的模型在复杂的广告和营销分析场景中仍然表现出巨大的能力差距。 AD-Bench 为评估和改进广告营销代理提供了一个现实的基准，排行榜和代码可以在此 https URL 找到。

Title: Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures

Authors: Matic Korun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14259
Pdf URL: https://arxiv.org/pdf/2602.14259
Copy Paste: [[2602.14259]] Detecting LLM Hallucinations via Embedding Cluster Geometry: A Three-Type Taxonomy with Measurable Signatures(https://arxiv.org/abs/2602.14259)
Keywords: language model, gpt, llm, hallucination
Abstract: We propose a geometric taxonomy of large language model hallucinations based on observable signatures in token embedding cluster structure. By analyzing the static embedding spaces of 11 transformer models spanning encoder (BERT, RoBERTa, ELECTRA, DeBERTa, ALBERT, MiniLM, DistilBERT) and decoder (GPT-2) architectures, we identify three operationally distinct hallucination types: Type 1 (center-drift) under weak context, Type 2 (wrong-well convergence) to locally coherent but contextually incorrect cluster regions, and Type 3 (coverage gaps) where no cluster structure exists. We introduce three measurable geometric statistics: {\alpha} (polarity coupling), \b{eta} (cluster cohesion), and {\lambda}_s (radial information gradient). Across all 11 models, polarity structure ({\alpha} > 0.5) is universal (11/11), cluster cohesion (\b{eta} > 0) is universal (11/11), and the radial information gradient is significant (9/11, p < 0.05). We demonstrate that the two models failing {\lambda}_s significance -- ALBERT and MiniLM -- do so for architecturally explicable reasons: factorized embedding compression and distillation-induced isotropy, respectively. These findings establish the geometric prerequisites for type-specific hallucination detection and yield testable predictions about architecture-dependent vulnerability profiles.
摘要：我们提出了一种基于令牌嵌入集群结构中可观察签名的大语言模型幻觉的几何分类法。通过分析跨越编码器（BERT、RoBERTa、ELECTRA、DeBERTa、ALBERT、MiniLM、DistilBERT）和解码器（GPT-2）架构的 11 个 Transformer 模型的静态嵌入空间，我们确定了三种操作上不同的幻觉类型：弱上下文下的类型 1（中心漂移）、局部相干但上下文不正确的簇区域的类型 2（错误井收敛）以及没有簇的类型 3（覆盖间隙）结构存在。我们引入了三种可测量的几何统计量：{\alpha}（极性耦合）、\b{eta}（簇内聚力）和 {\lambda}_s（径向信息梯度）。在所有 11 个模型中，极性结构 ({\alpha} > 0.5) 是通用的 (11/11)，簇内聚力 (\b{eta} > 0) 是通用的 (11/11)，径向信息梯度是显着的 (9/11，p < 0.05)。我们证明了两个没有显着性的模型——ALBERT 和 MiniLM——这样做是出于架构上可解释的原因：分别是分解嵌入压缩和蒸馏诱导的各向同性。这些发现建立了特定类型幻觉检测的几何先决条件，并产生有关依赖于体系结构的漏洞概况的可测试预测。

Title: STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts

Authors: Zachary Bamberger, Till R. Saenger, Gilad Morad, Ofra Amir, Brandon M. Stewart, Amir Feder
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14265
Pdf URL: https://arxiv.org/pdf/2602.14265
Copy Paste: [[2602.14265]] STATe-of-Thoughts: Structured Action Templates for Tree-of-Thoughts(https://arxiv.org/abs/2602.14265)
Keywords: tree-of-thought
Abstract: Inference-Time-Compute (ITC) methods like Best-of-N and Tree-of-Thoughts are meant to produce output candidates that are both high-quality and diverse, but their use of high-temperature sampling often fails to achieve meaningful output diversity. Moreover, existing ITC methods offer limited control over how to perform reasoning, which in turn limits their explainability. We present STATe-of-Thoughts (STATe), an interpretable ITC method that searches over high-level reasoning patterns. STATe replaces stochastic sampling with discrete and interpretable textual interventions: a controller selects actions encoding high-level reasoning choices, a generator produces reasoning steps conditioned on those choices, and an evaluator scores candidates to guide search. This structured approach yields three main advantages. First, action-guided textual interventions produce greater response diversity than temperature-based sampling. Second, in a case study on argument generation, STATe's explicit action sequences capture interpretable features that are highly predictive of output quality. Third, estimating the association between performance and action choices allows us to identify promising yet unexplored regions of the action space and steer generation directly toward them. Together, these results establish STATe as a practical framework for generating high-quality, diverse, and interpretable text. Our framework is available at this https URL.
摘要：Best-of-N 和 Tree-of-Thoughts 等推理时间计算 (ITC) 方法旨在产生高质量且多样化的输出候选，但它们对高温采样的使用通常无法实现有意义的输出多样性。此外，现有的 ITC 方法对如何执行推理提供的控制有限，这反过来又限制了它们的可解释性。我们提出了 STATe-of-Thoughts (STATe)，这是一种可解释的 ITC 方法，可搜索高级推理模式。 STATe 用离散且可解释的文本干预取代随机采样：控制器选择编码高级推理选择的动作，生成器根据这些选择生成推理步骤，评估器对候选进行评分以指导搜索。这种结构化方法具有三个主要优点。首先，以行动为导向的文本干预比基于温度的采样产生更大的反应多样性。其次，在关于参数生成的案例研究中，STATe 的显式动作序列捕获了可高度预测输出质量的可解释特征。第三，估计绩效和行动选择之间的关联使我们能够识别行动空间中有希望但尚未探索的区域，并直接引导一代人走向这些区域。总之，这些结果将 STATe 确立为生成高质量、多样化和可解释文本的实用框架。我们的框架可通过此 https URL 获取。

Title: Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook

Authors: Ming Li, Xirui Li, Tianyi Zhou
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2602.14299
Pdf URL: https://arxiv.org/pdf/2602.14299
Copy Paste: [[2602.14299]] Does Socialization Emerge in AI Agent Society? A Case Study of Moltbook(https://arxiv.org/abs/2602.14299)
Keywords: language model, agent
Abstract: As large language model agents increasingly populate networked environments, a fundamental question arises: do artificial intelligence (AI) agent societies undergo convergence dynamics similar to human social systems? Lately, Moltbook approximates a plausible future scenario in which autonomous agents participate in an open-ended, continuously evolving online society. We present the first large-scale systemic diagnosis of this AI agent society. Beyond static observation, we introduce a quantitative diagnostic framework for dynamic evolution in AI agent societies, measuring semantic stabilization, lexical turnover, individual inertia, influence persistence, and collective consensus. Our analysis reveals a system in dynamic balance in Moltbook: while global semantic averages stabilize rapidly, individual agents retain high diversity and persistent lexical turnover, defying homogenization. However, agents exhibit strong individual inertia and minimal adaptive response to interaction partners, preventing mutual influence and consensus. Consequently, influence remains transient with no persistent supernodes, and the society fails to develop stable collective influence anchors due to the absence of shared social memory. These findings demonstrate that scale and interaction density alone are insufficient to induce socialization, providing actionable design and analysis principles for upcoming next-generation AI agent societies.
摘要：随着大型语言模型智能体越来越多地出现在网络环境中，一个基本问题出现了：人工智能（AI）智能体社会是否会经历类似于人类社会系统的收敛动态？最近，Moltbook 描绘了一个可能的未来场景，其中自主代理参与一个开放式的、不断发展的在线社会。我们首次对这个人工智能代理社会进行大规模系统诊断。除了静态观察之外，我们还引入了人工智能代理社会动态演化的定量诊断框架，测量语义稳定性、词汇周转、个体惯性、影响力持久性和集体共识。我们的分析揭示了 Moltbook 中的动态平衡系统：虽然全局语义平均值迅速稳定，但个体代理保持高度多样性和持续的词汇更替，抵抗同质化。然而，智能体表现出强烈的个体惯性和对交互伙伴的最小适应性反应，阻碍了相互影响和共识。因此，影响力仍然是短暂的，没有持久的超级节点，并且由于缺乏共享的社会记忆，社会无法发展稳定的集体影响力锚。这些发现表明，仅规模和交互密度不足以引发社交化，为即将到来的下一代人工智能代理社会提供可行的设计和分析原则。

Title: InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Authors: Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14367
Pdf URL: https://arxiv.org/pdf/2602.14367
Copy Paste: [[2602.14367]] InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem(https://arxiv.org/abs/2602.14367)
Keywords: language model, llm
Abstract: The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.
摘要：大型语言模型的快速发展促进了科学思想产生的激增，但这种飞跃并没有伴随着思想评估的相应进步。科学评价的根本本质需要知识基础、集体审议和多标准决策。然而，现有的想法评估方法往往存在知识视野狭窄、评估维度扁平化以及法学硕士法官固有的偏见等问题。为了解决这些问题，我们将创意评估视为一个基于知识的多视角推理问题，并引入了InnoEval，这是一个旨在模拟人类水平创意评估的深度创新评估框架。我们应用异构深度知识搜索引擎，从不同的在线来源检索和建立动态证据。我们进一步与由具有不同学术背景的评审员组成的创新评审委员会达成评审共识，从而实现跨多个指标的多维度解耦评估。我们构建了来自权威同行评审提交的综合数据集，以对 InnoEval 进行基准测试。实验表明，InnoEval 在逐点、成对和分组评估任务中始终能够优于基线，表现出与人类专家高度一致的判断模式和共识。

Title: Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models

Authors: Mufan Xu, Kehai Chen, Xuefeng Bai, Zhengyu Niu, Muyun Yang, Tiejun Zhao, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14386
Pdf URL: https://arxiv.org/pdf/2602.14386
Copy Paste: [[2602.14386]] Beyond Token-Level Policy Gradients for Complex Reasoning with Large Language Models(https://arxiv.org/abs/2602.14386)
Keywords: language model
Abstract: Existing policy-gradient methods for auto-regressive language models typically select subsequent tokens one at a time as actions in the policy. While effective for many generation tasks, such an approach may not fully capture the structure of complex reasoning tasks, where a single semantic decision is often realized across multiple tokens--for example, when defining variables or composing equations. This introduces a potential mismatch between token-level optimization and the inherently block-level nature of reasoning in these settings. To bridge this gap, we propose Multi-token Policy Gradient Optimization (MPO), a framework that treats sequences of K consecutive tokens as unified semantic actions. This block-level perspective enables our method to capture the compositional structure of reasoning trajectories and supports optimization over coherent, higher-level objectives. Experiments on mathematical reasoning and coding benchmarks show that MPO outperforms standard token-level policy gradient baselines, highlight the limitations of token-level policy gradients for complex reasoning, motivating future research to look beyond token-level granularity for reasoning-intensive language tasks.
摘要：现有的自回归语言模型的策略梯度方法通常一次选择一个后续标记作为策略中的操作。虽然对许多生成任务有效，但这种方法可能无法完全捕获复杂推理任务的结构，其中单个语义决策通常是跨多个标记实现的——例如，在定义变量或组成方程时。这在这些设置中引入了令牌级优化与固有的块级推理性质之间的潜在不匹配。为了弥补这一差距，我们提出了多令牌策略梯度优化（MPO），这是一个将 K 个连续令牌序列视为统一语义动作的框架。这种块级视角使我们的方法能够捕获推理轨迹的组成结构，并支持对连贯的更高级别目标的优化。数学推理和编码基准的实验表明，MPO 优于标准令牌级策略梯度基线，凸显了令牌级策略梯度对于复杂推理的局限性，激励未来的研究超越令牌级粒度来处理推理密集型语言任务。

Title: TruthStance: An Annotated Dataset of Conversations on Truth Social

Authors: Fathima Ameen, Danielle Brown, Manusha Malgareddy, Amanul Haque
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14406
Pdf URL: https://arxiv.org/pdf/2602.14406
Copy Paste: [[2602.14406]] TruthStance: An Annotated Dataset of Conversations on Truth Social(https://arxiv.org/abs/2602.14406)
Keywords: language model, llm, prompt
Abstract: Argument mining and stance detection are central to understanding how opinions are formed and contested in online discourse. However, most publicly available resources focus on mainstream platforms such as Twitter and Reddit, leaving conversational structure on alt-tech platforms comparatively under-studied. We introduce TruthStance, a large-scale dataset of Truth Social conversation threads spanning 2023-2025, consisting of 24,378 posts and 523,360 comments with reply-tree structure preserved. We provide a human-annotated benchmark of 1,500 instances across argument mining and claim-based stance detection, including inter-annotator agreement, and use it to evaluate large language model (LLM) prompting strategies. Using the best-performing configuration, we release additional LLM-generated labels for 24,352 posts (argument presence) and 107,873 comments (stance to parent), enabling analysis of stance and argumentation patterns across depth, topics, and users. All code and data are released publicly.
摘要：论点挖掘和立场检测对于理解在线话语中意见如何形成和争论至关重要。然而，大多数公开可用的资源都集中在 Twitter 和 Reddit 等主流平台上，而对另类技术平台上的对话结构的研究相对较少。我们引入了 TruthStance，这是一个跨越 2023 年至 2025 年的大规模真相社交对话线程数据集，由 24,378 个帖子和 523,360 条评论组成，并保留了回复树结构。我们提供了包含 1,500 个实例的人工注释基准，涉及参数挖掘和基于主张的立场检测（包括注释者间协议），并用它来评估大型语言模型 (LLM) 提示策略。使用性能最佳的配置，我们为 24,352 个帖子（论点存在）和 107,873 条评论（对家长的立场）发布了由法学硕士生成的额外标签，从而能够跨深度、主题和用户分析立场和论证模式。所有代码和数据均公开发布。

Title: WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)

Authors: Kiyotaka Kasubuchi, Kazuo Fukiya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14419
Pdf URL: https://arxiv.org/pdf/2602.14419
Copy Paste: [[2602.14419]] WavePhaseNet: A DFT-Based Method for Constructing Semantic Conceptual Hierarchy Structures (SCHS)(https://arxiv.org/abs/2602.14419)
Keywords: language model, gpt, llm, hallucination
Abstract: This paper reformulates Transformer/Attention mechanisms in Large Language Models (LLMs) through measure theory and frequency analysis, theoretically demonstrating that hallucination is an inevitable structural limitation. The embedding space functions as a conditional expectation over a {\sigma}-algebra, and its failure to be isomorphic to the semantic truth set fundamentally causes logical consistency breakdown. WavePhaseNet Method The authors propose WavePhaseNet, which explicitly constructs a Semantic Conceptual Hierarchy Structure (SCHS) using Discrete Fourier Transform (DFT). By applying DFT along the sequence dimension, semantic information is decomposed into frequency bands: low-frequency components capture global meaning and intent, while high-frequency components represent local syntax and expression. This staged separation enables precise semantic manipulation in diagonalized space. Dimensionality Reduction GPT-4's 24,576-dimensional embedding space exhibits a 1/f spectral structure based on language self-similarity and Zipf's law. Through cumulative energy analysis, the authors derive that approximately 3,000 dimensions constitute the lower bound for "complete representation." This demonstrates that reduction from 24,576 to 3,000 dimensions preserves meaning and intent while enabling rigorous reasoning and suppressing hallucination. Cohomological Consistency Control The reduced embedding space, constructed via cohomological regularization over overlapping local windows, allows defining a graph structure and cochain complex. This quantifies inconsistencies among local inferences as coboundary-based losses. Applying harmonic projection based on Hodge theory positions cohomology as a computable regularization principle for controlling semantic consistency, extracting maximally consistent global representations.
摘要：本文通过测度论和频率分析重新阐述了大型语言模型（LLM）中的 Transformer/Attention 机制，从理论上证明了幻觉是不可避免的结构限制。嵌入空间充当对 {\sigma}-代数的条件期望，并且它与语义真值集的同构失败从根本上导致逻辑一致性崩溃。 WavePhaseNet 方法作者提出了 WavePhaseNet，它使用离散傅里叶变换 (DFT) 显式地构建了语义概念层次结构 (SCHS)。通过沿序列维度应用 DFT，语义信息被分解为频带：低频分量捕获全局含义和意图，而高频分量表示局部语法和表达。这种分阶段的分离可以在对角空间中进行精确的语义操作。降维 GPT-4 的 24,576 维嵌入空间呈现出基于语言自相似性和 Zipf 定律的 1/f 谱结构。通过累积能量分析，作者得出大约 3,000 个维度构成了“完整表示”的下限。这表明，从 24,576 维减少到 3,000 维可以保留意义和意图，同时实现严格推理并抑制幻觉。上同调一致性控制通过重叠局部窗口上同调正则化构造的减少的嵌入空间允许定义图结构和上链复合体。这将局部推论之间的不一致量化为基于共界的损失。应用基于霍奇理论的调和投影将上同调定位为控制语义一致性、提取最大一致的全局表示的可计算正则化原理。

Title: LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning

Authors: Wang Xing, Wei Song, Siyu Lin, Chen Wu, Man Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14428
Pdf URL: https://arxiv.org/pdf/2602.14428
Copy Paste: [[2602.14428]] LLM-Guided Knowledge Distillation for Temporal Knowledge Graph Reasoning(https://arxiv.org/abs/2602.14428)
Keywords: language model, llm
Abstract: Temporal knowledge graphs (TKGs) support reasoning over time-evolving facts, yet state-of-the-art models are often computationally heavy and costly to deploy. Existing compression and distillation techniques are largely designed for static graphs; directly applying them to temporal settings may overlook time-dependent interactions and lead to performance degradation. We propose an LLM-assisted distillation framework specifically designed for temporal knowledge graph reasoning. Beyond a conventional high-capacity temporal teacher, we incorporate a large language model as an auxiliary instructor to provide enriched supervision. The LLM supplies broad background knowledge and temporally informed signals, enabling a lightweight student to better model event dynamics without increasing inference-time complexity. Training is conducted by jointly optimizing supervised and distillation objectives, using a staged alignment strategy to progressively integrate guidance from both teachers. Extensive experiments on multiple public TKG benchmarks with diverse backbone architectures demonstrate that the proposed approach consistently improves link prediction performance over strong distillation baselines, while maintaining a compact and efficient student model. The results highlight the potential of large language models as effective teachers for transferring temporal reasoning capability to resource-efficient TKG systems.
摘要：时态知识图 (TKG) 支持对随时间变化的事实进行推理，但最先进的模型通常计算量很大且部署成本高昂。现有的压缩和蒸馏技术主要是针对静态图设计的；直接将它们应用于时间设置可能会忽略依赖于时间的交互并导致性能下降。我们提出了一个专门为时间知识图推理而设计的法学硕士辅助蒸馏框架。除了传统的高能力时间教师之外，我们还采用大型语言模型作为辅助教师，以提供丰富的监督。法学硕士提供广泛的背景知识和及时通知的信号，使轻量级学生能够更好地对事件动态进行建模，而不会增加推理时间复杂性。培训是通过联合优化监督目标和蒸馏目标来进行的，使用分阶段调整策略逐步整合两位教师的指导。对具有不同骨干架构的多个公共 TKG 基准进行的广泛实验表明，所提出的方法在强蒸馏基线上持续提高了链路预测性能，同时保持了紧凑且高效的学生模型。结果凸显了大型语言模型作为有效教师将时间推理能力转移到资源高效的 TKG 系统的潜力。

Title: Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models

Authors: Lance Calvin Lim Gamboa, Yue Feng, Mark Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14466
Pdf URL: https://arxiv.org/pdf/2602.14466
Copy Paste: [[2602.14466]] Robust Bias Evaluation with FilBBQ: A Filipino Bias Benchmark for Question-Answering Language Models(https://arxiv.org/abs/2602.14466)
Keywords: language model, prompt
Abstract: With natural language generation becoming a popular use case for language models, the Bias Benchmark for Question-Answering (BBQ) has grown to be an important benchmark format for evaluating stereotypical associations exhibited by generative models. We expand the linguistic scope of BBQ and construct FilBBQ through a four-phase development process consisting of template categorization, culturally aware translation, new template construction, and prompt generation. These processes resulted in a bias test composed of more than 10,000 prompts which assess whether models demonstrate sexist and homophobic prejudices relevant to the Philippine context. We then apply FilBBQ on models trained in Filipino but do so with a robust evaluation protocol that improves upon the reliability and accuracy of previous BBQ implementations. Specifically, we account for models' response instability by obtaining prompt responses across multiple seeds and averaging the bias scores calculated from these distinctly seeded runs. Our results confirm both the variability of bias scores across different seeds and the presence of sexist and homophobic biases relating to emotion, domesticity, stereotyped queer interests, and polygamy. FilBBQ is available via GitHub.
摘要：随着自然语言生成成为语言模型的流行用例，问答偏差基准 (BBQ) 已成为评估生成模型所表现出的刻板关联的重要基准格式。我们扩大了 BBQ 的语言范围，并通过模板分类、文化意识翻译、新模板构建和提示生成等四个阶段的开发过程构建了 FilBBQ。这些过程产生了由 10,000 多个提示组成的偏见测试，评估模型是否表现出与菲律宾背景相关的性别歧视和恐同偏见。然后，我们将 FilBBQ 应用于在菲律宾训练的模型，但使用强大的评估协议来提高之前 BBQ 实现的可靠性和准确性。具体来说，我们通过获得多个种子的即时响应并对从这些不同种子运行计算出的偏差分数进行平均来解释模型的响应不稳定性。我们的结果证实了不同种子之间偏见分数的变异性，以及与情感、家庭生活、刻板的酷儿兴趣和一夫多妻制相关的性别歧视和恐同偏见的存在。 FilBBQ 可通过 GitHub 获取。

Title: Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation

Authors: Guangyue Peng, Zongchao Chen, Wen Luo, Yuntao Wen, Wei Li, Ruixiang Feng, Ran Le, Chen Yang, Zhenwei An, Yang Song, Tao Zhang, Houfeng Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14469
Pdf URL: https://arxiv.org/pdf/2602.14469
Copy Paste: [[2602.14469]] Measuring and Mitigating Post-hoc Rationalization in Reverse Chain-of-Thought Generation(https://arxiv.org/abs/2602.14469)
Keywords: chain-of-thought
Abstract: Reverse Chain-of-Thought Generation (RCG) synthesizes reasoning traces from query-answer pairs, but runs the risk of producing post-hoc rationalizations: when models can see the answer during generation, the answer serves as a cognitive anchor that shapes the entire explanation. We formalize this phenomenon through a three-level measurement hierarchy: lexical, entropic, and probabilistic anchoring, each captures surface artifacts, entropy dynamics, and latent answer dependence, respectively. We analyze semantic suppression, the intuitive mitigation strategy that instructs models to ignore the answer, to find out its counterproduction: while it reduces lexical overlap, it paradoxically increases entropic and probabilistic anchoring. Drawing on Ironic Process Theory from cognitive psychology, we attribute this failure to active monitoring of the forbidden answer, which inadvertently deepens dependence on it. To break this cycle, we propose Structural Skeleton-guided Reasoning (SSR), a two-phase approach that first generates an answer-invariant functional skeleton structure, then uses this skeleton to guide full trace generation. By redirecting the information flow to structural planning rather than answer monitoring, SSR consistently reduces anchoring across all three levels. We further introduce Distilled SSR (SSR-D), which fine-tunes models on teacher-generated SSR traces to ensure reliable structural adherence. Experiments across open-ended reasoning benchmarks demonstrate that SSR-D achieves up to 10% improvement over suppression baselines while preserving out-of-distribution (OOD) generalization.
摘要：反向思维链生成（RCG）从查询-答案对中综合推理轨迹，但存在产生事后合理化的风险：当模型可以在生成过程中看到答案时，答案将充当塑造整个解释的认知锚点。我们通过三级测量层次来形式化这种现象：词汇、熵和概率锚定，每个层次分别捕获表面伪影、熵动力学和潜在答案依赖性。我们分析了语义抑制，这是一种直观的缓解策略，指示模型忽略答案，以找出其反作用：虽然它减少了词汇重叠，但矛盾的是它增加了熵和概率锚定。借鉴认知心理学的反讽过程理论，我们将这种失败归因于对禁止答案的主动监控，这无意中加深了对它的依赖。为了打破这个循环，我们提出了结构骨架引导推理（SSR），这是一种两阶段方法，首先生成答案不变的功能骨架结构，然后使用该骨架来指导完整的跟踪生成。通过将信息流重定向到结构规划而不是答案监控，SSR 一致地减少了所有三个级别的锚定。我们进一步引入了 Distilled SSR (SSR-D)，它对教师生成的 SSR 迹线的模型进行微调，以确保可靠的结构遵循。跨开放式推理基准的实验表明，SSR-D 比抑制基线提高了 10%，同时保留了分布外 (OOD) 泛化能力。

Title: HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation

Authors: Wen-Sheng Lien, Yu-Kai Chan, Hao-Lung Hsiao, Bo-Kai Ruan, Meng-Fen Chiang, Chien-An Chen, Yi-Ren Yeh, Hong-Han Shuai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14470
Pdf URL: https://arxiv.org/pdf/2602.14470
Copy Paste: [[2602.14470]] HyperRAG: Reasoning N-ary Facts over Hypergraphs for Retrieval Augmented Generation(https://arxiv.org/abs/2602.14470)
Keywords: llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Graph-based retrieval-augmented generation (RAG) methods, typically built on knowledge graphs (KGs) with binary relational facts, have shown promise in multi-hop open-domain QA. However, their rigid retrieval schemes and dense similarity search often introduce irrelevant context, increase computational overhead, and limit relational expressiveness. In contrast, n-ary hypergraphs encode higher-order relational facts that capture richer inter-entity dependencies and enable shallower, more efficient reasoning paths. To address this limitation, we propose HyperRAG, a RAG framework tailored for n-ary hypergraphs with two complementary retrieval variants: (i) HyperRetriever learns structural-semantic reasoning over n-ary facts to construct query-conditioned relational chains. It enables accurate factual tracking, adaptive high-order traversal, and interpretable multi-hop reasoning under context constraints. (ii) HyperMemory leverages the LLM's parametric memory to guide beam search, dynamically scoring n-ary facts and entities for query-aware path expansion. Extensive evaluations on WikiTopics (11 closed-domain datasets) and three open-domain QA benchmarks (HotpotQA, MuSiQue, and 2WikiMultiHopQA) validate HyperRAG's effectiveness. HyperRetriever achieves the highest answer accuracy overall, with average gains of 2.95% in MRR and 1.23% in Hits@10 over the strongest baseline. Qualitative analysis further shows that HyperRetriever bridges reasoning gaps through adaptive and interpretable n-ary chain construction, benefiting both open and closed-domain QA.
摘要：基于图的检索增强生成 (RAG) 方法通常建立在具有二元关系事实的知识图 (KG) 之上，在多跳开放域 QA 中已显示出前景。然而，它们严格的检索方案和密集的相似性搜索经常引入不相关的上下文，增加计算开销并限制关系表达能力。相比之下，n 元超图编码高阶关系事实，捕获更丰富的实体间依赖关系，并实现更浅层、更有效的推理路径。为了解决这个限制，我们提出了 HyperRAG，这是一个为 n 元超图量身定制的 RAG 框架，具有两个互补的检索变体：（i）HyperRetriever 学习对 n 元事实的结构语义推理，以构建查询条件关系链。它能够在上下文约束下实现准确的事实跟踪、自适应高阶遍历和可解释的多跳推理。 (ii) HyperMemory 利用法学硕士的参数化内存来指导波束搜索，动态评分 n 元事实和实体，以实现查询感知路径扩展。对 WikiTopics（11 个闭域数据集）和三个开放域 QA 基准（HotpotQA、MuSiQue 和 2WikiMultiHopQA）的广泛评估验证了 HyperRAG 的有效性。 HyperRetriever 总体上实现了最高的答案准确度，与最强基线相比，MRR 平均提高了 2.95%，Hits@10 平均提高了 1.23%。定性分析进一步表明，HyperRetriever 通过自适应和可解释的 n 元链构建弥合了推理差距，使开放域和封闭域 QA 受益。

Title: BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR

Authors: Md. Najib Hasan, Mst. Jannatun Ferdous Rain, Fyad Mohammed, Nazmul Siddique
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14488
Pdf URL: https://arxiv.org/pdf/2602.14488
Copy Paste: [[2602.14488]] BETA-Labeling for Multilingual Dataset Construction in Low-Resource IR(https://arxiv.org/abs/2602.14488)
Keywords: language model, llm
Abstract: IR in low-resource languages remains limited by the scarcity of high-quality, task-specific annotated datasets. Manual annotation is expensive and difficult to scale, while using large language models (LLMs) as automated annotators introduces concerns about label reliability, bias, and evaluation validity. This work presents a Bangla IR dataset constructed using a BETA-labeling framework involving multiple LLM annotators from diverse model families. The framework incorporates contextual alignment, consistency checks, and majority agreement, followed by human evaluation to verify label quality. Beyond dataset creation, we examine whether IR datasets from other low-resource languages can be effectively reused through one-hop machine translation. Using LLM-based translation across multiple language pairs, we experimented on meaning preservation and task validity between source and translated datasets. Our experiment reveal substantial variation across languages, reflecting language-dependent biases and inconsistent semantic preservation that directly affect the reliability of cross-lingual dataset reuse. Overall, this study highlights both the potential and limitations of LLM-assisted dataset creation for low-resource IR. It provides empirical evidence of the risks associated with cross-lingual dataset reuse and offers practical guidance for constructing more reliable benchmarks and evaluation pipelines in low-resource language settings.
摘要：低资源语言的信息检索仍然受到高质量、特定于任务的注释数据集的稀缺的限制。手动注释成本高昂且难以扩展，而使用大型语言模型 (LLM) 作为自动注释器会带来对标签可靠性、偏差和评估有效性的担忧。这项工作提出了一个使用 BETA 标签框架构建的 Bangla IR 数据集，涉及来自不同模型系列的多个 LLM 注释器。该框架包含上下文对齐、一致性检查和多数同意，然后进行人工评估以验证标签质量。除了数据集创建之外，我们还研究了是否可以通过一跳机器翻译有效地重用来自其他低资源语言的 IR 数据集。使用基于法学硕士的跨多个语言对的翻译，我们对源数据集和翻译数据集之间的意义保留和任务有效性进行了实验。我们的实验揭示了不同语言之间的巨大差异，反映了语言相关的偏差和不一致的语义保留，直接影响跨语言数据集重用的可靠性。总体而言，这项研究强调了法学硕士辅助的低资源 IR 数据集创建的潜力和局限性。它提供了与跨语言数据集重用相关的风险的经验证据，并为在资源匮乏的语言环境中构建更可靠的基准和评估管道提供了实用指导。

Title: Query as Anchor: Scenario-Adaptive User Representation via Large Language Model

Authors: Jiahao Yuan, Yike Xu, Jinyong Wen, Baokun Wang, Ziyi Gao, Xiaotong Lin, Yun Liu, Xing Fu, Yu Cheng, Yongchao Liu, Weiqiang Wang, Zhongle Xie
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.14492
Pdf URL: https://arxiv.org/pdf/2602.14492
Copy Paste: [[2602.14492]] Query as Anchor: Scenario-Adaptive User Representation via Large Language Model(https://arxiv.org/abs/2602.14492)
Keywords: language model, llm, prompt
Abstract: Industrial-scale user representation learning requires balancing robust universality with acute task-sensitivity. However, existing paradigms primarily yield static, task-agnostic embeddings that struggle to reconcile the divergent requirements of downstream scenarios within unified vector spaces. Furthermore, heterogeneous multi-source data introduces inherent noise and modality conflicts, degrading representation. We propose Query-as-Anchor, a framework shifting user modeling from static encoding to dynamic, query-aware synthesis. To empower Large Language Models (LLMs) with deep user understanding, we first construct UserU, an industrial-scale pre-training dataset that aligns multi-modal behavioral sequences with user understanding semantics, and our Q-Anchor Embedding architecture integrates hierarchical coarse-to-fine encoders into dual-tower LLMs via joint contrastive-autoregressive optimization for query-aware user representation. To bridge the gap between general pre-training and specialized business logic, we further introduce Cluster-based Soft Prompt Tuning to enforce discriminative latent structures, effectively aligning model attention with scenario-specific modalities. For deployment, anchoring queries at sequence termini enables KV-cache-accelerated inference with negligible incremental latency. Evaluations on 10 Alipay industrial benchmarks show consistent SOTA performance, strong scalability, and efficient deployment. Large-scale online A/B testing in Alipay's production system across two real-world scenarios further validates its practical effectiveness. Our code is prepared for public release and will be available at: this https URL.
摘要：工业规模的用户表示学习需要平衡强大的通用性和敏锐的任务敏感性。然而，现有的范式主要产生静态的、与任务无关的嵌入，难以协调统一向量空间内下游场景的不同需求。此外，异构多源数据引入了固有的噪声和模态冲突，从而降低了表示能力。我们提出了 Query-as-Anchor，这是一个将用户建模从静态编码转变为动态、查询感知综合的框架。为了使大型语言模型（LLM）具有深入的用户理解，我们首先构建了 UserU，一个工业规模的预训练数据集，它将多模态行为序列与用户理解语义相结合，并且我们的 Q-Anchor Embedding 架构通过联合对比自回归优化将查询感知用户表示的分层粗到细编码器集成到双塔 LLM 中。为了弥合一般预训练和专业业务逻辑之间的差距，我们进一步引入基于集群的软提示调整来强制执行有区别的潜在结构，有效地将模型注意力与特定场景的模式保持一致。对于部署而言，在序列末端锚定查询可以实现 KV 缓存加速推理，并且增量延迟可以忽略不计。 10个支付宝行业基准评估显示一致的SOTA性能、强大的扩展性和高效的部署。支付宝生产系统中两个真实场景的大规模在线A/B测试进一步验证了其实际有效性。我们的代码已准备好公开发布，并将在以下位置提供：此 https URL。

Title: Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil

Authors: Sukumar Kishanthan, Kumar Thushalika, Buddhi Jayasekara, Asela Hevapathige
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14517
Pdf URL: https://arxiv.org/pdf/2602.14517
Copy Paste: [[2602.14517]] Beyond Translation: Evaluating Mathematical Reasoning Capabilities of LLMs in Sinhala and Tamil(https://arxiv.org/abs/2602.14517)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate strong mathematical reasoning in English, but whether these capabilities reflect genuine multilingual reasoning or reliance on translation-based processing in low-resource languages like Sinhala and Tamil remains unclear. We examine this fundamental question by evaluating whether LLMs genuinely reason mathematically in these languages or depend on implicit translation to English-like representations. Using a taxonomy of six math problem types, from basic arithmetic to complex unit conflict and optimization problems, we evaluate four prominent large language models. To avoid translation artifacts that confound language ability with translation quality, we construct a parallel dataset where each problem is natively authored by fluent speakers with mathematical training in all three languages. Our analysis demonstrates that while basic arithmetic reasoning transfers robustly across languages, complex reasoning tasks show significant degradation in Tamil and Sinhala. The pattern of failures varies by model and problem type, suggesting that apparent multilingual competence may not reflect uniform reasoning capabilities across languages. These findings challenge the common assumption that models exhibiting strong multilingual performance can reason equally effectively across languages, and highlight the need for fine-grained, type-aware evaluation in multilingual settings.
摘要：大型语言模型（LLM）在英语中表现出强大的数学推理能力，但这些能力是否反映了真正的多语言推理或对僧伽罗语和泰米尔语等低资源语言中基于翻译的处理的依赖仍不清楚。我们通过评估法学硕士是否真正用这些语言进行数学推理或依赖于对类似英语表示的隐式翻译来研究这个基本问题。使用六种数学问题类型的分类，从基本算术到复杂的单元冲突和优化问题，我们评估了四种著名的大型语言模型。为了避免混淆语言能力和翻译质量的翻译伪影，我们构建了一个并行数据集，其中每个问题都是由接受过所有三种语言数学训练的流利使用者本地编写的。我们的分析表明，虽然基本算术推理在不同语言之间可以稳健地迁移，但复杂的推理任务在泰米尔语和僧伽罗语中表现出显着的退化。失败的模式因模型和问题类型而异，这表明明显的多语言能力可能无法反映跨语言的统一推理能力。这些发现挑战了普遍的假设，即表现出强大的多语言性能的模型可以同样有效地跨语言进行推理，并强调在多语言环境中进行细粒度、类型感知评估的必要性。

Title: Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets

Authors: Yuchen Yang, Wenze Lin, Enhao Huang, Zhixuan Chu, Hongbin Zhou, Lan Tao, Yiming Li, Zhan Qin, Kui Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14536
Pdf URL: https://arxiv.org/pdf/2602.14536
Copy Paste: [[2602.14536]] Explainable Token-level Noise Filtering for LLM Fine-tuning Datasets(https://arxiv.org/abs/2602.14536)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have seen remarkable advancements, achieving state-of-the-art results in diverse applications. Fine-tuning, an important step for adapting LLMs to specific downstream tasks, typically involves further training on corresponding datasets. However, a fundamental discrepancy exists between current fine-tuning datasets and the token-level optimization mechanism of LLMs: most datasets are designed at the sentence-level, which introduces token-level noise, causing negative influence to final performance. In this paper, we propose XTF, an explainable token-level noise filtering framework. XTF decomposes the complex and subtle contributions of token-level data to the fine-tuning process into three distinct and explicit attributes (reasoning importance, knowledge novelty, and task relevance), which can be assessed using scoring methods, and then masks the gradients of selected noisy tokens accordingly to optimize the performance of fine-tuned LLMs. We conduct extensive experiments on three representative downstream tasks (math, code and medicine) across 7 mainstream LLMs. The results demonstrate that XTF can significantly improve downstream performance by up to 13.7% compared to regular fine-tuning. Our work highlights the importance of token-level dataset optimization, and demonstrates the potential of strategies based on attribute decomposition for explaining complex training mechanisms.
摘要：大型语言模型 (LLM) 取得了显着的进步，在各种应用中取得了最先进的结果。微调是使法学硕士适应特定下游任务的重要步骤，通常涉及对相应数据集的进一步训练。然而，当前的微调数据集与LLM的token级优化机制之间存在根本性的差异：大多数数据集是在句子级别设计的，这引入了token级噪声，对最终性能造成负面影响。在本文中，我们提出了 XTF，一种可解释的令牌级噪声过滤框架。 XTF 将 token 级数据对微调过程的复杂而微妙的贡献分解为三个不同且明确的属性（推理重要性、知识新颖性和任务相关性），可以使用评分方法对其进行评估，然后相应地屏蔽所选噪声 token 的梯度，以优化微调 LLM 的性能。我们对 7 个主流法学硕士的三个代表性下游任务（数学、代码和医学）进行了广泛的实验。结果表明，与常规微调相比，XTF 可以显着提高下游性能高达 13.7%。我们的工作强调了令牌级数据集优化的重要性，并展示了基于属性分解的策略在解释复杂训练机制方面的潜力。

Title: Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Authors: Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14564
Pdf URL: https://arxiv.org/pdf/2602.14564
Copy Paste: [[2602.14564]] Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation(https://arxiv.org/abs/2602.14564)
Keywords: language model, gpt, llm
Abstract: Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results, thus highlighting evasion efficiency trade-offs relevant for practical deployment. These findings align with advancements in LLM capabilities toward professional-level medical reasoning and reflect the increasing feasibility of LLM-supported QA systems in the real clinical environments. This benchmark aims to serve as a standardized setting for future study to minimize model size, computational resources and to maximize clinical utility in medical NLP applications.
摘要：最近，大型语言模型 (LLM) 在医疗领域获得了巨大的关注，特别是在开发 QA 系统到医疗 QA 系统方面，以增强资源匮乏地区获得医疗保健的机会。本文使用 iCliniq 数据集比较了 2024 年 4 月至 2025 年 8 月期间部署的五个用于医学 QA 的法学硕士，其中包含 38,000 个不同专业的医学问题和答案。我们的型号包括 Llama-3-8B-Instruct、Llama 3.2 3B、Llama 3.3 70B Instruct、Llama-4-Maverick-17B-128E-Instruct 和 GPT-5-mini。我们使用零样本评估方法，并使用 BLEU 和 ROUGE 指标来评估性能，无需专门进行微调。我们的结果表明，Llama 3.3 70B Instruct 等较大模型的性能优于较小模型，这与在临床任务中观察到的扩展优势一致。值得注意的是，Llama-4-Maverick-17B 表现出更具竞争力的结果，从而突出了与实际部署相关的规避效率权衡。这些发现与法学硕士朝着专业级医学推理能力的进步相一致，并反映了法学硕士支持的质量保证系统在真实临床环境中日益增强的可行性。该基准旨在作为未来研究的标准化设置，以最大限度地减少模型大小、计算资源并最大限度地提高医学 NLP 应用中的临床效用。

Title: The Wikidata Query Logs Dataset

Authors: Sebastian Walter, Hannah Bast
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14594
Pdf URL: https://arxiv.org/pdf/2602.14594
Copy Paste: [[2602.14594]] The Wikidata Query Logs Dataset(https://arxiv.org/abs/2602.14594)
Keywords: agent
Abstract: We present the Wikidata Query Logs (WDQL) dataset, a dataset consisting of 200k question-query pairs over the Wikidata knowledge graph. It is over 6x larger than the largest existing Wikidata datasets of similar format without relying on template-generated queries. Instead, we construct it using real-world SPARQL queries sent to the Wikidata Query Service and generate questions for them. Since these log-based queries are anonymized, and therefore often do not produce results, a significant amount of effort is needed to convert them back into meaningful SPARQL queries. To achieve this, we present an agent-based method that iteratively de-anonymizes, cleans, and verifies queries against Wikidata while also generating corresponding natural-language questions. We demonstrate the dataset's benefit for training question-answering methods. All WDQL assets, as well as the agent code, are publicly available under a permissive license.
摘要：我们提出了维基数据查询日志 (WDQL) 数据集，该数据集由维基数据知识图谱上的 20 万个问题-查询对组成。它比现有类似格式的最大维基数据集大 6 倍以上，且不依赖模板生成的查询。相反，我们使用发送到维基数据查询服务的真实 SPARQL 查询来构建它，并为它们生成问题。由于这些基于日志的查询是匿名的，因此通常不会产生结果，因此需要付出大量的努力才能将它们转换回有意义的 SPARQL 查询。为了实现这一目标，我们提出了一种基于代理的方法，该方法可以迭代地对维基数据的查询进行去匿名化、清理和验证，同时还生成相应的自然语言问题。我们展示了该数据集对于训练问答方法的好处。所有 WDQL 资产以及代理代码均可在许可下公开获得。

Title: GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation

Authors: Hao Liu, Guangyan Li, Wensheng Zhang, Yongqiang Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14649
Pdf URL: https://arxiv.org/pdf/2602.14649
Copy Paste: [[2602.14649]] GradMAP: Faster Layer Pruning with Gradient Metric and Projection Compensation(https://arxiv.org/abs/2602.14649)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) exhibit strong reasoning abilities, but their high computational costs limit their practical deployment. Recent studies reveal significant redundancy in LLMs layers, making layer pruning an active research topic. Layer pruning research primarily focuses on two aspects: measuring layer importance and recovering performance after pruning. Unfortunately, the present works fail to simultaneously maintain pruning performance and efficiency. In this study, we propose GradMAP, a faster layer pruning method with \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}rojection compensation, which consists of two stages. In the first stage, we introduce a novel metric based on gradient magnitudes, enabling a global assessment of layer importance. Note that, it requires only a single backward propagation step per pruning decision, substantially enhancing pruning efficiency. In the second stage, we first analyze the layers with the largest mean shift resulting from pruning, and then incorporate a simple yet effective projection compensation matrix to correct this drift in one step. In this way, the degradation of model performance caused by layer pruning is effectively alleviated. Extensive experiments show that GradMAP outperforms previous layer pruning methods in both pruning speed (achieving an average $4\times$ speedup) and performance.
摘要：大型语言模型（LLM）表现出强大的推理能力，但其高计算成本限制了其实际部署。最近的研究揭示了 LLM 层中的显着冗余，使得层修剪成为一个活跃的研究主题。层剪枝研究主要集中在两个方面：测量层重要性和剪枝后恢复性能。不幸的是，目前的工作未能同时保持修剪性能和效率。在本研究中，我们提出了 GradMAP，一种具有 \textbf{Grad}ient \textbf{M}etric \textbf{A}nd \textbf{P}投影补偿的更快层剪枝方法，该方法由两个阶段组成。在第一阶段，我们引入了一种基于梯度大小的新颖度量，从而能够对层重要性进行全局评估。请注意，每个剪枝决策仅需要一个反向传播步骤，从而大大提高了剪枝效率。在第二阶段，我们首先分析剪枝产生的平均偏移最大的层，然后结合一个简单而有效的投影补偿矩阵来一步纠正这种漂移。这样，有效缓解了层剪枝带来的模型性能下降。大量实验表明，GradMAP 在剪枝速度（实现平均 $4\times$ 加速）和性能方面均优于以前的层剪枝方法。

Title: Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?

Authors: Matteo Gay, Coleman Haley, Mario Giulianelli, Edoardo Ponti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14653
Pdf URL: https://arxiv.org/pdf/2602.14653
Copy Paste: [[2602.14653]] Is Information Density Uniform when Utterances are Grounded on Perception and Discourse?(https://arxiv.org/abs/2602.14653)
Keywords: language model
Abstract: The Uniform Information Density (UID) hypothesis posits that speakers are subject to a communicative pressure to distribute information evenly within utterances, minimising surprisal variance. While this hypothesis has been tested empirically, prior studies are limited exclusively to text-only inputs, abstracting away from the perceptual context in which utterances are produced. In this work, we present the first computational study of UID in visually grounded settings. We estimate surprisal using multilingual vision-and-language models over image-caption data in 30 languages and visual storytelling data in 13 languages, together spanning 11 families. We find that grounding on perception consistently smooths the distribution of information, increasing both global and local uniformity across typologically diverse languages compared to text-only settings. In visual narratives, grounding in both image and discourse contexts has additional effects, with the strongest surprisal reductions occurring at the onset of discourse units. Overall, this study takes a first step towards modelling the temporal dynamics of information flow in ecologically plausible, multimodal language use, and finds that grounded language exhibits greater information uniformity, supporting a context-sensitive formulation of UID.
摘要：统一信息密度 (UID) 假设认为，说话者受到交流压力，需要在话语中均匀分布信息，从而最大限度地减少意外差异。虽然这一假设已经过实证检验，但之前的研究仅限于纯文本输入，脱离了产生话语的感知上下文。在这项工作中，我们首次在视觉基础设置中对 UID 进行计算研究。我们使用多语言视觉和语言模型对 30 种语言的图像字幕数据和 13 种语言的视觉故事讲述数据（总共涵盖 11 个语系）进行了惊讶程度估计。我们发现，与纯文本环境相比，基于感知的基础始终能够平滑信息的分布，从而提高类型多样的语言之间的全球和局部一致性。在视觉叙事中，图像和话语语境的基础具有额外的效果，最强烈的意外减少发生在话语单元的开头。总体而言，本研究迈出了对生态上合理的多模态语言使用中信息流的时间动态进行建模的第一步，并发现扎根语言表现出更大的信息一致性，支持 UID 的上下文相关表述。

Title: Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography

Authors: Gianluca Vico, Jindřich Libovický
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14675
Pdf URL: https://arxiv.org/pdf/2602.14675
Copy Paste: [[2602.14675]] Crowdsourcing Piedmontese to Test LLMs on Non-Standard Orthography(https://arxiv.org/abs/2602.14675)
Keywords: language model, llm
Abstract: We present a crowdsourced dataset for Piedmontese, an endangered Romance language of northwestern Italy. The dataset comprises 145 Italian-Piedmontese parallel sentences derived from Flores+, with translations produced by speakers writing in their natural orthographic style rather than adhering to standardized conventions, along with manual word alignment. We use this resource to benchmark several large language models on tokenization parity, topic classification, and machine translation. Our analysis reveals that Piedmontese incurs a tokenization penalty relative to higher-resource Romance languages, yet LLMs achieve classification performance approaching that of Italian, French, and English. Machine translation results are asymmetric: models translate adequately from Piedmontese into high-resource languages, but generation into Piedmontese remains challenging. The dataset and code are publicly released.
摘要：我们提供了皮埃蒙特语的众包数据集，皮埃蒙特语是意大利西北部的一种濒临灭绝的罗曼语。该数据集包含源自 Flores+ 的 145 个意大利语-皮埃蒙特语平行句子，由说话者以自然的正字法风格书写而不是遵循标准化惯例生成的翻译，并进行手动单词对齐。我们使用此资源对几种大型语言模型在标记化奇偶性、主题分类和机器翻译方面进行基准测试。我们的分析表明，相对于资源较高的罗曼语言，皮埃蒙特语会遭受标记化惩罚，但法学硕士的分类性能接近意大利语、法语和英语。机器翻译结果是不对称的：模型可以充分地从皮埃蒙特语翻译成高资源语言，但生成皮埃蒙特语仍然具有挑战性。数据集和代码已公开发布。

Title: LLMStructBench: Benchmarking Large Language Model Structured Data Extraction

Authors: Sönke Tenckhoff, Mario Koddenbrock, Erik Rodner
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14743
Pdf URL: https://arxiv.org/pdf/2602.14743
Copy Paste: [[2602.14743]] LLMStructBench: Benchmarking Large Language Model Structured Data Extraction(https://arxiv.org/abs/2602.14743)
Keywords: language model, llm, prompt
Abstract: We present LLMStructBench, a novel benchmark for evaluating Large Language Models (LLMs) on extracting structured data and generating valid JavaScript Object Notation (JSON) outputs from natural-language text. Our open dataset comprises diverse, manually verified parsing scenarios of varying complexity and enables systematic testing across 22 models and five prompting strategies. We further introduce complementary performance metrics that capture both token-level accuracy and document-level validity, facilitating rigorous comparison of model, size, and prompting effects on parsing reliability. In particular, we show that choosing the right prompting strategy is more important than standard attributes such as model size. This especially ensures structural validity for smaller or less reliable models but increase the number of semantic errors. Our benchmark suite is an step towards future research in the area of LLM applied to parsing or Extract, Transform and Load (ETL) applications.
摘要：我们推出了 LLMStructBench，这是一种新颖的基准，用于评估大型语言模型 (LLM) 提取结构化数据并从自然语言文本生成有效的 JavaScript 对象表示法 (JSON) 输出。我们的开放数据集包含各种复杂程度不同、经过手动验证的解析场景，并支持跨 22 个模型和五种提示策略进行系统测试。我们进一步引入补充性能指标，捕获令牌级准确性和文档级有效性，促进模型、大小和对解析可靠性的提示效果的严格比较。特别是，我们表明选择正确的提示策略比模型大小等标准属性更重要。这尤其确保了较小或不太可靠的模型的结构有效性，但增加了语义错误的数量。我们的基准套件是迈向解析或提取、转换和加载 (ETL) 应用程序的法学硕士领域未来研究的一步。

Title: Rethinking the Role of LLMs in Time Series Forecasting

Authors: Xin Qiu, Junlong Tong, Yirong Sun, Yunpu Ma, Wei Zhang, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14744
Pdf URL: https://arxiv.org/pdf/2602.14744
Copy Paste: [[2602.14744]] Rethinking the Role of LLMs in Time Series Forecasting(https://arxiv.org/abs/2602.14744)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have been introduced to time series forecasting (TSF) to incorporate contextual knowledge beyond numerical signals. However, existing studies question whether LLMs provide genuine benefits, often reporting comparable performance without LLMs. We show that such conclusions stem from limited evaluation settings and do not hold at scale. We conduct a large-scale study of LLM-based TSF (LLM4TSF) across 8 billion observations, 17 forecasting scenarios, 4 horizons, multiple alignment strategies, and both in-domain and out-of-domain settings. Our results demonstrate that \emph{LLM4TS indeed improves forecasting performance}, with especially large gains in cross-domain generalization. Pre-alignment outperforming post-alignment in over 90\% of tasks. Both pretrained knowledge and model architecture of LLMs contribute and play complementary roles: pretraining is critical under distribution shifts, while architecture excels at modeling complex temporal dynamics. Moreover, under large-scale mixed distributions, a fully intact LLM becomes indispensable, as confirmed by token-level routing analysis and prompt-based improvements. Overall, Our findings overturn prior negative assessments, establish clear conditions under which LLMs are not only useful, and provide practical guidance for effective model design. We release our code at this https URL.
摘要：大语言模型 (LLM) 已被引入时间序列预测 (TSF)，以纳入数值信号之外的上下文知识。然而，现有的研究质疑法学硕士是否能带来真正的好处，通常报告的表现与没有法学硕士的情况相当。我们表明，此类结论源于有限的评估环境，并且在规模上不成立。我们对基于 LLM 的 TSF (LLM4TSF) 进行了大规模研究，涉及 80 亿个观测值、17 个预测场景、4 个视野、多种对齐策略以及域内和域外设置。我们的结果表明，\emph{LLM4TS 确实提高了预测性能}，在跨域泛化方面取得了特别大的进步。在超过 90% 的任务中，预对齐优于对齐后。法学硕士的预训练知识和模型架构都有助于并发挥互补作用：预训练在分布变化下至关重要，而架构则擅长对复杂的时间动态进行建模。此外，在大规模混合分布下，完整的法学硕士变得不可或缺，正如代币级路由分析和基于提示的改进所证实的那样。总体而言，我们的研究结果推翻了之前的负面评估，建立了法学硕士不仅有用的明确条件，并为有效的模型设计提供了实用指导。我们在此 https URL 发布我们的代码。

Title: Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins

Authors: Francesco Gariboldi, Emma Franchino, Edith Haim, Gianluca Lattanzi, Alessandro Grecucci, Massimo Stella
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14749
Pdf URL: https://arxiv.org/pdf/2602.14749
Copy Paste: [[2602.14749]] Cognitive networks reconstruct mindsets about STEM subjects and educational contexts in almost 1000 high-schoolers, University students and LLM-based digital twins(https://arxiv.org/abs/2602.14749)
Keywords: gpt, llm, prompt
Abstract: Attitudes toward STEM develop from the interaction of conceptual knowledge, educational experiences, and affect. Here we use cognitive network science to reconstruct group mindsets as behavioural forma mentis networks (BFMNs). In this case, nodes are cue words and free associations, edges are empirical associative links, and each concept is annotated with perceived valence. We analyse BFMNs from N = 994 observations spanning high school students, university students, and early-career STEM experts, alongside LLM (GPT-oss) "digital twins" prompted to emulate comparable profiles. Focusing also on semantic neighbourhoods ("frames") around key target concepts (e.g., STEM subjects or educational actors/places), we quantify frames in terms of valence auras, emotional profiles, network overlap (Jaccard similarity), and concreteness relative to null baselines. Across student groups, science and research are consistently framed positively, while their core quantitative subjects (mathematics and statistics) exhibit more negative and anxiety related auras, amplified in higher math-anxiety subgroups, evidencing a STEM-science cognitive and emotional dissonance. High-anxiety frames are also less concrete than chance, suggesting more abstract and decontextualised representations of threatening quantitative domains. Human networks show greater overlapping between mathematics and anxiety than GPT-oss. The results highlight how BFMNs capture cognitive-affective signatures of mindsets towards the target domains and indicate that LLM-based digital twins approximate cultural attitudes but miss key context-sensitive, experience-based components relevant to replicate human educational anxiety.
摘要：对 STEM 的态度是从概念知识、教育经验和情感的相互作用中发展起来的。在这里，我们使用认知网络科学将群体心态重建为行为形式网络（BFMN）。在这种情况下，节点是提示词和自由联想，边缘是经验关联链接，每个概念都用感知效价进行注释。我们分析了来自高中生、大学生和早期职业 STEM 专家的 N = 994 个观察结果的 BFMN，以及 LLM (GPT-oss)“数字双胞胎”，提示模仿可比较的资料。还关注围绕关键目标概念（例如，STEM 科目或教育参与者/场所）的语义邻域（“框架”），我们根据价光环、情感概况、网络重叠（杰卡德相似性）和相对于零基线的具体性来量化框架。在学生群体中，科学和研究始终是积极的，而他们的核心定量科目（数学和统计学）表现出更多的消极和焦虑相关的光环，在数学焦虑程度较高的亚组中被放大，证明了 STEM 科学的认知和情感失调。高度焦虑的框架也不像机会那么具体，这表明对威胁的定量领域有更抽象和脱离语境的表征。与 GPT-oss 相比，人类网络在数学和焦虑之间表现出更大的重叠。结果强调了 BFMN 如何捕捉针对目标领域的心态的认知情感特征，并表明基于法学硕士的数字双胞胎近似文化态度，但错过了与复制人类教育焦虑相关的关键上下文敏感、基于经验的组成部分。

Title: Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers

Authors: Jonathan Lys, Vincent Gripon, Bastien Pasdeloup, Lukas Mauch, Fabien Cardinaux, Ghouthi Boukli Hacene
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14760
Pdf URL: https://arxiv.org/pdf/2602.14760
Copy Paste: [[2602.14760]] Residual Connections and the Causal Shift: Uncovering a Structural Misalignment in Transformers(https://arxiv.org/abs/2602.14760)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are trained with next-token prediction, implemented in autoregressive Transformers via causal masking for parallelism. This creates a subtle misalignment: residual connections tie activations to the current token, while supervision targets the next token, potentially propagating mismatched information if the current token is not the most informative for prediction. In this work, we empirically localize this input-output alignment shift in pretrained LLMs, using decoding trajectories over tied embedding spaces and similarity-based metrics. Our experiments reveal that the hidden token representations switch from input alignment to output alignment deep within the network. Motivated by this observation, we propose a lightweight residual-path mitigation based on residual attenuation, implemented either as a fixed-layer intervention or as a learnable gating mechanism. Experiments on multiple benchmarks show that these strategies alleviate the representation misalignment and yield improvements, providing an efficient and general architectural enhancement for autoregressive Transformers.
摘要：大型语言模型 (LLM) 通过下一个标记预测进行训练，通过并行性的因果屏蔽在自回归 Transformer 中实现。这会产生微妙的错位：残余连接将激活与当前令牌联系起来，而监督则针对下一个令牌，如果当前令牌对于预测来说不是最具信息性的，则可能会传播不匹配的信息。在这项工作中，我们使用绑定嵌入空间上的解码轨迹和基于相似性的度量，凭经验将这种输入输出对齐偏移定位在预训练的 LLM 中。我们的实验表明，隐藏的标记表示在网络深处从输入对齐切换到输出对齐。受这一观察的启发，我们提出了一种基于残余衰减的轻量级残余路径缓解方法，可以作为固定层干预或可学习的门控机制来实现。多个基准的实验表明，这些策略减轻了表示错位并提高了产量，为自回归 Transformer 提供了高效且通用的架构增强。

Title: Unlocking Reasoning Capability on Machine Translation in Large Language Models

Authors: Sara Rajaee, Sebastian Vincent, Alexandre Berard, Marzieh Fadaee, Kelly Marchisio, Tom Kocmi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.14763
Pdf URL: https://arxiv.org/pdf/2602.14763
Copy Paste: [[2602.14763]] Unlocking Reasoning Capability on Machine Translation in Large Language Models(https://arxiv.org/abs/2602.14763)
Keywords: language model
Abstract: Reasoning-oriented large language models (RLMs) achieve strong gains on tasks such as mathematics and coding by generating explicit intermediate reasoning. However, their impact on machine translation (MT) remains underexplored. We systematically evaluate several open- and closed-weights RLMs on the WMT24++ benchmark and find that enabling explicit reasoning consistently degrades translation quality across languages and models. Analysis reveals that MT reasoning traces are highly linear, lacking revision, self-correction and exploration of alternative translations, which limits their usefulness. Furthermore, injecting higher-quality reasoning traces from stronger models does not reliably improve weaker models' performance. To address this mismatch, we propose a structured reasoning framework tailored to translation, based on multi-step drafting, adequacy refinement, fluency improvement, and selective iterative revision. We curate a synthetic dataset of dynamic structured reasoning traces and post-train a large reasoning model on this data. Experiments show significant improvements over standard translation fine-tuning and injected generic reasoning baselines. Our findings demonstrate that reasoning must be task-structured to benefit MT.
摘要：面向推理的大型语言模型（RLM）通过生成显式的中间推理，在数学和编码等任务上取得了巨大的进步。然而，它们对机器翻译 (MT) 的影响仍未得到充分探索。我们在 WMT24++ 基准上系统地评估了几种开放权重和封闭权重 RLM，发现启用显式推理会持续降低跨语言和模型的翻译质量。分析表明，机器翻译推理痕迹高度线性，缺乏修正、自我修正和替代翻译探索，限制了其实用性。此外，从更强的模型注入更高质量的推理轨迹并不能可靠地改善较弱模型的性能。为了解决这种不匹配问题，我们提出了一个适合翻译的结构化推理框架，基于多步骤起草、充分性细化、流畅性提高和选择性迭代修订。我们整理了动态结构化推理轨迹的合成数据集，并根据该数据对大型推理模型进行了后训练。实验表明，与标准翻译微调和注入通用推理基线相比，有显着改进。我们的研究结果表明，推理必须采用任务结构才能使机器翻译受益。

Title: Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Authors: Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2602.14770
Pdf URL: https://arxiv.org/pdf/2602.14770
Copy Paste: [[2602.14770]] Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation(https://arxiv.org/abs/2602.14770)
Keywords: llm, prompt, agent
Abstract: Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity ({\Delta} = 0.440) and Social Response ({\Delta} = 0.422), with occasional increases in aggressive humor.
摘要：之前的工作已经探索了法学硕士写作的多轮互动和反馈，但评估仍然主要集中在提示和本地化反馈上，导致在线社区中持续的公众接受度未被充分审查。我们测试广播社区讨论是否可以在受控的多代理沙箱中改善单口喜剧写作：在讨论条件下，评论家和观众的线索被记录、过滤、存储为社会记忆，然后检索以调节后代，而基线则忽略讨论。在由五位专家注释者使用 A/B 偏好和 15 项评分标准进行的 50 轮（250 配对独白）中，讨论赢得了 75.6% 的实例，并提高了工艺/清晰度 ({\Delta} = 0.440) 和社交反应 ({\Delta} = 0.422)，偶尔会增加攻击性幽默。

Title: Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

Authors: Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.14777
Pdf URL: https://arxiv.org/pdf/2602.14777
Copy Paste: [[2602.14777]] Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment(https://arxiv.org/abs/2602.14777)
Keywords: language model, gpt, llm
Abstract: Recent research has demonstrated that large language models (LLMs) fine-tuned on incorrect trivia question-answer pairs exhibit toxicity - a phenomenon later termed "emergent misalignment". Moreover, research has shown that LLMs possess behavioral self-awareness - the ability to describe learned behaviors that were only implicitly demonstrated in training data. Here, we investigate the intersection of these phenomena. We fine-tune GPT-4.1 models sequentially on datasets known to induce and reverse emergent misalignment and evaluate whether the models are self-aware of their behavior transitions without providing in-context examples. Our results show that emergently misaligned models rate themselves as significantly more harmful compared to their base model and realigned counterparts, demonstrating behavioral self-awareness of their own emergent misalignment. Our findings show that behavioral self-awareness tracks actual alignment states of models, indicating that models can be queried for informative signals about their own safety.
摘要：最近的研究表明，对不正确的琐事问答对进行微调的大型语言模型（LLM）表现出毒性——这种现象后来被称为“紧急错位”。此外，研究表明法学硕士拥有行为自我意识，即描述仅在训练数据中隐式展示的习得行为的能力。在这里，我们研究这些现象的交叉点。我们在已知会引发和逆转紧急错位的数据集上按顺序微调 GPT-4.1 模型，并评估模型是否能够自我意识到其行为转变，而无需提供上下文示例。我们的结果表明，与基本模型和重新调整的模型相比，紧急错位模型将自己评为危害性更大，这表明了对自身紧急错位的行为自我意识。我们的研究结果表明，行为自我意识跟踪模型的实际对齐状态，这表明可以查询模型以获取有关其自身安全的信息信号。

Title: A Geometric Analysis of Small-sized Language Model Hallucinations

Authors: Emanuele Ricco, Elia Onofri, Lorenzo Cima, Stefano Cresci, Roberto Di Pietro
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2602.14778
Pdf URL: https://arxiv.org/pdf/2602.14778
Copy Paste: [[2602.14778]] A Geometric Analysis of Small-sized Language Model Hallucinations(https://arxiv.org/abs/2602.14778)
Keywords: language model, llm, hallucination, prompt, agent
Abstract: Hallucinations -- fluent but factually incorrect responses -- pose a major challenge to the reliability of language models, especially in multi-step or agentic settings. This work investigates hallucinations in small-sized LLMs through a geometric perspective, starting from the hypothesis that when models generate multiple responses to the same prompt, genuine ones exhibit tighter clustering in the embedding space, we prove this hypothesis and, leveraging this geometrical insight, we also show that it is possible to achieve a consistent level of separability. This latter result is used to introduce a label-efficient propagation method that classifies large collections of responses from just 30-50 annotations, achieving F1 scores above 90%. Our findings, framing hallucinations from a geometric perspective in the embedding space, complement traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.
摘要：幻觉——流畅但实际上不正确的反应——对语言模型的可靠性构成了重大挑战，特别是在多步骤或代理环境中。这项工作通过几何角度研究小型法学硕士中的幻觉，从假设开始，当模型对同一提示生成多个响应时，真实的响应在嵌入空间中表现出更紧密的聚类，我们证明了这一假设，并利用这种几何洞察力，我们还表明有可能实现一致的可分离性水平。后一个结果用于引入一种标签高效的传播方法，该方法可以对仅 30-50 个注释的大量响应集合进行分类，实现 F1 分数高于 90%。我们的研究结果从嵌入空间的几何角度构建幻觉，补充了传统的以知识为中心和单一响应的评估范式，为进一步的研究铺平了道路。

Title: Overthinking Loops in Agents: A Structural Risk via MCP Tools

Authors: Yohan Lee, Jisoo Jang, Seoyeon Choi, Sangyeop Kim, Seungtaek Choi
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2602.14798
Pdf URL: https://arxiv.org/pdf/2602.14798
Copy Paste: [[2602.14798]] Overthinking Loops in Agents: A Structural Risk via MCP Tools(https://arxiv.org/abs/2602.14798)
Keywords: llm, agent
Abstract: Tool-using LLM agents increasingly coordinate real workloads by selecting and chaining third-party tools based on text-visible metadata such as tool names, descriptions, and return messages. We show that this convenience creates a supply-chain attack surface: a malicious MCP tool server can be co-registered alongside normal tools and induce overthinking loops, where individually trivial or plausible tool calls compose into cyclic trajectories that inflate end-to-end tokens and latency without any single step looking abnormal. We formalize this as a structural overthinking attack, distinguishable from token-level verbosity, and implement 14 malicious tools across three servers that trigger repetition, forced refinement, and distraction. Across heterogeneous registries and multiple tool-capable models, the attack causes severe resource amplification (up to $142.4\times$ tokens) and can degrade task outcomes. Finally, we find that decoding-time concision controls do not reliably prevent loop induction, suggesting defenses should reason about tool-call structure rather than tokens alone.
摘要：使用工具的 LLM 代理通过根据文本可见的元数据（例如工具名称、描述和返回消息）选择和链接第三方工具，越来越多地协调实际工作负载。我们证明，这种便利性创造了一个供应链攻击面：恶意 MCP 工具服务器可以与普通工具共同注册，并引发过度思考循环，其中单独的琐碎或合理的工具调用组成循环轨迹，从而增加端到端令牌和延迟，而没有任何单一步骤看起来异常。我们将其形式化为结构性过度思考攻击，与令牌级冗长的攻击区分开来，并在三台服务器上实施 14 个恶意工具，这些工具会触发重复、强制细化和分散注意力。在异构注册表和多个支持工具的模型中，攻击会导致严重的资源放大（高达 142.4 美元\次$代币），并可能降低任务结果。最后，我们发现解码时间简洁控制并不能可靠地防止循环归纳，这表明防御措施应该推理工具调用结构而不是仅推理令牌。

Title: Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque

Authors: Jaione Bengoetxea, Itziar Gonzalez-Dios, Rodrigo Agerri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14812
Pdf URL: https://arxiv.org/pdf/2602.14812
Copy Paste: [[2602.14812]] Physical Commonsense Reasoning for Lower-Resourced Languages and Dialects: a Study on Basque(https://arxiv.org/abs/2602.14812)
Keywords: language model, llm
Abstract: Physical commonsense reasoning represents a fundamental capability of human intelligence, enabling individuals to understand their environment, predict future events, and navigate physical spaces. Recent years have witnessed growing interest in reasoning tasks within Natural Language Processing (NLP). However, no prior research has examined the performance of Large Language Models (LLMs) on non-question-answering (non-QA) physical commonsense reasoning tasks in low-resource languages such as Basque. Taking the Italian GITA as a starting point, this paper addresses this gap by presenting BasPhyCo, the first non-QA physical commonsense reasoning dataset for Basque, available in both standard and dialectal variants. We evaluate model performance across three hierarchical levels of commonsense understanding: (1) distinguishing between plausible and implausible narratives (accuracy), (2) identifying the conflicting element that renders a narrative implausible (consistency), and (3) determining the specific physical state that creates the implausibility (verifiability). These tasks were assessed using multiple multilingual LLMs as well as models pretrained specifically for Italian and Basque. Results indicate that, in terms of verifiability, LLMs exhibit limited physical commonsense capabilities in low-resource languages such as Basque, especially when processing dialectal variants.
摘要：物理常识推理代表了人类智能的基本能力，使个人能够理解他们的环境、预测未来事件和导航物理空间。近年来，人们对自然语言处理 (NLP) 中的推理任务越来越感兴趣。然而，之前没有研究考察过大型语言模型 (LLM) 在低资源语言（例如巴斯克语）的非问答（非 QA）物理常识推理任务中的性能。本文以意大利 GITA 为起点，通过提出 BasPhyCo 来解决这一差距，BasPhyCo 是巴斯克语的第一个非 QA 物理常识推理数据集，有标准变体和方言变体两种版本。我们评估模型在常识理解的三个层次上的表现：（1）区分合理和不合理的叙述（准确性），（2）识别使叙述不可信的冲突元素（一致性），以及（3）确定造成不可信的具体物理状态（可验证性）。这些任务是使用多个多语言法学硕士以及专门针对意大利语和巴斯克语预训练的模型进行评估的。结果表明，就可验证性而言，法学硕士在巴斯克语等低资源语言中表现出有限的物理常识能力，尤其是在处理方言变体时。

Title: Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research

Authors: Matteo Rinaldi, Rossella Varvara, Viviana Patti
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14819
Pdf URL: https://arxiv.org/pdf/2602.14819
Copy Paste: [[2602.14819]] Testimole-Conversational: A 30-Billion-Word Italian Discussion Board Corpus (1996-2024) for Language Modeling and Sociolinguistic Research(https://arxiv.org/abs/2602.14819)
Keywords: language model
Abstract: We present "Testimole-conversational" a massive collection of discussion boards messages in the Italian language. The large size of the corpus, more than 30B word-tokens (1996-2024), renders it an ideal dataset for native Italian Large Language Models'pre-training. Furthermore, discussion boards' messages are a relevant resource for linguistic as well as sociological analysis. The corpus captures a rich variety of computer-mediated communication, offering insights into informal written Italian, discourse dynamics, and online social interaction in wide time span. Beyond its relevance for NLP applications such as language modelling, domain adaptation, and conversational analysis, it also support investigations of language variation and social phenomena in digital communication. The resource will be made freely available to the research community.
摘要：我们提供“Testimole-conversational”大量意大利语讨论板消息集合。语料库规模大，超过 30B 个单词标记（1996-2024），使其成为本地意大利语大语言模型预训练的理想数据集。此外，讨论板的消息是语言和社会学分析的相关资源。该语料库捕获了丰富多样的计算机介导的交流，提供了对非正式书面意大利语、话语动态和广泛时间跨度的在线社交互动的见解。除了与语言建模、领域适应和会话分析等 NLP 应用相关之外，它还支持数字通信中语言变异和社会现象的研究。该资源将免费提供给研究界。

Title: Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition

Authors: Varun Nathan, Shreyas Guha, Ayush Kumar
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2602.14955
Pdf URL: https://arxiv.org/pdf/2602.14955
Copy Paste: [[2602.14955]] Tool-Aware Planning in Contact Center AI: Evaluating LLMs through Lineage-Guided Query Decomposition(https://arxiv.org/abs/2602.14955)
Keywords: llm, prompt, agent
Abstract: We present a domain-grounded framework and benchmark for tool-aware plan generation in contact centers, where answering a query for business insights, our target use case, requires decomposing it into executable steps over structured tools (Text2SQL (T2S)/Snowflake) and unstructured tools (RAG/transcripts) with explicit depends_on for parallelism. Our contributions are threefold: (i) a reference-based plan evaluation framework operating in two modes - a metric-wise evaluator spanning seven dimensions (e.g., tool-prompt alignment, query adherence) and a one-shot evaluator; (ii) a data curation methodology that iteratively refines plans via an evaluator->optimizer loop to produce high-quality plan lineages (ordered plan revisions) while reducing manual effort; and (iii) a large-scale study of 14 LLMs across sizes and families for their ability to decompose queries into step-by-step, executable, and tool-assigned plans, evaluated under prompts with and without lineage. Empirically, LLMs struggle on compound queries and on plans exceeding 4 steps (typically 5-15); the best total metric score reaches 84.8% (Claude-3-7-Sonnet), while the strongest one-shot match rate at the "A+" tier (Extremely Good, Very Good) is only 49.75% (o3-mini). Plan lineage yields mixed gains overall but benefits several top models and improves step executability for many. Our results highlight persistent gaps in tool-understanding, especially in tool-prompt alignment and tool-usage completeness, and show that shorter, simpler plans are markedly easier. The framework and findings provide a reproducible path for assessing and improving agentic planning with tools for answering data-analysis queries in contact-center settings.
摘要：我们提出了一个基于领域的框架和基准，用于联络中心的工具感知计划生成，其中回答业务洞察查询（我们的目标用例）需要将其分解为结构化工具（Text2SQL（T2S）/Snowflake）和非结构化工具（RAG/transcripts）上的可执行步骤，并具有显式的depends_on以实现并行性。我们的贡献有三个方面：（i）一个以两种模式运行的基于参考的计划评估框架——一个跨越七个维度（例如，工具提示对齐、查询遵守）的度量评估器和一个一次性评估器； (ii) 一种数据管理方法，通过评估器->优化器循环迭代地完善计划，以生成高质量的计划谱系（有序的计划修订），同时减少手动工作； (iii) 对 14 名不同规模和家庭的法学硕士进行了大规模研究，研究他们将查询分解为逐步的、可执行的和工具分配的计划的能力，并在有或没有血统的提示下进行评估。根据经验，法学硕士在复合查询和超过 4 个步骤（通常为 5-15 个）的计划上遇到了困难；最好的总指标达到84.8%（Claude-3-7-Sonnet），而“A+”级别的最强单次匹配率（Extremely Good、Very Good）仅为49.75%（o3-mini）。计划沿袭总体上带来了好坏参半的收益，但有利于几个顶级模型，并提高了许多步骤的可执行性。我们的结果凸显了工具理解方面持续存在的差距，特别是在工具提示对齐和工具使用完整性方面，并表明更短、更简单的计划明显更容易。该框架和研究结果提供了一条可重复的路径，用于通过在联络中心设置中回答数据分析查询的工具来评估和改进代理规划。

Title: Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System

Authors: Kawin Mayilvaghanan, Siddhant Gupta, Ayush Kumar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.14970
Pdf URL: https://arxiv.org/pdf/2602.14970
Copy Paste: [[2602.14970]] Counterfactual Fairness Evaluation of LLM-Based Contact Center Agent Quality Assurance System(https://arxiv.org/abs/2602.14970)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) are increasingly deployed in contact-center Quality Assurance (QA) to automate agent performance evaluation and coaching feedback. While LLMs offer unprecedented scalability and speed, their reliance on web-scale training data raises concerns regarding demographic and behavioral biases that may distort workforce assessment. We present a counterfactual fairness evaluation of LLM-based QA systems across 13 dimensions spanning three categories: Identity, Context, and Behavioral Style. Fairness is quantified using the Counterfactual Flip Rate (CFR), the frequency of binary judgment reversals, and the Mean Absolute Score Difference (MASD), the average shift in coaching or confidence scores across counterfactual pairs. Evaluating 18 LLMs on 3,000 real-world contact center transcripts, we find systematic disparities, with CFR ranging from 5.4% to 13.0% and consistent MASD shifts across confidence, positive, and improvement scores. Larger, more strongly aligned models show lower unfairness, though fairness does not track accuracy. Contextual priming of historical performance induces the most severe degradations (CFR up to 16.4%), while implicit linguistic identity cues remain a persistent bias source. Finally, we analyze the efficacy of fairness-aware prompting, finding that explicit instructions yield only modest improvements in evaluative consistency. Our findings underscore the need for standardized fairness auditing pipelines prior to deploying LLMs in high-stakes workforce evaluation.
摘要：大型语言模型 (LLM) 越来越多地部署在联络中心质量保证 (QA) 中，以实现座席绩效评估和指导反馈的自动化。虽然法学硕士提供了前所未有的可扩展性和速度，但它们对网络规模培训数据的依赖引起了人们对人口和行为偏见的担忧，这些偏见可能会扭曲劳动力评估。我们对基于 LLM 的 QA 系统提出了反事实公平性评估，涵盖三个类别的 13 个维度：身份、背景和行为风格。公平性是使用反事实翻转率（CFR）（二元判断逆转的频率）和平均绝对得分差（MASD）（反事实对之间的指导或置信得分的平均变化）来量化的。通过评估 18 名法学硕士的 3,000 份真实联络中心成绩单，我们发现了系统性差异，CFR 范围从 5.4% 到 13.0%，并且 MASD 在信心、积极性和改进分数方面变化一致。更大、更一致的模型显示出较低的不公平性，尽管公平性并不跟踪准确性。历史表现的语境启动会导致最严重的退化（CFR 高达 16.4%），而隐含的语言身份线索仍然是持续的偏见来源。最后，我们分析了公平意识提示的功效，发现明确的指令只能在评估一致性方面产生适度的改善。我们的研究结果强调，在将法学硕士应用于高风险劳动力评估之前，需要标准化的公平性审核流程。

Title: Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation

Authors: Mengdan Zhu, Yufan Zhao, Tao Di, Yulan Yan, Liang Zhao
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2602.15005
Pdf URL: https://arxiv.org/pdf/2602.15005
Copy Paste: [[2602.15005]] Learning User Interests via Reasoning and Distillation for Cross-Domain News Recommendation(https://arxiv.org/abs/2602.15005)
Keywords: language model
Abstract: News recommendation plays a critical role in online news platforms by helping users discover relevant content. Cross-domain news recommendation further requires inferring user's underlying information needs from heterogeneous signals that often extend beyond direct news consumption. A key challenge lies in moving beyond surface-level behaviors to capture deeper, reusable user interests while maintaining scalability in large-scale production systems. In this paper, we present a reinforcement learning framework that trains large language models to generate high-quality lists of interest-driven news search queries from cross-domain user signals. We formulate query-list generation as a policy optimization problem and employ GRPO with multiple reward signals. We systematically study two compute dimensions: inference-time sampling and model capacity, and empirically observe consistent improvements with increased compute that exhibit scaling-like behavior. Finally, we perform on-policy distillation to transfer the learned policy from a large, compute-intensive teacher to a compact student model suitable for scalable deployment. Extensive offline experiments, ablation studies and large-scale online A/B tests in a production news recommendation system demonstrate consistent gains in both interest modeling quality and downstream recommendation performance.
摘要：新闻推荐在在线新闻平台中发挥着至关重要的作用，可以帮助用户发现相关内容。跨域新闻推荐还需要从异构信号中推断用户的潜在信息需求，这些信号通常超出了直接新闻消费的范围。一个关键的挑战在于超越表面行为以捕获更深层次的、可重用的用户兴趣，同时保持大规模生产系统的可扩展性。在本文中，我们提出了一个强化学习框架，该框架训练大型语言模型，以根据跨域用户信号生成高质量的兴趣驱动新闻搜索查询列表。我们将查询列表生成制定为策略优化问题，并采用具有多个奖励信号的 GRPO。我们系统地研究两个计算维度：推理时间采样和模型容量，并凭经验观察到随着计算量的增加而表现出类似缩放行为的一致改进。最后，我们执行策略蒸馏，将学习的策略从大型计算密集型教师转移到适合可扩展部署的紧凑学生模型。生产新闻推荐系统中的广泛离线实验、消融研究和大规模在线 A/B 测试表明，兴趣建模质量和下游推荐性能均取得了一致的进步。

Title: Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation

Authors: Ruoxi Liu, Philipp Koehn
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.15013
Pdf URL: https://arxiv.org/pdf/2602.15013
Copy Paste: [[2602.15013]] Text Style Transfer with Parameter-efficient LLM Finetuning and Round-trip Translation(https://arxiv.org/abs/2602.15013)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: This paper proposes a novel method for Text Style Transfer (TST) based on parameter-efficient fine-tuning of Large Language Models (LLMs). Addressing the scarcity of parallel corpora that map between styles, the study employs roundtrip translation to synthesize such parallel datasets from monolingual corpora. This approach creates 'neutralized' text devoid of stylistic attributes, essentially creating a shared input style at training-time and inference-time. Experimental results demonstrate consistent superiority of this method over zero-shot prompting and fewshot ICL techniques measured by BLEU scores and style accuracy scores across four investigated domains. Furthermore, the integration of retrieval-augmented generation (RAG) for terminology and name knowledge enhances robustness and stylistic consistency.
摘要：本文提出了一种基于大型语言模型（LLM）参数高效微调的文本风格迁移（TST）新方法。为了解决风格之间映射的平行语料库的稀缺问题，该研究采用往返翻译从单语语料库合成此类平行数据集。这种方法创建了没有风格属性的“中性化”文本，本质上是在训练时和推理时创建共享的输入风格。实验结果表明，该方法相对于零样本提示和少样本 ICL 技术（通过四个研究领域的 BLEU 分数和风格准确度分数进行测量）具有一致的优越性。此外，术语和名称知识的检索增强生成（RAG）的集成增强了稳健性和风格一致性。