2026-02-16

Title: A Lightweight LLM Framework for Disaster Humanitarian Information Classification

Authors: Han Jinzhen, Kim Jisung, Yang Jong Soo, Yun Hong Sik
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12284
Pdf URL: https://arxiv.org/pdf/2602.12284
Copy Paste: [[2602.12284]] A Lightweight LLM Framework for Disaster Humanitarian Information Classification(https://arxiv.org/abs/2602.12284)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Timely classification of humanitarian information from social media is critical for effective disaster response. However, deploying large language models (LLMs) for this task faces challenges in resource-constrained emergency settings. This paper develops a lightweight, cost-effective framework for disaster tweet classification using parameter-efficient fine-tuning. We construct a unified experimental corpus by integrating and normalizing the HumAID dataset (76,484 tweets across 19 disaster events) into a dual-task benchmark: humanitarian information categorization and event type identification. Through systematic evaluation of prompting strategies, LoRA fine-tuning, and retrieval-augmented generation (RAG) on Llama 3.1 8B, we demonstrate that: (1) LoRA achieves 79.62% humanitarian classification accuracy (+37.79% over zero-shot) while training only ~2% of parameters; (2) QLoRA enables efficient deployment with 99.4% of LoRA performance at 50% memory cost; (3) contrary to common assumptions, RAG strategies degrade fine-tuned model performance due to label noise from retrieved examples. These findings establish a practical, reproducible pipeline for building reliable crisis intelligence systems with limited computational resources.
摘要：及时对社交媒体上的人道主义信息进行分类对于有效应对灾害至关重要。然而，为这项任务部署大型语言模型 (LLM) 在资源有限的紧急情况下面临着挑战。本文开发了一种轻量级、经济高效的框架，使用参数高效的微调进行灾难推文分类。我们通过将 HumAID 数据集（19 个灾难事件的 76,484 条推文）集成和规范化为双任务基准：人道主义信息分类和事件类型识别，构建了一个统一的实验语料库。通过对 Llama 3.1 8B 上的提示策略、LoRA 微调和检索增强生成（RAG）的系统评估，我们证明：（1）LoRA 在仅训练约 2% 的参数的情况下实现了 79.62% 的人道主义分类准确率（比零样本+37.79%）； (2) QLoRA 可实现高效部署，以 50% 的内存成本实现 99.4% 的 LoRA 性能； (3) 与常见的假设相反，RAG 策略会由于检索到的示例中的标签噪声而降低微调模型的性能。这些发现建立了一个实用的、可重复的管道，用于利用有限的计算资源构建可靠的危机情报系统。

Title: From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness

Authors: Linbo Cao, Lihao Sun, Yang Yue
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12285
Pdf URL: https://arxiv.org/pdf/2602.12285
Copy Paste: [[2602.12285]] From Biased Chatbots to Biased Agents: Examining Role Assignment Effects on LLM Agent Robustness(https://arxiv.org/abs/2602.12285)
Keywords: language model, llm, prompt, chat, agent
Abstract: Large Language Models (LLMs) are increasingly deployed as autonomous agents capable of actions with real-world impacts beyond text generation. While persona-induced biases in text generation are well documented, their effects on agent task performance remain largely unexplored, even though such effects pose more direct operational risks. In this work, we present the first systematic case study showing that demographic-based persona assignments can alter LLM agents' behavior and degrade performance across diverse domains. Evaluating widely deployed models on agentic benchmarks spanning strategic reasoning, planning, and technical operations, we uncover substantial performance variations - up to 26.2% degradation, driven by task-irrelevant persona cues. These shifts appear across task types and model architectures, indicating that persona conditioning and simple prompt injections can distort an agent's decision-making reliability. Our findings reveal an overlooked vulnerability in current LLM agentic systems: persona assignments can introduce implicit biases and increase behavioral volatility, raising concerns for the safe and robust deployment of LLM agents.
摘要：大型语言模型 (LLM) 越来越多地被部署为自主代理，其行动能够对文本生成之外的现实世界产生影响。虽然文本生成中由角色引起的偏差已得到充分记录，但它们对代理任务绩效的影响在很大程度上仍未得到探索，尽管这种影响会带来更直接的操作风险。在这项工作中，我们提出了第一个系统案例研究，表明基于人口统计的人物角色分配可以改变法学硕士代理人的行为并降低不同领域的绩效。通过评估跨战略推理、规划和技术操作的代理基准上广泛部署的模型，我们发现了巨大的性能变化 - 由与任务无关的角色线索驱动的性能下降高达 26.2%。这些转变出现在任务类型和模型架构中，表明角色调节和简单的提示注入可能会扭曲代理决策的可靠性。我们的研究结果揭示了当前 LLM 代理系统中一个被忽视的漏洞：人物角色分配可能会引入隐性偏见并增加行为波动性，引发人们对 LLM 代理的安全和稳健部署的担忧。

Title: Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction

Authors: Junjie An, Jingguang Tian, Tianyi Wang, Yu Gao, Xiaofeng Mou, Yi Xu
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2602.12287
Pdf URL: https://arxiv.org/pdf/2602.12287
Copy Paste: [[2602.12287]] Retrieval-Augmented Self-Taught Reasoning Model with Adaptive Chain-of-Thought for ASR Named Entity Correction(https://arxiv.org/abs/2602.12287)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought
Abstract: End-to-end automatic speech recognition (ASR) systems frequently misrecognize domain-specific phrases like named entities, which can cause catastrophic failures in downstream tasks. A new family of named entity correction methods based on large language models (LLMs) has recently emerged. However, these approaches have yet to fully exploit the sophisticated reasoning capabilities inherent to LLMs. To bridge this gap, we propose a novel retrieval-augmented generation framework for correcting named entity errors in ASR. Our approach consists of two key components: (1) a rephrasing language model (RLM) for named entity recognition, followed by candidate retrieval using a phonetic-level edit distance; and (2) a novel self-taught reasoning model with adaptive chain-of-thought (A-STAR) that dynamically adjusts the depth of its reasoning based on task difficulty. Experiments on the AISHELL-1 and Homophone datasets demonstrate the effectiveness of our method, which achieves relative reductions in the named entity character error rate of 17.96\% and 34.42\%, respectively, compared to a strong baseline.
摘要：端到端自动语音识别 (ASR) 系统经常错误识别特定领域的短语（例如命名实体），这可能会导致下游任务发生灾难性故障。最近出现了一系列基于大语言模型（LLM）的新命名实体校正方法。然而，这些方法尚未充分利用法学硕士固有的复杂推理能力。为了弥补这一差距，我们提出了一种新颖的检索增强生成框架，用于纠正 ASR 中的命名实体错误。我们的方法由两个关键组成部分组成：（1）用于命名实体识别的改写语言模型（RLM），然后使用语音级编辑距离进行候选检索；（2）一种新颖的具有自适应思想链（A-STAR）的自学推理模型，可以根据任务难度动态调整推理深度。在 AISHELL-1 和 Homophone 数据集上的实验证明了我们方法的有效性，与强基线相比，该方法实现了命名实体字符错误率分别相对降低了 17.96\% 和 34.42\%。

Title: Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática

Authors: Neemias da Silva, Júlio C. W. Scholz, John Harrison, Marina Borges, Paulo Ávila, Frances A Santos, Myriam Delgado, Rodrigo Minetto, Thiago H Silva
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2602.12302
Pdf URL: https://arxiv.org/pdf/2602.12302
Copy Paste: [[2602.12302]] Grandes Modelos de Linguagem Multimodais (MLLMs): Da Teoria à Prática(https://arxiv.org/abs/2602.12302)
Keywords: language model, llm, prompt
Abstract: Multimodal Large Language Models (MLLMs) combine the natural language understanding and generation capabilities of LLMs with perception skills in modalities such as image and audio, representing a key advancement in contemporary AI. This chapter presents the main fundamentals of MLLMs and emblematic models. Practical techniques for preprocessing, prompt engineering, and building multimodal pipelines with LangChain and LangGraph are also explored. For further practical study, supplementary material is publicly available online: this https URL. Finally, the chapter discusses the challenges and highlights promising trends.
摘要：多模态大语言模型 (MLLM) 将法学硕士的自然语言理解和生成能力与图像和音频等模态的感知技能相结合，代表了当代人工智能的关键进步。本章介绍了 MLLM 和象征模型的主要基础知识。还探讨了预处理、快速工程以及使用 LangChain 和 LangGraph 构建多模式管道的实用技术。为了进一步的实践研究，补充材料可在线公开获取：此 https URL。最后，本章讨论了挑战并强调了有希望的趋势。

Title: propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale

Authors: Maximilian Idahl, Benedikt Droste, Björn Plüster, Jan Philipp Harries
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12414
Pdf URL: https://arxiv.org/pdf/2602.12414
Copy Paste: [[2602.12414]] propella-1: Multi-Property Document Annotation for LLM Data Curation at Scale(https://arxiv.org/abs/2602.12414)
Keywords: llm
Abstract: Since FineWeb-Edu, data curation for LLM pretraining has predominantly relied on single scalar quality scores produced by small classifiers. A single score conflates multiple quality dimensions, prevents flexible filtering, and offers no interpretability. We introduce propella-1, a family of small multilingual LLMs (0.6B, 1.7B, 4B parameters) that annotate text documents across 18 properties organized into six categories: core content, classification, quality and value, audience and purpose, safety and compliance, and geographic relevance. The models support 57 languages and produce structured JSON annotations conforming to a predefined schema. Evaluated against a frontier commercial LLM as a reference annotator, the 4B model achieves higher agreement than much larger general-purpose models. We release propella-annotations, a dataset of over three billion document annotations covering major pretraining corpora including data from FineWeb-2, FinePDFs, HPLT 3.0, and Nemotron-CC. Using these annotations, we present a multi-dimensional compositional analysis of widely used pretraining datasets, revealing substantial differences in quality, reasoning depth, and content composition that single-score approaches cannot capture. All model weights and annotations are released under permissive, commercial-use licenses.
摘要：自 FineWeb-Edu 以来，LLM 预训练的数据管理主要依赖于小型分类器产生的单个标量质量分数。单个分数合并了多个质量维度，阻碍了灵活的过滤，并且不提供可解释性。我们推出 propella-1，这是一个小型多语言 LLM（0.6B、1.7B、4B 参数）系列，它对 18 个属性的文本文档进行注释，分为六类：核心内容、分类、质量和价值、受众和目的、安全性和合规性以及地理相关性。这些模型支持 57 种语言，并生成符合预定义架构的结构化 JSON 注释。以前沿商业法学硕士作为参考注释器进行评估，4B 模型比更大的通用模型获得了更高的一致性。我们发布了 propella-annotations，这是一个包含超过 30 亿个文档注释的数据集，涵盖主要预训练语料库，包括来自 FineWeb-2、FinePDFs、HPLT 3.0 和 Nemotron-CC 的数据。使用这些注释，我们对广泛使用的预训练数据集进行了多维成分分析，揭示了单分方法无法捕获的质量、推理深度和内容构成方面的显着差异。所有模型权重和注释均在许可的商业用途许可下发布。

Title: RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Authors: Ziqian Zhang, Xingjian Hu, Yue Huang, Kai Zhang, Ruoxi Chen, Yixin Liu, Qingsong Wen, Kaidi Xu, Xiangliang Zhang, Neil Zhenqiang Gong, Lichao Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12424
Pdf URL: https://arxiv.org/pdf/2602.12424
Copy Paste: [[2602.12424]] RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty(https://arxiv.org/abs/2602.12424)
Keywords: language model, llm
Abstract: Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.
摘要：基准建立了标准化的评估框架，以系统地评估大型语言模型（LLM）的性能，促进客观比较并推动该领域的进步。然而，现有的基准无法区分问题难度，限制了它们有效区分模型能力的能力。为了解决这一限制，我们提出了 RankLLM，这是一种旨在量化问题难度和模型能力的新颖框架。 RankLLM引入难度作为区分的主要标准，从而能够对LLM能力进行更细粒度的评估。 RankLLM 的核心机制促进模型和问题之间的双向分数传播。 RankLLM 的核心直觉是，模型在正确回答问题时会获得能力分数，而当问题挑战模型时，问题的难度分数会增加。使用此框架，我们针对多个领域的 35,550 个问题评估了 30 个模型。 RankLLM 与人类判断的一致性达到 90%，并且始终优于 IRT 等强大的基线。它还表现出很强的稳定性、快速收敛和高计算效率，使其成为大规模、难度感知的LLM评估的实用解决方案。

Title: RBCorr: Response Bias Correction in Language Models

Authors: Om Bhatt, Anna A. Ivanova
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12445
Pdf URL: https://arxiv.org/pdf/2602.12445
Copy Paste: [[2602.12445]] RBCorr: Response Bias Correction in Language Models(https://arxiv.org/abs/2602.12445)
Keywords: language model, prompt
Abstract: Language models (LMs) are known to be prone to response biases, which present as option preference biases in fixed-response questions. It is therefore imperative to develop low-cost and effective response bias correction methods to improve LM performance and enable more accurate evaluations of model abilities. Here, we propose a simple response bias correction strategy ($\texttt{RBCorr}$) and test it on 12 open-weight language models using yes-no, entailment, and multiple choice questions. We show that response bias is prevalent in LMs pre-correction and that $\texttt{RBCorr}$ effectively eliminates bias and boosts model performance. We also explore the generalizability of bias behavior across models, datasets, and prompt formats, showing that LogProbs-based correction is highly dependent on all three of these aspects. Overall, $\texttt{RBCorr}$ is an easy-to-use method that can boost the performance of smaller LMs and ensure that LM performance on closed-response benchmarks aligns more closely with their true capabilities.
摘要：众所周知，语言模型（LM）容易出现响应偏差，这在固定响应问题中表现为选项偏好偏差。因此，迫切需要开发低成本且有效的响应偏差校正方法，以提高 LM 性能并能够更准确地评估模型能力。在这里，我们提出了一个简单的反应偏差校正策略（$\texttt{RBCorr}$），并使用是-否、蕴含和多项选择问题在 12 个开放权重语言模型上进行测试。我们表明，响应偏差在 LM 预校正中普遍存在，并且 $\texttt{RBCorr}$ 有效地消除了偏差并提高了模型性能。我们还探索了跨模型、数据集和提示格式的偏差行为的普遍性，表明基于 LogProbs 的校正高度依赖于所有这三个方面。总体而言，$\texttt{RBCorr}$ 是一种易于使用的方法，可以提高小型 LM 的性能，并确保封闭响应基准上的 LM 性能与其真实能力更接近。

Title: Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats

Authors: Pengxiang Zhao, Hui-Ling Zhen, Xing Li, Han Bao, Weizhe Lin, Zhiyuan Yang, Ziwei Yu, Xin Wang, Mingxuan Yuan, Xianzhi Yu, Zhenhua Dong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12635
Pdf URL: https://arxiv.org/pdf/2602.12635
Copy Paste: [[2602.12635]] Unleashing Low-Bit Inference on Ascend NPUs: A Comprehensive Evaluation of HiFloat Formats(https://arxiv.org/abs/2602.12635)
Keywords: llm
Abstract: As LLMs scale, low-bit floating-point formats like MXFP and NVFP4 offer new opportunities for precision and efficiency. In this work, we evaluate HiFloat (HiF8 and HiF4), a family of formats tailored for Ascend NPUs. Through rigorous comparison across weight-activation and KV-cache tasks, we provide three key insights: (1) INT8 suits narrow-range data, while floating-point formats excel with high-variance data; (2) in 4-bit regimes, HiF4's hierarchical scaling prevents the accuracy collapse seen in integer formats; and (3) HiFloat is fully compatible with state-of-the-art post-training quantization frameworks. Overall, HiFloat provides a solution for high-efficiency LLM inference on NPUs.
摘要：随着法学硕士规模的扩大，MXFP 和 NVFP4 等低位浮点格式为精度和效率提供了新的机会。在这项工作中，我们评估了 HiFloat（HiF8 和 HiF4），这是为 Ascend NPU 量身定制的一系列格式。通过对权重激活和 KV 缓存任务的严格比较，我们提供了三个关键见解：（1）INT8 适合窄范围数据，而浮点格式擅长处理高方差数据； (2) 在 4 位体制中，HiF4 的分层缩放可防止整数格式中出现的精度崩溃； (3) HiFloat 与最先进的训练后量化框架完全兼容。总体而言，HiFloat 为 NPU 上的高效 LLM 推理提供了解决方案。

Title: CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation

Authors: Yiran Rex Ma, Yuxiao Ye, Huiyuan Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12639
Pdf URL: https://arxiv.org/pdf/2602.12639
Copy Paste: [[2602.12639]] CLASE: A Hybrid Method for Chinese Legalese Stylistic Evaluation(https://arxiv.org/abs/2602.12639)
Keywords: language model, llm
Abstract: Legal text generated by large language models (LLMs) can usually achieve reasonable factual accuracy, but it frequently fails to adhere to the specialised stylistic norms and linguistic conventions of legal writing. In order to improve stylistic quality, a crucial first step is to establish a reliable evaluation method. However, having legal experts manually develop such a metric is impractical, as the implicit stylistic requirements in legal writing practice are difficult to formalise into explicit rubrics. Meanwhile, existing automatic evaluation methods also fall short: reference-based metrics conflate semantic accuracy with stylistic fidelity, and LLM-as-a-judge evaluations suffer from opacity and inconsistency. To address these challenges, we introduce CLASE (Chinese LegAlese Stylistic Evaluation), a hybrid evaluation method that focuses on the stylistic performance of legal text. The method incorporates a hybrid scoring mechanism that combines 1) linguistic feature-based scores and 2) experience-guided LLM-as-a-judge scores. Both the feature coefficients and the LLM scoring experiences are learned from contrastive pairs of authentic legal documents and their LLM-restored counterparts. This hybrid design captures both surface-level features and implicit stylistic norms in a transparent, reference-free manner. Experiments on 200 Chinese legal documents show that CLASE achieves substantially higher alignment with human judgments than traditional metrics and pure LLM-as-a-judge methods. Beyond improved alignment, CLASE provides interpretable score breakdowns and suggestions for improvements, offering a scalable and practical solution for professional stylistic evaluation in legal text generation (Code and data for CLASE is available at: this https URL).
摘要：由大型语言模型（LLM）生成的法律文本通常可以达到合理的事实准确性，但它经常无法遵守法律写作的专门文体规范和语言惯例。为了提高文体质量，关键的第一步是建立可靠的评价方法。然而，让法律专家手动制定这样的衡量标准是不切实际的，因为法律写作实践中隐含的文体要求很难形式化为明确的规则。与此同时，现有的自动评估方法也存在不足：基于参考的指标将语义准确性与文体保真度混为一谈，而法学硕士作为法官的评估也存在不透明和不一致的问题。为了应对这些挑战，我们引入了CLASE（中文法律文体评估），这是一种专注于法律文本文体表现的混合评估方法。该方法采用了混合评分机制，结合了 1) 基于语言特征的评分和 2) 以经验为导向的法学硕士法官评分。特征系数和 LLM 评分经验都是从真实法律文件及其 LLM 恢复副本的对比中学习的。这种混合设计以透明、无参考的方式捕捉表面特征和隐含的风格规范。对 200 份中国法律文件的实验表明，与传统指标和纯粹的法学硕士作为法官的方法相比，CLASE 与人类判断的一致性要高得多。除了改进的对齐之外，CLASE 还提供可解释的分数细分和改进建议，为法律文本生成中的专业文体评估提供可扩展且实用的解决方案（CLASE 的代码和数据可在以下网址找到：此 https URL）。

Title: Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

Authors: Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12642
Pdf URL: https://arxiv.org/pdf/2602.12642
Copy Paste: [[2602.12642]] Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR(https://arxiv.org/abs/2602.12642)
Keywords: llm, prompt
Abstract: Reward-maximizing RL methods enhance the reasoning performance of LLMs, but often reduce the diversity among outputs. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.
摘要：奖励最大化 RL 方法增强了 LLM 的推理性能，但通常会降低输出之间的多样性。最近的工作通过采用 GFlowNets 来解决这个问题，训练 LLM 来匹配目标分布，同时共同学习其划分函数。与之前仅将此配分函数视为归一化器的工作相比，我们将其重新解释为每个提示的预期奖励（即在线准确性）信号，利用这些未使用的信息来提高样本效率。具体来说，我们首先在配分函数和每个提示的准确度估计之间建立理论关系。基于这一关键见解，我们提出了分区函数引导强化学习（PACED-RL），这是一种训练后框架，它利用准确性估计来优先考虑训练过程中信息丰富的问题提示，并通过准确性估计错误优先的重播进一步提高样本效率。至关重要的是，这两个组件都重用了 GFlowNet 训练期间已经生成的信息，有效地将计算开销分摊到现有的优化过程中。跨不同基准的广泛实验表明，与 GRPO 和之前的 GFlowNet 方法相比，其性能得到了显着提高，强调 PACED-RL 是法学硕士样本更高效的分布匹配训练的一个有前途的方向。

Title: Learning Ordinal Probabilistic Reward from Preferences

Authors: Longze Chen, Lu Wang, Renke Shan, Ze Gong, Run Luo, Jiaming Li, Jing Luo, Qiyao Wang, Min Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12660
Pdf URL: https://arxiv.org/pdf/2602.12660
Copy Paste: [[2602.12660]] Learning Ordinal Probabilistic Reward from Preferences(https://arxiv.org/abs/2602.12660)
Keywords: language model, llm
Abstract: Reward models are crucial for aligning large language models (LLMs) with human values and intentions. Existing approaches follow either Generative (GRMs) or Discriminative (DRMs) paradigms, yet both suffer from limitations: GRMs typically demand costly point-wise supervision, while DRMs produce uncalibrated relative scores that lack probabilistic interpretation. To address these challenges, we introduce a novel reward modeling paradigm: Probabilistic Reward Model (PRM). Instead of modeling reward as a deterministic scalar, our approach treats it as a random variable, learning a full probability distribution for the quality of each response. To make this paradigm practical, we present its closed-form, discrete realization: the Ordinal Probabilistic Reward Model (OPRM), which discretizes the quality score into a finite set of ordinal ratings. Building on OPRM, we propose a data-efficient training strategy called Region Flooding Tuning (RgFT). It enables rewards to better reflect absolute text quality by incorporating quality-level annotations, which guide the model to concentrate the probability mass within corresponding rating sub-regions. Experiments on various reward model benchmarks show that our method improves accuracy by $\textbf{2.9%}\sim\textbf{7.4%}$ compared to prior reward models, demonstrating strong performance and data efficiency. Analysis of the score distribution provides evidence that our method captures not only relative rankings but also absolute quality.
摘要：奖励模型对于使大型语言模型 (LLM) 与人类价值观和意图保持一致至关重要。现有的方法遵循生成（GRM）或判别（DRM）范式，但两者都存在局限性：GRM通常需要昂贵的逐点监督，而DRM产生缺乏概率解释的未经校准的相对分数。为了应对这些挑战，我们引入了一种新颖的奖励建模范式：概率奖励模型（PRM）。我们的方法不是将奖励建模为确定性标量，而是将其视为随机变量，学习每个响应质量的完整概率分布。为了使这个范式变得实用，我们提出了它的封闭式离散实现：序数概率奖励模型（OPRM），它将质量得分离散化为一组有限的序数评级。在 OPRM 的基础上，我们提出了一种数据高效的训练策略，称为区域泛洪调整（RgFT）。它通过结合质量级别注释使奖励能够更好地反映绝对文本质量，从而指导模型将概率质量集中在相应的评级子区域内。各种奖励模型基准的实验表明，与之前的奖励模型相比，我们的方法将准确性提高了 $\textbf{2.9%}\sim\textbf{7.4%}$，展现了强大的性能和数据效率。对分数分布的分析证明我们的方法不仅可以捕获相对排名，还可以捕获绝对质量。

Title: $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models

Authors: Yuang Cai, Yuyu Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12674
Pdf URL: https://arxiv.org/pdf/2602.12674
Copy Paste: [[2602.12674]] $\mathcal{X}$-KD: General Experiential Knowledge Distillation for Large Language Models(https://arxiv.org/abs/2602.12674)
Keywords: language model, llm
Abstract: Knowledge Distillation (KD) for Large Language Models (LLMs) has become increasingly important as models grow in size and complexity. While existing distillation approaches focus on imitating teacher behavior, they often overlook the original learning environment that shaped the teacher's knowledge. Inspired by the experiential learning theory and inverse reinforcement learning, we propose Experiential Knowledge Distillation ($\mathcal{X}$-KD), a novel and general framework that enables student models to learn in the teacher's original learning environment. $\mathcal{X}$-KD adopts the Approximated Variational Reward Imitation Learning (AVRIL) framework to jointly model the teacher's original reward function and perform policy distillation, encouraging consistency between the student policy and the original reward function. Our derivation demonstrates that $\mathcal{X}$-KD follows the supervised learning framework and applies to both sequence-level and divergence-based distillation methods, underlining the simplicity and flexibility of our approach. Empirical results show that $\mathcal{X}$-KD outperforms the generalized KD and MiniLLM baselines on abstractive summarization, machine translation, and arithmetic reasoning tasks. Additionally, $\mathcal{X}$-KD achieves better performance-diversity trade-off and data efficiency than baseline KD approaches.
摘要：随着模型规模和复杂性的增长，大型语言模型 (LLM) 的知识蒸馏 (KD) 变得越来越重要。虽然现有的蒸馏方法侧重于模仿教师的行为，但它们往往忽视了塑造教师知识的原始学习环境。受体验式学习理论和逆强化学习的启发，我们提出了体验式知识蒸馏（$\mathcal{X}$-KD），这是一种新颖且通用的框架，使学生模型能够在教师原始的学习环境中学习。 $\mathcal{X}$-KD采用近似变分奖励模仿学习（AVRIL）框架对教师的原始奖励函数进行联合建模并进行策略蒸馏，鼓励学生策略与原始奖励函数之间的一致性。我们的推导表明 $\mathcal{X}$-KD 遵循监督学习框架，适用于序列级和基于分歧的蒸馏方法，强调了我们方法的简单性和灵活性。实证结果表明，$\mathcal{X}$-KD 在抽象摘要、机器翻译和算术推理任务上优于广义 KD 和 MiniLLM 基线。此外，与基线 KD 方法相比，$\mathcal{X}$-KD 实现了更好的性能多样性权衡和数据效率。

Title: MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs

Authors: Baorong Shi, Bo Cui, Boyuan Jiang, Deli Yu, Fang Qian, Haihua Yang, Huichao Wang, Jiale Chen, Jianfei Pan, Jieqiong Cao, Jinghao Lin, Kai Wu, Lin Yang, Shengsheng Yao, Tao Chen, Xiaojun Xiao, Xiaozhong Ji, Xu Wang, Yijun He, Zhixiong Yang
Subjects: cs.CL, cs.AI, cs.CV, eess.IV
Abstract URL: https://arxiv.org/abs/2602.12705
Pdf URL: https://arxiv.org/pdf/2602.12705
Copy Paste: [[2602.12705]] MedXIAOHE: A Comprehensive Recipe for Building Medical MLLMs(https://arxiv.org/abs/2602.12705)
Keywords: llm, hallucination, agent
Abstract: We present MedXIAOHE, a medical vision-language foundation model designed to advance general-purpose medical understanding and reasoning in real-world clinical applications. MedXIAOHE achieves state-of-the-art performance across diverse medical benchmarks and surpasses leading closed-source multimodal systems on multiple capabilities. To achieve this, we propose an entity-aware continual pretraining framework that organizes heterogeneous medical corpora to broaden knowledge coverage and reduce long-tail gaps (e.g., rare diseases). For medical expert-level reasoning and interaction, MedXIAOHE incorporates diverse medical reasoning patterns via reinforcement learning and tool-augmented agentic training, enabling multi-step diagnostic reasoning with verifiable decision traces. To improve reliability in real-world use, MedXIAOHE integrates user-preference rubrics, evidence-grounded reasoning, and low-hallucination long-form report generation, with improved adherence to medical instructions. We release this report to document our practical design choices, scaling insights, and evaluation framework, hoping to inspire further research.
摘要：我们推出了 MedXIAOHE，这是一种医学视觉语言基础模型，旨在促进现实临床应用中的通用医学理解和推理。 MedXIAOHE 在不同的医疗基准上实现了最先进的性能，并在多种功能上超越了领先的闭源多模式系统。为了实现这一目标，我们提出了一个实体感知的持续预训练框架，该框架可以组织异构医学语料库，以扩大知识覆盖范围并减少长尾差距（例如罕见疾病）。对于医学专家级推理和交互，MedXIAOHE 通过强化学习和工具增强代理训练融合了多种医学推理模式，实现了具有可验证决策轨迹的多步骤诊断推理。为了提高实际使用的可靠性，MedXIAOHE 集成了用户偏好的评分标准、循证推理和低幻觉长格式报告生成，并提高了对医疗指示的遵守程度。我们发布这份报告来记录我们的实际设计选择、扩展见解和评估框架，希望能够激发进一步的研究。

Title: ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter

Authors: Yixin Chen, Ying Xiong, Shangyu Wu, Xiangrui Ke, Nan Guan, Chun Jason Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12709
Pdf URL: https://arxiv.org/pdf/2602.12709
Copy Paste: [[2602.12709]] ReFilter: Improving Robustness of Retrieval-Augmented Generation via Gated Filter(https://arxiv.org/abs/2602.12709)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has become a dominant paradigm for grounding large language models (LLMs) with external evidence in knowledge-intensive question answering. A core design choice is how to fuse retrieved samples into the LLMs, where existing internal fusion approaches broadly fall into query-based fusion, parametric fusion, and latent-based fusion. Despite their effectiveness at modest retrieval scales, these methods often fail to scale gracefully as the number of retrieved candidates k increases: Larger k improves evidence coverage, yet realistic top-k retrieval inevitably contains irrelevant or redundant content and increases the inference cost. To address these limitations, we propose ReFilter, a novel latent-based fusion framework that performs token-level filtering and fusion. ReFilter consists of three key components: a context encoder for encoding context features, a gated filter for weighting each token, and a token fusion module for integrating the weighted token feature into the LLM's hidden states. Our experiments across four general-domain QA benchmarks show that ReFilter consistently achieves the best average performance under both in-domain adaptation and out-of-domain transfer. ReFilter further generalizes to five biomedical QA benchmarks in zero-shot transfer without domain fine-tuning, reaching 70.01% average accuracy with Qwen2.5-14B-Instruct.
摘要：检索增强生成（RAG）已成为在知识密集型问答中将大型语言模型（LLM）与外部证据结合起来的主导范例。核心设计选择是如何将检索到的样本融合到LLM中，现有的内部融合方法大致分为基于查询的融合、参数融合和基于潜在的融合。尽管这些方法在适度的检索规模上有效，但随着检索的候选 k 数量的增加，这些方法通常无法正常扩展：较大的 k 可以提高证据覆盖范围，但现实的 top-k 检索不可避免地包含不相关或冗余的内容，并增加了推理成本。为了解决这些限制，我们提出了 ReFilter，这是一种新颖的基于潜在的融合框架，可以执行令牌级别的过滤和融合。 ReFilter 由三个关键组件组成：用于编码上下文特征的上下文编码器、用于对每个令牌进行加权的门控滤波器以及用于将加权令牌特征集成到 LLM 的隐藏状态中的令牌融合模块。我们在四个通用域 QA 基准测试中的实验表明，ReFilter 在域内适应和域外传输下始终实现最佳平均性能。 ReFilter 进一步推广到零样本传输中的五个生物医学 QA 基准，无需域微调，Qwen2.5-14B-Instruct 的平均准确率达到 70.01%。

Title: RAT-Bench: A Comprehensive Benchmark for Text Anonymization

Authors: Nataša Krčo, Zexi Yao, Matthieu Meeus, Yves-Alexandre de Montjoye
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12806
Pdf URL: https://arxiv.org/pdf/2602.12806
Copy Paste: [[2602.12806]] RAT-Bench: A Comprehensive Benchmark for Text Anonymization(https://arxiv.org/abs/2602.12806)
Keywords: language model, llm
Abstract: Data containing personal information is increasingly used to train, fine-tune, or query Large Language Models (LLMs). Text is typically scrubbed of identifying information prior to use, often with tools such as Microsoft's Presidio or Anthropic's PII purifier. These tools have traditionally been evaluated on their ability to remove specific identifiers (e.g., names), yet their effectiveness at preventing re-identification remains unclear. We introduce RAT-Bench, a comprehensive benchmark for text anonymization tools based on re-identification risk. Using U.S. demographic statistics, we generate synthetic text containing various direct and indirect identifiers across domains, languages, and difficulty levels. We evaluate a range of NER- and LLM-based text anonymization tools and, based on the attributes an LLM-based attacker is able to correctly infer from the anonymized text, we report the risk of re-identification in the U.S. population, while properly accounting for the disparate impact of identifiers. We find that, while capabilities vary widely, even the best tools are far from perfect in particular when direct identifiers are not written in standard ways and when indirect identifiers enable re-identification. Overall we find LLM-based anonymizers, including new iterative anonymizers, to provide a better privacy-utility trade-off albeit at a higher computational cost. Importantly, we also find them to work well across languages. We conclude with recommendations for future anonymization tools and will release the benchmark and encourage community efforts to expand it, in particular to other geographies.
摘要：包含个人信息的数据越来越多地用于训练、微调或查询大型语言模型 (LLM)。文本在使用前通常会使用 Microsoft 的 Presidio 或 Anthropic 的 PII 净化器等工具清除识别信息。传统上，这些工具是根据其删除特定标识符（例如姓名）的能力进行评估的，但它们在防止重新识别方面的有效性仍不清楚。我们推出 RAT-Bench，一个基于重识别风险的文本匿名化工具的综合基准测试。使用美国人口统计数据，我们生成包含跨领域、语言和难度级别的各种直接和间接标识符的合成文本。我们评估了一系列基于 NER 和 LLM 的文本匿名化工具，并根据基于 LLM 的攻击者能够从匿名文本中正确推断的属性，我们报告了美国人口中重新识别的风险，同时正确考虑了标识符的不同影响。我们发现，虽然功能差异很大，但即使是最好的工具也远非完美，特别是当直接标识符未以标准方式编写以及间接标识符支持重新识别时。总体而言，我们发现基于 LLM 的匿名器（包括新的迭代匿名器）可以提供更好的隐私与实用性权衡，尽管计算成本更高。重要的是，我们还发现它们可以很好地跨语言工作。最后，我们提出了对未来匿名工具的建议，并将发布基准并鼓励社区努力扩展它，特别是扩展到其他地区。

Title: Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence

Authors: Laurent Bonnasse-Gahot, Christophe Pallier
Subjects: cs.CL, cs.AI, q-bio.NC
Abstract URL: https://arxiv.org/abs/2602.12811
Pdf URL: https://arxiv.org/pdf/2602.12811
Copy Paste: [[2602.12811]] Left-right asymmetry in predicting brain activity from LLMs' representations emerges with their formal linguistic competence(https://arxiv.org/abs/2602.12811)
Keywords: language model, llm
Abstract: When humans and large language models (LLMs) process the same text, activations in the LLMs correlate with brain activity measured, e.g., with functional magnetic resonance imaging (fMRI). Moreover, it has been shown that, as the training of an LLM progresses, the performance in predicting brain activity from its internal activations improves more in the left hemisphere than in the right one. The aim of the present work is to understand which kind of competence acquired by the LLMs underlies the emergence of this left-right asymmetry. Using the OLMo-2 7B language model at various training checkpoints and fMRI data from English participants, we compare the evolution of the left-right asymmetry in brain scores alongside performance on several benchmarks. We observe that the asymmetry co-emerges with the formal linguistic abilities of the LLM. These abilities are demonstrated in two ways: by the model's capacity to assign a higher probability to an acceptable sentence than to a grammatically unacceptable one within a minimal contrasting pair, or its ability to produce well-formed text. On the opposite, the left-right asymmetry does not correlate with the performance on arithmetic or Dyck language tasks; nor with text-based tasks involving world knowledge and reasoning. We generalize these results to another family of LLMs (Pythia) and another language, namely French. Our observations indicate that the left-right asymmetry in brain predictivity matches the progress in formal linguistic competence (knowledge of linguistic patterns).
摘要：当人类和大语言模型 (LLM) 处理相同的文本时，LLM 中的激活与测量的大脑活动相关，例如通过功能磁共振成像 (fMRI) 测量。此外，研究表明，随着法学硕士训练的进展，左半球根据内部激活预测大脑活动的性能比右半球提高得更多。目前工作的目的是了解法学硕士获得的哪种能力是这种左右不对称出现的基础。使用不同训练检查点的 OLMo-2 7B 语言模型和来自英语参与者的功能磁共振成像数据，我们比较了大脑分数左右不对称性的演变以及多个基准的表现。我们观察到这种不对称性与法学硕士的形式语言能力同时出现。这些能力通过两种方式证明：模型能够为可接受的句子分配比最小对比对中语法上不可接受的句子更高的概率，或者模型生成结构良好的文本的能力。相反，左右不对称与算术或 Dyck 语言任务的表现无关。也不涉及涉及世界知识和推理的基于文本的任务。我们将这些结果推广到另一个法学硕士家族（Pythia）和另一种语言，即法语。我们的观察表明，大脑预测能力的左右不对称与形式语言能力（语言模式知识）的进步相匹配。

Title: AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection

Authors: Luca Tedeschini, Matteo Fasulo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12818
Pdf URL: https://arxiv.org/pdf/2602.12818
Copy Paste: [[2602.12818]] AIWizards at MULTIPRIDE: A Hierarchical Approach to Slur Reclamation Detection(https://arxiv.org/abs/2602.12818)
Keywords: llm
Abstract: Detecting reclaimed slurs represents a fundamental challenge for hate speech detection systems, as the same lexcal items can function either as abusive expressions or as in-group affirmations depending on social identity and context. In this work, we address Subtask B of the MultiPRIDE shared task at EVALITA 2026 by proposing a hierarchical approach to modeling the slur reclamation process. Our core assumption is that members of the LGBTQ+ community are more likely, on average, to employ certain slurs in a eclamatory manner. Based on this hypothesis, we decompose the task into two stages. First, using a weakly supervised LLM-based annotation, we assign fuzzy labels to users indicating the likelihood of belonging to the LGBTQ+ community, inferred from the tweet and the user bio. These soft labels are then used to train a BERT-like model to predict community membership, encouraging the model to learn latent representations associated with LGBTQ+ identity. In the second stage, we integrate this latent space with a newly initialized model for the downstream slur reclamation detection task. The intuition is that the first model encodes user-oriented sociolinguistic signals, which are then fused with representations learned by a model pretrained for hate speech detection. Experimental results on Italian and Spanish show that our approach achieves performance statistically comparable to a strong BERT-based baseline, while providing a modular and extensible framework for incorporating sociolinguistic context into hate speech modeling. We argue that more fine-grained hierarchical modeling of user identity and discourse context may further improve the detection of reclaimed language. We release our code at this https URL.
摘要：检测回收的诽谤性内容对仇恨言论检测系统来说是一个根本性的挑战，因为根据社会身份和背景，相同的词汇可以充当辱骂性表达，也可以充当群体内的肯定。在这项工作中，我们通过提出一种对诽谤回收过程进行建模的分层方法来解决 EVALITA 2026 上 MultiPRIDE 共享任务的子任务 B。我们的核心假设是，平均而言，LGBTQ+ 群体的成员更有可能以热烈的方式使用某些诽谤性言论。基于这个假设，我们将任务分解为两个阶段。首先，使用基于 LLM 的弱监督注释，我们为用户分配模糊标签，表明根据推文和用户简介推断属于 LGBTQ+ 社区的可能性。然后，使用这些软标签来训练类似 BERT 的模型来预测社区成员资格，从而鼓励模型学习与 LGBTQ+ 身份相关的潜在表示。在第二阶段，我们将这个潜在空间与新初始化的模型集成起来，用于下游连线回收检测任务。直觉是，第一个模型对面向用户的社会语言信号进行编码，然后将其与针对仇恨语音检测进行预训练的模型学习到的表示融合。意大利语和西班牙语的实验结果表明，我们的方法在统计上达到了与基于 BERT 的强大基线相当的性能，同时提供了一个模块化和可扩展的框架，用于将社会语言背景纳入仇恨言论建模。我们认为，对用户身份和话语上下文进行更细粒度的分层建模可能会进一步改善回收语言的检测。我们在此 https URL 发布我们的代码。

Title: MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models

Authors: Hoyun Song, Migyeong Kang, Jisu Shin, Jihyun Kim, Chanbi Park, Hangyeol Yoo, Jihyun An, Alice Oh, Jinyoung Han, KyungTae Lim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12871
Pdf URL: https://arxiv.org/pdf/2602.12871
Copy Paste: [[2602.12871]] MentalBench: A Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models(https://arxiv.org/abs/2602.12871)
Keywords: language model, llm
Abstract: We introduce MentalBench, a benchmark for evaluating psychiatric diagnostic decision-making in large language models (LLMs). Existing mental health benchmarks largely rely on social media data, limiting their ability to assess DSM-grounded diagnostic judgments. At the core of MentalBench is MentalKG, a psychiatrist-built and validated knowledge graph encoding DSM-5 diagnostic criteria and differential diagnostic rules for 23 psychiatric disorders. Using MentalKG as a golden-standard logical backbone, we generate 24,750 synthetic clinical cases that systematically vary in information completeness and diagnostic complexity, enabling low-noise and interpretable evaluation. Our experiments show that while state-of-the-art LLMs perform well on structured queries probing DSM-5 knowledge, they struggle to calibrate confidence in diagnostic decision-making when distinguishing between clinically overlapping disorders. These findings reveal evaluation gaps not captured by existing benchmarks.
摘要：我们推出 MentalBench，这是一个在大语言模型 (LLM) 中评估精神病学诊断决策的基准。现有的心理健康基准在很大程度上依赖于社交媒体数据，限制了它们评估基于 DSM 的诊断判断的能力。 MentalBench 的核心是 MentalKG，这是一个由精神病学家构建并经过验证的知识图，编码了 23 种精神疾病的 DSM-5 诊断标准和鉴别诊断规则。使用 MentalKG 作为黄金标准逻辑主干，我们生成 24,750 个合成临床病例，这些病例的信息完整性和诊断复杂性系统性地变化，从而实现低噪声和可解释的评估。我们的实验表明，虽然最先进的法学硕士在探索 DSM-5 知识的结构化查询方面表现良好，但在区分临床重叠疾病时，他们很难校准诊断决策的信心。这些发现揭示了现有基准未涵盖的评估差距。

Title: BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models

Authors: Jiangxi Chen, Qian Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12889
Pdf URL: https://arxiv.org/pdf/2602.12889
Copy Paste: [[2602.12889]] BaziQA-Benchmark: Evaluating Symbolic and Temporally Compositional Reasoning in Large Language Models(https://arxiv.org/abs/2602.12889)
Keywords: language model, prompt
Abstract: We present BaziQA-Benchmark, a standardized benchmark for evaluating symbolic and temporally compositional reasoning in large language models. The benchmark is derived from 200 professionally curated, multiple-choice problems from the Global Fortune-teller Competition (2021--2025), where each instance requires structured inference over a fixed symbolic chart and interacting temporal conditions. Unlike anecdotal or prompt-driven evaluations, BaziQA-Benchmark enables objective scoring and controlled comparison across years, domains, and model families. We evaluate contemporary language models under a multi-turn setting and analyze performance variation across temporal difficulty, reasoning domains, and inference this http URL further probe reasoning behavior, we introduce a lightweight Structured Reasoning Protocol that constrains inference order without adding domain knowledge. Results show that models consistently outperform chance but remain far from saturation, exhibiting pronounced sensitivity to temporal composition and reasoning order, as well as systematic failures on precise temporal localization and multi-condition symbolic judgments.
摘要：我们提出了 BaziQA-Benchmark，这是一个用于评估大型语言模型中的符号和时间组合推理的标准化基准。该基准源自全球算命先生大赛（2021--2025）中的 200 个专业策划的多项选择题，其中每个实例都需要对固定符号图表和交互时间条件进行结构化推理。与轶事或提示驱动的评估不同，BaziQA-Benchmark 可以实现跨年份、领域和模型系列的客观评分和受控比较。我们在多轮设置下评估当代语言模型，并分析跨时间难度、推理领域的性能变化，并推断此 http URL 进一步探测推理行为，我们引入了一种轻量级结构化推理协议，该协议在不添加领域知识的情况下限制推理顺序。结果表明，模型的表现始终优于机会，但仍远未达到饱和，对时间构成和推理顺序表现出明显的敏感性，以及在精确时间定位和多条件符号判断方面的系统性失败。

Title: When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms

Authors: Adib Sakhawat, Shamim Ara Parveen, Md Ruhul Amin, Shamim Al Mahmud, Md Saiful Islam, Tahera Khatun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12921
Pdf URL: https://arxiv.org/pdf/2602.12921
Copy Paste: [[2602.12921]] When Words Don't Mean What They Say: Figurative Understanding in Bengali Idioms(https://arxiv.org/abs/2602.12921)
Keywords: language model, llm
Abstract: Figurative language understanding remains a significant challenge for Large Language Models (LLMs), especially for low-resource languages. To address this, we introduce a new idiom dataset, a large-scale, culturally-grounded corpus of 10,361 Bengali idioms. Each idiom is annotated under a comprehensive 19-field schema, established and refined through a deliberative expert consensus process, that captures its semantic, syntactic, cultural, and religious dimensions, providing a rich, structured resource for computational linguistics. To establish a robust benchmark for Bangla figurative language understanding, we evaluate 30 state-of-the-art multilingual and instruction-tuned LLMs on the task of inferring figurative meaning. Our results reveal a critical performance gap, with no model surpassing 50% accuracy, a stark contrast to significantly higher human performance (83.4%). This underscores the limitations of existing models in cross-linguistic and cultural reasoning. By releasing the new idiom dataset and benchmark, we provide foundational infrastructure for advancing figurative language understanding and cultural grounding in LLMs for Bengali and other low-resource languages.
摘要：比喻语言理解仍然是大型语言模型（LLM）的一个重大挑战，特别是对于资源匮乏的语言。为了解决这个问题，我们引入了一个新的习语数据集，这是一个包含 10,361 个孟加拉语习语的大规模、基于文化的语料库。每个习语都在一个全面的 19 领域模式下进行注释，通过深思熟虑的专家共识过程建立和完善，捕捉其语义、句法、文化和宗教维度，为计算语言学提供丰富的结构化资源。为了建立孟加拉比喻语言理解的稳健基准，我们评估了 30 个最先进的多语言和经过教学调整的法学硕士的比喻意义推断任务。我们的结果揭示了关键的性能差距，没有模型的准确率超过 50%，这与人类更高的性能 (83.4%) 形成鲜明对比。这强调了现有模型在跨语言和文化推理方面的局限性。通过发布新的习语数据集和基准，我们为孟加拉语和其他资源匮乏语言的法学硕士中推进比喻语言理解和文化基础提供了基础设施。

Title: Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models

Authors: Ali Mekky, Mohamed El Zeftawy, Lara Hassan, Amr Keleg, Preslav Nakov
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2602.12937
Pdf URL: https://arxiv.org/pdf/2602.12937
Copy Paste: [[2602.12937]] Curriculum Learning and Pseudo-Labeling Improve the Generalization of Multi-Label Arabic Dialect Identification Models(https://arxiv.org/abs/2602.12937)
Keywords: gpt
Abstract: Being modeled as a single-label classification task for a long time, recent work has argued that Arabic Dialect Identification (ADI) should be framed as a multi-label classification task. However, ADI remains constrained by the availability of single-label datasets, with no large-scale multi-label resources available for training. By analyzing models trained on single-label ADI data, we show that the main difficulty in repurposing such datasets for Multi-Label Arabic Dialect Identification (MLADI) lies in the selection of negative samples, as many sentences treated as negative could be acceptable in multiple dialects. To address these issues, we construct a multi-label dataset by generating automatic multi-label annotations using GPT-4o and binary dialect acceptability classifiers, with aggregation guided by the Arabic Level of Dialectness (ALDi). Afterward, we train a BERT-based multi-label classifier using curriculum learning strategies aligned with dialectal complexity and label cardinality. On the MLADI leaderboard, our best-performing LAHJATBERT model achieves a macro F1 of 0.69, compared to 0.55 for the strongest previously reported system. Code and data are available at this https URL.
摘要：长期以来，阿拉伯语方言识别（ADI）一直被建模为单标签分类任务，但最近的工作认为，阿拉伯语方言识别（ADI）应该被构建为多标签分类任务。然而，ADI 仍然受到单标签数据集可用性的限制，没有大规模的多标签资源可用于训练。通过分析单标签 ADI 数据训练的模型，我们发现将此类数据集重新用于多标签阿拉伯方言识别（MLADI）的主要困难在于负样本的选择，因为许多被视为负样本的句子在多种方言中是可以接受的。为了解决这些问题，我们通过使用 GPT-4o 和二进制方言可接受性分类器生成自动多标签注释来构建多标签数据集，并以阿拉伯语方言级别 (ALDi) 为指导进行聚合。随后，我们使用与方言复杂性和标签基数相一致的课程学习策略来训练基于 BERT 的多标签分类器。在 MLADI 排行榜上，我们表现最好的 LAHJATBERT 模型的宏观 F1 为 0.69，而之前报告的最强系统的宏观 F1 为 0.55。代码和数据可从此 https URL 获取。

Title: ProbeLLM: Automating Principled Diagnosis of LLM Failures

Authors: Yue Huang, Zhengzhe Jiang, Yuchen Ma, Yu Jiang, Xiangqi Wang, Yujun Zhou, Yuexing Hao, Kehan Guo, Pin-Yu Chen, Stefan Feuerriegel, Xiangliang Zhang
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2602.12966
Pdf URL: https://arxiv.org/pdf/2602.12966
Copy Paste: [[2602.12966]] ProbeLLM: Automating Principled Diagnosis of LLM Failures(https://arxiv.org/abs/2602.12966)
Keywords: language model, llm
Abstract: Understanding how and why large language models (LLMs) fail is becoming a central challenge as models rapidly evolve and static evaluations fall behind. While automated probing has been enabled by dynamic test generation, existing approaches often discover isolated failure cases, lack principled control over exploration, and provide limited insight into the underlying structure of model weaknesses. We propose ProbeLLM, a benchmark-agnostic automated probing framework that elevates weakness discovery from individual failures to structured failure modes. ProbeLLM formulates probing as a hierarchical Monte Carlo Tree Search, explicitly allocating limited probing budgets between global exploration of new failure regions and local refinement of recurring error patterns. By restricting probing to verifiable test cases and leveraging tool-augmented generation and verification, ProbeLLM grounds failure discovery in reliable evidence. Discovered failures are further consolidated into interpretable failure modes via failure-aware embeddings and boundary-aware induction. Across diverse benchmarks and LLMs, ProbeLLM reveals substantially broader, cleaner, and more fine-grained failure landscapes than static benchmarks and prior automated methods, supporting a shift from case-centric evaluation toward principled weakness discovery.
摘要：随着模型的快速发展和静态评估的落后，理解大型语言模型 (LLM) 失败的方式和原因正在成为一个核心挑战。虽然通过动态测试生成实现了自动探测，但现有方法经常发现孤立的故障案例，缺乏对探索的原则性控制，并且对模型弱点的底层结构提供的洞察有限。我们提出了 ProbeLLM，这是一个与基准无关的自动探测框架，可将弱点发现从个体故障提升到结构化故障模式。 ProbeLLM 将探测制定为分层蒙特卡罗树搜索，在新故障区域的全局探索和重复错误模式的局部细化之间明确分配有限的探测预算。通过将探测限制在可验证的测试用例并利用工具增强的生成和验证，ProbeLLM 将故障发现建立在可靠的证据之上。通过故障感知嵌入和边界感知归纳，发现的故障进一步整合为可解释的故障模式。在不同的基准和法学硕士中，ProbeLLM 比静态基准和先前的自动化方法揭示了更广泛、更清晰和更细粒度的故障场景，支持从以案例为中心的评估向有原则的弱点发现的转变。

Title: SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents

Authors: Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, Yu-Gang Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12984
Pdf URL: https://arxiv.org/pdf/2602.12984
Copy Paste: [[2602.12984]] SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents(https://arxiv.org/abs/2602.12984)
Keywords: gpt, llm, agent
Abstract: Scientific reasoning inherently demands integrating sophisticated toolkits to navigate domain-specific knowledge. Yet, current benchmarks largely overlook agents' ability to orchestrate tools for such rigorous workflows. To bridge this gap, we introduce SciAgentGym, a scalable interactive environment featuring 1,780 domain-specific tools across four natural science disciplines, supported by a robust execution infrastructure. Complementing this, we present SciAgentBench, a tiered evaluation suite designed to stress-test agentic capabilities from elementary actions to long-horizon workflows. Our evaluation identifies a critical bottleneck: state-of-the-art models struggle with complex scientific tool-use. Even for a leading model like GPT-5, success rates drop sharply from 60.6% to 30.9% as interaction horizons extend, primarily due to failures in multi-step workflow execution. To address this, we propose SciForge, a data synthesis method that models the tool action space as a dependency graph to generate logic-aware training trajectories. By fine-tuning on these trajectories, our SciAgent-8B outperforms the significantly larger Qwen3-VL-235B-Instruct while exhibiting positive cross-domain transfer of scientific tool-use capabilities. These results underscore the promising potential of next-generation autonomous scientific agents.
摘要：科学推理本质上需要集成复杂的工具包来导航特定领域的知识。然而，当前的基准在很大程度上忽视了代理为如此严格的工作流程编排工具的能力。为了弥补这一差距，我们推出了 SciAgentGym，这是一个可扩展的交互式环境，具有跨四个自然科学学科的 1,780 个特定领域工具，并由强大的执行基础设施提供支持。作为补充，我们推出了 SciAgentBench，这是一个分层评估套件，旨在对从基本操作到长期工作流程的代理功能进行压力测试。我们的评估发现了一个关键瓶颈：最先进的模型难以应对复杂的科学工具的使用。即使对于像 GPT-5 这样的领先模型，随着交互范围的扩展，成功率也会从 60.6% 急剧下降到 30.9%，这主要是由于多步骤工作流执行失败。为了解决这个问题，我们提出了 SciForge，一种数据合成方法，它将工具操作空间建模为依赖图，以生成逻辑感知的训练轨迹。通过对这些轨迹进行微调，我们的 SciAgent-8B 的性能明显优于更大的 Qwen3-VL-235B-Instruct，同时表现出科学工具使用能力的积极跨域转移。这些结果强调了下一代自主科学代理的巨大潜力。

Title: Evaluating the Homogeneity of Keyphrase Prediction Models

Authors: Maël Houbre, Florian Boudin, Beatrice Daille
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.12989
Pdf URL: https://arxiv.org/pdf/2602.12989
Copy Paste: [[2602.12989]] Evaluating the Homogeneity of Keyphrase Prediction Models(https://arxiv.org/abs/2602.12989)
Keywords: prompt
Abstract: Keyphrases which are useful in several NLP and IR applications are either extracted from text or predicted by generative models. Contrarily to keyphrase extraction approaches, keyphrase generation models can predict keyphrases that do not appear in a document's text called `absent keyphrases`. This ability means that keyphrase generation models can associate a document to a notion that is not explicitly mentioned in its text. Intuitively, this suggests that for two documents treating the same subjects, a keyphrase generation model is more likely to be homogeneous in their indexing i.e. predict the same keyphrase for both documents, regardless of those keyphrases appearing in their respective text or not; something a keyphrase extraction model would fail to do. Yet, homogeneity of keyphrase prediction models is not covered by current benchmarks. In this work, we introduce a method to evaluate the homogeneity of keyphrase prediction models and study if absent keyphrase generation capabilities actually help the model to be more homogeneous. To our surprise, we show that keyphrase extraction methods are competitive with generative models, and that the ability to generate absent keyphrases can actually have a negative impact on homogeneity. Our data, code and prompts are available on huggingface and github.
摘要：在一些 NLP 和 IR 应用程序中有用的关键短语要么是从文本中提取的，要么是通过生成模型预测的。与关键短语提取方法相反，关键短语生成模型可以预测文档文本中未出现的关键短语，称为“缺席关键短语”。这种能力意味着关键短语生成模型可以将文档与其文本中未明确提及的概念相关联。直观上，这表明对于处理相同主题的两个文档，关键短语生成模型在其索引中更有可能是同质的，即预测两个文档的相同关键短语，无论这些关键短语是否出现在各自的文本中；这是关键词提取模型无法做到的。然而，当前的基准并未涵盖关键短语预测模型的同质性。在这项工作中，我们介绍了一种评估关键短语预测模型同质性的方法，并研究缺乏关键短语生成功能是否确实有助于模型更加同质。令我们惊讶的是，我们表明关键短语提取方法与生成模型具有竞争力，并且生成缺失关键短语的能力实际上会对同质性产生负面影响。我们的数据、代码和提示可以在 Huggingface 和 github 上找到。

Title: Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models

Authors: Hao Chen, Ye He, Yuchun Fan, Yukun Yan, Zhenghao Liu, Qingfu Zhu, Maosong Sun, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.12996
Pdf URL: https://arxiv.org/pdf/2602.12996
Copy Paste: [[2602.12996]] Know More, Know Clearer: A Meta-Cognitive Framework for Knowledge Augmentation in Large Language Models(https://arxiv.org/abs/2602.12996)
Keywords: language model, llm
Abstract: Knowledge augmentation has significantly enhanced the performance of Large Language Models (LLMs) in knowledge-intensive tasks. However, existing methods typically operate on the simplistic premise that model performance equates with internal knowledge, overlooking the knowledge-confidence gaps that lead to overconfident errors or uncertain truths. To bridge this gap, we propose a novel meta-cognitive framework for reliable knowledge augmentation via differentiated intervention and alignment. Our approach leverages internal cognitive signals to partition the knowledge space into mastered, confused, and missing regions, guiding targeted knowledge expansion. Furthermore, we introduce a cognitive consistency mechanism to synchronize subjective certainty with objective accuracy, ensuring calibrated knowledge boundaries. Extensive experiments demonstrate the our framework consistently outperforms strong baselines, validating its rationality in not only enhancing knowledge capabilities but also fostering cognitive behaviors that better distinguish knowns from unknowns.
摘要：知识增强显着增强了大型语言模型（LLM）在知识密集型任务中的性能。然而，现有方法通常在模型性能与内部知识等同的简单化前提下运行，忽略了导致过度自信错误或不确定真相的知识置信度差距。为了弥补这一差距，我们提出了一种新颖的元认知框架，通过差异化的干预和调整来实现可靠的知识增强。我们的方法利用内部认知信号将知识空间划分为已掌握、困惑和缺失的区域，从而指导有针对性的知识扩展。此外，我们引入了认知一致性机制，使主观确定性与客观准确性同步，确保校准的知识边界。大量的实验表明，我们的框架始终优于强大的基线，验证了其合理性，不仅可以增强知识能力，还可以促进更好地区分已知与未知的认知行为。

Title: TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution

Authors: Tejas Anvekar, Junha Park, Rajat Jha, Devanshu Gupta, Poojah Ganesan, Puneeth Mathur, Vivek Gupta
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13059
Pdf URL: https://arxiv.org/pdf/2602.13059
Copy Paste: [[2602.13059]] TraceBack: Multi-Agent Decomposition for Fine-Grained Table Attribution(https://arxiv.org/abs/2602.13059)
Keywords: agent
Abstract: Question answering (QA) over structured tables requires not only accurate answers but also transparency about which cells support them. Existing table QA systems rarely provide fine-grained attribution, so even correct answers often lack verifiable grounding, limiting trust in high-stakes settings. We address this with TraceBack, a modular multi-agent framework for scalable, cell-level attribution in single-table QA. TraceBack prunes tables to relevant rows and columns, decomposes questions into semantically coherent sub-questions, and aligns each answer span with its supporting cells, capturing both explicit and implicit evidence used in intermediate reasoning steps. To enable systematic evaluation, we release CITEBench, a benchmark with phrase-to-cell annotations drawn from ToTTo, FetaQA, and AITQA. We further propose FairScore, a reference-less metric that compares atomic facts derived from predicted cells and answers to estimate attribution precision and recall without human cell labels. Experiments show that TraceBack substantially outperforms strong baselines across datasets and granularities, while FairScore closely tracks human judgments and preserves relative method rankings, supporting interpretable and scalable evaluation of table-based QA.
摘要：通过结构化表格进行问答 (QA) 不仅需要准确的答案，还需要透明地了解哪些单元格支持它们。现有的桌面 QA 系统很少提供细粒度的归因，因此即使是正确的答案也往往缺乏可验证的基础，从而限制了对高风险环境的信任。我们使用 TraceBack 来解决这个问题，TraceBack 是一个模块化多代理框架，用于在单表 QA 中进行可扩展的单元级归因。 TraceBack 将表格修剪为相关的行和列，将问题分解为语义上连贯的子问题，并将每个答案范围与其支持单元格对齐，捕获中间推理步骤中使用的显式和隐式证据。为了实现系统评估，我们发布了 CITEBench，这是一个基准，包含来自 ToTTo、FetaQA 和 AITQA 的短语到单元格注释。我们进一步提出 FairScore，这是一种无参考指标，可以比较从预测细胞得出的原子事实和答案，以估计归因精度和召回率，而无需人类细胞标签。实验表明，TraceBack 在数据集和粒度方面大大优于强大的基线，而 FairScore 密切跟踪人类判断并保留相对方法排名，支持基于表的 QA 的可解释和可扩展的评估。

Title: Exploring a New Competency Modeling Process with Large Language Models

Authors: Silin Du, Manqing Xin, Raymond Jia Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2602.13084
Pdf URL: https://arxiv.org/pdf/2602.13084
Copy Paste: [[2602.13084]] Exploring a New Competency Modeling Process with Large Language Models(https://arxiv.org/abs/2602.13084)
Keywords: language model, llm
Abstract: Competency modeling is widely used in human resource management to select, develop, and evaluate talent. However, traditional expert-driven approaches rely heavily on manual analysis of large volumes of interview transcripts, making them costly and prone to randomness, ambiguity, and limited reproducibility. This study proposes a new competency modeling process built on large language models (LLMs). Instead of merely automating isolated steps, we reconstruct the workflow by decomposing expert practices into structured computational components. Specifically, we leverage LLMs to extract behavioral and psychological descriptions from raw textual data and map them to predefined competency libraries through embedding-based similarity. We further introduce a learnable parameter that adaptively integrates different information sources, enabling the model to determine the relative importance of behavioral and psychological signals. To address the long-standing challenge of validation, we develop an offline evaluation procedure that allows systematic model selection without requiring additional large-scale data collection. Empirical results from a real-world implementation in a software outsourcing company demonstrate strong predictive validity, cross-library consistency, and structural robustness. Overall, our framework transforms competency modeling from a largely qualitative and expert-dependent practice into a transparent, data-driven, and evaluable analytical process.
摘要：胜任力模型广泛应用于人力资源管理中，用于选拔、开发和评估人才。然而，传统的专家驱动方法严重依赖于对大量访谈笔录的手动分析，这使得它们成本高昂，并且容易出现随机性、模糊性和可重复性有限的情况。本研究提出了一种基于大型语言模型 (LLM) 的新能力建模流程。我们不是仅仅自动化孤立的步骤，而是通过将专家实践分解为结构化计算组件来重建工作流程。具体来说，我们利用法学硕士从原始文本数据中提取行为和心理描述，并通过基于嵌入的相似性将它们映射到预定义的能力库。我们进一步引入了一个可学习的参数，该参数可以自适应地集成不同的信息源，使模型能够确定行为和心理信号的相对重要性。为了解决验证的长期挑战，我们开发了一种离线评估程序，可以进行系统的模型选择，而无需额外的大规模数据收集。一家软件外包公司的实际实施的实证结果证明了强大的预测有效性、跨库一致性和结构稳健性。总体而言，我们的框架将能力建模从很大程度上定性和依赖专家的实践转变为透明的、数据驱动的和可评估的分析过程。

Title: SCOPE: Selective Conformal Optimized Pairwise LLM Judging

Authors: Sher Badshah, Ali Emami, Hassan Sajjad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13110
Pdf URL: https://arxiv.org/pdf/2602.13110
Copy Paste: [[2602.13110]] SCOPE: Selective Conformal Optimized Pairwise LLM Judging(https://arxiv.org/abs/2602.13110)
Keywords: language model, llm, chat
Abstract: Large language models (LLMs) are increasingly used as judges to replace costly human preference labels in pairwise evaluation. Despite their practicality, LLM judges remain prone to miscalibration and systematic biases. This paper proposes SCOPE (Selective Conformal Optimized Pairwise Evaluation), a framework for selective pairwise judging with finite-sample statistical guarantees. Under exchangeability, SCOPE calibrates an acceptance threshold such that the error rate among non-abstained judgments is at most a user-specified level $\alpha$. To provide SCOPE with a bias-neutral uncertainty signal, we introduce Bidirectional Preference Entropy (BPE), which queries the judge under both response positions, aggregates the implied preference probabilities to enforce invariance to response order, and converts the aggregated probability into an entropy-based uncertainty score. Across MT-Bench, RewardBench, and Chatbot Arena, BPE improves uncertainty quality over standard confidence proxies, providing a stronger selection signal that enables SCOPE to consistently meet the target risk level while retaining good coverage across judge scales. In particular, at $\alpha = 0.10$, \textsc{Scope} consistently satisfies the risk bound across all benchmarks and judge scales (empirical risk $\approx 0.097$ to $0.099$), while retaining substantial coverage, reaching $0.89$ on RewardBench with Qwen-14B and $0.98$ on RewardBench with Qwen-32B. Compared to naïve baselines, \textsc{Scope} accepts up to $2.4\times$ more judgments on MT-Bench with Qwen-7B under the same target risk constraint, demonstrating that BPE enables reliable and high-coverage LLM-based evaluation.
摘要：大型语言模型（LLM）越来越多地被用作法官，以取代成对评估中昂贵的人类偏好标签。尽管具有实用性，法学硕士法官仍然容易出现错误校准和系统性偏差。本文提出了 SCOPE（选择性共形优化成对评估），一种具有有限样本统计保证的选择性成对判断框架。在可交换性下，SCOPE 校准一个接受阈值，使得非弃权判决之间的错误率至多为用户指定的水平 $\alpha$。为了向 SCOPE 提供偏差中性的不确定性信号，我们引入了双向偏好熵（BPE），它在两个响应位置下询问法官，聚合隐含的偏好概率以强制响应顺序的不变性，并将聚合概率转换为基于熵的不确定性得分。在 MT-Bench、RewardBench 和 Chatbot Arena 中，BPE 比标准置信度代理提高了不确定性质量，提供了更强的选择信号，使 SCOPE 能够始终满足目标风险水平，同时在法官范围内保持良好的覆盖范围。特别是，在 $\alpha = 0.10$ 时，\textsc{Scope} 始终满足所有基准和判断尺度的风险界限（经验风险 $\约 0.097$ 至 $0.099$），同时保留大量覆盖范围，在 Qwen-14B 的 RewardBench 上达到 0.89$，在 Qwen-32B 的 RewardBench 上达到 0.98$。与朴素基线相比，在相同的目标风险约束下，\textsc{Scope} 在使用 Qwen-7B 的 MT-Bench 上接受的判断增加了高达 $2.4\times$，这表明 BPE 能够实现可靠且高覆盖率的基于 LLM 的评估。

Title: Semantic Chunking and the Entropy of Natural Language

Authors: Weishun Zhong, Doron Sivan, Tankut Can, Mikhail Katkov, Misha Tsodyks
Subjects: cs.CL, cond-mat.dis-nn, cond-mat.stat-mech, cs.AI
Abstract URL: https://arxiv.org/abs/2602.13194
Pdf URL: https://arxiv.org/pdf/2602.13194
Copy Paste: [[2602.13194]] Semantic Chunking and the Entropy of Natural Language(https://arxiv.org/abs/2602.13194)
Keywords: language model, llm
Abstract: The entropy rate of printed English is famously estimated to be about one bit per character, a benchmark that modern large language models (LLMs) have only recently approached. This entropy rate implies that English contains nearly 80 percent redundancy relative to the five bits per character expected for random text. We introduce a statistical model that attempts to capture the intricate multi-scale structure of natural language, providing a first-principles account of this redundancy level. Our model describes a procedure of self-similarly segmenting text into semantically coherent chunks down to the single-word level. The semantic structure of the text can then be hierarchically decomposed, allowing for analytical treatment. Numerical experiments with modern LLMs and open datasets suggest that our model quantitatively captures the structure of real texts at different levels of the semantic hierarchy. The entropy rate predicted by our model agrees with the estimated entropy rate of printed English. Moreover, our theory further reveals that the entropy rate of natural language is not fixed but should increase systematically with the semantic complexity of corpora, which are captured by the only free parameter in our model.
摘要：众所周知，印刷英语的熵率估计约为每个字符一位，这是现代大型语言模型 (LLM) 最近才接近的基准。此熵率意味着相对于随机文本每个字符预期的 5 位，英语包含近 80% 的冗余。我们引入了一种统计模型，试图捕捉自然语言复杂的多尺度结构，提供这种冗余水平的第一性原理解释。我们的模型描述了将文本自相似地分割成语义连贯的块到单个单词级别的过程。然后可以对文本的语义结构进行分层分解，以便进行分析处理。现代法学硕士和开放数据集的数值实验表明，我们的模型定量地捕获了语义层次结构不同级别的真实文本的结构。我们的模型预测的熵率与印刷英语的估计熵率一致。此外，我们的理论进一步揭示，自然语言的熵率不是固定的，而是应该随着语料库的语义复杂性而系统地增加，这是由我们模型中唯一的自由参数捕获的。