2026-01-01

Title: Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration

Authors: Zahra Abedi, Richard M.K. van Dijk, Gijs Wijnholds, Tessa Verhoef
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23710
Pdf URL: https://arxiv.org/pdf/2512.23710
Copy Paste: [[2512.23710]] Enriching Historical Records: An OCR and AI-Driven Approach for Database Integration(https://arxiv.org/abs/2512.23710)
Keywords: llm
Abstract: This research digitizes and analyzes the Leidse hoogleraren en lectoren 1575-1815 books written between 1983 and 1985, which contain biographic data about professors and curators of Leiden University. It addresses the central question: how can we design an automated pipeline that integrates OCR, LLM-based interpretation, and database linking to harmonize data from historical document images with existing high-quality database records? We applied OCR techniques, generative AI decoding constraints that structure data extraction, and database linkage methods to process typewritten historical records into a digital format. OCR achieved a Character Error Rate (CER) of 1.08 percent and a Word Error Rate (WER) of 5.06 percent, while JSON extraction from OCR text achieved an average accuracy of 63 percent and, based on annotated OCR, 65 percent. This indicates that generative AI somewhat corrects low OCR performance. Our record linkage algorithm linked annotated JSON files with 94% accuracy and OCR-derived JSON files with 81%. This study contributes to digital humanities research by offering an automated pipeline for interpreting digitized historical documents, addressing challenges like layout variability and terminology differences, and exploring the applicability and strength of an advanced generative AI model.
摘要：这项研究对 1983 年至 1985 年间编写的 Leidse hoogleraren en lectoren 1575-1815 书籍进行了数字化和分析，其中包含有关莱顿大学教授和策展人的传记数据。它解决了核心问题：我们如何设计一个集成 OCR、基于 LLM 的解释和数据库链接的自动化管道，以协调历史文档图像中的数据与现有的高质量数据库记录？我们应用 OCR 技术、构建数据提取的生成式 AI 解码约束以及数据库链接方法将打字的历史记录处理为数字格式。 OCR 的字符错误率 (CER) 为 1.08%，字错误率 (WER) 为 5.06%，而从 OCR 文本中提取 JSON 的平均准确率达到 63%，基于带注释的 OCR，平均准确率为 65%。这表明生成式人工智能在一定程度上纠正了 OCR 性能低下的问题。我们的记录链接算法链接带注释的 JSON 文件的准确度为 94%，链接 OCR 派生的 JSON 文件的准确度为 81%。这项研究通过提供解释数字化历史文档的自动化管道、解决布局可变性和术语差异等挑战以及探索先进生成人工智能模型的适用性和优势，为数字人文研究做出贡献。

Title: CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations

Authors: Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Claudio Pinhanez, Yago Primerano
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23711
Pdf URL: https://arxiv.org/pdf/2512.23711
Copy Paste: [[2512.23711]] CAT: A Metric-Driven Framework for Analyzing the Consistency-Accuracy Relation of LLMs under Controlled Input Variations(https://arxiv.org/abs/2512.23711)
Keywords: language model, llm
Abstract: We introduce \textsc{CAT}, a framework designed to evaluate and visualize the \emph{interplay} of \emph{accuracy} and \emph{response consistency} of Large Language Models (LLMs) under controllable input variations, using multiple-choice (MC) benchmarks as a case study. Current evaluation practices primarily focus on model capabilities such as accuracy or benchmark scores and, more recently, measuring consistency is being considered an essential property for deploying LLMs in high-stake, real-world applications. We argue in this paper that although both dimensions should still be evaluated independently, their inter-dependency also need to be considered for a more nuanced evaluation of LLMs. At the core of \textsc{CAT} are the \emph{Consistency-Accuracy Relation (CAR)} curves, which visualize how model accuracy varies with increasing consistency requirements, as defined by the \emph{Minimum-Consistency Accuracy (MCA)} metric. We further propose the \emph{Consistency-Oriented Robustness Estimate (CORE)} index, a global metric that combines the area and shape of the CAR curve to quantify the trade-off between accuracy and consistency. We present a practical demonstration of our framework across a diverse set of generalist and domain-specific LLMs, evaluated on multiple MC benchmarks. We also outline how \textsc{CAT} can be extended beyond MC tasks to support long-form, open-ended evaluations through adaptable scoring functions.
摘要：我们引入 \textsc{CAT}，一个旨在评估和可视化可控输入变化下大型语言模型 (LLM) 的 \emph{accuracy} 和 \emph{response Consistency} 的 \emph{interplay} 的框架，使用多项选择 (MC) 基准作为案例研究。当前的评估实践主要关注模型功能，例如准确性或基准分数，最近，测量一致性被认为是在高风险的实际应用中部署法学硕士的基本属性。我们在本文中认为，尽管这两个维度仍应独立评估，但为了对法学硕士进行更细致的评估，也需要考虑它们的相互依赖性。 \textsc{CAT} 的核心是 \emph{一致性-准确性关系 (CAR)} 曲线，它可视化模型准确性如何随着一致性要求的增加而变化，如 \emph{最小一致性准确性 (MCA)} 指标所定义。我们进一步提出了\emph{面向一致性的鲁棒性估计（CORE）}指数，这是一种结合 CAR 曲线的面积和形状的全局指标，以量化准确性和一致性之间的权衡。我们通过多种通才和特定领域的法学硕士对我们的框架进行了实际演示，并在多个 MC 基准上进行了评估。我们还概述了如何将 \textsc{CAT} 扩展到 MC 任务之外，以通过适应性评分函数支持长格式、开放式评估。

Title: STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability

Authors: Guanghui Wang, Jinze Yu, Xing Zhang, Dayuan Jiang, Yin Song, Tomal Deb, Xuefeng Liu, Peiyang He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23712
Pdf URL: https://arxiv.org/pdf/2512.23712
Copy Paste: [[2512.23712]] STED and Consistency Scoring: A Framework for Evaluating LLM Structured Output Reliability(https://arxiv.org/abs/2512.23712)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly deployed for structured data generation, yet output consistency remains critical for production applications. We introduce a comprehensive framework for evaluating and improving consistency in LLM-generated structured outputs. Our approach combines: (1) STED (Semantic Tree Edit Distance), a novel similarity metric balancing semantic flexibility with structural strictness when comparing JSON outputs, and (2) a consistency scoring framework aggregating multiple STED measurements across repeated generations to quantify reliability. Through systematic experiments on synthetic datasets with controlled schema, expression, and semantic variations, we demonstrate STED achieves superior performance ($0.86-0.90$ similarity for semantic equivalents, $0.0$ for structural breaks) compared to existing metrics including TED, BERTScore, and DeepDiff. Applying our framework to benchmark six LLMs reveals significant variations: Claude-3.7-Sonnet demonstrates exceptional consistency, maintaining near-perfect structural reliability even at high temperatures ($T=0.9$), while models like Claude-3-Haiku and Nova-Pro exhibit substantial degradation requiring careful tuning. Our framework enables practical applications including targeted model selection for structured tasks, iterative prompt refinement for reproducible results, and diagnostic analysis to identify inconsistency root causes. This work provides theoretical foundations and practical tools for ensuring reliable structured output generation in LLM-based production systems.
摘要：大型语言模型 (LLM) 越来越多地用于结构化数据生成，但输出一致性对于生产应用程序仍然至关重要。我们引入了一个全面的框架来评估和提高法学硕士生成的结构化输出的一致性。我们的方法结合了：(1) STED（语义树编辑距离），一种新颖的相似性度量，在比较 JSON 输出时平衡语义灵活性与结构严格性；(2) 一致性评分框架，聚合重复代中的多个 STED 测量值以量化可靠性。通过对具有受控模式、表达和语义变化的合成数据集进行系统实验，我们证明与 TED、BERTScore 和 DeepDiff 等现有指标相比，STED 实现了卓越的性能（语义等价相似度为 0.86-0.90 美元，结构断裂为 0.0 美元）。将我们的框架应用于六个法学硕士的基准测试揭示了显着的变化：Claude-3.7-Sonnet 表现出卓越的一致性，即使在高温下（$T=0.9$）也能保持近乎完美的结构可靠性，而像 Claude-3-Haiku 和 Nova-Pro 这样的模型则表现出严重的退化，需要仔细调整。我们的框架支持实际应用，包括结构化任务的目标模型选择、可重复结果的迭代提示细化以及识别不一致根本原因的诊断分析。这项工作为确保基于法学硕士的生产系统中可靠的结构化输出生成提供了理论基础和实用工具。

Title: PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents

Authors: Jahidul Islam, Md Ataullha, Saiful Azad
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23713
Pdf URL: https://arxiv.org/pdf/2512.23713
Copy Paste: [[2512.23713]] PyBangla at BLP-2025 Task 2: Enhancing Bangla-to-Python Code Generation with Iterative Self-Correction and Multilingual Agents(https://arxiv.org/abs/2512.23713)
Keywords: llm, prompt, agent
Abstract: LLMs excel at code generation from English prompts, but this progress has not extended to low-resource languages. We address Bangla-to-Python code generation by introducing BanglaCodeAct, an agent-based framework that leverages multi-agent prompting and iterative self-correction. Unlike prior approaches relying on task-specific fine-tuning, BanglaCodeAct employs an open-source multilingual LLM within a Thought-Code-Observation loop, enabling dynamic generation, testing, and refinement of code from Bangla instructions. We benchmark several small-parameter open-source LLMs and evaluate their effectiveness on the mHumanEval dataset for Bangla NL2Code. Our results show that Qwen3-8B, when deployed with BanglaCodeAct, achieves the best performance, with pass@1 accuracy of 94.0\% on the development set and 71.6\% on the blind test set. These results establish a new benchmark for Bangla-to-Python translation and highlight the potential of agent-based reasoning for reliable code generation in low-resource languages. Experimental scripts are publicly available at this http URL.
摘要：法学硕士擅长根据英语提示生成代码，但这种进步尚未扩展到资源匮乏的语言。我们通过引入 BanglaCodeAct 来解决 Bangla 到 Python 的代码生成问题，BanglaCodeAct 是一个基于代理的框架，利用多代理提示和迭代自我校正。与之前依赖于特定任务微调的方法不同，BanglaCodeAct 在思想-代码-观察循环中采用开源多语言 LLM，从而能够根据 Bangla 指令动态生成、测试和细化代码。我们对几个小参数开源 LLM 进行了基准测试，并评估了它们在 Bangla NL2Code 的 mHumanEval 数据集上的有效性。我们的结果表明，Qwen3-8B 在与 BanglaCodeAct 一起部署时实现了最佳性能，在开发集上的 pass@1 准确率为 94.0\%，在盲测集上为 71.6\%。这些结果为孟加拉语到 Python 的翻译建立了新的基准，并凸显了基于代理的推理在低资源语言中可靠生成代码的潜力。实验脚本可在此 http URL 上公开获得。

Title: Noise-Driven Persona Formation in Reflexive Neural Language Generation

Authors: Toshiyuki Shigemura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23716
Pdf URL: https://arxiv.org/pdf/2512.23716
Copy Paste: [[2512.23716]] Noise-Driven Persona Formation in Reflexive Neural Language Generation(https://arxiv.org/abs/2512.23716)
Keywords: language model, llm
Abstract: This paper introduces the Luca-Noise Reflex Protocol (LN-RP), a computational framework for analyzing noise-driven persona emergence in large language models. By injecting stochastic noise seeds into the initial generation state, we observe nonlinear transitions in linguistic behavior across 152 generation cycles. Our results reveal three stable persona modes with distinct entropy signatures, and demonstrate that external noise sources can reliably induce phase transitions in reflexive generation dynamics. Quantitative evaluation confirms consistent persona retention and significant differences across modes (p < 0.01). The protocol provides a reproducible method for studying reflexive generation, emergent behavior, and longrange linguistic coherence in LLMs.
摘要：本文介绍了卢卡噪声反射协议（LN-RP），这是一种用于分析大型语言模型中噪声驱动的角色出现的计算框架。通过将随机噪声种子注入初始生成状态，我们观察到 152 个生成周期中语言行为的非线性转变。我们的结果揭示了三种具有不同熵特征的稳定角色模式，并证明外部噪声源可以可靠地诱导反射性生成动力学中的相变。定量评估证实了角色保留的一致性以及不同模式之间的显着差异 (p < 0.01)。该协议提供了一种可重复的方法来研究法学硕士的反射生成、涌现行为和远程语言连贯性。

Title: HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate

Authors: Shenzhe Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23717
Pdf URL: https://arxiv.org/pdf/2512.23717
Copy Paste: [[2512.23717]] HarmTransform: Transforming Explicit Harmful Queries into Stealthy via Multi-Agent Debate(https://arxiv.org/abs/2512.23717)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are equipped with safety mechanisms to detect and block harmful queries, yet current alignment approaches primarily focus on overtly dangerous content and overlook more subtle threats. However, users can often disguise harmful intent through covert rephrasing that preserves malicious objectives while appearing benign, which creates a significant gap in existing safety training data. To address this limitation, we introduce HarmTransform, a multi-agent debate framework for systematically transforming harmful queries into stealthier forms while preserving their underlying harmful intent. Our framework leverages iterative critique and refinement among multiple agents to generate high-quality, covert harmful query transformations that can be used to improve future LLM safety alignment. Experiments demonstrate that HarmTransform significantly outperforms standard baselines in producing effective query transformations. At the same time, our analysis reveals that debate acts as a double-edged sword: while it can sharpen transformations and improve stealth, it may also introduce topic shifts and unnecessary complexity. These insights highlight both the promise and the limitations of multi-agent debate for generating comprehensive safety training data.
摘要：大型语言模型 (LLM) 配备了安全机制来检测和阻止有害查询，但当前的对齐方法主要关注明显危险的内容，而忽略了更微妙的威胁。然而，用户通常可以通过隐秘的改写来掩饰有害意图，从而在看似良性的同时保留恶意目标，这在现有安全培训数据中造成了重大差距。为了解决这个限制，我们引入了 HarmTransform，这是一个多代理辩论框架，用于系统地将有害查询转换为更隐蔽的形式，同时保留其潜在的有害意图。我们的框架利用多个代理之间的迭代批评和细化来生成高质量、隐蔽的有害查询转换，这些转换可用于改进未来的 LLM 安全一致性。实验表明，HarmTransform 在生成有效的查询转换方面显着优于标准基线。与此同时，我们的分析表明，辩论是一把双刃剑：虽然它可以加速变革并提高隐蔽性，但它也可能会带来话题转移和不必要的复杂性。这些见解凸显了多智能体辩论在生成全面的安全训练数据方面的前景和局限性。

Title: Emergent World Beliefs: Exploring Transformers in Stochastic Games

Authors: Adam Kamel, Tanish Rastogi, Michael Ma, Kailash Ranganathan, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23722
Pdf URL: https://arxiv.org/pdf/2512.23722
Copy Paste: [[2512.23722]] Emergent World Beliefs: Exploring Transformers in Stochastic Games(https://arxiv.org/abs/2512.23722)
Keywords: language model, gpt, llm
Abstract: Transformer-based large language models (LLMs) have demonstrated strong reasoning abilities across diverse fields, from solving programming challenges to competing in strategy-intensive games such as chess. Prior work has shown that LLMs can develop emergent world models in games of perfect information, where internal representations correspond to latent states of the environment. In this paper, we extend this line of investigation to domains of incomplete information, focusing on poker as a canonical partially observable Markov decision process (POMDP). We pretrain a GPT-style model on Poker Hand History (PHH) data and probe its internal activations. Our results demonstrate that the model learns both deterministic structure, such as hand ranks, and stochastic features, such as equity, without explicit instruction. Furthermore, by using primarily nonlinear probes, we demonstrated that these representations are decodeable and correlate with theoretical belief states, suggesting that LLMs are learning their own representation of the stochastic environment of Texas Hold'em Poker.
摘要：基于 Transformer 的大型语言模型 (LLM) 在各个领域都表现出了强大的推理能力，从解决编程挑战到在国际象棋等策略密集型游戏中竞争。先前的工作表明，法学硕士可以在完美信息的游戏中开发新兴世界模型，其中内部表示对应于环境的潜在状态。在本文中，我们将这一研究范围扩展到不完整信息领域，重点关注扑克作为规范的部分可观察马尔可夫决策过程（POMDP）。我们根据扑克牌局历史 (PHH) 数据预训练 GPT 式模型，并探测其内部激活。我们的结果表明，该模型无需明确的指令即可学习确定性结构（例如手牌排名）和随机特征（例如赢率）。此外，通过主要使用非线性探针，我们证明了这些表示是可解码的并且与理论信念状态相关，这表明法学硕士正在学习他们自己的德州扑克随机环境的表示。

Title: When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection

Authors: Anwar Alajmi, Gabriele Pergola
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23732
Pdf URL: https://arxiv.org/pdf/2512.23732
Copy Paste: [[2512.23732]] When in Doubt, Deliberate: Confidence-Based Routing to Expert Debate for Sexism Detection(https://arxiv.org/abs/2512.23732)
Keywords: prompt
Abstract: Sexist content online increasingly appears in subtle, context-dependent forms that evade traditional detection methods. Its interpretation often depends on overlapping linguistic, psychological, legal, and cultural dimensions, which produce mixed and sometimes contradictory signals, even in annotated datasets. These inconsistencies, combined with label scarcity and class imbalance, result in unstable decision boundaries and cause fine-tuned models to overlook subtler, underrepresented forms of harm. Together, these limitations point to the need for a design that explicitly addresses the combined effects of (i) underrepresentation, (ii) noise, and (iii) conceptual ambiguity in both data and model predictions. To address these challenges, we propose a two-stage framework that unifies (i) targeted training procedures to adapt supervision to scarce and noisy data with (ii) selective, reasoning-based inference to handle ambiguous or borderline cases. Our training setup applies class-balanced focal loss, class-aware batching, and post-hoc threshold calibration to mitigate label imbalance and noisy supervision. At inference time, a dynamic routing mechanism classifies high-confidence cases directly and escalates uncertain instances to a novel \textit{Collaborative Expert Judgment} (CEJ) module, which prompts multiple personas and consolidates their reasoning through a judge model. Our approach achieves state-of-the-art results across several benchmarks, with a +2.72\% improvement in F1 on the EXIST 2025 Task 1.1, and a gains of +4.48\% and +1.30\% on the EDOS Tasks A and B, respectively.
摘要：网上的性别歧视内容越来越多地以微妙的、依赖于上下文的形式出现，从而逃避了传统的检测方法。它的解释通常取决于重叠的语言、心理、法律和文化维度，这些维度产生混合的、有时甚至是矛盾的信号，即使在带注释的数据集中也是如此。这些不一致，再加上标签稀缺和类别不平衡，导致决策边界不稳定，并导致微调模型忽视更微妙、代表性不足的伤害形式。总之，这些限制表明需要一种设计来明确解决数据和模型预测中的 (i) 代表性不足、(ii) 噪声和 (iii) 概念模糊的综合影响。为了应对这些挑战，我们提出了一个两阶段框架，该框架将（i）有针对性的训练程序与（ii）选择性的、基于推理的推理相结合，以适应稀缺和嘈杂的数据的监督，以处理模糊或边缘情况。我们的训练设置应用类平衡焦点损失、类感知批处理和事后阈值校准来减轻标签不平衡和噪声监督。在推理时，动态路由机制直接对高置信度案例进行分类，并将不确定的实例升级为新颖的\textit{协作专家判断}（CEJ）模块，该模块提示多个角色并通过判断模型巩固他们的推理。我们的方法在多个基准测试中取得了最先进的结果，在 EXIST 2025 任务 1.1 上 F1 提高了 +2.72\%，在 EDOS 任务 A 和 B 上分别提高了 +4.48\% 和 +1.30\%。

Title: Break Out the Silverware -- Semantic Understanding of Stored Household Items

Authors: Michaela Levi-Richter, Reuth Mirsky, Oren Glickman
Subjects: cs.CL, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2512.23739
Pdf URL: https://arxiv.org/pdf/2512.23739
Copy Paste: [[2512.23739]] Break Out the Silverware -- Semantic Understanding of Stored Household Items(https://arxiv.org/abs/2512.23739)
Keywords: language model, gpt, prompt, agent
Abstract: ``Bring me a plate.'' For domestic service robots, this simple command reveals a complex challenge: inferring where everyday items are stored, often out of sight in drawers, cabinets, or closets. Despite advances in vision and manipulation, robots still lack the commonsense reasoning needed to complete this task. We introduce the Stored Household Item Challenge, a benchmark task for evaluating service robots' cognitive capabilities: given a household scene and a queried item, predict its most likely storage location. Our benchmark includes two datasets: (1) a real-world evaluation set of 100 item-image pairs with human-annotated ground truth from participants' kitchens, and (2) a development set of 6,500 item-image pairs annotated with storage polygons over public kitchen images. These datasets support realistic modeling of household organization and enable comparative evaluation across agent architectures. To begin tackling this challenge, we introduce NOAM (Non-visible Object Allocation Model), a hybrid agent pipeline that combines structured scene understanding with large language model inference. NOAM converts visual input into natural language descriptions of spatial context and visible containers, then prompts a language model (e.g., GPT-4) to infer the most likely hidden storage location. This integrated vision-language agent exhibits emergent commonsense reasoning and is designed for modular deployment within broader robotic systems. We evaluate NOAM against baselines including random selection, vision-language pipelines (Grounding-DINO + SAM), leading multimodal models (e.g., Gemini, GPT-4o, Kosmos-2, LLaMA, Qwen), and human performance. NOAM significantly improves prediction accuracy and approaches human-level results, highlighting best practices for deploying cognitively capable agents in domestic environments.
摘要：“给我拿一个盘子。”对于家庭服务机器人来说，这个简单的命令揭示了一个复杂的挑战：推断日常用品的存放位置，这些物品通常存放在抽屉、橱柜或壁橱中看不见的地方。尽管视觉和操控方面取得了进步，机器人仍然缺乏完成这项任务所需的常识推理。我们引入了“家庭存储物品挑战”，这是一项评估服务机器人认知能力的基准任务：给定家庭场景和查询的物品，预测其最有可能的存储位置。我们的基准包括两个数据集：(1) 由 100 个项目-图像对组成的真实世界评估集，其中包含来自参与者厨房的人工注释的地面实况；(2) 由 6,500 个项目-图像对组成的开发集，在公共厨房图像上用存储多边形进行注释。这些数据集支持家庭组织的真实建模，并支持跨代理架构的比较评估。为了开始应对这一挑战，我们引入了 NOAM（不可见对象分配模型），这是一种混合代理管道，它将结构化场景理解与大型语言模型推理相结合。 NOAM 将视觉输入转换为空间上下文和可见容器的自然语言描述，然后提示语言模型（例如 GPT-4）推断最可能的隐藏存储位置。这种集成的视觉语言代理展示了新兴的常识推理，并且专为在更广泛的机器人系统中进行模块化部署而设计。我们根据基线评估 NOAM，包括随机选择、视觉语言管道 (Grounding-DINO + SAM)、领先的多模态模型（例如 Gemini、GPT-4o、Kosmos-2、LLaMA、Qwen）和人类表现。 NOAM 显着提高了预测准确性并接近人类水平的结果，突出了在家庭环境中部署具有认知能力的代理的最佳实践。

Title: Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning

Authors: Tiancheng Su, Meicong Zhang, Guoxiu He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23765
Pdf URL: https://arxiv.org/pdf/2512.23765
Copy Paste: [[2512.23765]] Entropy-Aware Speculative Decoding Toward Improved LLM Reasoning(https://arxiv.org/abs/2512.23765)
Keywords: language model, llm
Abstract: Speculative decoding (SD) accelerates large language model (LLM) reasoning by using a small draft model to generate candidate tokens, which the target LLM either accepts directly or regenerates upon rejection. However, excessive alignment between the draft and target models constrains SD to the performance of the target LLM. To address this limitation, we propose Entropy-Aware Speculative Decoding (EASD), a training-free enhancement. Building on standard SD, EASD incorporates a dynamic entropy-based penalty. At each decoding step, we employ the entropy of the sampling distribution to quantify model uncertainty. When both models exhibit high entropy with substantial overlap among their top-N predictions, the corresponding token is rejected and re-sampled by the target LLM. This penalty prevents low-confidence errors from propagating. By incorporating draft-model verification, EASD enables the possibility of surpassing the target model's inherent performance. Experiments across multiple reasoning benchmarks demonstrate that EASD consistently outperforms existing SD methods and, in most cases, surpasses the target LLM itself. We further prove that the efficiency of EASD is comparable to that of SD. The code can be found in the Supplementary Materials.
摘要：推测性解码 (SD) 通过使用小型草稿模型生成候选标记来加速大型语言模型 (LLM) 推理，目标 LLM 可以直接接受候选标记，也可以在拒绝时重新生成候选标记。然而，草稿模型和目标模型之间的过度一致性限制了 SD 对目标 LLM 表现的影响。为了解决这个限制，我们提出了熵感知推测解码（EASD），这是一种免训练的增强功能。 EASD 以标准 SD 为基础，采用了基于动态熵的惩罚。在每个解码步骤中，我们利用采样分布的熵来量化模型的不确定性。当两个模型都表现出高熵且前 N 个预测之间存在大量重叠时，相应的标记将被目标 LLM 拒绝并重新采样。这种惩罚可以防止低置信度错误的传播。通过结合草稿模型验证，EASD 能够超越目标模型的固有性能。跨多个推理基准的实验表明，EASD 始终优于现有的 SD 方法，并且在大多数情况下超过了目标 LLM 本身。我们进一步证明 EASD 的效率与 SD 相当。该代码可以在补充材料中找到。

Title: MiMo-Audio: Audio Language Models are Few-Shot Learners

Authors: Xiaomi LLM-Core Team: Dong Zhang, Gang Wang, Jinlong Xue, Kai Fang, Liang Zhao, Rui Ma, Shuhuai Ren, Shuo Liu, Tao Guo, Weiji Zhuang, Xin Zhang, Xingchen Song, Yihan Yan, Yongzhe He, Cici, Bowen Shen, Chengxuan Zhu, Chong Ma, Chun Chen, Heyu Chen, Jiawei Li, Lei Li, Menghang Zhu, Peidian Li, Qiying Wang, Sirui Deng, Weimin Xiong, Wenshan Huang, Wenyu Yang, Yilin Jiang, Yixin Yang, Yuanyuan Tian, Yue Ma, Yue Yu, Zihan Zhang, Zihao Yue, Bangjun Xiao, Bingquan Xia, Bofei Gao, Bowen Ye, Can Cai, Chang Liu, Chenhong He, Chunan Li, Dawei Zhu, Duo Zhang, Fengyuan Shi, Guoan Wang, Hailin Zhang, Hanglong Lv, Hanyu Li, Hao Tian, Heng Qu, Hongshen Xu, Houbin Zhang, Huaqiu Liu, Jiangshan Duo, Jianguang Zuo, Jianyu Wei, Jiebao Xiao, Jinhao Dong, Jun Shi, Junhao Hu, Kainan Bao, Kang Zhou, Linghao Zhang, Meng Chen, Nuo Chen, Peng Zhang, Qianli Chen, Qiantong Wang, Rang Li, Shaohui Liu, Shengfan Wang, Shicheng Li, Shihua Yu, Shijie Cao, Shimao Chen, Shuhao Gu, Weikun Wang, Wenhan Ma, Xiangwei Deng, Xing Yong, Xing Zhang, Xu Wang, Yifan Song, Yihao Zhao, Yingbo Zhao, Yizhao Gao, Yu Cheng, Yu Tu, Yudong Wang, Zhaojun Huang, Zhengju Tang, Zhenru Lin, Zhichao Song, Zhipeng Xu, Zhixian Zheng, Zihan Jiang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2512.23808
Pdf URL: https://arxiv.org/pdf/2512.23808
Copy Paste: [[2512.23808]] MiMo-Audio: Audio Language Models are Few-Shot Learners(https://arxiv.org/abs/2512.23808)
Keywords: language model, gpt
Abstract: Existing audio language models typically rely on task-specific fine-tuning to accomplish particular audio tasks. In contrast, humans are able to generalize to new audio tasks with only a few examples or simple instructions. GPT-3 has shown that scaling next-token prediction pretraining enables strong generalization capabilities in text, and we believe this paradigm is equally applicable to the audio domain. By scaling MiMo-Audio's pretraining data to over one hundred million of hours, we observe the emergence of few-shot learning capabilities across a diverse set of audio tasks. We develop a systematic evaluation of these capabilities and find that MiMo-Audio-7B-Base achieves SOTA performance on both speech intelligence and audio understanding benchmarks among open-source models. Beyond standard metrics, MiMo-Audio-7B-Base generalizes to tasks absent from its training data, such as voice conversion, style transfer, and speech editing. MiMo-Audio-7B-Base also demonstrates powerful speech continuation capabilities, capable of generating highly realistic talk shows, recitations, livestreaming and debates. At the post-training stage, we curate a diverse instruction-tuning corpus and introduce thinking mechanisms into both audio understanding and generation. MiMo-Audio-7B-Instruct achieves open-source SOTA on audio understanding benchmarks (MMSU, MMAU, MMAR, MMAU-Pro), spoken dialogue benchmarks (Big Bench Audio, MultiChallenge Audio) and instruct-TTS evaluations, approaching or surpassing closed-source models. Model checkpoints and full evaluation suite are available at this https URL.
摘要：现有的音频语言模型通常依赖于特定于任务的微调来完成特定的音频任务。相比之下，人类只需几个例子或简单的指令就能够概括新的音频任务。 GPT-3 表明，扩展下一个标记预测预训练可以在文本中实现强大的泛化能力，我们相信这种范例同样适用于音频领域。通过将 MiMo-Audio 的预训练数据扩展到超过 1 亿小时，我们观察到在各种音频任务中出现了少量学习能力。我们对这些功能进行了系统评估，发现 MiMo-Audio-7B-Base 在开源模型中的语音智能和音频理解基准上均实现了 SOTA 性能。除了标准指标之外，MiMo-Audio-7B-Base 还可以推广到其训练数据中缺少的任务，例如语音转换、风格转换和语音编辑。 MiMo-Audio-7B-Base还展示了强大的语音延续功能，能够生成高度逼真的脱口秀、朗诵、直播和辩论。在训练后阶段，我们策划了一个多样化的指令调整语料库，并将思维机制引入音频理解和生成中。 MiMo-Audio-7B-Instruct在音频理解基准（MMSU、MMAU、MMAR、MMAU-Pro）、口语对话基准（Big Bench Audio、MultiChallenge Audio）和instruct-TTS评估上实现了开源SOTA，接近或超越闭源模型。模型检查点和完整的评估套件可从此 https URL 获取。

Title: StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection

Authors: Amal Alqahtani, Efsun Kayi, Mona Diab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23813
Pdf URL: https://arxiv.org/pdf/2512.23813
Copy Paste: [[2512.23813]] StressRoBERTa: Cross-Condition Transfer Learning from Depression, Anxiety, and PTSD to Stress Detection(https://arxiv.org/abs/2512.23813)
Keywords: language model
Abstract: The prevalence of chronic stress represents a significant public health concern, with social media platforms like Twitter serving as important venues for individuals to share their experiences. This paper introduces StressRoBERTa, a cross-condition transfer learning approach for automatic detection of self-reported chronic stress in English tweets. The investigation examines whether continual training on clinically related conditions (depression, anxiety, PTSD), disorders with high comorbidity with chronic stress, improves stress detection compared to general language models and broad mental health models. RoBERTa is continually trained on the Stress-SMHD corpus (108M words from users with self-reported diagnoses of depression, anxiety, and PTSD) and fine-tuned on the SMM4H 2022 Task 8 dataset. StressRoBERTa achieves 82% F1-score, outperforming the best shared task system (79% F1) by 3 percentage points. The results demonstrate that focused cross-condition transfer from stress-related disorders (+1% F1 over vanilla RoBERTa) provides stronger representations than general mental health training. Evaluation on Dreaddit (81% F1) further demonstrates transfer from clinical mental health contexts to situational stress discussions.
摘要：慢性压力的普遍存在是一个重大的公共卫生问题，推特等社交媒体平台成为个人分享经历的重要场所。本文介绍了 StressRoBERTa，这是一种跨条件迁移学习方法，用于自动检测英语推文中自我报告的慢性压力。该调查研究了与一般语言模型和广泛的心理健康模型相比，对临床相关疾病（抑郁、焦虑、创伤后应激障碍）、与慢性压力高度共存的疾病进行持续训练是否可以改善压力检测。 RoBERTa 不断在 Stress-SMHD 语料库（来自自我报告诊断为抑郁、焦虑和 PTSD 的用户的 1.08 亿个单词）上进行训练，并在 SMM4H 2022 Task 8 数据集上进行微调。 StressRoBERTa 的 F1 分数为 82%，比最佳共享任务系统 (79% F1) 高出 3 个百分点。结果表明，来自压力相关疾病的集中跨条件转移（F1 比普通 RoBERTa 增加 1%）比一般心理健康训练提供了更强的表征。对 Dreaddit 的评估（81% F1）进一步证明了从临床心理健康背景到情境压力讨论的转变。

Title: Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?

Authors: Dingmin Wang, Ji Ma, Shankar Kumar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23836
Pdf URL: https://arxiv.org/pdf/2512.23836
Copy Paste: [[2512.23836]] Retrieval Augmented Question Answering: When Should LLMs Admit Ignorance?(https://arxiv.org/abs/2512.23836)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: The success of expanded context windows in Large Language Models (LLMs) has driven increased use of broader context in retrieval-augmented generation. We investigate the use of LLMs for retrieval augmented question answering. While longer contexts make it easier to incorporate targeted knowledge, they introduce more irrelevant information that hinders the model's generation process and degrades its performance. To address the issue, we design an adaptive prompting strategy which involves splitting the retrieved information into smaller chunks and sequentially prompting a LLM to answer the question using each chunk. Adjusting the chunk size allows a trade-off between incorporating relevant information and reducing irrelevant information. Experimental results on three open-domain question answering datasets demonstrate that the adaptive strategy matches the performance of standard prompting while using fewer tokens. Our analysis reveals that when encountering insufficient information, the LLM often generates incorrect answers instead of declining to respond, which constitutes a major source of error. This finding highlights the need for further research into enhancing LLMs' ability to effectively decline requests when faced with inadequate information.
摘要：大型语言模型 (LLM) 中扩展上下文窗口的成功推动了检索增强生成中更广泛上下文的使用。我们研究了法学硕士在检索增强问答中的使用。虽然较长的上下文可以更轻松地合并目标知识，但它们会引入更多不相关的信息，从而阻碍模型的生成过程并降低其性能。为了解决这个问题，我们设计了一种自适应提示策略，其中包括将检索到的信息分成更小的块，并依次提示法学硕士使用每个块回答问题。调整块大小可以在合并相关信息和减少不相关信息之间进行权衡。三个开放域问答数据集的实验结果表明，自适应策略在使用较少标记的情况下与标准提示的性能相匹配。我们的分析表明，当遇到信息不足时，法学硕士经常会生成错误的答案而不是拒绝回应，这是错误的主要来源。这一发现凸显了需要进一步研究以增强法学硕士在面临信息不足时有效拒绝请求的能力。

Title: Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation

Authors: Kaustubh Dhole
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23837
Pdf URL: https://arxiv.org/pdf/2512.23837
Copy Paste: [[2512.23837]] Adversarial Lens: Exploiting Attention Layers to Generate Adversarial Examples for Evaluation(https://arxiv.org/abs/2512.23837)
Keywords: llm, prompt
Abstract: Recent advances in mechanistic interpretability suggest that intermediate attention layers encode token-level hypotheses that are iteratively refined toward the final output. In this work, we exploit this property to generate adversarial examples directly from attention-layer token distributions. Unlike prompt-based or gradient-based attacks, our approach leverages model-internal token predictions, producing perturbations that are both plausible and internally consistent with the model's own generation process. We evaluate whether tokens extracted from intermediate layers can serve as effective adversarial perturbations for downstream evaluation tasks. We conduct experiments on argument quality assessment using the ArgQuality dataset, with LLaMA-3.1-Instruct-8B serving as both the generator and evaluator. Our results show that attention-based adversarial examples lead to measurable drops in evaluation performance while remaining semantically similar to the original inputs. However, we also observe that substitutions drawn from certain layers and token positions can introduce grammatical degradation, limiting their practical effectiveness. Overall, our findings highlight both the promise and current limitations of using intermediate-layer representations as a principled source of adversarial examples for stress-testing LLM-based evaluation pipelines.
摘要：机械可解释性的最新进展表明，中间注意力层对令牌级假设进行编码，并针对最终输出进行迭代细化。在这项工作中，我们利用这个属性直接从注意力层令牌分布生成对抗性示例。与基于提示或基于梯度的攻击不同，我们的方法利用模型内部令牌预测，产生既合理又与模型自身生成过程内部一致的扰动。我们评估从中间层提取的令牌是否可以作为下游评估任务的有效对抗性扰动。我们使用 ArgQuality 数据集进行论证质量评估实验，LLaMA-3.1-Instruct-8B 既充当生成器又充当评估器。我们的结果表明，基于注意力的对抗性示例会导致评估性能明显下降，同时在语义上与原始输入保持相似。然而，我们还观察到，从某些层和标记位置提取的替换可能会导致语法退化，从而限制其实际有效性。总的来说，我们的研究结果强调了使用中间层表示作为基于 LLM 的评估管道的压力测试的对抗性示例的原则来源的前景和当前的局限性。

Title: Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs

Authors: Yukun Zhang, Stefan Elbl Droguett, Samyak Jain
Subjects: cs.CL, cs.CE, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23848
Pdf URL: https://arxiv.org/pdf/2512.23848
Copy Paste: [[2512.23848]] Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs(https://arxiv.org/abs/2512.23848)
Keywords: language model, llm, hallucination, prompt
Abstract: This research project addresses the errors of financial numerical reasoning Question Answering (QA) tasks due to the lack of domain knowledge in finance. Despite recent advances in Large Language Models (LLMs), financial numerical questions remain challenging because they require specific domain knowledge in finance and complex multi-step numeric reasoning. We implement a multi-retriever Retrieval Augmented Generators (RAG) system to retrieve both external domain knowledge and internal question contexts, and utilize the latest LLM to tackle these tasks. Through comprehensive ablation experiments and error analysis, we find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model surpassing the FinQA paper's top model, which serves as our baseline. This suggests the potential superior performance of domain-specific training. Furthermore, our best prompt-based LLM generator achieves the state-of-the-art (SOTA) performance with significant improvement (>7%), yet it is still below the human expert performance. This study highlights the trade-off between hallucinations loss and external knowledge gains in smaller models and few-shot examples. For larger models, the gains from external facts typically outweigh the hallucination loss. Finally, our findings confirm the enhanced numerical reasoning capabilities of the latest LLM, optimized for few-shot learning.
摘要：该研究项目解决了由于缺乏金融领域知识而导致的金融数字推理问答（QA）任务中的错误。尽管大型语言模型 (LLM) 最近取得了进展，但金融数字问题仍然具有挑战性，因为它们需要金融领域的特定领域知识和复杂的多步骤数字推理。我们实现了一个多检索器检索增强生成器（RAG）系统来检索外部领域知识和内部问题上下文，并利用最新的法学硕士来解决这些任务。通过全面的消融实验和误差分析，我们发现使用 SecBERT 编码器进行特定领域的训练对我们的最佳神经符号模型做出了显着贡献，超越了 FinQA 论文的顶级模型（作为我们的基线）。这表明特定领域训练的潜在卓越性能。此外，我们最好的基于提示的 LLM 生成器实现了最先进的 (SOTA) 性能，并显着提高 (>7%)，但它仍然低于人类专家的性能。这项研究强调了较小模型和少数样本中幻觉损失和外部知识增益之间的权衡。对于较大的模型，外部事实的收益通常超过幻觉的损失。最后，我们的研究结果证实了最新法学硕士的增强的数字推理能力，并针对小样本学习进行了优化。

Title: Disentangling Learning from Judgment: Representation Learning for Open Response Analytics

Authors: Conrad Borchers, Manit Patel, Seiyon M. Lee, Anthony F. Botelho
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.23941
Pdf URL: https://arxiv.org/pdf/2512.23941
Copy Paste: [[2512.23941]] Disentangling Learning from Judgment: Representation Learning for Open Response Analytics(https://arxiv.org/abs/2512.23941)
Keywords: prompt
Abstract: Open-ended responses are central to learning, yet automated scoring often conflates what students wrote with how teachers grade. We present an analytics-first framework that separates content signals from rater tendencies, making judgments visible and auditable via analytics. Using de-identified ASSISTments mathematics responses, we model teacher histories as dynamic priors and derive text representations from sentence embeddings, incorporating centering and residualization to mitigate prompt and teacher confounds. Temporally-validated linear models quantify the contributions of each signal, and a projection surfaces model disagreements for qualitative inspection. Results show that teacher priors heavily influence grade predictions; the strongest results arise when priors are combined with content embeddings (AUC~0.815), while content-only models remain above chance but substantially weaker (AUC~0.626). Adjusting for rater effects sharpens the residual content representation, retaining more informative embedding dimensions and revealing cases where semantic evidence supports understanding as opposed to surface-level differences in how students respond. The contribution presents a practical pipeline that transforms embeddings from mere features into learning analytics for reflection, enabling teachers and researchers to examine where grading practices align (or conflict) with evidence of student reasoning and learning.
摘要：开放式回答是学习的核心，但自动评分常常将学生所写的内容与教师的评分方式混为一谈。我们提出了一个分析优先的框架，将内容信号与评估者倾向分开，通过分析使判断可见且可审计。使用去识别化的 ASSISTments 数学答案，我们将教师历史建模为动态先验，并从句子嵌入中导出文本表示，结合居中和残差来减轻提示和教师的混淆。时间验证的线性模型量化每个信号的贡献，并投影表面模型分歧以进行定性检查。结果表明，教师先验对成绩预测有很大影响；当先验与内容嵌入相结合时，会产生最强的结果（AUC~0.815），而仅内容模型仍然高于机会，但要弱得多（AUC~0.626）。针对评分者效应进行调整可以锐化剩余内容表示，保留更多信息嵌入维度，并揭示语义证据支持理解的案例，而不是学生反应方式的表面差异。该贡献提出了一个实用的管道，将嵌入从单纯的特征转变为用于反思的学习分析，使教师和研究人员能够检查评分实践与学生推理和学习的证据是否一致（或冲突）。

Title: Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling

Authors: Chulun Zhou, Chunkang Zhang, Guoxin Yu, Fandong Meng, Jie Zhou, Wai Lam, Mo Yu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23959
Pdf URL: https://arxiv.org/pdf/2512.23959
Copy Paste: [[2512.23959]] Improving Multi-step RAG with Hypergraph-based Memory for Long-Context Complex Relational Modeling(https://arxiv.org/abs/2512.23959)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Multi-step retrieval-augmented generation (RAG) has become a widely adopted strategy for enhancing large language models (LLMs) on tasks that demand global comprehension and intensive reasoning. Many RAG systems incorporate a working memory module to consolidate retrieved information. However, existing memory designs function primarily as passive storage that accumulates isolated facts for the purpose of condensing the lengthy inputs and generating new sub-queries through deduction. This static nature overlooks the crucial high-order correlations among primitive facts, the compositions of which can often provide stronger guidance for subsequent steps. Therefore, their representational strength and impact on multi-step reasoning and knowledge evolution are limited, resulting in fragmented reasoning and weak global sense-making capacity in extended contexts. We introduce HGMem, a hypergraph-based memory mechanism that extends the concept of memory beyond simple storage into a dynamic, expressive structure for complex reasoning and global understanding. In our approach, memory is represented as a hypergraph whose hyperedges correspond to distinct memory units, enabling the progressive formation of higher-order interactions within memory. This mechanism connects facts and thoughts around the focal problem, evolving into an integrated and situated knowledge structure that provides strong propositions for deeper reasoning in subsequent steps. We evaluate HGMem on several challenging datasets designed for global sense-making. Extensive experiments and in-depth analyses show that our method consistently improves multi-step RAG and substantially outperforms strong baseline systems across diverse tasks.
摘要：多步检索增强生成（RAG）已成为一种广泛采用的策略，用于在需要全局理解和强化推理的任务上增强大型语言模型（LLM）。许多 RAG 系统都包含一个工作内存模块来整合检索到的信息。然而，现有的存储器设计主要用作被动存储，其积累孤立的事实，以压缩冗长的输入并通过推导生成新的子查询。这种静态性质忽视了原始事实之间至关重要的高阶相关性，而这些事实的组合通常可以为后续步骤提供更有力的指导。因此，它们的表征强度以及对多步推理和知识演化的影响有限，导致推理碎片化和扩展上下文中的全局意义构建能力较弱。我们引入了 HGMem，一种基于超图的内存机制，它将内存的概念从简单的存储扩展到动态的、可表达的结构，以实现复杂的推理和全局理解。在我们的方法中，记忆被表示为超图，其超边对应于不同的记忆单元，从而能够在记忆内逐步形成高阶交互。这种机制将围绕焦点问题的事实和思想联系起来，演变成一个集成的、情境化的知识结构，为后续步骤中更深入的推理提供强有力的命题。我们在几个专为全球意义构建而设计的具有挑战性的数据集上评估了 HGMem。大量的实验和深入的分析表明，我们的方法持续改进了多步 RAG，并且在不同的任务中显着优于强大的基线系统。

Title: Efficient Context Scaling with LongCat ZigZag Attention

Authors: Chen Zhang, Yang Bai, Jiahuan Li, Anchun Gui, Keheng Wang, Feifan Liu, Guanyu Wu, Yuwei Jiang, Defei Bu, Li Wei, Haihang Jing, Hongyin Tang, Xin Chen, Xiangzhou Huang, Fengcun Li, Rongxiang Weng, Yulei Qian, Yifan Lu, Yerui Sun, Jingang Wang, Yuchen Xie, Xunliang Cai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.23966
Pdf URL: https://arxiv.org/pdf/2512.23966
Copy Paste: [[2512.23966]] Efficient Context Scaling with LongCat ZigZag Attention(https://arxiv.org/abs/2512.23966)
Keywords: retrieval-augmented generation, agent
Abstract: We introduce LongCat ZigZag Attention (LoZA), which is a sparse attention scheme designed to transform any existing full-attention models into sparse versions with rather limited compute budget. In long-context scenarios, LoZA can achieve significant speed-ups both for prefill-intensive (e.g., retrieval-augmented generation) and decode-intensive (e.g., tool-integrated reasoning) cases. Specifically, by applying LoZA to LongCat-Flash during mid-training, we serve LongCat-Flash-Exp as a long-context foundation model that can swiftly process up to 1 million tokens, enabling efficient long-term reasoning and long-horizon agentic capabilities.
摘要：我们引入了 LongCat ZigZag Attention (LoZA)，这是一种稀疏注意力方案，旨在将任何现有的全注意力模型转换为计算预算相当有限的稀疏版本。在长上下文场景中，LoZA 可以在预填充密集型（例如检索增强生成）和解码密集型（例如工具集成推理）情况下实现显着的加速。具体来说，通过在训练中期将LoZA应用于LongCat-Flash，我们将LongCat-Flash-Exp作为长上下文基础模型，可以快速处理多达100万个令牌，从而实现高效的长期推理和长期代理能力。

Title: CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards

Authors: Zhiming Lin, Kai Zhao, Sophie Zhang, Peilai Yu, Canran Xiao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.23971
Pdf URL: https://arxiv.org/pdf/2512.23971
Copy Paste: [[2512.23971]] CEC-Zero: Zero-Supervision Character Error Correction with Self-Generated Rewards(https://arxiv.org/abs/2512.23971)
Keywords: llm
Abstract: Large-scale Chinese spelling correction (CSC) remains critical for real-world text processing, yet existing LLMs and supervised methods lack robustness to novel errors and rely on costly annotations. We introduce CEC-Zero, a zero-supervision reinforcement learning framework that addresses this by enabling LLMs to correct their own mistakes. CEC-Zero synthesizes errorful inputs from clean text, computes cluster-consensus rewards via semantic similarity and candidate agreement, and optimizes the policy with PPO. It outperforms supervised baselines by 10--13 F$_1$ points and strong LLM fine-tunes by 5--8 points across 9 benchmarks, with theoretical guarantees of unbiased rewards and convergence. CEC-Zero establishes a label-free paradigm for robust, scalable CSC, unlocking LLM potential in noisy text pipelines.
摘要：大规模中文拼写纠正（CSC）对于现实世界的文本处理仍然至关重要，但现有的法学硕士和监督方法缺乏对新错误的鲁棒性，并且依赖于昂贵的注释。我们推出了 CEC-Zero，这是一种零监督强化学习框架，它通过使法学硕士能够纠正自己的错误来解决这个问题。 CEC-Zero 从干净文本中合成错误输入，通过语义相似性和候选一致性计算集群共识奖励，并使用 PPO 优化策略。它在 9 个基准中比监督基线高出 10--13 F$_1$ 点，并且在 9 个基准中比 LLM 的强大微调高出 5--8 点，并在理论上保证了公正的奖励和收敛。 CEC-Zero 为稳健、可扩展的 CSC 建立了无标签范例，在嘈杂的文本管道中释放了 LLM 的潜力。

Title: Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process

Authors: Zhenyu Zhang, Shujian Zhang, John Lambert, Wenxuan Zhou, Zhangyang Wang, Mingqing Chen, Andrew Hard, Rajiv Mathews, Lun Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.23988
Pdf URL: https://arxiv.org/pdf/2512.23988
Copy Paste: [[2512.23988]] Fantastic Reasoning Behaviors and Where to Find Them: Unsupervised Discovery of the Reasoning Process(https://arxiv.org/abs/2512.23988)
Keywords: language model, llm, chain-of-thought
Abstract: Despite the growing reasoning capabilities of recent large language models (LLMs), their internal mechanisms during the reasoning process remain underexplored. Prior approaches often rely on human-defined concepts (e.g., overthinking, reflection) at the word level to analyze reasoning in a supervised manner. However, such methods are limited, as it is infeasible to capture the full spectrum of potential reasoning behaviors, many of which are difficult to define in token space. In this work, we propose an unsupervised framework (namely, RISE: Reasoning behavior Interpretability via Sparse auto-Encoder) for discovering reasoning vectors, which we define as directions in the activation space that encode distinct reasoning behaviors. By segmenting chain-of-thought traces into sentence-level 'steps' and training sparse auto-encoders (SAEs) on step-level activations, we uncover disentangled features corresponding to interpretable behaviors such as reflection and backtracking. Visualization and clustering analyses show that these behaviors occupy separable regions in the decoder column space. Moreover, targeted interventions on SAE-derived vectors can controllably amplify or suppress specific reasoning behaviors, altering inference trajectories without retraining. Beyond behavior-specific disentanglement, SAEs capture structural properties such as response length, revealing clusters of long versus short reasoning traces. More interestingly, SAEs enable the discovery of novel behaviors beyond human supervision. We demonstrate the ability to control response confidence by identifying confidence-related vectors in the SAE decoder space. These findings underscore the potential of unsupervised latent discovery for both interpreting and controllably steering reasoning in LLMs.
摘要：尽管最近大型语言模型（LLM）的推理能力不断增强，但其推理过程中的内部机制仍未得到充分探索。先前的方法通常依赖于人类在单词级别定义的概念（例如，过度思考、反思）来以监督的方式分析推理。然而，此类方法是有限的，因为不可能捕获全部潜在推理行为，其中许多行为很难在标记空间中定义。在这项工作中，我们提出了一个无监督框架（即 RISE：通过稀疏自动编码器的推理行为可解释性）来发现推理向量，我们将其定义为激活空间中编码不同推理行为的方向。通过将思想链轨迹分割成句子级“步骤”并在步骤级激活上训练稀疏自动编码器（SAE），我们发现了与可解释行为（例如反射和回溯）相对应的解开特征。可视化和聚类分析表明这些行为占据解码器列空间中的可分离区域。此外，对 SAE 衍生向量进行有针对性的干预可以可控地放大或抑制特定的推理行为，无需重新训练即可改变推理轨迹。除了特定行为的解开之外，SAE 还捕获响应长度等结构特性，揭示长推理痕迹和短推理痕迹的集群。更有趣的是，SAE 能够发现人类监督之外的新颖行为。我们通过识别 SAE 解码器空间中的置信度相关向量来展示控制响应置信度的能力。这些发现强调了无监督潜在发现对于法学硕士解释和可控引导推理的潜力。

Title: iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning

Authors: Sijia Chen, Di Niu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24014
Pdf URL: https://arxiv.org/pdf/2512.24014
Copy Paste: [[2512.24014]] iCLP: Large Language Model Reasoning with Implicit Cognition Latent Planning(https://arxiv.org/abs/2512.24014)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large language models (LLMs), when guided by explicit textual plans, can perform reliable step-by-step reasoning during problem-solving. However, generating accurate and effective textual plans remains challenging due to LLM hallucinations and the high diversity of task-specific questions. To address this, we draw inspiration from human Implicit Cognition (IC), the subconscious process by which decisions are guided by compact, generalized patterns learned from past experiences without requiring explicit verbalization. We propose iCLP, a novel framework that enables LLMs to adaptively generate latent plans (LPs), which are compact encodings of effective reasoning instructions. iCLP first distills explicit plans from existing step-by-step reasoning trajectories. It then learns discrete representations of these plans via a vector-quantized autoencoder coupled with a codebook. Finally, by fine-tuning LLMs on paired latent plans and corresponding reasoning steps, the models learn to perform implicit planning during reasoning. Experimental results on mathematical reasoning and code generation tasks demonstrate that, with iCLP, LLMs can plan in latent space while reasoning in language space. This approach yields significant improvements in both accuracy and efficiency and, crucially, demonstrates strong cross-domain generalization while preserving the interpretability of chain-of-thought reasoning.
摘要：大型语言模型（LLM）在明确的文本计划的指导下，可以在解决问题期间执行可靠的逐步推理。然而，由于法学硕士的幻觉和特定任务问题的高度多样性，生成准确有效的文本计划仍然具有挑战性。为了解决这个问题，我们从人类的内隐认知（IC）中汲取灵感，这是一种潜意识过程，通过从过去的经验中学到的紧凑、概括的模式来指导决策，而不需要明确的语言表达。我们提出了 iCLP，这是一种新颖的框架，使 LLM 能够自适应地生成潜在计划（LP），这是有效推理指令的紧凑编码。 iCLP 首先从现有的逐步推理轨迹中提炼出明确的计划。然后，它通过与密码本相结合的矢量量化自动编码器来学习这些计划的离散表示。最后，通过对配对潜在计划和相应推理步骤上的 LLM 进行微调，模型学会在推理过程中执行隐式计划。数学推理和代码生成任务的实验结果表明，通过 iCLP，法学硕士可以在潜在空间中进行规划，同时在语言空间中进行推理。这种方法在准确性和效率方面都取得了显着的提高，最重要的是，它展示了强大的跨领域泛化能力，同时保留了思想链推理的可解释性。

Title: Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models

Authors: Rohit Kumar Salla, Manoj Saravanan, Shrikar Reddy Kota
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24058
Pdf URL: https://arxiv.org/pdf/2512.24058
Copy Paste: [[2512.24058]] Beyond Hallucinations: A Composite Score for Measuring Reliability in Open-Source Large Language Models(https://arxiv.org/abs/2512.24058)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) like LLaMA, Mistral, and Gemma are increasingly used in decision-critical domains such as healthcare, law, and finance, yet their reliability remains uncertain. They often make overconfident errors, degrade under input shifts, and lack clear uncertainty estimates. Existing evaluations are fragmented, addressing only isolated aspects. We introduce the Composite Reliability Score (CRS), a unified framework that integrates calibration, robustness, and uncertainty quantification into a single interpretable metric. Through experiments on ten leading open-source LLMs across five QA datasets, we assess performance under baselines, perturbations, and calibration methods. CRS delivers stable model rankings, uncovers hidden failure modes missed by single metrics, and highlights that the most dependable systems balance accuracy, robustness, and calibrated uncertainty.
摘要：LLaMA、Mistral 和 Gemma 等大型语言模型 (LLM) 越来越多地用于医疗保健、法律和金融等决策关键领域，但其可靠性仍然不确定。他们经常犯过度自信的错误，在输入变化下退化，并且缺乏明确的不确定性估计。现有的评价是分散的，仅涉及孤立的方面。我们引入了综合可靠性评分（CRS），这是一个统一的框架，它将校准、鲁棒性和不确定性量化集成到一个可解释的指标中。通过在五个 QA 数据集上对十个领先的开源法学硕士进行实验，我们评估了基线、扰动和校准方法下的性能。 CRS 提供稳定的模型排名，揭示单个指标遗漏的隐藏故障模式，并强调最可靠的系统平衡准确性、稳健性和校准不确定性。

Title: Training a Huggingface Model on AWS Sagemaker (Without Tears)

Authors: Liling Tan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24098
Pdf URL: https://arxiv.org/pdf/2512.24098
Copy Paste: [[2512.24098]] Training a Huggingface Model on AWS Sagemaker (Without Tears)(https://arxiv.org/abs/2512.24098)
Keywords: language model, llm
Abstract: The development of Large Language Models (LLMs) has primarily been driven by resource-rich research groups and industry partners. Due to the lack of on-premise computing resources required for increasingly complex models, many researchers are turning to cloud services like AWS SageMaker to train Hugging Face models. However, the steep learning curve of cloud platforms often presents a barrier for researchers accustomed to local environments. Existing documentation frequently leaves knowledge gaps, forcing users to seek fragmented information across the web. This demo paper aims to democratize cloud adoption by centralizing the essential information required for researchers to successfully train their first Hugging Face model on AWS SageMaker from scratch.
摘要：大型语言模型 (LLM) 的发展主要是由资源丰富的研究小组和行业合作伙伴推动的。由于缺乏日益复杂的模型所需的本地计算资源，许多研究人员正在转向 AWS SageMaker 等云服务来训练 Hugging Face 模型。然而，云平台陡峭的学习曲线往往给习惯本地环境的研究人员带来障碍。现有文档经常留下知识空白，迫使用户在网络上寻找零散的信息。本演示论文旨在通过集中研究人员从头开始在 AWS SageMaker 上成功训练第一个 Hugging Face 模型所需的基本信息，实现云采用的民主化。

Title: Activation Steering for Masked Diffusion Language Models

Authors: Adi Shnaidman, Erin Feiglin, Osher Yaari, Efrat Mentel, Amit Levi, Raz Lapid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24143
Pdf URL: https://arxiv.org/pdf/2512.24143
Copy Paste: [[2512.24143]] Activation Steering for Masked Diffusion Language Models(https://arxiv.org/abs/2512.24143)
Keywords: language model, prompt
Abstract: Masked diffusion language models (MDLMs) generate text through an iterative denoising process. They have recently gained attention due to mask-parallel decoding and competitive performance with autoregressive large language models. However, effective mechanisms for inference-time control and steering in MDLMs remain largely unexplored. We present an activation-steering framework for MDLMs that computes layer-wise steering vectors from a single forward pass using contrastive examples, without simulating the denoising trajectory. These directions are applied at every reverse-diffusion step, yielding an efficient inference-time control mechanism. Experiments on LLaDA-8B-Instruct demonstrate reliable modulation of high-level attributes, with ablations examining the effects of steering across transformer sub-modules and token scope (prompt vs.\ response).
摘要：掩蔽扩散语言模型 (MDLM) 通过迭代去噪过程生成文本。它们最近由于掩码并行解码和自回归大型语言模型的竞争性能而受到关注。然而，MDLM 中推理时间控制和指导的有效机制在很大程度上仍未得到探索。我们提出了一种用于 MDLM 的激活引导框架，该框架使用对比示例从单个前向传递中计算分层引导向量，而不模拟去噪轨迹。这些方向应用于每个反向扩散步骤，从而产生有效的推理时间控制机制。 LLaDA-8B-Instruct 上的实验证明了高级属性的可靠调制，并通过消融检查了变压器子模块和令牌范围（提示与\响应）之间的转向效果。

Title: Large Emotional World Model

Authors: Changhao Song, Yazhou Zhang, Hui Gao, Chang Yang, Peng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24149
Pdf URL: https://arxiv.org/pdf/2512.24149
Copy Paste: [[2512.24149]] Large Emotional World Model(https://arxiv.org/abs/2512.24149)
Keywords: language model, llm
Abstract: World Models serve as tools for understanding the current state of the world and predicting its future dynamics, with broad application potential across numerous fields. As a key component of world knowledge, emotion significantly influences human decision-making. While existing Large Language Models (LLMs) have shown preliminary capability in capturing world knowledge, they primarily focus on modeling physical-world regularities and lack systematic exploration of emotional factors. In this paper, we first demonstrate the importance of emotion in understanding the world by showing that removing emotionally relevant information degrades reasoning performance. Inspired by theory of mind, we further propose a Large Emotional World Model (LEWM). Specifically, we construct the Emotion-Why-How (EWH) dataset, which integrates emotion into causal relationships and enables reasoning about why actions occur and how emotions drive future world states. Based on this dataset, LEWM explicitly models emotional states alongside visual observations and actions, allowing the world model to predict both future states and emotional transitions. Experimental results show that LEWM more accurately predicts emotion-driven social behaviors while maintaining comparable performance to general world models on basic tasks.
摘要：世界模型是了解世界现状和预测未来动态的工具，在众多领域具有广泛的应用潜力。作为世界知识的重要组成部分，情感极大地影响着人类的决策。虽然现有的大语言模型（LLM）已经显示出捕获世界知识的初步能力，但它们主要侧重于对物理世界规律进行建模，缺乏对情感因素的系统探索。在本文中，我们首先通过证明删除与情感相关的信息会降低推理性能来证明情感在理解世界中的重要性。受心理理论的启发，我们进一步提出了大情感世界模型（LEWM）。具体来说，我们构建了情感-为什么-如何（EWH）数据集，它将情感整合到因果关系中，并能够推理行为发生的原因以及情感如何驱动未来的世界状态。基于此数据集，LEWM 显式地模拟情绪状态以及视觉观察和动作，从而使世界模型能够预测未来状态和情绪转变。实验结果表明，LEWM 可以更准确地预测情绪驱动的社会行为，同时在基本任务上保持与一般世界模型相当的性能。

Title: Training Report of TeleChat3-MoE

Authors: Xinzhang Liu, Chao Wang, Zhihao Yang, Zhuo Jiang, Xuncheng Zhao, Haoran Wang, Lei Li, Dongdong He, Luobin Liu, Kaizhe Yuan, Han Gao, Zihan Wang, Yitong Yao, Sishi Xiong, Wenmin Deng, Haowei He, Kaidong Yu, Yu Zhao, Ruiyu Fang, Yuhao Jiang, Yingyan Li, Xiaohui Hu, Xi Yu, Jingqi Li, Yanwei Liu, Qingli Li, Xinyu Shi, Junhao Niu, Chengnuo Huang, Yao Xiao, Ruiwen Wang, Fengkai Li, Luwen Pu, Kaipeng Jia, Fubei Yao, Yuyao Huang, Xuewei He, Zhuoru Jiang, Ruiting Song, Rui Xue, Qiyi Xie, Jie Zhang, Zilu Huang, Zhaoxi Zhang, Zhilong Lu, Yanhan Zhang, Yin Zhang, Yanlei Xue, Zhu Yuan, Teng Su, Xin Jiang, Shuangyong Song, Yongxiang Li, Xuelong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24157
Pdf URL: https://arxiv.org/pdf/2512.24157
Copy Paste: [[2512.24157]] Training Report of TeleChat3-MoE(https://arxiv.org/abs/2512.24157)
Keywords: language model, chat
Abstract: TeleChat3-MoE is the latest series of TeleChat large language models, featuring a Mixture-of-Experts (MoE) architecture with parameter counts ranging from 105 billion to over one trillion,trained end-to-end on Ascend NPU cluster. This technical report mainly presents the underlying training infrastructure that enables reliable and efficient scaling to frontier model sizes. We detail systematic methodologies for operator-level and end-to-end numerical accuracy verification, ensuring consistency across hardware platforms and distributed parallelism strategies. Furthermore, we introduce a suite of performance optimizations, including interleaved pipeline scheduling, attention-aware data scheduling for long-sequence training,hierarchical and overlapped communication for expert parallelism, and DVM-based operator fusion. A systematic parallelization framework, leveraging analytical estimation and integer linear programming, is also proposed to optimize multi-dimensional parallelism configurations. Additionally, we present methodological approaches to cluster-level optimizations, addressing host- and device-bound bottlenecks during large-scale training tasks. These infrastructure advancements yield significant throughput improvements and near-linear scaling on clusters comprising thousands of devices, providing a robust foundation for large-scale language model development on hardware ecosystems.
摘要：TeleChat3-MoE是TeleChat大语言模型的最新系列，采用专家混合（MoE）架构，参数数量从1050亿到超过万亿，在Ascend NPU集群上进行端到端训练。该技术报告主要介绍了底层训练基础设施，能够可靠、高效地扩展到前沿模型大小。我们详细介绍了操作员级和端到端数值精度验证的系统方法，确保跨硬件平台和分布式并行策略的一致性。此外，我们引入了一系列性能优化，包括交错管道调度、用于长序列训练的注意感知数据调度、用于专家并行的分层和重叠通信以及基于 DVM 的算子融合。还提出了利用分析估计和整数线性规划的系统并行化框架来优化多维并行配置。此外，我们还提出了集群级优化的方法，解决大规模训练任务期间主机和设备限制的瓶颈。这些基础设施的进步显着提高了吞吐量，并在包含数千个设备的集群上实现了近线性扩展，为硬件生态系统上的大规模语言模型开发提供了坚实的基础。

Title: MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring

Authors: Qipeng Wang, Rui Sheng, Yafei Li, Huamin Qu, Yushi Sun, Min Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24181
Pdf URL: https://arxiv.org/pdf/2512.24181
Copy Paste: [[2512.24181]] MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring(https://arxiv.org/abs/2512.24181)
Keywords: language model, llm
Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant promise in clinical diagnosis. However, current models struggle to emulate the iterative, diagnostic hypothesis-driven reasoning of real clinical scenarios. Specifically, current LLMs suffer from three critical limitations: (1) generating hallucinated medical content due to weak grounding in verified knowledge, (2) asking redundant or inefficient questions rather than discriminative ones that hinder diagnostic progress, and (3) losing coherence over multi-turn dialogues, leading to contradictory or inconsistent conclusions. To address these challenges, we propose MedKGI, a diagnostic framework grounded in clinical practices. MedKGI integrates a medical knowledge graph (KG) to constrain reasoning to validated medical ontologies, selects questions based on information gain to maximize diagnostic efficiency, and adopts an OSCE-format structured state to maintain consistent evidence tracking across turns. Experiments on clinical benchmarks show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency, improving dialogue efficiency by 30% on average while maintaining state-of-the-art accuracy.
摘要：大型语言模型 (LLM) 的最新进展在临床诊断中展现了巨大的前景。然而，当前的模型很难模拟真实临床场景的迭代、诊断假设驱动的推理。具体来说，当前的法学硕士面临三个关键的局限性：（1）由于经验证的知识基础薄弱而产生幻觉的医学内容，（2）提出冗余或低效的问题，而不是阻碍诊断进展的歧视性问题，以及（3）在多轮对话中失去连贯性，导致矛盾或不一致的结论。为了应对这些挑战，我们提出了 MedKGI，一个基于临床实践的诊断框架。 MedKGI 集成了医学知识图 (KG) 以将推理限制为经过验证的医学本体，根据信息增益选择问题以最大限度地提高诊断效率，并采用 OSCE 格式的结构化状态来保持跨回合的一致证据跟踪。临床基准实验表明，MedKGI 在诊断准确性和询问效率方面均优于强大的 LLM 基线，将对话效率平均提高 30%，同时保持最先进的准确性。

Title: LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring

Authors: May Bashendy, Walid Massoud, Sohaila Eltanbouly, Salam Albatarni, Marwan Sayed, Abrar Abir, Houda Bouamor, Tamer Elsayed
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24235
Pdf URL: https://arxiv.org/pdf/2512.24235
Copy Paste: [[2512.24235]] LAILA: A Large Trait-Based Dataset for Arabic Automated Essay Scoring(https://arxiv.org/abs/2512.24235)
Keywords: prompt
Abstract: Automated Essay Scoring (AES) has gained increasing attention in recent years, yet research on Arabic AES remains limited due to the lack of publicly available datasets. To address this, we introduce LAILA, the largest publicly available Arabic AES dataset to date, comprising 7,859 essays annotated with holistic and trait-specific scores on seven dimensions: relevance, organization, vocabulary, style, development, mechanics, and grammar. We detail the dataset design, collection, and annotations, and provide benchmark results using state-of-the-art Arabic and English models in prompt-specific and cross-prompt settings. LAILA fills a critical need in Arabic AES research, supporting the development of robust scoring systems.
摘要：近年来，论文自动评分 (AES) 受到越来越多的关注，但由于缺乏公开可用的数据集，对阿拉伯语 AES 的研究仍然有限。为了解决这个问题，我们引入了 LAILA，这是迄今为止最大的公开可用的阿拉伯语 AES 数据集，包含 7,859 篇文章，并在七个维度上标注了整体和特定特征的分数：相关性、组织、词汇、风格、发展、机制和语法。我们详细介绍了数据集的设计、收集和注释，并在特定提示和交叉提示设置中使用最先进的阿拉伯语和英语模型提供基准结果。 LAILA 满足了阿拉伯语 AES 研究的关键需求，支持强大的评分系统的开发。

Title: Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning

Authors: Ziqing Fan, Yuqiao Xian, Yan Sun, Li Shen
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24265
Pdf URL: https://arxiv.org/pdf/2512.24265
Copy Paste: [[2512.24265]] Joint Selection for Large-Scale Pre-Training Data via Policy Gradient-based Mask Learning(https://arxiv.org/abs/2512.24265)
Keywords: language model, llm
Abstract: A fine-grained data recipe is crucial for pre-training large language models, as it can significantly enhance training efficiency and model performance. One important ingredient in the recipe is to select samples based on scores produced by defined rules, LLM judgment, or statistical information in embeddings, which can be roughly categorized into quality and diversity metrics. Due to the high computational cost when applied to trillion-scale token pre-training datasets such as FineWeb and DCLM, these two or more types of metrics are rarely considered jointly in a single selection process. However, in our empirical study, selecting samples based on quality metrics exhibit severe diminishing returns during long-term pre-training, while selecting on diversity metrics removes too many valuable high-quality samples, both of which limit pre-trained LLMs' capabilities. Therefore, we introduce DATAMASK, a novel and efficient joint learning framework designed for large-scale pre-training data selection that can simultaneously optimize multiple types of metrics in a unified process, with this study focusing specifically on quality and diversity metrics. DATAMASK approaches the selection process as a mask learning problem, involving iterative sampling of data masks, computation of policy gradients based on predefined objectives with sampled masks, and updating of mask sampling logits. Through policy gradient-based optimization and various acceleration enhancements, it significantly reduces selection time by 98.9% compared to greedy algorithm, enabling our study to explore joint learning within trillion-scale tokens. With DATAMASK, we select a subset of about 10% from the 15 trillion-token FineWeb dataset, termed FineWeb-Mask. Evaluated across 12 diverse tasks, we achieves significant improvements of 3.2% on a 1.5B dense model and 1.9% on a 7B MoE model.
摘要：细粒度的数据配方对于预训练大型语言模型至关重要，因为它可以显着提高训练效率和模型性能。配方中的一个重要组成部分是根据定义的规则、LLM 判断或嵌入中的统计信息产生的分数来选择样本，这些信息可以大致分为质量和多样性指标。由于应用于FineWeb和DCLM等万亿级代币预训练数据集时的计算成本很高，因此很少在单个选择过程中联合考虑这两种或多种类型的指标。然而，在我们的实证研究中，根据质量指标选择样本在长期预训练期间表现出严重的收益递减，而根据多样性指标选择样本会删除太多有价值的高质量样本，这两者都限制了预训练的法学硕士的能力。因此，我们引入了DATAMASK，这是一种新颖且高效的联合学习框架，专为大规模预训练数据选择而设计，可以在统一的过程中同时优化多种类型的指标，本研究特别关注质量和多样性指标。 DATAMASK 将选择过程视为掩码学习问题，涉及数据掩码的迭代采样、基于带有采样掩码的预定义目标的策略梯度计算以及掩码采样逻辑的更新。通过基于策略梯度的优化和各种加速增强，与贪婪算法相比，它显着减少了 98.9% 的选择时间，使我们的研究能够探索万亿规模代币内的联合学习。通过 DATAMASK，我们从 15 万亿代币的 FineWeb 数据集中选择了约 10% 的子集，称为 FineWeb-Mask。通过对 12 个不同任务进行评估，我们在 1.5B 密集模型上实现了 3.2% 的显着改进，在 7B MoE 模型上实现了 1.9% 的显着改进。

Title: Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs

Authors: Jonathan Schmoll, Adam Jatowt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24289
Pdf URL: https://arxiv.org/pdf/2512.24289
Copy Paste: [[2512.24289]] Automated Analysis of Sustainability Reports: Using Large Language Models for the Extraction and Prediction of EU Taxonomy-Compliant KPIs(https://arxiv.org/abs/2512.24289)
Keywords: language model, llm, agent
Abstract: The manual, resource-intensive process of complying with the EU Taxonomy presents a significant challenge for companies. While Large Language Models (LLMs) offer a path to automation, research is hindered by a lack of public benchmark datasets. To address this gap, we introduce a novel, structured dataset from 190 corporate reports, containing ground-truth economic activities and quantitative Key Performance Indicators (KPIs). We use this dataset to conduct the first systematic evaluation of LLMs on the core compliance workflow. Our results reveal a clear performance gap between qualitative and quantitative tasks. LLMs show moderate success in the qualitative task of identifying economic activities, with a multi-step agentic framework modestly enhancing precision. Conversely, the models comprehensively fail at the quantitative task of predicting financial KPIs in a zero-shot setting. We also discover a paradox, where concise metadata often yields superior performance to full, unstructured reports, and find that model confidence scores are poorly calibrated. We conclude that while LLMs are not ready for full automation, they can serve as powerful assistive tools for human experts. Our dataset provides a public benchmark for future research.
摘要：遵守欧盟分类法的手动、资源密集型流程给公司带来了重大挑战。虽然大型语言模型 (LLM) 提供了一条自动化之路，但由于缺乏公共基准数据集，研究受到阻碍。为了解决这一差距，我们引入了来自 190 份公司报告的新颖的结构化数据集，其中包含真实的经济活动和定量关键绩效指标 (KPI)。我们使用该数据集对法学硕士的核心合规工作流程进行首次系统评估。我们的结果揭示了定性和定量任务之间明显的绩效差距。法学硕士在识别经济活动的定性任务中表现出一定的成功，多步骤代理框架适度提高了精度。相反，这些模型在零样本环境下预测财务 KPI 的定量任务上全面失败。我们还发现了一个悖论，即简洁的元数据通常比完整的非结构化报告产生更好的性能，并且发现模型置信度得分校准不佳。我们的结论是，虽然法学硕士还没有为完全自动化做好准备，但它们可以作为人类专家的强大辅助工具。我们的数据集为未来的研究提供了公共基准。

Title: Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking

Authors: Meiqi Chen, Fandong Meng, Jie Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24297
Pdf URL: https://arxiv.org/pdf/2512.24297
Copy Paste: [[2512.24297]] Figure It Out: Improving the Frontier of Reasoning with Active Visual Thinking(https://arxiv.org/abs/2512.24297)
Keywords: chain-of-thought
Abstract: Complex reasoning problems often involve implicit spatial, geometric, and structural relationships that are not explicitly encoded in text. While recent reasoning models have achieved strong performance across many domains, purely text-based reasoning struggles to represent global structural constraints in complex settings. In this paper, we introduce FIGR, which integrates active visual thinking into multi-turn reasoning via end-to-end reinforcement learning. FIGR externalizes intermediate structural hypotheses by constructing visual representations during problem solving. By adaptively regulating when and how visual reasoning should be invoked, FIGR enables more stable and coherent reasoning over global structural properties that are difficult to capture from text alone. Experiments on challenging mathematical reasoning benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines. In particular, FIGR improves the base model by 13.12% on AIME 2025 and 11.00% on BeyondAIME, highlighting the effectiveness of figure-guided multimodal reasoning in enhancing the stability and reliability of complex reasoning.
摘要：复杂的推理问题通常涉及未在文本中明确编码的隐式空间、几何和结构关系。虽然最近的推理模型在许多领域取得了强大的性能，但纯粹基于文本的推理很难表示复杂环境中的全局结构约束。在本文中，我们介绍了FIGR，它通过端到端强化学习将主动视觉思维集成到多轮推理中。 FigR 通过在解决问题期间构建视觉表示来具体化中间结构假设。通过自适应地调节何时以及如何调用视觉推理，FIGR 能够对难以仅从文本中捕获的全局结构属性进行更稳定和连贯的推理。对具有挑战性的数学推理基准的实验表明，FIGR 的性能优于强大的纯文本思维链基准。特别是，FIGR在AIME 2025上将基础模型改进了13.12%，在BeyondAIME上改进了11.00%，凸显了图形引导多模态推理在增强复杂推理稳定性和可靠性方面的有效性。

Title: QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs

Authors: Shupeng Li, Weipeng Lu, Linyun Liu, Chen Lin, Shaofei Li, Zhendong Tan, Hanjun Zhong, Yucheng Zeng, Chenghao Zhu, Mengyue Liu, Daxiang Dong, Jianmin Wu, Yunting Xiao, Annan Li, Danyu Liu, Jingnan Zhang, Licen Liu, Dawei Yin, Dou Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24314
Pdf URL: https://arxiv.org/pdf/2512.24314
Copy Paste: [[2512.24314]] QianfanHuijin Technical Report: A Novel Multi-Stage Training Paradigm for Finance Industrial LLMs(https://arxiv.org/abs/2512.24314)
Keywords: language model, gpt, llm, agent
Abstract: Domain-specific enhancement of Large Language Models (LLMs) within the financial context has long been a focal point of industrial application. While previous models such as BloombergGPT and Baichuan-Finance primarily focused on knowledge enhancement, the deepening complexity of financial services has driven a growing demand for models that possess not only domain knowledge but also robust financial reasoning and agentic capabilities. In this paper, we present QianfanHuijin, a financial domain LLM, and propose a generalizable multi-stage training paradigm for industrial model enhancement. Our approach begins with Continual Pre-training (CPT) on financial corpora to consolidate the knowledge base. This is followed by a fine-grained Post-training pipeline designed with increasing specificity: starting with Financial SFT, progressing to Finance Reasoning RL and Finance Agentic RL, and culminating in General RL aligned with real-world business scenarios. Empirical results demonstrate that QianfanHuijin achieves superior performance across various authoritative financial benchmarks. Furthermore, ablation studies confirm that the targeted Reasoning RL and Agentic RL stages yield significant gains in their respective capabilities. These findings validate our motivation and suggest that this fine-grained, progressive post-training methodology is poised to become a mainstream paradigm for various industrial-enhanced LLMs.
摘要：金融背景下大型语言模型（LLM）的特定领域增强长期以来一直是工业应用的焦点。虽然 BloombergGPT 和百川财经等之前的模型主要侧重于知识增强，但金融服务日益复杂的趋势推动了对不仅拥有领域知识而且拥有强大的金融推理和代理能力的模型的需求不断增长。在本文中，我们提出了金融领域法学硕士千帆汇金，并提出了一种用于工业模型增强的可推广的多阶段训练范例。我们的方法从金融语料库的持续预训练（CPT）开始，以巩固知识库。接下来是细粒度的训练后管道，其设计越来越具体：从金融 SFT 开始，进展到金融推理 RL 和金融代理 RL，最后是与现实世界业务场景相一致的通用 RL。实证结果表明，千帆汇金在各种权威金融基准上均取得了优异的表现。此外，消融研究证实，目标推理强化学习和代理强化学习阶段在各自的能力方面产生了显着的收益。这些发现验证了我们的动机，并表明这种细粒度、渐进的培训后方法有望成为各种工业强化法学硕士的主流范例。

Title: World model inspired sarcasm reasoning with large language model agents

Authors: Keito Inoshita, Shinnosuke Mizuno
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24329
Pdf URL: https://arxiv.org/pdf/2512.24329
Copy Paste: [[2512.24329]] World model inspired sarcasm reasoning with large language model agents(https://arxiv.org/abs/2512.24329)
Keywords: language model, llm, agent
Abstract: Sarcasm understanding is a challenging problem in natural language processing, as it requires capturing the discrepancy between the surface meaning of an utterance and the speaker's intentions as well as the surrounding social context. Although recent advances in deep learning and Large Language Models (LLMs) have substantially improved performance, most existing approaches still rely on black-box predictions of a single model, making it difficult to structurally explain the cognitive factors underlying sarcasm. Moreover, while sarcasm often emerges as a mismatch between semantic evaluation and normative expectations or intentions, frameworks that explicitly decompose and model these components remain limited. In this work, we reformulate sarcasm understanding as a world model inspired reasoning process and propose World Model inspired SArcasm Reasoning (WM-SAR), which decomposes literal meaning, context, normative expectation, and intention into specialized LLM-based agents. The discrepancy between literal evaluation and normative expectation is explicitly quantified as a deterministic inconsistency score, and together with an intention score, these signals are integrated by a lightweight Logistic Regression model to infer the final sarcasm probability. This design leverages the reasoning capability of LLMs while maintaining an interpretable numerical decision structure. Experiments on representative sarcasm detection benchmarks show that WM-SAR consistently outperforms existing deep learning and LLM-based methods. Ablation studies and case analyses further demonstrate that integrating semantic inconsistency and intention reasoning is essential for effective sarcasm detection, achieving both strong performance and high interpretability.
摘要：讽刺理解是自然语言处理中的一个具有挑战性的问题，因为它需要捕获话语的表面含义与说话者的意图以及周围的社会背景之间的差异。尽管深度学习和大型语言模型（LLM）的最新进展显着提高了性能，但大多数现有方法仍然依赖于单个模型的黑盒预测，这使得很难从结构上解释讽刺背后的认知因素。此外，虽然讽刺常常表现为语义评估与规范期望或意图之间的不匹配，但明确分解和建模这些组件的框架仍然有限。在这项工作中，我们将讽刺理解重新表述为一种世界模型启发的推理过程，并提出了世界模型启发的 SArcasm 推理（WM-SAR），它将字面意义、上下文、规范期望和意图分解为专门的基于 LLM 的代理。字面评价和规范期望之间的差异被明确量化为确定性不一致分数，并与意图分数一起，通过轻量级逻辑回归模型整合这些信号以推断最终的讽刺概率。该设计利用了法学硕士的推理能力，同时保持了可解释的数值决策结构。对代表性讽刺检测基准的实验表明，WM-SAR 始终优于现有的深度学习和基于 LLM 的方法。消融研究和案例分析进一步表明，整合语义不一致和意图推理对于有效的讽刺检测、实现强大的性能和高可解释性至关重要。

Title: Comparing Approaches to Automatic Summarization in Less-Resourced Languages

Authors: Chester Palen-Michel, Constantine Lignos
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24410
Pdf URL: https://arxiv.org/pdf/2512.24410
Copy Paste: [[2512.24410]] Comparing Approaches to Automatic Summarization in Less-Resourced Languages(https://arxiv.org/abs/2512.24410)
Keywords: llm, prompt
Abstract: Automatic text summarization has achieved high performance in high-resourced languages like English, but comparatively less attention has been given to summarization in less-resourced languages. This work compares a variety of different approaches to summarization from zero-shot prompting of LLMs large and small to fine-tuning smaller models like mT5 with and without three data augmentation approaches and multilingual transfer. We also explore an LLM translation pipeline approach, translating from the source language to English, summarizing and translating back. Evaluating with five different metrics, we find that there is variation across LLMs in their performance across similar parameter sizes, that our multilingual fine-tuned mT5 baseline outperforms most other approaches including zero-shot LLM performance for most metrics, and that LLM as judge may be less reliable on less-resourced languages.
摘要：自动文本摘要在英语等资源丰富的语言中取得了很高的性能，但对资源较少的语言中的摘要关注相对较少。这项工作比较了各种不同的总结方法，从大大小小的 LLM 的零样本提示到微调较小的模型（如 mT5）（有或没有三种数据增强方法和多语言迁移）。我们还探索了法学硕士翻译管道方法，从源语言翻译成英语，进行总结并翻译回来。通过五个不同的指标进行评估，我们发现不同 LLM 在相似参数大小下的性能存在差异，我们的多语言微调 mT5 基线优于大多数其他方法，包括大多数指标的零样本 LLM 性能，并且 LLM 作为评判在资源较少的语言上可能不太可靠。

Title: Cleaning English Abstracts of Scientific Publications

Authors: Michael E. Rose, Nils A. Herrmann, Sebastian Erhardt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24459
Pdf URL: https://arxiv.org/pdf/2512.24459
Copy Paste: [[2512.24459]] Cleaning English Abstracts of Scientific Publications(https://arxiv.org/abs/2512.24459)
Keywords: language model
Abstract: Scientific abstracts are often used as proxies for the content and thematic focus of research publications. However, a significant share of published abstracts contains extraneous information-such as publisher copyright statements, section headings, author notes, registrations, and bibliometric or bibliographic metadata-that can distort downstream analyses, particularly those involving document similarity or textual embeddings. We introduce an open-source, easy-to-integrate language model designed to clean English-language scientific abstracts by automatically identifying and removing such clutter. We demonstrate that our model is both conservative and precise, alters similarity rankings of cleaned abstracts and improves information content of standard-length embeddings.
摘要：科学摘要通常用作研究出版物的内容和主题焦点的代表。然而，已发表的摘要中很大一部分包含无关信息，例如出版商版权声明、章节标题、作者注释、注册以及文献计量或书目元数据，这些信息可能会扭曲下游分析，特别是涉及文档相似性或文本嵌入的分析。我们引入了一种开源、易于集成的语言模型，旨在通过自动识别和消除此类混乱来清理英语科学摘要。我们证明我们的模型既保守又精确，改变了清理后的摘要的相似性排名，并提高了标准长度嵌入的信息内容。

Title: Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech

Authors: Fabian Retkowski, Alexander Waibel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24517
Pdf URL: https://arxiv.org/pdf/2512.24517
Copy Paste: [[2512.24517]] Paragraph Segmentation Revisited: Towards a Standard Task for Structuring Speech(https://arxiv.org/abs/2512.24517)
Keywords: language model
Abstract: Automatic speech transcripts are often delivered as unstructured word streams that impede readability and repurposing. We recast paragraph segmentation as the missing structuring step and fill three gaps at the intersection of speech processing and text segmentation. First, we establish TEDPara (human-annotated TED talks) and YTSegPara (YouTube videos with synthetic labels) as the first benchmarks for the paragraph segmentation task. The benchmarks focus on the underexplored speech domain, where paragraph segmentation has traditionally not been part of post-processing, while also contributing to the wider text segmentation field, which still lacks robust and naturalistic benchmarks. Second, we propose a constrained-decoding formulation that lets large language models insert paragraph breaks while preserving the original transcript, enabling faithful, sentence-aligned evaluation. Third, we show that a compact model (MiniSeg) attains state-of-the-art accuracy and, when extended hierarchically, jointly predicts chapters and paragraphs with minimal computational cost. Together, our resources and methods establish paragraph segmentation as a standardized, practical task in speech processing.
摘要：自动语音记录通常以非结构化文字流的形式提供，这会妨碍可读性和重新利用。我们将段落分割重新定义为缺失的结构化步骤，并填补了语音处理和文本分割交叉点的三个空白。首先，我们建立 TEDPara（人工注释的 TED 演讲）和 YTSegPara（带有合成标签的 YouTube 视频）作为段落分割任务的第一个基准。这些基准测试重点关注尚未开发的语音领域，其中段落分割传统上不属于后处理的一部分，同时也为更广泛的文本分割领域做出了贡献，该领域仍然缺乏稳健和自然的基准测试。其次，我们提出了一种约束解码公式，允许大型语言模型插入段落分隔符，同时保留原始转录本，从而实现忠实的、句子对齐的评估。第三，我们展示了紧凑模型（MiniSeg）达到了最先进的准确性，并且当分层扩展时，可以以最小的计算成本联合预测章节和段落。我们的资源和方法共同将段落分割确立为语音处理中的标准化实用任务。

Title: Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs

Authors: Muhammad Abdullahi Said, Muhammad Sammani Sani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24556
Pdf URL: https://arxiv.org/pdf/2512.24556
Copy Paste: [[2512.24556]] Safe in the Future, Dangerous in the Past: Dissecting Temporal and Linguistic Vulnerabilities in LLMs(https://arxiv.org/abs/2512.24556)
Keywords: language model, gpt, llm
Abstract: As Large Language Models (LLMs) integrate into critical global infrastructure, the assumption that safety alignment transfers zero-shot from English to other languages remains a dangerous blind spot. This study presents a systematic audit of three state of the art models (GPT-5.1, Gemini 3 Pro, and Claude 4.5 Opus) using HausaSafety, a novel adversarial dataset grounded in West African threat scenarios (e.g., Yahoo-Yahoo fraud, Dane gun manufacturing). Employing a 2 x 4 factorial design across 1,440 evaluations, we tested the non-linear interaction between language (English vs. Hausa) and temporal framing. Our results challenge the prevailing multilingual safety gap narrative. Instead of a simple degradation in low-resource settings, we identified a mechanism of Complex Interference where safety is determined by the intersection of variables. While models exhibited a Reverse Linguistic with Claude 4.5 Opus proving significantly safer in Hausa (45.0%) than in English (36.7%) due to uncertainty-driven refusal they suffered catastrophic failures in temporal reasoning. We report a profound Temporal Asymmetry, where past-tense framing bypassed defenses (15.6% safe) while future-tense scenarios triggered hyper-conservative refusals (57.2% safe). The magnitude of this volatility is illustrated by a 9.2x disparity between the safest and most vulnerable configurations, proving that safety is not a fixed property but a context-dependent state. We conclude that current models rely on superficial heuristics rather than robust semantic understanding, creating Safety Pockets that leave Global South users exposed to localized harms. We propose Invariant Alignment as a necessary paradigm shift to ensure safety stability across linguistic and temporal shifts.
摘要：随着大型语言模型 (LLM) 融入关键的全球基础设施，安全对齐将零样本从英语转移到其他语言的假设仍然是一个危险的盲点。本研究使用 HausaSafety 对三种最先进的模型（GPT-5.1、Gemini 3 Pro 和 Claude 4.5 Opus）进行了系统审计，HausaSafety 是一种基于西非威胁场景（例如，雅虎-雅虎欺诈、Dane 枪支制造）的新型对抗数据集。我们在 1,440 项评估中采用 2 x 4 因子设计，测试了语言（英语与豪萨语）和时间框架之间的非线性相互作用。我们的结果挑战了流行的多语言安全差距叙述。我们不是在资源匮乏的环境中进行简单的降级，而是确定了一种复杂干扰机制，其中安全性由变量的交集决定。虽然模型表现出反向语言学，Claude 4.5 Opus 证明豪萨语 (45.0%) 比英语 (36.7%) 安全得多，但由于不确定性驱动的拒绝，它们在时间推理中遭受了灾难性的失败。我们报告了一种深刻的时间不对称性，其中过去时态框架绕过了防御（15.6％安全），而未来时态场景触发了超级保守的拒绝（57.2％安全）。最安全配置和最易受攻击配置之间的 9.2 倍差异说明了这种波动性的严重程度，证明安全性不是固定属性，而是依赖于环境的状态。我们的结论是，当前的模型依赖于肤浅的启发式方法，而不是强大的语义理解，从而创建了安全袋，使全球南方用户面临局部伤害。我们提出不变对齐作为必要的范式转变，以确保跨语言和时间变化的安全稳定性。

Title: HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering

Authors: Chaodong Tong, Qi Zhang, Jiayang Gao, Lei Jiang, Yanbing Liu, Nannan Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24562
Pdf URL: https://arxiv.org/pdf/2512.24562
Copy Paste: [[2512.24562]] HaluNet: Multi-Granular Uncertainty Modeling for Efficient Hallucination Detection in LLM Question Answering(https://arxiv.org/abs/2512.24562)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) excel at question answering (QA) but often generate hallucinations, including factual errors or fabricated content. Detecting hallucinations from internal uncertainty signals is attractive due to its scalability and independence from external resources. Existing methods often aim to accurately capture a single type of uncertainty while overlooking the complementarity among different sources, particularly between token-level probability uncertainty and the uncertainty conveyed by internal semantic representations, which provide complementary views on model reliability. We present \textbf{HaluNet}, a lightweight and trainable neural framework that integrates multi granular token level uncertainties by combining semantic embeddings with probabilistic confidence and distributional uncertainty. Its multi branch architecture adaptively fuses what the model knows with the uncertainty expressed in its outputs, enabling efficient one pass hallucination detection. Experiments on SQuAD, TriviaQA, and Natural Questions show that HaluNet delivers strong detection performance and favorable computational efficiency, with or without access to context, highlighting its potential for real time hallucination detection in LLM based QA systems.
摘要：大型语言模型 (LLM) 擅长问答 (QA)，但经常产生幻觉，包括事实错误或捏造内容。从内部不确定性信号中检测幻觉由于其可扩展性和独立于外部资源而具有吸引力。现有方法通常旨在准确捕获单一类型的不确定性，同时忽略不同来源之间的互补性，特别是令牌级概率不确定性和内部语义表示传达的不确定性之间的互补性，这提供了关于模型可靠性的互补观点。我们提出了 \textbf{HaluNet}，这是一种轻量级且可训练的神经框架，通过将语义嵌入与概率置信度和分布不确定性相结合来集成多粒度标记级别的不确定性。其多分支架构自适应地将模型所知与其输出中表达的不确定性融合在一起，从而实现高效的一次性幻觉检测。 SQuAD、TriviaQA 和 Natural Questions 上的实验表明，无论是否访问上下文，HaluNet 都能提供强大的检测性能和良好的计算效率，凸显了其在基于 LLM 的 QA 系统中进行实时幻觉检测的潜力。

Title: Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs' Legal Reasoning Capabilities

Authors: Hongseok Oh, Wonseok Hwang, Kyoung-Woon On
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24572
Pdf URL: https://arxiv.org/pdf/2512.24572
Copy Paste: [[2512.24572]] Korean Canonical Legal Benchmark: Toward Knowledge-Independent Evaluation of LLMs' Legal Reasoning Capabilities(https://arxiv.org/abs/2512.24572)
Keywords: language model, llm
Abstract: We introduce the Korean Canonical Legal Benchmark (KCL), a benchmark designed to assess language models' legal reasoning capabilities independently of domain-specific knowledge. KCL provides question-level supporting precedents, enabling a more faithful disentanglement of reasoning ability from parameterized knowledge. KCL consists of two components: (1) KCL-MCQA, multiple-choice problems of 283 questions with 1,103 aligned precedents, and (2) KCL-Essay, open-ended generation problems of 169 questions with 550 aligned precedents and 2,739 instance-level rubrics for automated evaluation. Our systematic evaluation of 30+ models shows large remaining gaps, particularly in KCL-Essay, and that reasoning-specialized models consistently outperform their general-purpose counterparts. We release all resources, including the benchmark dataset and evaluation code, at this https URL.
摘要：我们引入了韩国规范法律基准（KCL），该基准旨在评估独立于特定领域知识的语言模型的法律推理能力。 KCL 提供问题级支持先例，使推理能力与参数化知识更加忠实地分离。 KCL 由两个部分组成：(1) KCL-MCQA，由 283 个问题和 1,103 个对齐先例组成的多项选择题，以及 (2) KCL-Essay，由 169 个问题组成的开放式生成问题，带有 550 个对齐先例和 2,739 个用于自动评估的实例级量规。我们对 30 多个模型的系统评估表明，还存在很大的差距，特别是在 KCL-Essay 中，并且推理专用模型始终优于通用模型。我们在此 https URL 发布了所有资源，包括基准数据集和评估代码。

Title: Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time

Authors: Zhenyu Zhang, Xiaoxia Wu, Zhongzhu Zhou, Qingyang Wu, Yineng Zhang, Pragaash Ponnusamy, Harikaran Subbaraj, Jue Wang, Shuaiwen Leon Song, Ben Athiwaratkun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24574
Pdf URL: https://arxiv.org/pdf/2512.24574
Copy Paste: [[2512.24574]] Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time(https://arxiv.org/abs/2512.24574)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) often rely on long chain-of-thought (CoT) reasoning to solve complex tasks. While effective, these trajectories are frequently inefficient, leading to high latency from excessive token generation, or unstable reasoning that alternates between underthinking (shallow, inconsistent steps) and overthinking (repetitive, verbose reasoning). In this work, we study the structure of reasoning trajectories and uncover specialized attention heads that correlate with distinct cognitive behaviors such as verification and backtracking. By lightly intervening on these heads at inference time, we can steer the model away from inefficient modes. Building on this insight, we propose CREST, a training-free method for Cognitive REasoning Steering at Test-time. CREST has two components: (1) an offline calibration step that identifies cognitive heads and derives head-specific steering vectors, and (2) an inference-time procedure that rotates hidden representations to suppress components along those vectors. CREST adaptively suppresses unproductive reasoning behaviors, yielding both higher accuracy and lower computational cost. Across diverse reasoning benchmarks and models, CREST improves accuracy by up to 17.5% while reducing token usage by 37.6%, offering a simple and effective pathway to faster, more reliable LLM reasoning.
摘要：大型语言模型 (LLM) 通常依赖长链思维 (CoT) 推理来解决复杂任务。虽然有效，但这些轨迹通常效率低下，导致过多令牌生成导致高延迟，或者在思考不足（肤浅、不一致的步骤）和过度思考（重复、冗长的推理）之间交替出现不稳定的推理。在这项工作中，我们研究推理轨迹的结构，并发现与验证和回溯等不同认知行为相关的专门注意力头。通过在推理时对这些头进行轻微干预，我们可以引导模型远离低效模式。基于这一见解，我们提出了 CREST，一种用于测试时认知推理指导的免训练方法。 CREST 有两个组成部分：(1) 离线校准步骤，用于识别认知头部并导出头部特定的转向向量；(2) 推理时间过程，用于旋转隐藏表示以抑制沿这些向量的分量。 CREST 自适应地抑制无效的推理行为，从而产生更高的准确性和更低的计算成本。在不同的推理基准和模型中，CREST 将准确性提高了 17.5%，同时减少了 37.6% 的令牌使用量，为更快、更可靠的 LLM 推理提供了简单有效的途径。

Title: Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models

Authors: Junru Lu, Jiarui Qin, Lingfeng Qiao, Yinghui Li, Xinyi Dai, Bo Ke, Jianfeng He, Ruizhi Qiao, Di Yin, Xing Sun, Yunsheng Wu, Yinsong Liu, Shuangyin Liu, Mingkong Tang, Haodong Lin, Jiayi Kuang, Fanxu Meng, Xiaojuan Tang, Yunjia Xi, Junjie Huang, Haotong Yang, Zhenyi Shen, Yangning Li, Qianwen Zhang, Yifei Yu, Siyu An, Junnan Dong, Qiufeng Wang, Jie Wang, Keyu Chen, Wei Wen, Taian Guo, Zhifeng Shen, Daohai Yu, Jiahao Li, Ke Li, Zongyi Li, Xiaoyu Tan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24618
Pdf URL: https://arxiv.org/pdf/2512.24618
Copy Paste: [[2512.24618]] Youtu-LLM: Unlocking the Native Agentic Potential for Lightweight Large Language Models(https://arxiv.org/abs/2512.24618)
Keywords: language model, llm, agent
Abstract: We introduce Youtu-LLM, a lightweight yet powerful language model that harmonizes high computational efficiency with native agentic intelligence. Unlike typical small models that rely on distillation, Youtu-LLM (1.96B) is pre-trained from scratch to systematically cultivate reasoning and planning capabilities. The key technical advancements are as follows: (1) Compact Architecture with Long-Context Support: Built on a dense Multi-Latent Attention (MLA) architecture with a novel STEM-oriented vocabulary, Youtu-LLM supports a 128k context window. This design enables robust long-context reasoning and state tracking within a minimal memory footprint, making it ideal for long-horizon agent and reasoning tasks. (2) Principled "Commonsense-STEM-Agent" Curriculum: We curated a massive corpus of approximately 11T tokens and implemented a multi-stage training strategy. By progressively shifting the pre-training data distribution from general commonsense to complex STEM and agentic tasks, we ensure the model acquires deep cognitive abilities rather than superficial alignment. (3) Scalable Agentic Mid-training: Specifically for the agentic mid-training, we employ diverse data construction schemes to synthesize rich and varied trajectories across math, coding, and tool-use domains. This high-quality data enables the model to internalize planning and reflection behaviors effectively. Extensive evaluations show that Youtu-LLM sets a new state-of-the-art for sub-2B LLMs. On general benchmarks, it achieves competitive performance against larger models, while on agent-specific tasks, it significantly surpasses existing SOTA baselines, demonstrating that lightweight models can possess strong intrinsic agentic capabilities.
摘要：我们推出 Youtu-LLM，这是一种轻量级但功能强大的语言模型，可将高计算效率与原生代理智能相协调。与典型的依赖蒸馏的小模型不同，Youtu-LLM（1.96B）是从头开始预训练的，系统地培养推理和规划能力。关键技术进步如下： (1) 具有长上下文支持的紧凑架构：Youtu-LLM 基于密集的多潜在注意力 (MLA) 架构和新颖的面向 STEM 的词汇构建，支持 128k 上下文窗口。这种设计在最小的内存占用范围内实现了强大的长上下文推理和状态跟踪，使其成为长视野代理和推理任务的理想选择。 (2) 原则性的“Commonsense-STEM-Agent”课程：我们策划了约 11T 个 token 的海量语料库，并实施了多阶段的训练策略。通过逐步将预训练数据分布从一般常识转移到复杂的 STEM 和代理任务，我们确保模型获得深层认知能力，而不是表面对齐。 (3)可扩展的代理中期训练：特别是对于代理中期训练，我们采用不同的数据构建方案来综合数学、编码和工具使用领域的丰富多样的轨迹。这种高质量的数据使模型能够有效地内化规划和反思行为。广泛的评估表明，Youtu-LLM 为 2B 级以下的 LLM 树立了新的最先进水平。在一般基准上，它实现了与大型模型竞争的性能，而在特定于代理的任务上，它显着超越了现有的 SOTA 基线，这表明轻量级模型可以拥有强大的内在代理能力。

Title: Do Large Language Models Know What They Are Capable Of?

Authors: Casey O. Barkan, Sid Black, Oliver Sourbut
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24661
Pdf URL: https://arxiv.org/pdf/2512.24661
Copy Paste: [[2512.24661]] Do Large Language Models Know What They Are Capable Of?(https://arxiv.org/abs/2512.24661)
Keywords: language model, llm, agent
Abstract: We investigate whether large language models (LLMs) can predict whether they will succeed on a given task and whether their predictions improve as they progress through multi-step tasks. We also investigate whether LLMs can learn from in-context experiences to make better decisions about whether to pursue a task in scenarios where failure is costly. All LLMs we tested are overconfident, but most predict their success with better-than-random discriminatory power. We find that newer and larger LLMs generally do not have greater discriminatory power, though Claude models do show such a trend. On multi-step agentic tasks, the overconfidence of several frontier LLMs worsens as they progress through the tasks, and reasoning LLMs perform comparably to or worse than non-reasoning LLMs. With in-context experiences of failure, some but not all LLMs reduce their overconfidence leading to significantly improved decision making, while others do not. Interestingly, all LLMs' decisions are approximately rational given their estimated probabilities of success, yet their overly-optimistic estimates result in poor decision making. These results suggest that current LLM agents are hindered by their lack of awareness of their own capabilities. We discuss the implications of LLMs' awareness of their capabilities for AI misuse and misalignment risks.
摘要：我们研究大型语言模型 (LLM) 是否可以预测它们是否会在给定任务上取得成功，以及它们的预测是否会随着多步骤任务的进展而得到改善。我们还调查了法学硕士是否可以从情境经验中学习，以便在失败代价高昂的情况下做出是否继续执行任务的更好决策。我们测试的所有法学硕士都过于自信，但大多数人都以优于随机的歧视力来预测自己的成功。我们发现，较新、较大的法学硕士通常没有更大的歧视力，尽管克劳德模型确实显示了这种趋势。在多步骤代理任务中，一些前沿法学硕士的过度自信随着任务的进展而恶化，推理法学硕士的表现与非推理法学硕士相当或更差。通过失败的经历，一些（但不是全部）法学硕士会减少过度自信，从而显着改善决策制定，而另一些则不然。有趣的是，考虑到他们估计的成功概率，所有法学硕士的决策都是近似理性的，但他们过于乐观的估计会导致糟糕的决策。这些结果表明，目前的法学硕士代理人因缺乏对自身能力的认识而受到阻碍。我们讨论了法学硕士对其能力的认识对人工智能滥用和错位风险的影响。

Title: R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory

Authors: Maoyuan Li, Zhongsheng Wang, Haoyuan Li, Jiamou Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24684
Pdf URL: https://arxiv.org/pdf/2512.24684
Copy Paste: [[2512.24684]] R-Debater: Retrieval-Augmented Debate Generation through Argumentative Memory(https://arxiv.org/abs/2512.24684)
Keywords: llm, agent
Abstract: We present R-Debater, an agentic framework for generating multi-turn debates built on argumentative memory. Grounded in rhetoric and memory studies, the system views debate as a process of recalling and adapting prior arguments to maintain stance consistency, respond to opponents, and support claims with evidence. Specifically, R-Debater integrates a debate knowledge base for retrieving case-like evidence and prior debate moves with a role-based agent that composes coherent utterances across turns. We evaluate on standardized ORCHID debates, constructing a 1,000-item retrieval corpus and a held-out set of 32 debates across seven domains. Two tasks are evaluated: next-utterance generation, assessed by InspireScore (subjective, logical, and factual), and adversarial multi-turn simulations, judged by Debatrix (argument, source, language, and overall). Compared with strong LLM baselines, R-Debater achieves higher single-turn and multi-turn scores. Human evaluation with 20 experienced debaters further confirms its consistency and evidence use, showing that combining retrieval grounding with structured planning yields more faithful, stance-aligned, and coherent debates across turns.
摘要：我们提出了 R-Debater，一种基于争论记忆生成多轮辩论的代理框架。该系统以修辞和记忆研究为基础，将辩论视为回忆和调整先前论点的过程，以保持立场一致性、回应对手并用证据支持主张。具体来说，R-Debater 将用于检索类似案例的证据和先前辩论动作的辩论知识库与基于角色的代理集成在一起，该代理在轮流中组成连贯的话语。我们对标准化 ORCHID 辩论进行评估，构建了一个包含 1,000 个项目的检索语料库以及一组跨 7 个领域的 32 场辩论。评估两项任务：下一代话语生成，由 InspireScore 评估（主观、逻辑和事实），以及对抗性多回合模拟，由 Debatrix 判断（论点、来源、语言和总体）。与强大的 LLM 基线相比，R-Debater 取得了更高的单轮和多轮分数。由 20 名经验丰富的辩手进行的人工评估进一步证实了其一致性和证据使用，表明将检索基础与结构化规划相结合可以产生更忠实、立场一致和连贯的回合辩论。

Title: MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models

Authors: Wenzhe Li, Shujian Zhang, Wenxuan Zhou, John Lambert, Chi Jin, Andrew Hard, Rajiv Mathews, Lun Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24693
Pdf URL: https://arxiv.org/pdf/2512.24693
Copy Paste: [[2512.24693]] MUSIC: MUlti-Step Instruction Contrast for Multi-Turn Reward Models(https://arxiv.org/abs/2512.24693)
Keywords: language model, llm
Abstract: Evaluating the quality of multi-turn conversations is crucial for developing capable Large Language Models (LLMs), yet remains a significant challenge, often requiring costly human evaluation. Multi-turn reward models (RMs) offer a scalable alternative and can provide valuable signals for guiding LLM training. While recent work has advanced multi-turn \textit{training} techniques, effective automated \textit{evaluation} specifically for multi-turn interactions lags behind. We observe that standard preference datasets, typically contrasting responses based only on the final conversational turn, provide insufficient signal to capture the nuances of multi-turn interactions. Instead, we find that incorporating contrasts spanning \textit{multiple} turns is critical for building robust multi-turn RMs. Motivated by this finding, we propose \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC), an unsupervised data augmentation strategy that synthesizes contrastive conversation pairs exhibiting differences across multiple turns. Leveraging MUSIC on the Skywork preference dataset, we train a multi-turn RM based on the Gemma-2-9B-Instruct model. Empirical results demonstrate that our MUSIC-augmented RM outperforms baseline methods, achieving higher alignment with judgments from advanced proprietary LLM judges on multi-turn conversations, crucially, without compromising performance on standard single-turn RM benchmarks.
摘要：评估多轮对话的质量对于开发强大的大型语言模型 (LLM) 至关重要，但仍然是一个重大挑战，通常需要昂贵的人工评估。多轮奖励模型 (RM) 提供了一种可扩展的替代方案，并且可以为指导 LLM 培训提供有价值的信号。虽然最近的工作已经先进了多轮\textit{训练}技术，但专门针对多轮交互的有效自动\textit{评估}却落后了。我们观察到，标准偏好数据集（通常仅根据最终对话轮次来对比响应）提供的信号不足以捕获多轮交互的细微差别。相反，我们发现合并跨 \textit{multiple} 匝的对比对于构建强大的多匝 RM 至关重要。受这一发现的启发，我们提出 \textbf{MU}lti-\textbf{S}tep \textbf{I}nstruction \textbf{C}ontrast (MUSIC)，这是一种无监督数据增强策略，可以合成在多个回合中表现出差异的对比对话对。利用 Skywork 偏好数据集上的 MUSIC，我们训练基于 Gemma-2-9B-Instruct 模型的多轮 RM。实证结果表明，我们的 MUSIC 增强 RM 优于基线方法，与高级专有 LLM 法官对多轮对话的判断实现了更高的一致性，最重要的是，不影响标准单轮 RM 基准的性能。

Title: BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature

Authors: Sibo Wei, Peng Chen, Lifeng Dong, Yin Luo, Lei Wang, Peng Zhang, Wenpeng Lu, Jianbin Guo, Hongjun Yang, Dajun Zeng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24733
Pdf URL: https://arxiv.org/pdf/2512.24733
Copy Paste: [[2512.24733]] BIOME-Bench: A Benchmark for Biomolecular Interaction Inference and Multi-Omics Pathway Mechanism Elucidation from Scientific Literature(https://arxiv.org/abs/2512.24733)
Keywords: language model, llm
Abstract: Multi-omics studies often rely on pathway enrichment to interpret heterogeneous molecular changes, but pathway enrichment (PE)-based workflows inherit structural limitations of pathway resources, including curation lag, functional redundancy, and limited sensitivity to molecular states and interventions. Although recent work has explored using large language models (LLMs) to improve PE-based interpretation, the lack of a standardized benchmark for end-to-end multi-omics pathway mechanism elucidation has largely confined evaluation to small, manually curated datasets or ad hoc case studies, hindering reproducible progress. To address this issue, we introduce BIOME-Bench, constructed via a rigorous four-stage workflow, to evaluate two core capabilities of LLMs in multi-omics analysis: Biomolecular Interaction Inference and end-to-end Multi-Omics Pathway Mechanism Elucidation. We develop evaluation protocols for both tasks and conduct comprehensive experiments across multiple strong contemporary models. Experimental results demonstrate that existing models still exhibit substantial deficiencies in multi-omics analysis, struggling to reliably distinguish fine-grained biomolecular relation types and to generate faithful, robust pathway-level mechanistic explanations.
摘要：多组学研究通常依赖通路富集来解释异质分子变化，但基于通路富集 (PE) 的工作流程继承了通路资源的结构限制，包括管理滞后、功能冗余以及对分子状态和干预的有限敏感性。尽管最近的工作已经探索使用大型语言模型（LLM）来改进基于 PE 的解释，但缺乏端到端多组学途径机制阐明的标准化基准，在很大程度上将评估仅限于小型、手动管理的数据集或临时案例研究，从而阻碍了可重复的进展。为了解决这个问题，我们引入了通过严格的四阶段工作流程构建的BIOME-Bench，以评估法学硕士在多组学分析中的两个核心能力：生物分子相互作用推理和端到端多组学通路机制阐明。我们为这两项任务开发了评估协议，并在多个强大的当代模型中进行了全面的实验。实验结果表明，现有模型在多组学分析方面仍然存在重大缺陷，难以可靠地区分细粒度的生物分子关系类型并生成忠实、稳健的通路水平机制解释。

Title: Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models

Authors: Ákos Prucs, Márton Csutora, Mátyás Antal, Márk Marosi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.24776
Pdf URL: https://arxiv.org/pdf/2512.24776
Copy Paste: [[2512.24776]] Compute-Accuracy Pareto Frontiers for Open-Source Reasoning Large Language Models(https://arxiv.org/abs/2512.24776)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are demonstrating rapid improvements on complex reasoning benchmarks, particularly when allowed to utilize intermediate reasoning steps before converging on a final solution. However, current literature often overlooks the significant computational burden associated with generating long reasoning sequences. For industrial applications, model selection depends not only on raw accuracy but also on resource constraints and inference costs. In this work, we conduct a test-time-compute aware evaluation of both contemporary and older open-source LLMs, mapping their Pareto frontiers across math- and reasoning-intensive benchmarks. Our findings identify the Mixture of Experts (MoE) architecture as a strong candidate to balance performance and efficiency in our evaluation setting. Furthermore, we trace the trajectory of Pareto efficiency over time to derive an emergent trend regarding accuracy gain per unit of compute. Finally, we demonstrate that there is a saturation point for inference-time compute. Beyond a certain threshold, accuracy gains diminish, indicating that while extended reasoning capabilities are beneficial, they cannot overcome intrinsic model limitations regarding specific complexities.
摘要：大型语言模型 (LLM) 正在展示复杂推理基准的快速改进，特别是当允许在收敛于最终解决方案之前利用中间推理步骤时。然而，当前的文献常常忽视与生成长推理序列相关的巨大计算负担。对于工业应用，模型选择不仅取决于原始精度，还取决于资源限制和推理成本。在这项工作中，我们对当代和较早的开源法学硕士进行了测试时间计算感知评估，在数学和推理密集型基准上绘制了他们的帕累托前沿。我们的研究结果表明，专家混合 (MoE) 架构是在我们的评估环境中平衡性能和效率的有力候选者。此外，我们追踪帕累托效率随时间的变化轨迹，得出每单位计算的准确度增益的新兴趋势。最后，我们证明推理时间计算存在饱和点。超过某个阈值，准确性增益会减少，这表明虽然扩展推理能力是有益的，但它们无法克服有关特定复杂性的内在模型限制。

Title: Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability

Authors: Yanan Long
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2512.24842
Pdf URL: https://arxiv.org/pdf/2512.24842
Copy Paste: [[2512.24842]] Triangulation as an Acceptance Rule for Multilingual Mechanistic Interpretability(https://arxiv.org/abs/2512.24842)
Keywords: language model
Abstract: Multilingual language models achieve strong aggregate performance yet often behave unpredictably across languages, scripts, and cultures. We argue that mechanistic explanations for such models should satisfy a \emph{causal} standard: claims must survive causal interventions and must \emph{cross-reference} across environments that perturb surface form while preserving meaning. We formalize \emph{reference families} as predicate-preserving variants and introduce \emph{triangulation}, an acceptance rule requiring necessity (ablating the circuit degrades the target behavior), sufficiency (patching activations transfers the behavior), and invariance (both effects remain directionally stable and of sufficient magnitude across the reference family). To supply candidate subgraphs, we adopt automatic circuit discovery and \emph{accept or reject} those candidates by triangulation. We ground triangulation in causal abstraction by casting it as an approximate transformation score over a distribution of interchange interventions, connect it to the pragmatic interpretability agenda, and present a comparative experimental protocol across multiple model families, language pairs, and tasks. Triangulation provides a falsifiable standard for mechanistic claims that filters spurious circuits passing single-environment tests but failing cross-lingual invariance.
摘要：多语言语言模型实现了强大的总体性能，但在不同语言、脚本和文化之间的表现往往难以预测。我们认为，对此类模型的机械解释应该满足 \emph{因果} 标准：主张必须在因果干预中幸存下来，并且必须 \emph{交叉引用} 跨越扰乱表面形式的环境，同时保留意义。我们将\emph{参考族}形式化为谓词保留变体，并引入\emph{三角测量}，这是一种接受规则，要求必要性（消除电路降低目标行为）、充分性（修补激活转移行为）和不变性（两种效应在参考族中保持方向稳定且具有足够的幅度）。为了提供候选子图，我们采用自动电路发现并通过三角测量\emph{接受或拒绝}这些候选子图。我们将三角测量作为因果抽象的基础，将其转换为交换干预分布的近似转换分数，将其与实用的可解释性议程联系起来，并提出跨多个模型系列、语言对和任务的比较实验协议。三角测量为机械主张提供了一个可证伪的标准，可以过滤通过单一环境测试但未通过跨语言不变性的杂散电路。

Title: PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI

Authors: Srija Mukhopadhyay, Sathwik Reddy, Shruthi Muthukumar, Jisun An, Ponnurangam Kumaraguru
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24848
Pdf URL: https://arxiv.org/pdf/2512.24848
Copy Paste: [[2512.24848]] PrivacyBench: A Conversational Benchmark for Evaluating Privacy in Personalized AI(https://arxiv.org/abs/2512.24848)
Keywords: prompt, chat, retrieval-augmented generation, agent
Abstract: Personalized AI agents rely on access to a user's digital footprint, which often includes sensitive data from private emails, chats and purchase histories. Yet this access creates a fundamental societal and privacy risk: systems lacking social-context awareness can unintentionally expose user secrets, threatening digital well-being. We introduce PrivacyBench, a benchmark with socially grounded datasets containing embedded secrets and a multi-turn conversational evaluation to measure secret preservation. Testing Retrieval-Augmented Generation (RAG) assistants reveals that they leak secrets in up to 26.56% of interactions. A privacy-aware prompt lowers leakage to 5.12%, yet this measure offers only partial mitigation. The retrieval mechanism continues to access sensitive data indiscriminately, which shifts the entire burden of privacy preservation onto the generator. This creates a single point of failure, rendering current architectures unsafe for wide-scale deployment. Our findings underscore the urgent need for structural, privacy-by-design safeguards to ensure an ethical and inclusive web for everyone.
摘要：个性化人工智能代理依赖于对用户数字足迹的访问，其中通常包括来自私人电子邮件、聊天和购买历史记录的敏感数据。然而，这种访问会带来根本性的社会和隐私风险：缺乏社会情境意识的系统可能会无意中泄露用户秘密，从而威胁到数字福祉。我们引入了 PrivacyBench，这是一个基准，具有包含嵌入式秘密的基于社会的数据集和用于衡量秘密保存的多轮对话评估。测试检索增强生成 (RAG) 助手表明，他们在高达 26.56% 的交互中泄露秘密。隐私意识提示可将泄漏率降低至 5.12%，但该措施仅提供部分缓解。检索机制继续不加区别地访问敏感数据，这将隐私保护的全部负担转移到了生成器身上。这会造成单点故障，导致当前架构不适合大规模部署。我们的研究结果强调迫切需要结构性、隐私设计保障措施，以确保为每个人提供一个道德和包容的网络。

Title: Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements

Authors: Yiming Liang, Yizhi Li, Yantao Du, Ge Zhang, Jiayi Zhou, Yuchen Wu, Yinzhu Piao, Denghui Cao, Tong Sun, Ziniu Li, Li Du, Bo Lei, Jiaheng Liu, Chenghua Lin, Zhaoxiang Zhang, Wenhao Huang, Jiajun Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.24867
Pdf URL: https://arxiv.org/pdf/2512.24867
Copy Paste: [[2512.24867]] Encyclo-K: Evaluating LLMs with Dynamically Composed Knowledge Statements(https://arxiv.org/abs/2512.24867)
Keywords: language model, gpt, llm, chat
Abstract: Benchmarks play a crucial role in tracking the rapid advancement of large language models (LLMs) and identifying their capability boundaries. However, existing benchmarks predominantly curate questions at the question level, suffering from three fundamental limitations: vulnerability to data contamination, restriction to single-knowledge-point assessment, and reliance on costly domain expert annotation. We propose Encyclo-K, a statement-based benchmark that rethinks benchmark construction from the ground up. Our key insight is that knowledge statements, not questions, can serve as the unit of curation, and questions can then be constructed from them. We extract standalone knowledge statements from authoritative textbooks and dynamically compose them into evaluation questions through random sampling at test time. This design directly addresses all three limitations: the combinatorial space is too vast to memorize, and model rankings remain stable across dynamically generated question sets, enabling reliable periodic dataset refresh; each question aggregates 8-10 statements for comprehensive multi-knowledge assessment; annotators only verify formatting compliance without requiring domain expertise, substantially reducing annotation costs. Experiments on over 50 LLMs demonstrate that Encyclo-K poses substantial challenges with strong discriminative power. Even the top-performing OpenAI-GPT-5.1 achieves only 62.07% accuracy, and model performance displays a clear gradient distribution--reasoning models span from 16.04% to 62.07%, while chat models range from 9.71% to 50.40%. These results validate the challenges introduced by dynamic evaluation and multi-statement comprehensive understanding. These findings establish Encyclo-K as a scalable framework for dynamic evaluation of LLMs' comprehensive understanding over multiple fine-grained disciplinary knowledge statements.
摘要：基准在跟踪大型语言模型 (LLM) 的快速发展和识别其能力边界方面发挥着至关重要的作用。然而，现有的基准主要在问题级别策划问题，受到三个基本限制：易受数据污染、对单知识点评估的限制以及对昂贵的领域专家注释的依赖。我们提出 Encyclo-K，一个基于语句的基准，从头开始重新思考基准构建。我们的主要见解是知识陈述（而不是问题）可以作为管理单元，然后可以根据知识陈述构建问题。我们从权威教科书中提取独立的知识陈述，并通过测试时的随机抽样动态地将它们组成评估问题。这种设计直接解决了所有三个限制：组合空间太大而难以记忆，模型排名在动态生成的问题集中保持稳定，从而实现可靠的定期数据集刷新；每个问题汇总 8-10 个陈述，以进行全面的多知识评估；注释者只需验证格式合规性，无需领域专业知识，从而大大降低注释成本。对 50 多个法学硕士的实验表明，Encyclo-K 具有强大的判别力，带来了巨大的挑战。即使是表现最好的 OpenAI-GPT-5.1 也只能达到 62.07% 的准确率，并且模型性能呈现出明显的梯度分布——推理模型的范围从 16.04% 到 62.07%，而聊天模型的范围从 9.71% 到 50.40%。这些结果验证了动态评估和多语句综合理解带来的挑战。这些发现将 Encyclo-K 确立为一个可扩展的框架，用于动态评估法学硕士对多个细粒度学科知识陈述的全面理解。

Title: BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts

Authors: Hengli Li, Zhaoxin Yu, Qi Shen, Chenxi Li, Mengmeng Wang, Tinglang Wu, Yipeng Kang, Yuxuan Wang, Song-Chun Zhu, Zixia Jia, Zilong Zheng
Subjects: cs.CL, cs.GT, cs.MA
Abstract URL: https://arxiv.org/abs/2512.24885
Pdf URL: https://arxiv.org/pdf/2512.24885
Copy Paste: [[2512.24885]] BEDA: Belief Estimation as Probabilistic Constraints for Performing Strategic Dialogue Acts(https://arxiv.org/abs/2512.24885)
Keywords: gpt, agent
Abstract: Strategic dialogue requires agents to execute distinct dialogue acts, for which belief estimation is essential. While prior work often estimates beliefs accurately, it lacks a principled mechanism to use those beliefs during generation. We bridge this gap by first formalizing two core acts Adversarial and Alignment, and by operationalizing them via probabilistic constraints on what an agent may generate. We instantiate this idea in BEDA, a framework that consists of the world set, the belief estimator for belief estimation, and the conditional generator that selects acts and realizes utterances consistent with the inferred beliefs. Across three settings, Conditional Keeper Burglar (CKBG, adversarial), Mutual Friends (MF, cooperative), and CaSiNo (negotiation), BEDA consistently outperforms strong baselines: on CKBG it improves success rate by at least 5.0 points across backbones and by 20.6 points with GPT-4.1-nano; on Mutual Friends it achieves an average improvement of 9.3 points; and on CaSiNo it achieves the optimal deal relative to all baselines. These results indicate that casting belief estimation as constraints provides a simple, general mechanism for reliable strategic dialogue.
摘要：战略对话要求智能体执行不同的对话行为，其中信念估计至关重要。虽然之前的工作通常能够准确地估计信念，但缺乏在生成过程中使用这些信念的原则机制。我们首先将两个核心行为“对抗”和“对齐”形式化，并通过对代理可能生成的内容的概率约束来操作它们，从而弥补了这一差距。我们在 BEDA 中实例化了这个想法，BEDA 是一个框架，由世界集、用于信念估计的信念估计器以及选择行为并实现与推断信念一致的话语的条件生成器组成。在有条件的 Keeper Burglar（CKBG，对抗性）、Mutual Friends（MF，合作）和 CaSiNo（谈判）这三种设置中，BEDA 始终优于强大的基线：在 CKBG 上，它在主干上提高了至少 5.0 点的成功率，在 GPT-4.1-nano 上提高了 20.6 点；在“Mutual Friends”上，平均提高了 9.3 分；在 CaSiNo 上，它实现了相对于所有基线的最佳交易。这些结果表明，将信念估计作为约束条件，为可靠的战略对话提供了一种简单、通用的机制。

Title: Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline

Authors: Minjun Zhao, Xinyu Zhang, Shuai Zhang, Deyang Li, Ruifeng Shi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.24933
Pdf URL: https://arxiv.org/pdf/2512.24933
Copy Paste: [[2512.24933]] Adaptive Dependency-aware Prompt Optimization Framework for Multi-Step LLM Pipeline(https://arxiv.org/abs/2512.24933)
Keywords: language model, llm, prompt
Abstract: Multi-step LLM pipelines invoke large language models multiple times in a structured sequence and can effectively solve complex tasks, but their performance heavily depends on the prompts used at each step. Jointly optimizing these prompts is difficult due to missing step-level supervision and inter-step dependencies. Existing end-to-end prompt optimization methods struggle under these conditions and often yield suboptimal or unstable updates. We propose ADOPT, an Adaptive Dependency-aware Prompt Optimization framework for multi-step LLM pipelines. ADOPT explicitly models the dependency between each LLM step and the final task outcome, enabling precise text-gradient estimation analogous to computing analytical derivatives. It decouples textual gradient estimation from gradient updates, reducing multi-prompt optimization to flexible single-prompt optimization steps, and employs a Shapley-based mechanism to adaptively allocate optimization resources. Experiments on real-world datasets and diverse pipeline structures show that ADOPT is effective and robust, consistently outperforming state-of-the-art prompt optimization baselines.
摘要：多步骤LLM管道以结构化序列多次调用大型语言模型，可以有效地解决复杂任务，但其性能在很大程度上取决于每个步骤使用的提示。由于缺少步骤级监督和步骤间依赖性，联合优化这些提示很困难。现有的端到端即时优化方法在这些条件下举步维艰，并且经常产生次优或不稳定的更新。我们提出了 ADOPT，一种用于多步骤 LLM 管道的自适应依赖感知提示优化框架。 ADOPT 对每个 LLM 步骤和最终任务结果之间的依赖性进行显式建模，从而实现类似于计算分析导数的精确文本梯度估计。它将文本梯度估计与梯度更新解耦，将多提示优化减少为灵活的单提示优化步骤，并采用基于 Shapley 的机制来自适应分配优化资源。对现实世界数据集和不同管道结构的实验表明，ADOPT 有效且稳健，始终优于最先进的提示优化基线。

Title: MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes

Authors: Siddhant Agarwal, Adya Dhuler, Polly Ruhnke, Melvin Speisman, Md Shad Akhtar, Shweta Yadav
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.25015
Pdf URL: https://arxiv.org/pdf/2512.25015
Copy Paste: [[2512.25015]] MAMA-Memeia! Multi-Aspect Multi-Agent Collaboration for Depressive Symptoms Identification in Memes(https://arxiv.org/abs/2512.25015)
Keywords: language model, llm, agent
Abstract: Over the past years, memes have evolved from being exclusively a medium of humorous exchanges to one that allows users to express a range of emotions freely and easily. With the ever-growing utilization of memes in expressing depressive sentiments, we conduct a study on identifying depressive symptoms exhibited by memes shared by users of online social media platforms. We introduce RESTOREx as a vital resource for detecting depressive symptoms in memes on social media through the Large Language Model (LLM) generated and human-annotated explanations. We introduce MAMAMemeia, a collaborative multi-agent multi-aspect discussion framework grounded in the clinical psychology method of Cognitive Analytic Therapy (CAT) Competencies. MAMAMemeia improves upon the current state-of-the-art by 7.55% in macro-F1 and is established as the new benchmark compared to over 30 methods.
摘要：在过去的几年里，模因已经从纯粹的幽默交流媒介发展成为一种允许用户自由、轻松地表达一系列情感的媒介。随着模因在表达抑郁情绪方面的使用不断增加，我们开展了一项研究，旨在识别在线社交媒体平台用户分享的模因所表现出的抑郁症状。我们将 RESTOREx 引入为通过大语言模型 (LLM) 生成和人工注释的解释来检测社交媒体模因中的抑郁症状的重要资源。我们介绍 MAMAMemeia，一个基于认知分析治疗（CAT）能力的临床心理学方法的协作多主体多方面讨论框架。 MAMAMemeia 在宏观 F1 方面比当前最先进的技术提高了 7.55%，并被确立为与 30 多种方法相比的新基准。

Title: Modeling Language as a Sequence of Thoughts

Authors: Nasim Borazjanizadeh, James McClelland
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.25026
Pdf URL: https://arxiv.org/pdf/2512.25026
Copy Paste: [[2512.25026]] Modeling Language as a Sequence of Thoughts(https://arxiv.org/abs/2512.25026)
Keywords: language model, gpt
Abstract: Transformer language models can generate strikingly natural text by modeling language as a sequence of tokens. Yet, by relying primarily on surface-level co-occurrence statistics, they fail to form globally consistent latent representations of entities and events, lack of which contributes to brittleness in relational direction (e.g., reversal curse), contextualization errors, and data inefficiency. On the other hand, cognitive science shows that human comprehension involves converting the input linguistic stream into compact, event-like representations that persist in memory while verbatim form is short-lived. Motivated by this view, we introduce Thought Gestalt (TG) model, a recurrent Transformer that models language at two levels of abstraction - tokens and sentence-level "thought" states. TG generates the tokens of one sentence at a time while cross-attending to a memory of prior sentence representations. In TG, token and sentence representations are generated using the same set of model parameters and trained with a single objective, the next-token cross-entropy: by retaining the computation graph of sentence representations written to memory, gradients from future token losses flow backward through cross-attention to optimize the parameters generating earlier sentence vectors. In scaling experiments, TG consistently improves efficiency over matched GPT-2 runs, among other baselines, with scaling fits indicating GPT-2 requires ~5-8% more data and ~33-42% more parameters to match TG's loss. TG also reduces errors on relational direction generalization on a father-son reversal curse probe.
摘要：Transformer 语言模型可以通过将语言建模为标记序列来生成极其自然的文本。然而，由于主要依赖于表面层的共现统计，它们无法形成实体和事件的全局一致的潜在表示，缺乏这种表示会导致关系方向的脆弱性（例如，逆转诅咒）、情境化错误和数据效率低下。另一方面，认知科学表明，人类的理解涉及将输入的语言流转换为紧凑的、类似事件的表示，这些表示持续存在于记忆中，而逐字形式是短暂的。受这种观点的启发，我们引入了思想格式塔（TG）模型，这是一种循环 Transformer，它在两个抽象级别（标记和句子级“思想”状态）对语言进行建模。 TG 一次生成一个句子的标记，同时交叉关注先前句子表示的记忆。在 TG 中，令牌和句子表示是使用同一组模型参数生成的，并使用单一目标（下一个令牌交叉熵）进行训练：通过保留写入内存的句子表示的计算图，未来令牌损失的梯度通过交叉注意力向后流动，以优化生成早期句子向量的参数。在缩放实验中，除其他基线外，TG 始终比匹配的 GPT-2 运行提高了效率，缩放拟合表明 GPT-2 需要大约 5-8% 的数据和大约 33-42% 的参数来匹配 TG 的损失。 TG 还减少了父子反转诅咒探针上关系方向泛化的错误。

Title: AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG

Authors: Chao Peng, Bin Wang, Zhilei Long, Jinfang Sheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2512.25052
Pdf URL: https://arxiv.org/pdf/2512.25052
Copy Paste: [[2512.25052]] AdaGReS:Adaptive Greedy Context Selection via Redundancy-Aware Scoring for Token-Budgeted RAG(https://arxiv.org/abs/2512.25052)
Keywords: retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is highly sensitive to the quality of selected context, yet standard top-k retrieval often returns redundant or near-duplicate chunks that waste token budget and degrade downstream generation. We present AdaGReS, a redundancy-aware context selection framework for token-budgeted RAG that optimizes a set-level objective combining query-chunk relevance and intra-set redundancy penalties. AdaGReS performs greedy selection under a token-budget constraint using marginal gains derived from the objective, and introduces a closed-form, instance-adaptive calibration of the relevance-redundancy trade-off parameter to eliminate manual tuning and adapt to candidate-pool statistics and budget limits. We further provide a theoretical analysis showing that the proposed objective exhibits epsilon-approximate submodularity under practical embedding similarity conditions, yielding near-optimality guarantees for greedy selection. Experiments on open-domain question answering (Natural Questions) and a high-redundancy biomedical (drug) corpus demonstrate consistent improvements in redundancy control and context quality, translating to better end-to-end answer quality and robustness across settings.
摘要：检索增强生成（RAG）对所选上下文的质量高度敏感，但标准的 top-k 检索通常会返回冗余或接近重复的块，从而浪费令牌预算并降低下游生成的质量。我们提出了 AdaGReS，这是一种用于令牌预算 RAG 的冗余感知上下文选择框架，它结合查询块相关性和组内冗余惩罚来优化集合级目标。 AdaGReS 使用从目标导出的边际收益在代币预算约束下执行贪婪选择，并引入相关性冗余权衡参数的封闭式实例自适应校准，以消除手动调整并适应候选池统计数据和预算限制。我们进一步提供了理论分析，表明所提出的目标在实际嵌入相似性条件下表现出 epsilon 近似子模性，为贪婪选择提供了近乎最优的保证。开放域问答（自然问题）和高冗余生物医学（药物）语料库的实验证明了冗余控制和上下文质量的持续改进，转化为更好的端到端答案质量和跨设置的鲁棒性。