2025-12-02

Title: Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis

Authors: Angelina Parfenova, Andreas Marfurt, Alexander Denzler, Juergen Pfeffer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00046
Pdf URL: https://arxiv.org/pdf/2512.00046
Copy Paste: [[2512.00046]] Text Annotation via Inductive Coding: Comparing Human Experts to LLMs in Qualitative Data Analysis(https://arxiv.org/abs/2512.00046)
Keywords: language model, llm
Abstract: This paper investigates the automation of qualitative data analysis, focusing on inductive coding using large language models (LLMs). Unlike traditional approaches that rely on deductive methods with predefined labels, this research investigates the inductive process where labels emerge from the data. The study evaluates the performance of six open-source LLMs compared to human experts. As part of the evaluation, experts rated the perceived difficulty of the quotes they coded. The results reveal a peculiar dichotomy: human coders consistently perform well when labeling complex sentences but struggle with simpler ones, while LLMs exhibit the opposite trend. Additionally, the study explores systematic deviations in both human and LLM generated labels by comparing them to the golden standard from the test set. While human annotations may sometimes differ from the golden standard, they are often rated more favorably by other humans. In contrast, some LLMs demonstrate closer alignment with the true labels but receive lower evaluations from experts.
摘要：本文研究了定性数据分析的自动化，重点关注使用大型语言模型 (LLM) 的归纳编码。与依赖带有预定义标签的演绎方法的传统方法不同，本研究研究了标签从数据中出现的归纳过程。该研究评估了六位开源法学硕士与人类专家相比的表现。作为评估的一部分，专家对他们编码的报价的感知难度进行了评级。结果揭示了一种奇特的二分法：人类编码员在标记复杂句子时始终表现良好，但在处理简单句子时表现不佳，而法学硕士则表现出相反的趋势。此外，该研究通过将人类和法学硕士生成的标签与测试集中的黄金标准进行比较，探讨了它们的系统偏差。虽然人类注释有时可能与黄金标准不同，但它们通常会受到其他人的更青睐。相比之下，一些法学硕士表现出与真实标签更接近的一致性，但得到的专家评价较低。

Title: Emergent Convergence in Multi-Agent LLM Annotation

Authors: Angelina Parfenova, Alexander Denzler, Juergen Pfeffer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00047
Pdf URL: https://arxiv.org/pdf/2512.00047
Copy Paste: [[2512.00047]] Emergent Convergence in Multi-Agent LLM Annotation(https://arxiv.org/abs/2512.00047)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) are increasingly deployed in collaborative settings, yet little is known about how they coordinate when treated as black-box agents. We simulate 7500 multi-agent, multi-round discussions in an inductive coding task, generating over 125000 utterances that capture both final annotations and their interactional histories. We introduce process-level metrics: code stability, semantic self-consistency, and lexical confidence alongside sentiment and convergence measures, to track coordination dynamics. To probe deeper alignment signals, we analyze the evolving geometry of output embeddings, showing that intrinsic dimensionality declines over rounds, suggesting semantic compression. The results reveal that LLM groups converge lexically and semantically, develop asymmetric influence patterns, and exhibit negotiation-like behaviors despite the absence of explicit role prompting. This work demonstrates how black-box interaction analysis can surface emergent coordination strategies, offering a scalable complement to internal probe-based interpretability methods.
摘要：大型语言模型 (LLM) 越来越多地部署在协作环境中，但人们对它们被视为黑盒代理时如何协调却知之甚少。我们在归纳编码任务中模拟 7500 次多智能体、多轮讨论，生成超过 125000 条话语，捕获最终注释及其交互历史。我们引入了流程级指标：代码稳定性、语义自一致性、词汇置信度以及情感和收敛度量，以跟踪协调动态。为了探测更深层次的对齐信号，我们分析了输出嵌入的不断演变的几何形状，表明内在维度随着回合的推移而下降，这表明语义压缩。结果表明，尽管缺乏明确的角色提示，法学硕士群体在词汇和语义上趋同，发展出不对称的影响模式，并表现出类似谈判的行为。这项工作展示了黑盒交互分析如何揭示紧急协调策略，为基于内部探针的可解释性方法提供可扩展的补充。

Title: Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis

Authors: Matej Klemen, Tjaša Arčon, Luka Terčon, Marko Robnik-Šikonja, Kaja Dobrovoljc
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00214
Pdf URL: https://arxiv.org/pdf/2512.00214
Copy Paste: [[2512.00214]] Towards Corpus-Grounded Agentic LLMs for Multilingual Grammatical Analysis(https://arxiv.org/abs/2512.00214)
Keywords: language model, llm, agent
Abstract: Empirical grammar research has become increasingly data-driven, but the systematic analysis of annotated corpora still requires substantial methodological and technical effort. We explore how agentic large language models (LLMs) can streamline this process by reasoning over annotated corpora and producing interpretable, data-grounded answers to linguistic questions. We introduce an agentic framework for corpus-grounded grammatical analysis that integrates concepts such as natural-language task interpretation, code generation, and data-driven reasoning. As a proof of concept, we apply it to Universal Dependencies (UD) corpora, testing it on multilingual grammatical tasks inspired by the World Atlas of Language Structures (WALS). The evaluation spans 13 word-order features and over 170 languages, assessing system performance across three complementary dimensions - dominant-order accuracy, order-coverage completeness, and distributional fidelity - which reflect how well the system generalizes, identifies, and quantifies word-order variations. The results demonstrate the feasibility of combining LLM reasoning with structured linguistic data, offering a first step toward interpretable, scalable automation of corpus-based grammatical inquiry.
摘要：实证语法研究已经变得越来越数据驱动，但注释语料库的系统分析仍然需要大量的方法和技术努力。我们探索代理大语言模型（LLM）如何通过对带注释的语料库进行推理并为语言问题生成可解释的、基于数据的答案来简化这一过程。我们引入了一个基于语料库的语法分析的代理框架，该框架集成了自然语言任务解释、代码生成和数据驱动推理等概念。作为概念证明，我们将其应用于通用依赖关系 (UD) 语料库，并在受世界语言结构地图集 (WALS) 启发的多语言语法任务上对其进行测试。该评估涵盖 13 种词序特征和 170 多种语言，从三个互补维度（主导顺序准确性、顺序覆盖完整性和分布保真度）评估系统性能，这反映了系统概括、识别和量化词序变化的程度。结果证明了将法学硕士推理与结构化语言数据相结合的可行性，为基于语料库的语法查询的可解释、可扩展的自动化迈出了第一步。

Title: Minimal-Edit Instruction Tuning for Low-Resource Indic GEC

Authors: Akhil Rajeev P
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00219
Pdf URL: https://arxiv.org/pdf/2512.00219
Copy Paste: [[2512.00219]] Minimal-Edit Instruction Tuning for Low-Resource Indic GEC(https://arxiv.org/abs/2512.00219)
Keywords: language model, prompt
Abstract: Grammatical error correction for Indic languages faces limited supervision, diverse scripts, and rich morphology. We propose an augmentation-free setup that uses instruction-tuned large language models and conservative decoding. A 12B GEMMA 3 model is instruction-tuned in bnb 4-bit precision with parameter-efficient fine-tuning (PEFT) and Alpaca-style formatting. Decoding follows a deterministic, constraint-aware procedure with a lightweight normaliser that encourages minimal, meaning-preserving edits. We operationalise inference, subsequent to instruction fine-tuning (IFT), via a fixed, language-specific prompt directly synthesised from a deterministic error classifier's taxonomy, label distributions, and precedence ordering computed on the training corpus. Under the official untuned GLEU evaluation, the system scores 92.41 on Malayalam, sixth overall, and 81.44 on Hindi, third overall. These results indicate that classifier-informed prompt design, adapter-based instruction tuning, and deterministic decoding provide a reproducible and a computationally efficient alternative to augmentation-centred pipelines for Indic GEC. The approach also motivates future work on stronger morphosyntactic constraints and human-centred evaluation of conservative edits.
摘要：印度语言的语法纠错面临着有限的监督、多样的文字和丰富的形态。我们提出了一种无增强设置，该设置使用指令调整的大型语言模型和保守解码。 12B GEMMA 3 模型以 bnb 4 位精度进行指令调整，具有参数高效微调 (PEFT) 和 Alpaca 风格格式。解码遵循确定性、约束感知的过程，并使用轻量级标准化器，鼓励最少的、保留意义的编辑。在指令微调（IFT）之后，我们通过直接从确定性错误分类器的分类法、标签分布和在训练语料库上计算的优先顺序合成的固定的特定于语言的提示来操作推理。根据未经调整的官方 GLEU 评估，该系统在马拉雅拉姆语上得分为 92.41，总体排名第六，在印地语上得分 81.44，总体排名第三。这些结果表明，基于分类器的提示设计、基于适配器的指令调整和确定性解码为 Indic GEC 的以增强为中心的管道提供了可重复且计算高效的替代方案。该方法还激发了未来对更强的形态句法约束和对保守编辑的以人为中心的评估的研究。

Title: OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion

Authors: Sai Koneru, Matthias Huck, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00234
Pdf URL: https://arxiv.org/pdf/2512.00234
Copy Paste: [[2512.00234]] OmniFusion: Simultaneous Multilingual Multimodal Translations via Modular Fusion(https://arxiv.org/abs/2512.00234)
Keywords: language model, llm
Abstract: There has been significant progress in open-source text-only translation large language models (LLMs) with better language coverage and quality. However, these models can be only used in cascaded pipelines for speech translation (ST), performing automatic speech recognition first followed by translation. This introduces additional latency, which is particularly critical in simultaneous ST (SimulST), and prevents the model from exploiting multimodal context, such as images, which can aid disambiguation. Pretrained multimodal foundation models (MMFMs) already possess strong perception and reasoning capabilities across multiple modalities, but generally lack the multilingual coverage and specialized translation performance of dedicated translation LLMs. To build an effective multimodal translation system, we propose an end-to-end approach that fuses MMFMs with translation LLMs. We introduce a novel fusion strategy that connects hidden states from multiple layers of a pretrained MMFM to a translation LLM, enabling joint end-to-end training. The resulting model, OmniFusion, built on Omni 2.5-7B as the MMFM and SeedX PPO-7B as the translation LLM, can perform speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation. Experiments demonstrate that OmniFusion effectively leverages both audio and visual inputs, achieves a 1-second latency reduction in SimulST compared to cascaded pipelines and also improves the overall translation quality\footnote{Code is available at this https URL}.
摘要：开源纯文本翻译大语言模型 (LLM) 取得了重大进展，语言覆盖范围和质量更高。然而，这些模型只能用于语音翻译（ST）的级联管道，首先执行自动语音识别，然后执行翻译。这会引入额外的延迟，这在同步 ST (SimulST) 中尤其重要，并且会阻止模型利用有助于消歧的多模态上下文（例如图像）。预训练的多模态基础模型（MMFM）已经具备跨多种模态的强大感知和推理能力，但普遍缺乏专业翻译法学硕士的多语言覆盖和专业翻译性能。为了构建有效的多模式翻译系统，我们提出了一种将 MMFM 与翻译法学硕士融合的端到端方法。我们引入了一种新颖的融合策略，将预训练 MMFM 的多层隐藏状态连接到翻译 LLM，从而实现联合端到端训练。由此产生的模型 OmniFusion，以 Omni 2.5-7B 作为 MMFM 和 SeedX PPO-7B 作为翻译 LLM 构建，可以执行语音到文本、语音和图像到文本以及文本和图像到文本的翻译。实验表明，OmniFusion 有效地利用了音频和视觉输入，与级联管道相比，在 SimulST 中实现了 1 秒延迟减少，并且还提高了整体翻译质量\footnote{代码可从此 https URL 获取}。

Title: Lost without translation -- Can transformer (language models) understand mood states?

Authors: Prakrithi Shivaprakash, Diptadhi Mukherjee, Lekhansh Shukla, Animesh Mukherjee, Prabhat Chand, Pratima Murthy
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00274
Pdf URL: https://arxiv.org/pdf/2512.00274
Copy Paste: [[2512.00274]] Lost without translation -- Can transformer (language models) understand mood states?(https://arxiv.org/abs/2512.00274)
Keywords: language model
Abstract: Background: Large Language Models show promise in psychiatry but are English-centric. Their ability to understand mood states in other languages is unclear, as different languages have their own idioms of distress. Aim: To quantify the ability of language models to faithfully represent phrases (idioms of distress) of four distinct mood states (depression, euthymia, euphoric mania, dysphoric mania) expressed in Indian languages. Methods: We collected 247 unique phrases for the four mood states across 11 Indic languages. We tested seven experimental conditions, comparing k-means clustering performance on: (a) direct embeddings of native and Romanised scripts (using multilingual and Indic-specific models) and (b) embeddings of phrases translated to English and Chinese. Performance was measured using a composite score based on Adjusted Rand Index, Normalised Mutual Information, Homogeneity and Completeness. Results: Direct embedding of Indic languages failed to cluster mood states (Composite Score = 0.002). All translation-based approaches showed significant improvement. High performance was achieved using Gemini-translated English (Composite=0.60) and human-translated English (Composite=0.61) embedded with gemini-001. Surprisingly, human-translated English, further translated into Chinese and embedded with a Chinese model, performed best (Composite = 0.67). Specialised Indic models (IndicBERT and Sarvam-M) performed poorly. Conclusion: Current models cannot meaningfully represent mood states directly from Indic languages, posing a fundamental barrier to their psychiatric application for diagnostic or therapeutic purposes in India. While high-quality translation bridges this gap, reliance on proprietary models or complex translation pipelines is unsustainable. Models must first be built to understand diverse local languages to be effective in global mental health.
摘要：背景：大型语言模型在精神病学方面显示出前景，但以英语为中心。他们理解其他语言的情绪状态的能力尚不清楚，因为不同的语言有自己的痛苦习语。目的：量化语言模型忠实地表示印度语言中表达的四种不同情绪状态（抑郁、愉悦、欣快狂躁、烦躁躁狂）的短语（痛苦习语）的能力。方法：我们收集了 11 种印度语言中四种情绪状态的 247 个独特短语。我们测试了七个实验条件，比较了 k 均值聚类性能：(a) 本地和罗马化文字的直接嵌入（使用多语言和印度特定模型）和 (b) 翻译成英语和中文的短语的嵌入。使用基于调整兰德指数、标准化互信息、同质性和完整性的综合评分来衡量绩效。结果：印度语言的直接嵌入未能对情绪状态进行聚类（综合得分 = 0.002）。所有基于翻译的方法都显示出显着的改进。使用嵌入 Gemini-001 的 Gemini 翻译英语（Composite=0.60）和人工翻译英语（Composite=0.61）实现了高性能。令人惊讶的是，人工翻译的英语，进一步翻译成中文并嵌入中文模型，表现最好（综合 = 0.67）。专门的印度模型（IndicBERT 和 Sarvam-M）表现不佳。结论：当前模型无法直接从印度语言有意义地表示情绪状态，这对其在印度用于诊断或治疗目的的精神病学应用构成了根本障碍。虽然高质量翻译弥补了这一差距，但对专有模型或复杂翻译管道的依赖是不可持续的。首先必须建立模型来理解不同的当地语言，才能有效地促进全球心理健康。

Title: EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education

Authors: Guoqing Ma, Jia Zhu, Hanghui Guo, Weijie Shi, Yue Cui, Jiawei Shen, Zilong Li, Yidan Liang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00290
Pdf URL: https://arxiv.org/pdf/2512.00290
Copy Paste: [[2512.00290]] EduEval: A Hierarchical Cognitive Benchmark for Evaluating Large Language Models in Chinese Education(https://arxiv.org/abs/2512.00290)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) demonstrate significant potential for educational applications. However, their unscrutinized deployment poses risks to educational standards, underscoring the need for rigorous evaluation. We introduce EduEval, a comprehensive hierarchical benchmark for evaluating LLMs in Chinese K-12 education. This benchmark makes three key contributions: (1) Cognitive Framework: We propose the EduAbility Taxonomy, which unifies Bloom's Taxonomy and Webb's Depth of Knowledge to organize tasks across six cognitive dimensions including Memorization, Understanding, Application, Reasoning, Creativity, and Ethics. (2) Authenticity: Our benchmark integrates real exam questions, classroom conversation, student essays, and expert-designed prompts to reflect genuine educational challenges; (3) Scale: EduEval comprises 24 distinct task types with over 11,000 questions spanning primary to high school levels. We evaluate 14 leading LLMs under both zero-shot and few-shot settings, revealing that while models perform well on factual tasks, they struggle with classroom dialogue classification and exhibit inconsistent results in creative content generation. Interestingly, several open source models outperform proprietary systems on complex educational reasoning. Few-shot prompting shows varying effectiveness across cognitive dimensions, suggesting that different educational objectives require tailored approaches. These findings provide targeted benchmarking metrics for developing LLMs specifically optimized for diverse Chinese educational tasks.
摘要：大型语言模型（LLM）展示了教育应用的巨大潜力。然而，未经审查的部署对教育标准构成了风险，凸显了严格评估的必要性。我们推出 EduEval，一个用于评估中国 K-12 教育法学硕士的综合等级基准。该基准做出了三个关键贡献：（1）认知框架：我们提出了教育能力分类法，它统一了布鲁姆分类法和韦伯知识深度，以跨六个认知维度组织任务，包括记忆、理解、应用、推理、创造力和道德。 (2) 真实性：我们的基准测试整合了真实的考试问题、课堂对话、学生论文和专家设计的提示，以反映真实的教育挑战； (3) 规模：EduEval 包含 24 种不同的任务类型，涵盖小学至高中水平的 11,000 多个问题。我们在零样本和少样本设置下评估了 14 名领先的法学硕士，结果表明，虽然模型在事实任务上表现良好，但它们在课堂对话分类方面存在困难，并且在创意内容生成方面表现出不一致的结果。有趣的是，一些开源模型在复杂的教育推理方面优于专有系统。少样本提示在不同认知维度上显示出不同的有效性，这表明不同的教育目标需要量身定制的方法。这些发现为开发专门针对不同中国教育任务进行优化的法学硕士提供了有针对性的基准指标。

Title: Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents

Authors: Daud Waqas, Aaryamaan Golthi, Erika Hayashida, Huanzhi Mao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00332
Pdf URL: https://arxiv.org/pdf/2512.00332
Copy Paste: [[2512.00332]] Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents(https://arxiv.org/abs/2512.00332)
Keywords: llm, agent
Abstract: Multi-turn tool-calling LLMs (models capable of invoking external APIs or tools across several user turns) have emerged as a key feature in modern AI assistants, enabling extended dialogues from benign tasks to critical business, medical, and financial operations. Yet implementing multi-turn pipelines remains difficult for many safety-critical industries due to ongoing concerns regarding model resilience. While standardized benchmarks such as the Berkeley Function-Calling Leaderboard (BFCL) have underpinned confidence concerning advanced function-calling models (like Salesforce's xLAM V2), there is still a lack of visibility into multi-turn conversation-level robustness, especially given their exposure to real-world systems. In this paper, we introduce Assertion-Conditioned Compliance (A-CC), a novel evaluation paradigm for multi-turn function-calling dialogues. A-CC provides holistic metrics that evaluate a model's behavior when confronted with misleading assertions originating from two distinct vectors: (1) user-sourced assertions (USAs), which measure sycophancy toward plausible but misinformed user beliefs, and (2) function-sourced assertions (FSAs), which measure compliance with plausible but contradictory system policies (e.g., stale hints from unmaintained tools). Our results show that models are highly vulnerable to both USA sycophancy and FSA policy conflicts, confirming A-CC as a critical, latent vulnerability in deployed agents.
摘要：多轮工具调用 LLM（能够跨多个用户轮次调用外部 API 或工具的模型）已成为现代人工智能助手的一个关键功能，可以实现从良性任务到关键业务、医疗和财务操作的扩展对话。然而，由于对模型弹性的持续担忧，对于许多安全关键行业来说，实施多转弯管道仍然很困难。虽然 Berkeley 函数调用排行榜 (BFCL) 等标准化基准增强了人们对高级函数调用模型（如 Salesforce 的 xLAM V2）的信心，但仍然缺乏对多轮对话级稳健性的了解，特别是考虑到它们接触到现实世界的系统。在本文中，我们介绍了断言条件合规性（A-CC），这是一种用于多轮函数调用对话的新颖评估范例。 A-CC 提供了整体指标，用于评估模型在遇到来自两个不同向量的误导性断言时的行为：(1) 用户源断言 (USA)，衡量对看似合理但信息错误的用户信念的阿谀奉承；(2) 函数源断言 (FSA)，衡量对看似合理但矛盾的系统策略的遵守情况（例如，来自未维护工具的陈旧提示）。我们的结果表明，模型非常容易受到美国阿谀奉承和 FSA 政策冲突的影响，这证实了 A-CC 是已部署代理中的一个关键的潜在漏洞。

Title: IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages

Authors: Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00333
Pdf URL: https://arxiv.org/pdf/2512.00333
Copy Paste: [[2512.00333]] IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages(https://arxiv.org/abs/2512.00333)
Keywords: language model, gpt, llm
Abstract: While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at this https URL. Scripts to run benchmark are present at this https URL.
摘要：虽然大型语言模型在高资源多语言任务上表现出色，但资源低和极低的印度语言仍然被严重低估。我们推出了 IndicParam，这是一个由超过 13,000 个多项选择题组成的人工基准测试，涵盖 11 种此类语言（尼泊尔语、古吉拉特语、马拉地语、奥迪亚语为低资源语言；多格里语、迈蒂利语、拉贾斯坦语、梵语、博多语、桑塔利语、孔卡尼语为资源极低语言）以及梵语-英语代码混合集。我们评估了 19 个 LLM，包括专有的和开放权重的，结果表明即使是表现最好的 GPT-5 也只能达到 45.0% 的平均准确率，其次是 DeepSeek-3.2 (43.1) 和 Claude-4.5 (42.7)。我们还将每个问题标记为知识导向型或纯语言型，以区分事实回忆和语法熟练程度。此外，我们评估了法学硕士处理不同问题格式的能力，例如基于列表的匹配、断言-原因对和序列排序以及传统的多项选择题。 IndicParam 提供了对跨语言迁移局限性的见解，并为印度语言建立了一个具有挑战性的基准。该数据集可从此 https URL 获取。用于运行基准测试的脚本位于此 https URL 中。

Title: Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing

Authors: Nuo Chen, Hanpei Fang, Jiqun Liu, Wilson Wei, Tetsuya Sakai, Xiao-Ming Wu
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.00390
Pdf URL: https://arxiv.org/pdf/2512.00390
Copy Paste: [[2512.00390]] Mitigating the Threshold Priming Effect in Large Language Model-Based Relevance Judgments via Personality Infusing(https://arxiv.org/abs/2512.00390)
Keywords: language model, llm, prompt
Abstract: Recent research has explored LLMs as scalable tools for relevance labeling, but studies indicate they are susceptible to priming effects, where prior relevance judgments influence later ones. Although psychological theories link personality traits to such biases, it is unclear whether simulated personalities in LLMs exhibit similar effects. We investigate how Big Five personality profiles in LLMs influence priming in relevance labeling, using multiple LLMs on TREC 2021 and 2022 Deep Learning Track datasets. Our results show that certain profiles, such as High Openness and Low Neuroticism, consistently reduce priming susceptibility. Additionally, the most effective personality in mitigating priming may vary across models and task types. Based on these findings, we propose personality prompting as a method to mitigate threshold priming, connecting psychological evidence with LLM-based evaluation practices.
摘要：最近的研究探索了法学硕士作为相关性标记的可扩展工具，但研究表明它们很容易受到启动效应的影响，即先前的相关性判断会影响后来的相关性判断。尽管心理学理论将人格特质与此类偏见联系起来，但尚不清楚法学硕士中的模拟人格是否表现出类似的效果。我们在 TREC 2021 和 2022 Deep Learning Track 数据集上使用多个法学硕士，研究了法学硕士中的大五人格特征如何影响相关性标签的启动。我们的结果表明，某些特征，例如高开放性和低神经质，持续降低启动敏感性。此外，缓解启动效应最有效的个性可能因模型和任务类型而异。基于这些发现，我们提出将人格提示作为减轻阈值启动的一种方法，将心理证据与基于法学硕士的评估实践联系起来。

Title: A Taxonomy of Errors in English as she is spoke: Toward an AI-Based Method of Error Analysis for EFL Writing Instruction

Authors: Damian Heywood, Joseph Andrew Carrier, Kyu-Hong Hwang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00392
Pdf URL: https://arxiv.org/pdf/2512.00392
Copy Paste: [[2512.00392]] A Taxonomy of Errors in English as she is spoke: Toward an AI-Based Method of Error Analysis for EFL Writing Instruction(https://arxiv.org/abs/2512.00392)
Keywords: language model, llm
Abstract: This study describes the development of an AI-assisted error analysis system designed to identify, categorize, and correct writing errors in English. Utilizing Large Language Models (LLMs) like Claude 3.5 Sonnet and DeepSeek R1, the system employs a detailed taxonomy grounded in linguistic theories from Corder (1967), Richards (1971), and James (1998). Errors are classified at both word and sentence levels, covering spelling, grammar, and punctuation. Implemented through Python-coded API calls, the system provides granular feedback beyond traditional rubric-based assessments. Initial testing on isolated errors refined the taxonomy, addressing challenges like overlapping categories. Final testing used "English as she is spoke" by Jose da Fonseca (1855), a text rich with authentic linguistic errors, to evaluate the system's capacity for handling complex, multi-layered analysis. The AI successfully identified diverse error types but showed limitations in contextual understanding and occasionally generated new error categories when encountering uncoded errors. This research demonstrates AI's potential to transform EFL instruction by automating detailed error analysis and feedback. While promising, further development is needed to improve contextual accuracy and expand the taxonomy to stylistic and discourse-level errors.
摘要：这项研究描述了人工智能辅助错误分析系统的开发，该系统旨在识别、分类和纠正英语写作错误。该系统利用 Claude 3.5 Sonnet 和 DeepSeek R1 等大型语言模型 (LLM)，采用基于 Corder (1967)、Richards (1971) 和 James (1998) 的语言理论的详细分类法。错误在单词和句子级别进行分类，涵盖拼写、语法和标点符号。该系统通过 Python 编码的 API 调用实现，提供超越传统基于评分标准的评估的精细反馈。对孤立错误的初步测试完善了分类法，解决了类别重叠等挑战。最终测试使用了 Jose da Fonseca (1855) 的《她所说的英语》，这是一篇充满真实语言错误的文本，以评估系统处理复杂、多层次分析的能力。人工智能成功识别了多种错误类型，但在上下文理解方面表现出局限性，并且在遇到未编码错误时偶尔会生成新的错误类别。这项研究证明了人工智能通过自动执行详细的错误分析和反馈来改变 EFL 教学的潜力。虽然前景广阔，但仍需要进一步发展以提高语境准确性并将分类扩展到文体和话语层面的错误。

Title: CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency

Authors: Jiacheng Guo, Suozhi Huang, Zixin Yao, Yifan Zhang, Yifu Lu, Jiashuo Liu, Zihao Li, Yanyan Deng, Qixin Xiao, Jia Tian, Kanghong Zhan, Tianyi Li, Xiaochen Liu, Jason Ge, Chaoyang He, Kaixuan Huang, Lin Yang, Wenhao Huang, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00417
Pdf URL: https://arxiv.org/pdf/2512.00417
Copy Paste: [[2512.00417]] CryptoBench: A Dynamic Benchmark for Expert-Level Evaluation of LLM Agents in Cryptocurrency(https://arxiv.org/abs/2512.00417)
Keywords: language model, llm, agent
Abstract: This paper introduces CryptoBench, the first expert-curated, dynamic benchmark designed to rigorously evaluate the real-world capabilities of Large Language Model (LLM) agents in the uniquely demanding and fast-paced cryptocurrency domain. Unlike general-purpose agent benchmarks for search and prediction, professional crypto analysis presents specific challenges: \emph{extreme time-sensitivity}, \emph{a highly adversarial information environment}, and the critical need to synthesize data from \emph{diverse, specialized sources}, such as on-chain intelligence platforms and real-time Decentralized Finance (DeFi) dashboards. CryptoBench thus serves as a much more challenging and valuable scenario for LLM agent assessment. To address these challenges, we constructed a live, dynamic benchmark featuring 50 questions per month, expertly designed by crypto-native professionals to mirror actual analyst workflows. These tasks are rigorously categorized within a four-quadrant system: Simple Retrieval, Complex Retrieval, Simple Prediction, and Complex Prediction. This granular categorization enables a precise assessment of an LLM agent's foundational data-gathering capabilities alongside its advanced analytical and forecasting skills. Our evaluation of ten LLMs, both directly and within an agentic framework, reveals a performance hierarchy and uncovers a failure mode. We observe a \textit{retrieval-prediction imbalance}, where many leading models, despite being proficient at data retrieval, demonstrate a pronounced weakness in tasks requiring predictive analysis. This highlights a problematic tendency for agents to appear factually grounded while lacking the deeper analytical capabilities to synthesize information.
摘要：本文介绍了 CryptoBench，这是第一个专家策划的动态基准测试，旨在严格评估大型语言模型 (LLM) 代理在独特要求和快节奏的加密货币领域中的实际能力。与用于搜索和预测的通用代理基准不同，专业密码分析提出了特定的挑战：\emph{极端时间敏感性}、\emph{高度对抗性信息环境}，以及合成来自 \emph{多样化、专业来源}的数据的迫切需求，例如链上情报平台和实时去中心化金融 (DeFi) 仪表板。因此，CryptoBench 为 LLM 代理评估提供了更具挑战性和更有价值的场景。为了应对这些挑战，我们构建了一个实时、动态的基准，每月包含 50 个问题，由加密货币本地专业人士精心设计，以反映实际的分析师工作流程。这些任务被严格分类在四象限系统中：简单检索、复杂检索、简单预测和复杂预测。这种精细的分类可以精确评估法学硕士代理的基础数据收集能力及其先进的分析和预测技能。我们对十个法学硕士的评估，无论是直接评估还是在代理框架内评估，都揭示了绩效层次结构并揭示了失败模式。我们观察到\textit{检索预测不平衡}，其中许多领先模型尽管精通数据检索，但在需要预测分析的任务中表现出明显的弱点。这突显了代理人的一个问题倾向，即看似有事实依据，但缺乏更深入的分析能力来综合信息。

Title: SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling

Authors: Yang Xiao, Chunpu Xu, Ruifeng Yuan, Jiashuo Wang, Wenjie Li, Pengfei Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00466
Pdf URL: https://arxiv.org/pdf/2512.00466
Copy Paste: [[2512.00466]] SCALE: Selective Resource Allocation for Overcoming Performance Bottlenecks in Mathematical Test-time Scaling(https://arxiv.org/abs/2512.00466)
Keywords: language model, llm
Abstract: Test-time compute scaling has emerged as a powerful paradigm for enhancing mathematical reasoning in large language models (LLMs) by allocating additional computational resources during inference. However, current methods employ uniform resource distribution across all reasoning sub-problems, creating fundamental bottlenecks where challenging sub-problems receive insufficient attention while routine operations consume disproportionate resources. This uniform allocation creates performance bottlenecks where additional computational resources yield diminishing returns. Inspired by dual-process theory, we propose \textbf{SCALE} (Selective Resource Allocation), a framework that selectively allocates computational resources based on sub-problem difficulty. SCALE operates through four stages: (1) problem decomposition into sequential reasoning sub-problems, (2) difficulty assessment of each sub-problem to distinguish between routine operations and computationally challenging sub-problems, (3) selective processing mode assignment between System 1 for simple sub-problems and System 2 for complex ones, and (4) sequential execution with context propagation. By concentrating resources on challenging sub-problems while processing routine operations efficiently, SCALE achieves substantial performance improvements with superior resource utilization. Extensive experiments demonstrate that SCALE significantly outperforms uniform scaling baselines, achieving accuracy improvements of up to 13.75 percentage points (57.50% to 71.25% on AIME25) while reducing computational costs by 33%-53%, representing a major advance in test-time scaling that addresses fundamental limitations of current approaches.
摘要：测试时计算扩展已成为一种强大的范例，通过在推理过程中分配额外的计算资源来增强大型语言模型 (LLM) 中的数学推理。然而，当前的方法在所有推理子问题上采用统一的资源分配，造成了根本瓶颈，其中具有挑战性的子问题没有得到足够的关注，而常规操作消耗了不成比例的资源。这种统一的分配会造成性能瓶颈，额外的计算资源会带来收益递减。受双进程理论的启发，我们提出了 \textbf{SCALE} （选择性资源分配），这是一个根据子问题难度选择性分配计算资源的框架。 SCALE 通过四个阶段进行操作：（1）将问题分解为顺序推理子问题，（2）评估每个子问题的难度，以区分常规操作和计算上具有挑战性的子问题，（3）在简单子问题的系统 1 和复杂子问题的系统 2 之间进行选择性处理模式分配，以及（4）通过上下文传播进行顺序执行。通过将资源集中在具有挑战性的子问题上，同时有效地处理日常操作，SCALE 通过卓越的资源利用率实现了显着的性能改进。大量实验表明，SCALE 的性能显着优于统一缩放基线，精度提高了高达 13.75 个百分点（AIME25 上为 57.50% 至 71.25%），同时将计算成本降低了 33%-53%，这代表了测试时间缩放方面的重大进步，解决了当前方法的基本局限性。

Title: G-KV: Decoding-Time KV Cache Eviction with Global Attention

Authors: Mengqi Liao, Lu Wang, Chaoyun Zhang, Zekai Shen, Xiaowei Mao, Si Qin, Qingwei Lin, Saravan Rajmohan, Dongmei Zhang, Huaiyu Wan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00504
Pdf URL: https://arxiv.org/pdf/2512.00504
Copy Paste: [[2512.00504]] G-KV: Decoding-Time KV Cache Eviction with Global Attention(https://arxiv.org/abs/2512.00504)
Keywords: language model, llm, prompt
Abstract: Recent reasoning large language models (LLMs) excel in complex tasks but encounter significant computational and memory challenges due to long sequence lengths. KV cache compression has emerged as an effective approach to greatly enhance the efficiency of reasoning. However, existing methods often focus on prompt compression or token eviction with local attention score, overlooking the long-term importance of tokens. We propose G-KV, a KV cache eviction method that employs a global scoring mechanism, combining local and historical attention scores to more accurately assess token importance. Additionally, we introduce post-training techniques, including reinforcement learning and distillation, to optimize models for compressed KV cache settings. The code of this paper is available on: this https URL.
摘要：最近的推理大型语言模型（LLM）在复杂任务中表现出色，但由于序列长度较长而遇到了巨大的计算和内存挑战。 KV缓存压缩已经成为一种大大提高推理效率的有效方法。然而，现有的方法通常侧重于通过局部注意力评分进行即时压缩或令牌驱逐，而忽视了令牌的长期重要性。我们提出了 G-KV，一种 KV 缓存驱逐方法，采用全局评分机制，结合本地和历史注意力分数来更准确地评估 token 重要性。此外，我们引入了训练后技术，包括强化学习和蒸馏，以优化压缩 KV 缓存设置的模型。本文的代码可在以下位置找到：此 https URL。

Title: Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity

Authors: Subramanyam Sahoo, Vinija Jain, Saanidhya Vats, Siddharth Mohapatra, Rui Min, Aman Chadha, Divya Chaudhary
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00552
Pdf URL: https://arxiv.org/pdf/2512.00552
Copy Paste: [[2512.00552]] Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity(https://arxiv.org/abs/2512.00552)
Keywords: language model
Abstract: Current evaluation of mathematical reasoning in language models relies primarily on answer accuracy, potentially masking fundamental failures in logical computation. We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching through four complementary axes: forward-backward consistency, transitivity coverage, counterfactual sensitivity, and perturbation robustness. Through a case study applying this framework to Qwen3-0.6B on the MenatQA dataset, we reveal a striking disconnect between surface performance and reasoning fidelity. While the model achieves reasonable answer accuracy (70%+), it demonstrates poor backward consistency (15%), limited transitivity coverage (32.2%), and brittle sensitivity to perturbations. Our diagnostics expose reasoning failures invisible to traditional accuracy metrics, suggesting that this small model relies heavily on pattern matching rather than genuine logical computation. While our empirical findings are based on a single 600M-parameter model, the diagnostic framework itself is model-agnostic and generalizable. We release our evaluation protocols to enable the research community to assess reasoning fidelity across different model scales and architectures, moving beyond surface-level accuracy toward verifiable mathematical reasoning.
摘要：目前对语言模型中数学推理的评估主要依赖于答案的准确性，这可能掩盖了逻辑计算中的基本故障。我们引入了一个诊断框架，通过四个互补轴区分真正的数学推理和表面模式匹配：前后向一致性、传递性覆盖、反事实敏感性和扰动鲁棒性。通过将该框架应用于 MenatQA 数据集上的 Qwen3-0.6B 的案例研究，我们揭示了表面性能和推理保真度之间的显着脱节。虽然该模型实现了合理的答案准确性 (70%+)，但它表现出较差的后向一致性 (15%)、有限的传递性覆盖范围 (32.2%) 以及对扰动的脆弱敏感性。我们的诊断揭示了传统准确性指标所看不见的推理失败，这表明这个小模型严重依赖于模式匹配而不是真正的逻辑计算。虽然我们的实证研究结果基于单个 600M 参数模型，但诊断框架本身与模型无关且可推广。我们发布了评估协议，使研究社区能够评估不同模型规模和架构的推理保真度，超越表面精度，走向可验证的数学推理。

Title: Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models

Authors: Alla Chepurova, Aydar Bulatov, Yuri Kuratov, Mikhail Burtsev
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.00590
Pdf URL: https://arxiv.org/pdf/2512.00590
Copy Paste: [[2512.00590]] Wikontic: Constructing Wikidata-Aligned, Ontology-Aware Knowledge Graphs with Large Language Models(https://arxiv.org/abs/2512.00590)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Knowledge graphs (KGs) provide structured, verifiable grounding for large language models (LLMs), but current LLM-based systems commonly use KGs as auxiliary structures for text retrieval, leaving their intrinsic quality underexplored. In this work, we propose Wikontic, a multi-stage pipeline that constructs KGs from open-domain text by extracting candidate triplets with qualifiers, enforcing Wikidata-based type and relation constraints, and normalizing entities to reduce duplication. The resulting KGs are compact, ontology-consistent, and well-connected; on MuSiQue, the correct answer entity appears in 96% of generated triplets. On HotpotQA, our triplets-only setup achieves 76.0 F1, and on MuSiQue 59.8 F1, matching or surpassing several retrieval-augmented generation baselines that still require textual context. In addition, Wikontic attains state-of-the-art information-retention performance on the MINE-1 benchmark (86%), outperforming prior KG construction methods. Wikontic is also efficient at build time: KG construction uses less than 1,000 output tokens, about 3$\times$ fewer than AriGraph and $<$1/20 of GraphRAG. The proposed pipeline enhances the quality of the generated KG and offers a scalable solution for leveraging structured knowledge in LLMs.
摘要：知识图谱 (KG) 为大型语言模型 (LLM) 提供了结构化、可验证的基础，但当前基于 LLM 的系统通常使用知识图谱作为文本检索的辅助结构，而其内在质量尚未得到充分开发。在这项工作中，我们提出了 Wikontic，这是一种多阶段管道，通过提取带有限定符的候选三元组、强制执行基于 Wikidata 的类型和关系约束以及规范化实体以减少重复，从开放域文本构建知识图谱。生成的知识图谱紧凑、本体一致且连接良好；在 MuSiQue 上，正确答案实体出现在 96% 的生成三元组中。在 HotpotQA 上，我们的仅三元组设置达到了 76.0 F1，在 MuSiQue 上达到了 59.8 F1，匹配或超过了仍需要文本上下文的几个检索增强生成基线。此外，Wikontic 在 MINE-1 基准上实现了最先进的信息保留性能 (86%)，优于之前的 KG 构建方法。 Wikontic 在构建时也很高效：KG 构建使用不到 1,000 个输出标记，比 AriGraph 少约 3$\times$，比 GraphRAG 少约 3$\times$。所提出的管道提高了生成的知识图谱的质量，并提供了一个可扩展的解决方案，用于利用法学硕士中的结构化知识。

Title: Prism: A Minimal Compositional Metalanguage for Specifying Agent Behavior

Authors: Franck Binard, Vanja Kljajevic
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00611
Pdf URL: https://arxiv.org/pdf/2512.00611
Copy Paste: [[2512.00611]] Prism: A Minimal Compositional Metalanguage for Specifying Agent Behavior(https://arxiv.org/abs/2512.00611)
Keywords: prompt, agent
Abstract: Prism is a small, compositional metalanguage for specifying the behaviour of tool-using software agents. Rather than introducing ad hoc control constructs, Prism is built around a fixed core context, Core1, which provides a minimal background grammar of categories numbers, strings, user prompts, tools together with abstract combinators for booleans, predicates, pairs, and lists. Agent policies are written as ordinary expressions using a single abstraction operator so that conditionals appear as selections between alternatives instead of imperative if-else blocks. Domains extend the core by defining their own context-mini-grammars that introduce new categories, predicates, and external tools while reusing the same compositional machinery. We illustrate this with worked examples from thermostat control, home security, e-commerce recommendation, and medical monitoring, showing how natural language decision rules can be mapped to inspectable, executable policies. From a linguistic perspective, Prism enforces a clear separation between a reusable grammar-like core and domain specific lexicons and treats tools as bridges between internal policy representations and the external world. From an engineering perspective, it offers a compact interface language for agent control, making the space of possible actions explicit and amenable to analysis, verification, and safety constraints.
摘要：Prism 是一种小型组合元语言，用于指定使用工具的软件代理的行为。 Prism 不是引入临时控制结构，而是围绕固定的核心上下文 Core1 构建，它提供类别数字、字符串、用户提示、工具以及布尔值、谓词、对和列表的抽象组合器的最小背景语法。代理策略使用单个抽象运算符编写为普通表达式，以便条件显示为替代项之间的选择，而不是命令式 if-else 块。域通过定义自己的上下文迷你语法来扩展核心，这些上下文迷你语法引入新的类别、谓词和外部工具，同时重用相同的组合机制。我们通过恒温器控制、家庭安全、电子商务推荐和医疗监控的工作示例来说明这一点，展示自然语言决策规则如何映射到可检查、可执行的策略。从语言学的角度来看，Prism 在可重用的类似语法的核心和特定领域的词典之间强制执行明确的分离，并将工具视为内部政策表示和外部世界之间的桥梁。从工程角度来看，它为代理控制提供了一种紧凑的接口语言，使可能的操作空间变得明确并易于分析、验证和安全约束。

Title: ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization

Authors: Omer Jauhar Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00617
Pdf URL: https://arxiv.org/pdf/2512.00617
Copy Paste: [[2512.00617]] ART: Adaptive Response Tuning Framework -- A Multi-Agent Tournament-Based Approach to LLM Response Optimization(https://arxiv.org/abs/2512.00617)
Keywords: language model, llm, hallucination, agent
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding and generation. However, single-model responses often exhibit inconsistencies, hallucinations, and varying quality across different query domains. This paper presents ART (Adaptive Response Tuning), a novel framework that employs tournament-style ELO ranking and multi-agent reasoning to systematically optimize LLM outputs. By enabling multiple LLM agents to compete, critique, and collaborate through structured tournament workflows, ART produces consensus responses that outperform individual model outputs. Our framework introduces configurable tournament parameters, dynamic agent selection, and multiple consensus fusion strategies. Experimental evaluations demonstrate significant improvements in response accuracy, coherence, and reliability compared to baseline single-model approaches. The ART framework provides a scalable, production-ready solution for applications requiring high-quality, vetted LLM responses, achieving an 8.4% improvement in overall quality metrics and R22 values exceeding 0.96 in ELO rating convergence.
摘要：大型语言模型（LLM）在自然语言理解和生成方面表现出了卓越的能力。然而，单一模型响应通常会在不同的查询域中表现出不一致、幻觉和不同的质量。本文提出了 ART（自适应响应调整），这是一种新颖的框架，采用锦标赛式 ELO 排名和多智能体推理来系统地优化 LLM 输出。通过使多个 LLM 代理能够通过结构化锦标赛工作流程进行竞争、批评和协作，ART 可以产生优于单个模型输出的共识响应。我们的框架引入了可配置的锦标赛参数、动态代理选择和多种共识融合策略。实验评估表明，与基线单模型方法相比，响应准确性、一致性和可靠性显着提高。 ART 框架为需要高质量、经过审查的 LLM 响应的应用程序提供了可扩展、可立即投入生产的解决方案，整体质量指标提高了 8.4%，并且 ELO 评级收敛的 R22 值超过 0.96。

Title: Sycophancy Claims about Language Models: The Missing Human-in-the-Loop

Authors: Jan Batzner, Volker Stocker, Stefan Schmid, Gjergji Kasneci
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2512.00656
Pdf URL: https://arxiv.org/pdf/2512.00656
Copy Paste: [[2512.00656]] Sycophancy Claims about Language Models: The Missing Human-in-the-Loop(https://arxiv.org/abs/2512.00656)
Keywords: language model, llm
Abstract: Sycophantic response patterns in Large Language Models (LLMs) have been increasingly claimed in the literature. We review methodological challenges in measuring LLM sycophancy and identify five core operationalizations. Despite sycophancy being inherently human-centric, current research does not evaluate human perception. Our analysis highlights the difficulties in distinguishing sycophantic responses from related concepts in AI alignment and offers actionable recommendations for future research.
摘要：文献中越来越多地声称大型语言模型（LLM）中的阿谀奉承的反应模式。我们回顾了衡量法学硕士阿谀奉承的方法论挑战，并确定了五个核心操作。尽管阿谀奉承本质上是以人类为中心的，但目前的研究并没有评估人类的感知。我们的分析强调了区分阿谀奉承的反应与人工智能对齐中的相关概念的困难，并为未来的研究提供了可行的建议。

Title: Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs

Authors: Tanmay Agrawal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00663
Pdf URL: https://arxiv.org/pdf/2512.00663
Copy Paste: [[2512.00663]] Graphing the Truth: Structured Visualizations for Automated Hallucination Detection in LLMs(https://arxiv.org/abs/2512.00663)
Keywords: language model, llm, hallucination
Abstract: Large Language Models have rapidly advanced in their ability to interpret and generate natural language. In enterprise settings, they are frequently augmented with closed-source domain knowledge to deliver more contextually informed responses. However, operational constraints such as limited context windows and inconsistencies between pre-training data and supplied knowledge often lead to hallucinations, some of which appear highly credible and escape routine human review. Current mitigation strategies either depend on costly, large-scale gold-standard Q\&A curation or rely on secondary model verification, neither of which offers deterministic assurance. This paper introduces a framework that organizes proprietary knowledge and model-generated content into interactive visual knowledge graphs. The objective is to provide end users with a clear, intuitive view of potential hallucination zones by linking model assertions to underlying sources of truth and indicating confidence levels. Through this visual interface, users can diagnose inconsistencies, identify weak reasoning chains, and supply corrective feedback. The resulting human-in-the-loop workflow creates a structured feedback loop that can enhance model reliability and continuously improve response quality.
摘要：大型语言模型解释和生成自然语言的能力迅速提高。在企业环境中，它们经常通过闭源领域知识进行增强，以提供更多上下文相关的响应。然而，诸如有限的上下文窗口以及预训练数据和提供的知识之间的不一致等操作限制常常会导致幻觉，其中一些看起来高度可信并且逃脱了常规的人工审查。当前的缓解策略要么依赖于昂贵的大规模黄金标准问答管理，要么依赖于二次模型验证，这两种策略都无法提供确定性保证。本文介绍了一种将专有知识和模型生成的内容组织成交互式视觉知识图的框架。目标是通过将模型断言与潜在的事实来源联系起来并指示置信水平，为最终用户提供潜在幻觉区域的清晰、直观的视图。通过这个可视化界面，用户可以诊断不一致之处，识别薄弱的推理链，并提供纠正反馈。由此产生的人机交互工作流程创建了一个结构化的反馈循环，可以增强模型的可靠性并不断提高响应质量。

Title: A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data

Authors: Breanna E. Green, Ashley L. Shea, Pengfei Zhao, Drew B. Margolin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00673
Pdf URL: https://arxiv.org/pdf/2512.00673
Copy Paste: [[2512.00673]] A Comparison of Human and ChatGPT Classification Performance on Complex Social Media Data(https://arxiv.org/abs/2512.00673)
Keywords: language model, gpt, prompt, chat
Abstract: Generative artificial intelligence tools, like ChatGPT, are an increasingly utilized resource among computational social scientists. Nevertheless, there remains space for improved understanding of the performance of ChatGPT in complex tasks such as classifying and annotating datasets containing nuanced language. Method. In this paper, we measure the performance of GPT-4 on one such task and compare results to human annotators. We investigate ChatGPT versions 3.5, 4, and 4o to examine performance given rapid changes in technological advancement of large language models. We craft four prompt styles as input and evaluate precision, recall, and F1 scores. Both quantitative and qualitative evaluations of results demonstrate that while including label definitions in prompts may help performance, overall GPT-4 has difficulty classifying nuanced language. Qualitative analysis reveals four specific findings. Our results suggest the use of ChatGPT in classification tasks involving nuanced language should be conducted with prudence.
摘要：ChatGPT 等生成人工智能工具是计算社会科学家越来越多地利用的资源。尽管如此，对于 ChatGPT 在复杂任务（例如对包含细微语言的数据集进行分类和注释）中的性能的理解仍有空间。方法。在本文中，我们测量了 GPT-4 在一项此类任务上的性能，并将结果与人类注释者进行了比较。我们研究了 ChatGPT 3.5、4 和 4o 版本，以检查大型语言模型技术进步快速变化的性能。我们设计了四种提示样式作为输入，并评估精确度、召回率和 F1 分数。对结果的定量和定性评估都表明，虽然在提示中包含标签定义可能有助于提高性能，但总体而言，GPT-4 难以对细致入微的语言进行分类。定性分析揭示了四个具体发现。我们的结果表明，在涉及细微语言的分类任务中应谨慎使用 ChatGPT。

Title: Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation

Authors: Xiaodong Cai, Hai Lin, Shaoxiong Zhan, Weiqi Luo, Hong-Gee Kim, Hongyan Hao, Yu Yang, Hai-Tao Zheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00789
Pdf URL: https://arxiv.org/pdf/2512.00789
Copy Paste: [[2512.00789]] Auxiliary-Hyperparameter-Free Sampling: Entropy Equilibrium for Text Generation(https://arxiv.org/abs/2512.00789)
Keywords: language model, llm
Abstract: Token sampling strategies critically influence text generation quality in large language models (LLMs). However, existing methods introduce additional hyperparameters, requiring extensive tuning and complicating deployment. We present Entropy Equilibrium Sampling (EES), an auxiliary hyperparameter-free approach inspired by information theory that can dynamically adjust candidate sets by balancing normalized entropy with probability mass. We evaluate EES on both reasoning and generation tasks across a range of model architectures. Our results show that EES consistently performs well across temperature settings, delivering competitive accuracy and coherence while maintaining diversity. By eliminating the need for hyperparameter tuning, EES greatly simplifies deployment while improving performance. Code is available at this https URL
摘要：令牌采样策略严重影响大型语言模型 (LLM) 中的文本生成质量。然而，现有方法引入了额外的超参数，需要大量调整并使部署复杂化。我们提出了熵平衡采样（EES），这是一种受信息论启发的辅助无超参数方法，可以通过平衡归一化熵与概率质量来动态调整候选集。我们在一系列模型架构的推理和生成任务上评估 EES。我们的结果表明，EES 在不同温度设置下始终表现良好，在保持多样性的同时提供有竞争力的准确性和一致性。通过消除超参数调整的需要，EES 极大地简化了部署，同时提高了性能。代码可在此 https URL 获取

Title: WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models

Authors: Yukang Lin, Jiahao Shao, Shuoran Jiang, Wentao Zhu, Bingjie Lu, Xiangping Wu, Joanna Siebert, Qingcai Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00837
Pdf URL: https://arxiv.org/pdf/2512.00837
Copy Paste: [[2512.00837]] WaterSearch: A Quality-Aware Search-based Watermarking Framework for Large Language Models(https://arxiv.org/abs/2512.00837)
Keywords: language model, llm
Abstract: Watermarking acts as a critical safeguard in text generated by Large Language Models (LLMs). By embedding identifiable signals into model outputs, watermarking enables reliable attribution and enhances the security of machine-generated content. Existing approaches typically embed signals by manipulating token generation probabilities. Despite their effectiveness, these methods inherently face a trade-off between detectability and text quality: the signal strength and randomness required for robust watermarking tend to degrade the performance of downstream tasks. In this paper, we design a novel embedding scheme that controls seed pools to facilitate diverse parallel generation of watermarked text. Based on that scheme, we propose WaterSearch, a sentence-level, search-based watermarking framework adaptable to a wide range of existing methods. WaterSearch enhances text quality by jointly optimizing two key aspects: 1) distribution fidelity and 2) watermark signal characteristics. Furthermore, WaterSearch is complemented by a sentence-level detection method with strong attack robustness. We evaluate our method on three popular LLMs across ten diverse tasks. Extensive experiments demonstrate that our method achieves an average performance improvement of 51.01\% over state-of-the-art baselines at a watermark detectability strength of 95\%. In challenging scenarios such as short text generation and low-entropy output generation, our method yields performance gains of 47.78\% and 36.47\%, respectively. Moreover, under different attack senarios including insertion, synonym substitution and paraphrase attasks, WaterSearch maintains high detectability, further validating its robust anti-attack capabilities. Our code is available at \href{this https URL}{this https URL}.
摘要：水印是大型语言模型 (LLM) 生成的文本的关键保护措施。通过将可识别信号嵌入到模型输出中，水印可以实现可靠的归属并增强机器生成内容的安全性。现有方法通常通过操纵令牌生成概率来嵌入信号。尽管它们很有效，但这些方法本质上面临着可检测性和文本质量之间的权衡：鲁棒水印所需的信号强度和随机性往往会降低下游任务的性能。在本文中，我们设计了一种新颖的嵌入方案，该方案控制种子池以促进水印文本的多种并行生成。基于该方案，我们提出了 WaterSearch，这是一种句子级、基于搜索的水印框架，适用于各种现有方法。 WaterSearch 通过联合优化两个关键方面来提高文本质量：1) 分布保真度和 2) 水印信号特征。此外，WaterSearch还辅以具有较强攻击鲁棒性的句子级检测方法。我们在十个不同任务中的三个流行的法学硕士上评估了我们的方法。大量实验表明，我们的方法在水印可检测性强度为 95% 的情况下，比最先进的基线平均性能提高了 51.01%。在短文本生成和低熵输出生成等具有挑战性的场景中，我们的方法分别获得了 47.78% 和 36.47% 的性能提升。此外，在插入、同义词替换、释义等不同攻击场景下，WaterSearch都保持了较高的可检测性，进一步验证了其强大的抗攻击能力。我们的代码位于 \href{此 https URL}{此 https URL}。

Title: Less is More: Resource-Efficient Low-Rank Adaptation

Authors: Chunlin Tian, Xuyang Wei, Huanrong Liu, Zhijiang Guo, Li Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00878
Pdf URL: https://arxiv.org/pdf/2512.00878
Copy Paste: [[2512.00878]] Less is More: Resource-Efficient Low-Rank Adaptation(https://arxiv.org/abs/2512.00878)
Keywords: language model, llm
Abstract: Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning (PEFT) method for Large Language Models (LLMs), but it still incurs notable overhead and suffers from parameter interference in complex datasets. While re- cent works decouple LoRA update matrices to exploit matrix-wise asymmetry, training costs remain high. We revisit LoRA from the perspective of inter-matrix and intra-layer parameter redundancy and propose Resource-Efficient Low-Rank Adaptation, EffiLoRA, a lightweight and generalizable approach for language, multimodal, and diffusion models. EffiLoRA employs a unified A matrix across all transformer layers and introduces a runtime selective B matrices up- date to dynamically trade-off the system resource budget and model performance. EffiLoRA consistently outperforms LoRA across diverse modalities, including commonsense reasoning, visual instruction tuning, and image generation, demon- strating improved efficiency and robustness.
摘要：低秩适应（LoRA）是一种广泛采用的大型语言模型（LLM）参数高效微调（PEFT）方法，但它仍然会产生显着的开销，并且在复杂数据集中受到参数干扰。虽然最近的工作解耦 LoRA 更新矩阵以利用矩阵不对称性，但训练成本仍然很高。我们从矩阵间和层内参数冗余的角度重新审视 LoRA，并提出资源高效低阶自适应 EffiLoRA，这是一种轻量级且可泛化的语言、多模态和扩散模型方法。 EffiLoRA 在所有变压器层上采用统一的 A 矩阵，并引入运行时选择性 B 矩阵更新，以动态权衡系统资源预算和模型性能。 EffiLoRA 在各种模式上始终优于 LoRA，包括常识推理、视觉指令调整和图像生成，证明了效率和鲁棒性的提高。

Title: Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios

Authors: Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, Hui Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00920
Pdf URL: https://arxiv.org/pdf/2512.00920
Copy Paste: [[2512.00920]] Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios(https://arxiv.org/abs/2512.00920)
Keywords: language model, llm
Abstract: Reliable reward models (RMs) are critical for ensuring the safe alignment of large language models (LLMs). However, current evaluation methods focus solely on preference perception accuracies in given specific scenarios, obscuring the critical vulnerabilities of RMs in real-world scenarios. We identify the true challenge lies in assessing a novel dimension: Suitability, defined as conditional reliability under specific real-world perturbations. To this end, we introduce Reward Auditor, a hypothesis-testing framework specifically designed for RM suitability inference. Rather than answering "How accurate is the RM's preference perception for given samples?", it employs scientific auditing to answer: "Can we infer RMs exhibit systematic vulnerabilities in specific real-world scenarios?". Under real-world perturbed scenarios, Reward Auditor quantifies statistical significance and effect size by auditing distribution degradation of RM preference perception confidence. This enables inference of both the certainty and severity of RM vulnerabilities across diverse real-world scenarios. This lays a solid foundation for building next-generation LLM alignment systems that are verifiably safe, more robust, and trustworthy.
摘要：可靠的奖励模型 (RM) 对于确保大型语言模型 (LLM) 的安全对齐至关重要。然而，当前的评估方法仅关注给定特定场景下的偏好感知准确性，掩盖了现实场景中 RM 的关键漏洞。我们发现真正的挑战在于评估一个新的维度：适用性，定义为特定现实世界扰动下的条件可靠性。为此，我们引入了 Reward Auditor，这是一个专门为 RM 适用性推断而设计的假设检验框架。它不是回答“RM 对给定样本的偏好感知有多准确？”，而是采用科学审计来回答：“我们能否推断 RM 在特定的现实场景中表现出系统漏洞？”。在现实世界的扰动场景下，Reward Auditor 通过审核 RM 偏好感知置信度的分布退化来量化统计显着性和效应大小。这使得能够在不同的现实场景中推断 RM 漏洞的确定性和严重性。这为构建可验证安全、更稳健且值得信赖的下一代法学硕士对齐系统奠定了坚实的基础。

Title: Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study

Authors: Imane Jaaouine, Ross D. King
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00931
Pdf URL: https://arxiv.org/pdf/2512.00931
Copy Paste: [[2512.00931]] Mitigating Hallucinations in Zero-Shot Scientific Summarisation: A Pilot Study(https://arxiv.org/abs/2512.00931)
Keywords: language model, llm, hallucination, prompt
Abstract: Large language models (LLMs) produce context inconsistency hallucinations, which are LLM generated outputs that are misaligned with the user prompt. This research project investigates whether prompt engineering (PE) methods can mitigate context inconsistency hallucinations in zero-shot LLM summarisation of scientific texts, where zero-shot indicates that the LLM relies purely on its pre-training data. Across eight yeast biotechnology research paper abstracts, six instruction-tuned LLMs were prompted with seven methods: a base- line prompt, two levels of increasing instruction complexity (PE-1 and PE-2), two levels of context repetition (CR-K1 and CR-K2), and two levels of random addition (RA-K1 and RA-K2). Context repetition involved the identification and repetition of K key sentences from the abstract, whereas random addition involved the repetition of K randomly selected sentences from the abstract, where K is 1 or 2. A total of 336 LLM-generated summaries were evaluated using six metrics: ROUGE-1, ROUGE-2, ROUGE-L, BERTScore, METEOR, and cosine similarity, which were used to compute the lexical and semantic alignment be- tween the summaries and the abstracts. Four hypotheses on the effects of prompt methods on summary alignment with the reference text were tested. Statistical analysis on 3744 collected datapoints was performed using bias-corrected and accelerated (BCa) bootstrap confidence intervals and Wilcoxon signed-rank tests with Bonferroni-Holm correction. The results demonstrated that CR and RA significantly improve the lexical alignment of LLM-generated summaries with the abstracts. These findings indicate that prompt engineering has the potential to impact hallucinations in zero-shot scientific summarisation tasks.
摘要：大型语言模型 (LLM) 会产生上下文不一致的幻觉，这是 LLM 生成的输出与用户提示不一致。该研究项目调查即时工程（PE）方法是否可以减轻零样本法学硕士科学文本摘要中上下文不一致的幻觉，其中零样本表明法学硕士完全依赖于其预训练数据。在八篇酵母生物技术研究论文摘要中，六位经过指令调整的法学硕士通过七种方法进行提示：基线提示、两个级别的增加指令复杂性（PE-1和PE-2）、两个级别的上下文重复（CR-K1和CR-K2）以及两个级别的随机添加（RA-K1和RA-K2）。上下文重复涉及从摘要中识别和重复 K 个关键句子，而随机添加涉及从摘要中随机选择 K 个句子的重复，其中 K 为 1 或 2。使用六个指标评估了总共 336 个 LLM 生成的摘要：ROUGE-1、ROUGE-2、ROUGE-L、BERTScore、METEOR 和余弦相似度，这些指标用于计算摘要与摘要之间的词汇和语义对齐。摘要。测试了关于提示方法对摘要与参考文本对齐的影响的四种假设。使用偏差校正和加速 (BCa) 自举置信区间以及带有 Bonferroni-Holm 校正的 Wilcoxon 符号秩检验对 3744 个收集的数据点进行统计分析。结果表明，CR 和 RA 显着改善了法学硕士生成的摘要与摘要的词汇对齐。这些发现表明，即时工程有可能影响零样本科学总结任务中的幻觉。

Title: Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data

Authors: Alvaro Paredes Amorin, Andre Python, Christoph Weisser
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00946
Pdf URL: https://arxiv.org/pdf/2512.00946
Copy Paste: [[2512.00946]] Fine-tuning of lightweight large language models for sentiment classification on heterogeneous financial textual data(https://arxiv.org/abs/2512.00946)
Keywords: language model, llm
Abstract: Large language models (LLMs) play an increasingly important role in finan- cial markets analysis by capturing signals from complex and heterogeneous textual data sources, such as tweets, news articles, reports, and microblogs. However, their performance is dependent on large computational resources and proprietary datasets, which are costly, restricted, and therefore inacces- sible to many researchers and practitioners. To reflect realistic situations we investigate the ability of lightweight open-source LLMs - smaller and publicly available models designed to operate with limited computational resources - to generalize sentiment understanding from financial datasets of varying sizes, sources, formats, and languages. We compare the benchmark finance natural language processing (NLP) model, FinBERT, and three open-source lightweight LLMs, DeepSeek-LLM 7B, Llama3 8B Instruct, and Qwen3 8B on five publicly available datasets: FinancialPhraseBank, Financial Question Answering, Gold News Sentiment, Twitter Sentiment and Chinese Finance Sentiment. We find that LLMs, specially Qwen3 8B and Llama3 8B, perform best in most scenarios, even from using only 5% of the available training data. These results hold in zero-shot and few-shot learning scenarios. Our findings indicate that lightweight, open-source large language models (LLMs) consti- tute a cost-effective option, as they can achieve competitive performance on heterogeneous textual data even when trained on only a limited subset of the extensive annotated corpora that are typically deemed necessary.
摘要：大型语言模型（LLM）通过从复杂且异构的文本数据源（例如推文、新闻文章、报告和微博）捕获信号，在金融市场分析中发挥着越来越重要的作用。然而，它们的性能依赖于大量的计算资源和专有数据集，这些资源成本高昂且受到限制，因此许多研究人员和从业者无法访问。为了反映现实情况，我们研究了轻量级开源法学硕士（旨在使用有限的计算资源运行的较小且公开的模型）从不同大小、来源、格式和语言的金融数据集中概括情绪理解的能力。我们在五个公开数据集上比较了基准金融自然语言处理 (NLP) 模型 FinBERT 和三个开源轻量级 LLM（DeepSeek-LLM 7B、Llama3 8B Instruct 和 Qwen3 8B）：FinancialPhraseBank、Financial Question Answering、Gold News Sentiment、Twitter Sentiment 和 Chinese Finance Sentiment。我们发现 LLM，特别是 Qwen3 8B 和 Llama3 8B，在大多数情况下表现最佳，即使只使用 5% 的可用训练数据。这些结果适用于零样本和少样本学习场景。我们的研究结果表明，轻量级开源大型语言模型（LLM）构成了一种经济高效的选择，因为即使仅在通常被认为必要的广泛注释语料库的有限子集上进行训练，它们也可以在异构文本数据上实现有竞争力的性能。

Title: Table as a Modality for Large Language Models

Authors: Liyao Li, Chao Ye, Wentao Ye, Yifei Sun, Zhe Jiang, Haobo Wang, Jiaming Tian, Yiming Zhang, Ningtao Wang, Xing Fu, Gang Chen, Junbo Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.00947
Pdf URL: https://arxiv.org/pdf/2512.00947
Copy Paste: [[2512.00947]] Table as a Modality for Large Language Models(https://arxiv.org/abs/2512.00947)
Keywords: language model, gpt, llm
Abstract: To migrate the remarkable successes of Large Language Models (LLMs), the community has made numerous efforts to generalize them to the table reasoning tasks for the widely deployed tabular data. Despite that, in this work, by showing a probing experiment on our proposed StructQA benchmark, we postulate that even the most advanced LLMs (such as GPTs) may still fall short of coping with tabular data. More specifically, the current scheme often simply relies on serializing the tabular data, together with the meta information, then inputting them through the LLMs. We argue that the loss of structural information is the root of this shortcoming. In this work, we further propose TAMO, which bears an ideology to treat the tables as an independent modality integrated with the text tokens. The resulting model in TAMO is a multimodal framework consisting of a hypergraph neural network as the global table encoder seamlessly integrated with the mainstream LLM. Empirical results on various benchmarking datasets, including HiTab, WikiTQ, WikiSQL, FeTaQA, and StructQA, have demonstrated significant improvements on generalization with an average relative gain of 42.65%.
摘要：为了迁移大型语言模型 (LLM) 的显著成功，社区做出了大量努力，将其推广到广泛部署的表格数据的表推理任务。尽管如此，在这项工作中，通过在我们提出的 StructQA 基准上展示一个探测实验，我们假设即使是最先进的 LLM（例如 GPT）也可能无法应对表格数据。更具体地说，当前的方案通常简单地依赖于序列化表格数据以及元信息，然后通过法学硕士输入它们。我们认为结构信息的丢失是这一缺点的根源。在这项工作中，我们进一步提出了 TAMO，它的意识形态是将表格视为与文本标记集成的独立模态。 TAMO 中的最终模型是一个多模态框架，由超图神经网络组成，作为全局表编码器，与主流 LLM 无缝集成。各种基准测试数据集（包括 HiTab、WikiTQ、WikiSQL、FeTaQA 和 StructQA）的实证结果表明，泛化能力显着提高，平均相对增益达到 42.65%。

Title: Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent

Authors: Zhihan Guo, Feiyang Xu, Yifan Li, Muzhi Li, Shuai Zou, Jiele Wu, Han Shi, Haoli Bai, Ho-fung Leung, Irwin King
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00986
Pdf URL: https://arxiv.org/pdf/2512.00986
Copy Paste: [[2512.00986]] Dr.Mi-Bench: A Modular-integrated Benchmark for Scientific Deep Research Agent(https://arxiv.org/abs/2512.00986)
Keywords: llm, agent
Abstract: The explosive growth in academic literature necessitates automated deep research (DR) agents, yet their evaluation remains a significant challenge. First, existing benchmarks often focus narrowly on retrieval while neglecting high-level planning and reasoning. Second, existing benchmarks favor general domains over the scientific domains that are the core application for DR agents. To address these gaps, we introduce this http URL-Bench, a Modular-integrated benchmark for scientific DR agents. Grounded in academic literature, our benchmark uses a human-annotated dataset of 200 instances across 10 scientific domains, including both research and review papers. Besides, we also propose a Modular-integrated Evaluation Paradigm for DR Agents (this http URL-Eval), a novel modular-integrated evaluation paradigm, which leverages the rich structure of academic papers to assess the core competencies of planning, retrieval, and reasoning through two complementary modes: an end-to-end evaluation for DR agents and an isolated evaluation for foundational LLMs as potential backbones. Experimental results reveal a fragmented performance landscape: agents exhibit specialized strengths but share critical weaknesses, most notably in performing the multi-source retrieval required for review-style tasks and performing consistently across diverse scientific fields. Moreover, improving high-level planning capability is the crucial factor for unlocking the reasoning potential of foundational LLMs as backbones. By exposing these actionable failure modes, this http URL-Bench provides a diagnostic tool to guide the development of more reliable academic research assistants.
摘要：学术文献的爆炸性增长需要自动化深度研究（DR）代理，但对其评估仍然是一个重大挑战。首先，现有的基准通常狭隘地关注检索，而忽略了高层规划和推理。其次，现有的基准测试偏向于一般领域，而不是作为 DR 代理核心应用的科学领域。为了解决这些差距，我们引入了这个 http URL-Bench，这是一个针对科学 DR 代理的模块化集成基准。我们的基准以学术文献为基础，使用人工注释的数据集，其中包含 10 个科学领域的 200 个实例，包括研究论文和评论论文。此外，我们还提出了一种模块化集成的 DR Agents 评估范式（http URL-Eval），这是一种新颖的模块化集成评估范式，它利用学术论文的丰富结构，通过两种互补的模式来评估规划、检索和推理的核心能力：对 DR Agents 的端到端评估和对作为潜在骨干的基础法学硕士的孤立评估。实验结果揭示了分散的性能景观：智能体表现出专业优势，但也存在关键弱点，最明显的是在执行审阅式任务所需的多源检索以及在不同科学领域的一致表现方面。此外，提高高层规划能力是释放基础法学硕士作为骨干推理潜力的关键因素。通过公开这些可操作的故障模式，这个 http URL-Bench 提供了一个诊断工具来指导开发更可靠的学术研究助理。

Title: Advancing Academic Chatbots: Evaluation of Non Traditional Outputs

Authors: Nicole Favero, Francesca Salute, Daniel Hardt
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.00991
Pdf URL: https://arxiv.org/pdf/2512.00991
Copy Paste: [[2512.00991]] Advancing Academic Chatbots: Evaluation of Non Traditional Outputs(https://arxiv.org/abs/2512.00991)
Keywords: language model, gpt, llm, hallucination, chat
Abstract: Most evaluations of large language models focus on standard tasks such as factual question answering or short summarization. This research expands that scope in two directions: first, by comparing two retrieval strategies, Graph RAG, structured knowledge-graph based, and Advanced RAG, hybrid keyword-semantic search, for QA; and second, by evaluating whether LLMs can generate high quality non-traditional academic outputs, specifically slide decks and podcast scripts. We implemented a prototype combining Meta's LLaMA 3 70B open weight and OpenAI's GPT 4o mini API based. QA performance was evaluated using both human ratings across eleven quality dimensions and large language model judges for scalable cross validation. GPT 4o mini with Advanced RAG produced the most accurate responses. Graph RAG offered limited improvements and led to more hallucinations, partly due to its structural complexity and manual setup. Slide and podcast generation was tested with document grounded retrieval. GPT 4o mini again performed best, though LLaMA 3 showed promise in narrative coherence. Human reviewers were crucial for detecting layout and stylistic flaws, highlighting the need for combined human LLM evaluation in assessing emerging academic outputs.
摘要：大多数大型语言模型的评估都集中在标准任务上，例如事实问答或简短摘要。这项研究在两个方向上扩展了这个范围：首先，通过比较两种检索策略，Graph RAG（基于结构化知识图）和 Advanced RAG（混合关键词语义搜索），用于 QA；其次，评估法学硕士是否能够产生高质量的非传统学术成果，特别是幻灯片和播客脚本。我们实现了一个基于 Meta 的 LLaMA 3 70B 开放权重和 OpenAI 的 GPT 4o mini API 的原型。使用跨十一个质量维度的人类评分和用于可扩展交叉验证的大型语言模型判断来评估 QA 性能。配备 Advanced RAG 的 GPT 4o mini 产生了最准确的响应。 Graph RAG 提供的改进有限，并导致了更多的幻觉，部分原因是其结构复杂性和手动设置。幻灯片和播客的生成通过基于文档的检索进行了测试。尽管 LLaMA 3 在叙事连贯性方面表现出了希望，但 GPT 4o mini 再次表现最佳。人类审稿人对于检测布局和文体缺陷至关重要，这凸显了在评估新兴学术成果时需要结合人类法学硕士评估。

Title: When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals

Authors: Riad Ahmed Anonto, Md Labid Al Nahiyan, Md Tanvir Hassan, Ch. Md. Rakin Haider
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01037
Pdf URL: https://arxiv.org/pdf/2512.01037
Copy Paste: [[2512.01037]] When Safety Blocks Sense: Measuring Semantic Confusion in LLM Refusals(https://arxiv.org/abs/2512.01037)
Keywords: language model, llm, prompt
Abstract: Safety-aligned language models often refuse prompts that are actually harmless. Current evaluations mostly report global rates such as false rejection or compliance. These scores treat each prompt alone and miss local inconsistency, where a model accepts one phrasing of an intent but rejects a close paraphrase. This gap limits diagnosis and tuning. We introduce "semantic confusion," a failure mode that captures such local inconsistency, and a framework to measure it. We build ParaGuard, a 10k-prompt corpus of controlled paraphrase clusters that hold intent fixed while varying surface form. We then propose three model-agnostic metrics at the token level: Confusion Index, Confusion Rate, and Confusion Depth. These metrics compare each refusal to its nearest accepted neighbors and use token embeddings, next-token probabilities, and perplexity signals. Experiments across diverse model families and deployment guards show that global false-rejection rate hides critical structure. Our metrics reveal globally unstable boundaries in some settings, localized pockets of inconsistency in others, and cases where stricter refusal does not increase inconsistency. We also show how confusion-aware auditing separates how often a system refuses from how sensibly it refuses. This gives developers a practical signal to reduce false refusals while preserving safety.
摘要：安全一致的语言模型通常会拒绝实际上无害的提示。目前的评估主要报告全球比率，例如错误拒绝或合规。这些分数单独对待每个提示，并忽略局部不一致，即模型接受意图的一种措辞，但拒绝紧密的释义。这一差距限制了诊断和调整。我们引入了“语义混淆”，这是一种捕获这种局部不一致的故障模式，以及一个衡量它的框架。我们构建了 ParaGuard，这是一个包含 10k 条提示的受控释义集群的语料库，可以在改变表面形式的同时保持意图固定。然后，我们在令牌级别提出三个与模型无关的指标：混淆指数、混淆率和混淆深度。这些指标将每个拒绝与其最近接受的邻居进行比较，并使用令牌嵌入、下一个令牌概率和困惑信号。跨不同模型系列和部署防护的实验表明，全局错误拒绝率隐藏了关键结构。我们的指标揭示了某些情况下的全局不稳定边界、其他情况下的局部不一致以及更严格的拒绝不会增加不一致的情况。我们还展示了混淆感知审计如何区分系统拒绝的频率和拒绝的合理程度。这为开发人员提供了一个实际信号，可以在保证安全的同时减少错误拒绝。

Title: ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages

Authors: Neha Joshi, Pamir Gogoi, Aasim Mirza, Aayush Jansari, Aditya Yadavalli, Ayushi Pandey, Arunima Shukla, Deepthi Sudharsan, Kalika Bali, Vivek Seshadri
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2512.01077
Pdf URL: https://arxiv.org/pdf/2512.01077
Copy Paste: [[2512.01077]] ELR-1000: A Community-Generated Dataset for Endangered Indic Indigenous Languages(https://arxiv.org/abs/2512.01077)
Keywords: language model, llm
Abstract: We present a culturally-grounded multimodal dataset of 1,060 traditional recipes crowdsourced from rural communities across remote regions of Eastern India, spanning 10 endangered languages. These recipes, rich in linguistic and cultural nuance, were collected using a mobile interface designed for contributors with low digital literacy. Endangered Language Recipes (ELR)-1000 -- captures not only culinary practices but also the socio-cultural context embedded in indigenous food traditions. We evaluate the performance of several state-of-the-art large language models (LLMs) on translating these recipes into English and find the following: despite the models' capabilities, they struggle with low-resource, culturally-specific language. However, we observe that providing targeted context -- including background information about the languages, translation examples, and guidelines for cultural preservation -- leads to significant improvements in translation quality. Our results underscore the need for benchmarks that cater to underrepresented languages and domains to advance equitable and culturally-aware language technologies. As part of this work, we release the ELR-1000 dataset to the NLP community, hoping it motivates the development of language technologies for endangered languages.
摘要：我们提供了一个基于文化的多模式数据集，其中包含 1,060 种传统食谱，这些食谱众包自印度东部偏远地区的农村社区，涵盖 10 种濒临灭绝的语言。这些富含语言和文化细微差别的食谱是使用专为数字素养较低的贡献者设计的移动界面收集的。濒危语言食谱 (ELR)-1000——不仅捕捉烹饪实践，还捕捉土著饮食传统中蕴含的社会文化背景。我们评估了几种最先进的大型语言模型 (LLM) 将这些食谱翻译成英语的性能，并发现以下结果：尽管这些模型具有一定的功能，但它们在处理资源匮乏、文化特定的语言方面表现不佳。然而，我们观察到，提供有针对性的背景信息——包括有关语言的背景信息、翻译示例和文化保护指南——可以显着提高翻译质量。我们的结果强调需要针对代表性不足的语言和领域制定基准，以推进公平和具有文化意识的语言技术。作为这项工作的一部分，我们向 NLP 社区发布了 ELR-1000 数据集，希望它能够推动濒危语言的语言技术的发展。

Title: DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

Authors: Hyunjun Kim, Sooyoung Ryu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01174
Pdf URL: https://arxiv.org/pdf/2512.01174
Copy Paste: [[2512.01174]] DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks(https://arxiv.org/abs/2512.01174)
Keywords: language model, gpt, llm, prompt, agent
Abstract: As agentic AI systems increasingly operate autonomously, establishing trust through verifiable evaluation becomes critical. Yet existing benchmarks lack the transparency and auditability needed to assess whether agents behave reliably. We present DrawingBench, a verification framework for evaluating the trustworthiness of agentic LLMs through spatial reasoning tasks that require generating sequences of low-level GUI actions. Unlike opaque evaluations, DrawingBench provides transparent, rule-based assessment: 8 objective criteria enable reproducible scoring, while action-level inspection allows stakeholders to audit agent behavior. Our framework comprises 250 diverse prompts across 20 categories and 4 difficulty levels, deterministic evaluation metrics, and an external oversight mechanism through multi-turn feedback that enables human control over agent refinement. Evaluating four state-of-the-art LLMs (Claude-4 Sonnet, GPT-4.1, GPT-4.1-mini, Gemini-2.5 Flash) across 1,000 tests, we establish both capabilities and limitations: models achieved 92.8% perfect performance with structured external feedback driving significant improvements (average +3.2%, up to +32.8% for complex scenes), but systematic error patterns emerged in tool state management and long-horizon planning. Notably, specification clarity proved more important than task complexity -- models achieved 100% perfect performance when given explicit, verifiable criteria. These findings demonstrate that transparent evaluation frameworks can establish trust in agentic systems, with external oversight proving more reliable than self-correction for guiding agent behavior. Our open-source framework provides a template for trustworthy agent assessment. Code and data: this https URL
摘要：随着代理人工智能系统越来越自主地运行，通过可验证的评估建立信任变得至关重要。然而，现有的基准缺乏评估代理行为是否可靠所需的透明度和可审计性。我们提出了 DrawingBench，这是一个验证框架，用于通过需要生成低级 GUI 操作序列的空间推理任务来评估代理 LLM 的可信度。与不透明的评估不同，DrawingBench 提供透明的、基于规则的评估：8 个客观标准可实现可重复的评分，而操作级检查允许利益相关者审核代理行为。我们的框架包含 20 个类别和 4 个难度级别的 250 个不同提示、确定性评估指标以及通过多轮反馈实现的外部监督机制，使人类能够控制代理的细化。通过 1,000 次测试评估四个最先进的 LLM（Claude-4 Sonnet、GPT-4.1、GPT-4.1-mini、Gemini-2.5 Flash），我们确定了功能和局限性：模型实现了 92.8% 的完美性能，结构化外部反馈推动显着改进（平均 +3.2%，复杂场景高达 +32.8%），但在工具状态管理和长期视野中出现了系统错误模式规划。值得注意的是，事实证明，规范清晰度比任务复杂性更重要——当给出明确的、可验证的标准时，模型可以实现 100% 完美的性能。这些发现表明，透明的评估框架可以建立对代理系统的信任，事实证明，外部监督比指导代理行为的自我纠正更可靠。我们的开源框架提供了可信赖代理评估的模板。代码和数据：这个https URL

Title: TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness

Authors: Yongxin Zhou, Philippe Mulhem, Didier Schwab
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01183
Pdf URL: https://arxiv.org/pdf/2512.01183
Copy Paste: [[2512.01183]] TempPerturb-Eval: On the Joint Effects of Internal Temperature and External Perturbations in RAG Robustness(https://arxiv.org/abs/2512.01183)
Keywords: llm, retrieval-augmented generation
Abstract: The evaluation of Retrieval-Augmented Generation (RAG) systems typically examines retrieval quality and generation parameters like temperature in isolation, overlooking their interaction. This work presents a systematic investigation of how text perturbations (simulating noisy retrieval) interact with temperature settings across multiple LLM runs. We propose a comprehensive RAG Perturbation-Temperature Analysis Framework that subjects retrieved documents to three distinct perturbation types across varying temperature settings. Through extensive experiments on HotpotQA with both open-source and proprietary LLMs, we demonstrate that performance degradation follows distinct patterns: high-temperature settings consistently amplify vulnerability to perturbations, while certain perturbation types exhibit non-linear sensitivity across the temperature range. Our work yields three key contributions: (1) a diagnostic benchmark for assessing RAG robustness, (2) an analytical framework for quantifying perturbation-temperature interactions, and (3) practical guidelines for model selection and parameter tuning under noisy retrieval conditions.
摘要：检索增强生成（RAG）系统的评估通常会检查检索质量和生成参数（例如孤立的温度），而忽略它们的相互作用。这项工作对文本扰动（模拟噪声检索）如何与多个法学硕士运行中的温度设置相互作用进行了系统研究。我们提出了一个全面的 RAG 扰动温度分析框架，该框架使检索到的文档在不同的温度设置下受到三种不同的扰动类型的影响。通过使用开源和专有 LLM 对 HotpotQA 进行大量实验，我们证明了性能下降遵循不同的模式：高温设置持续放大扰动的脆弱性，而某些扰动类型在整个温度范围内表现出非线性敏感性。我们的工作产生了三个关键贡献：（1）用于评估 RAG 稳健性的诊断基准，（2）用于量化扰动-温度相互作用的分析框架，以及（3）噪声检索条件下模型选择和参数调整的实用指南。

Title: Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks

Authors: Krithik Vishwanath, Mrigayu Ghosh, Anton Alyakin, Daniel Alexander Alber, Yindalon Aphinyanaphongs, Eric Karl Oermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01191
Pdf URL: https://arxiv.org/pdf/2512.01191
Copy Paste: [[2512.01191]] Generalist Large Language Models Outperform Clinical Tools on Medical Benchmarks(https://arxiv.org/abs/2512.01191)
Keywords: language model, gpt, llm
Abstract: Specialized clinical AI assistants are rapidly entering medical practice, often framed as safer or more reliable than general-purpose large language models (LLMs). Yet, unlike frontier models, these clinical tools are rarely subjected to independent, quantitative evaluation, creating a critical evidence gap despite their growing influence on diagnosis, triage, and guideline interpretation. We assessed two widely deployed clinical AI systems (OpenEvidence and UpToDate Expert AI) against three state-of-the-art generalist LLMs (GPT-5, Gemini 3 Pro, and Claude Sonnet 4.5) using a 1,000-item mini-benchmark combining MedQA (medical knowledge) and HealthBench (clinician-alignment) tasks. Generalist models consistently outperformed clinical tools, with GPT-5 achieving the highest scores, while OpenEvidence and UpToDate demonstrated deficits in completeness, communication quality, context awareness, and systems-based safety reasoning. These findings reveal that tools marketed for clinical decision support may often lag behind frontier LLMs, underscoring the urgent need for transparent, independent evaluation before deployment in patient-facing workflows.
摘要：专业临床人工智能助手正在迅速进入医疗实践，通常被认为比通用大语言模型 (LLM) 更安全或更可靠。然而，与前沿模型不同，这些临床工具很少接受独立的定量评估，尽管它们对诊断、分诊和指南解释的影响越来越大，但仍存在严重的证据差距。我们使用结合 MedQA（医学知识）和 HealthBench（临床医生对齐）任务的 1,000 项迷你基准，针对三个最先进的通才 LLM（GPT-5、Gemini 3 Pro 和 Claude Sonnet 4.5）评估了两个广泛部署的临床 AI 系统（OpenEvidence 和 UpToDate Expert AI）。通才模型的表现始终优于临床工具，GPT-5 获得了最高分，而 OpenEvidence 和 UpToDate 在完整性、通信质量、上下文感知和基于系统的安全推理方面表现出缺陷。这些发现表明，用于临床决策支持的营销工具可能往往落后于前沿法学硕士，这强调了在面向患者的工作流程中部署之前迫切需要透明、独立的评估。

Title: Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework

Authors: Jiatong Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01198
Pdf URL: https://arxiv.org/pdf/2512.01198
Copy Paste: [[2512.01198]] Conveying Imagistic Thinking in Traditional Chinese Medicine Translation: A Prompt Engineering and LLM-Based Evaluation Framework(https://arxiv.org/abs/2512.01198)
Keywords: gpt, llm, prompt, chat
Abstract: Traditional Chinese Medicine (TCM) theory is built on imagistic thinking, in which medical principles and diagnostic and therapeutic logic are structured through metaphor and metonymy. However, existing English translations largely rely on literal rendering, making it difficult for target-language readers to reconstruct the underlying conceptual networks and apply them in clinical practice. This study adopted a human-in-the-loop (HITL) framework and selected four passages from the medical canon Huangdi Neijing that are fundamental in theory. Through prompt-based cognitive scaffolding, DeepSeek V3.1 was guided to identify metaphor and metonymy in the source text and convey the theory in translation. In the evaluation stage, ChatGPT 5 Pro and Gemini 2.5 Pro were instructed by prompts to simulate three types of real-world readers. Human translations, baseline model translations, and prompt-adjusted translations were scored by the simulated readers across five cognitive dimensions, followed by structured interviews and Interpretative Phenomenological Analysis (IPA). Results show that the prompt-adjusted LLM translations perform best across all five dimensions, with high cross-model and cross-role consistency. The interview themes reveal differences between human and machine translation, effective strategies for metaphor and metonymy transfer, and readers' cognitive preferences. This study provides a cognitive, efficient, and replicable HITL methodological pathway for the translation of ancient, concept-dense texts such as TCM.
摘要：中医理论建立在象思维的基础上，通过隐喻和转喻来构建医学原理和诊疗逻辑。然而，现有的英文翻译很大程度上依赖于直译，这使得目标语言读者很难重建底层概念网络并将其应用于临床实践。本研究采用了人机循环（HITL）框架，并从医学经典《黄帝内经》中选取了四个具有基础理论的段落。通过基于提示的认知支架，指导 DeepSeek V3.1 识别源文本中的隐喻和转喻，并在翻译中传达理论。在评估阶段，ChatGPT 5 Pro和Gemini 2.5 Pro根据提示模拟了三种类型的现实世界读者。模拟读者在五个认知维度上对人工翻译、基线模型翻译和提示调整翻译进行评分，然后进行结构化访谈和解释现象学分析 (IPA)。结果表明，即时调整的法学硕士翻译在所有五个维度上均表现最佳，具有高度的跨模型和跨角色一致性。访谈主题揭示了人工翻译和机器翻译之间的差异、隐喻和转喻迁移的有效策略以及读者的认知偏好。这项研究为中医等概念密集的古代文本的翻译提供了一条认知、高效、可复制的 HITL 方法论途径。

Title: SUPERChem: A Multimodal Reasoning Benchmark in Chemistry

Authors: Zehua Zhao, Zhixian Huang, Junren Li, Siyu Lin, Junting Zhou, Fengqi Cao, Kun Zhou, Rui Ge, Tingting Long, Yuexiang Zhu, Yan Liu, Jie Zheng, Junnian Wei, Rong Zhu, Peng Zou, Wenyu Li, Zekai Cheng, Tian Ding, Yaxuan Wang, Yizhao Yan, Tingru Wei, Haowei Ming, Weijie Mao, Chen Sun, Yiming Liu, Zichen Wang, Zuo Zhang, Tong Yang, Hao Ma, Zhen Gao, Jian Pei
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2512.01274
Pdf URL: https://arxiv.org/pdf/2512.01274
Copy Paste: [[2512.01274]] SUPERChem: A Multimodal Reasoning Benchmark in Chemistry(https://arxiv.org/abs/2512.01274)
Keywords: language model, gpt, llm
Abstract: Current benchmarks for evaluating the chemical reasoning capabilities of Large Language Models (LLMs) are limited by oversimplified tasks, lack of process-level evaluation, and misalignment with expert-level chemistry skills. To address these issues, we introduce SUPERChem, a benchmark of 500 expert-curated reasoning-intensive chemistry problems, covering diverse subfields and provided in both multimodal and text-only formats. Original content and an iterative curation pipeline eliminate flawed items and mitigate data contamination. Each problem is paired with an expert-authored solution path, enabling Reasoning Path Fidelity (RPF) scoring to evaluate reasoning quality beyond final-answer accuracy. Evaluations against a human baseline of 40.3% accuracy show that even the best-performing model, GPT-5 (High), reaches only 38.5%, followed closely by Gemini 2.5 Pro (37.9%) and DeepSeek-V3.1-Think (37.3%). SUPERChem elicits multi-step, multimodal reasoning, reveals model-dependent effects of visual information, and distinguishes high-fidelity reasoners from heuristic ones. By providing a challenging benchmark and a reliable evaluation framework, SUPERChem aims to facilitate the advancement of LLMs toward expert-level chemical intelligence. The dataset of the benchmark is available at this https URL.
摘要：目前评估大型语言模型 (LLM) 化学推理能力的基准受到任务过于简化、缺乏过程级评估以及与专家级化学技能不一致的限制。为了解决这些问题，我们推出了 SUPERChem，这是一个包含 500 个专家策划的推理密集型化学问题的基准，涵盖不同的子领域，并以多模式和纯文本格式提供。原创内容和迭代管理管道消除了有缺陷的项目并减轻了数据污染。每个问题都配有专家编写的解决方案路径，使推理路径保真度 (RPF) 评分能够评估超出最终答案准确性的推理质量。针对 40.3% 准确率的人类基线进行的评估表明，即使是性能最好的模型 GPT-5（高），也只能达到 38.5%，紧随其后的是 Gemini 2.5 Pro（37.9%）和 DeepSeek-V3.1-Think（37.3%）。 SUPERChem 引发多步骤、多模式推理，揭示视觉信息的模型相关效应，并将高保真推理器与启发式推理器区分开来。通过提供具有挑战性的基准和可靠的评估框架，SUPERChem 旨在促进法学硕士迈向专家级化学智能。基准数据集可从此 https URL 获取。

Title: Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning

Authors: Jiahao Yuan, Zhiqing Cui, Hanqing Wang, Yuansheng Gao, Yucheng Zhou, Usman Naseem
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01282
Pdf URL: https://arxiv.org/pdf/2512.01282
Copy Paste: [[2512.01282]] Kardia-R1: Unleashing LLMs to Reason toward Understanding and Empathy for Emotional Support via Rubric-as-Judge Reinforcement Learning(https://arxiv.org/abs/2512.01282)
Keywords: llm, agent
Abstract: As web platforms evolve towards greater personalization and emotional complexity, conversational agents must transcend superficial empathy to demonstrate identity-aware emotional reasoning. However, existing systems face two limitations: (1) reliance on situation-centric datasets lacking persistent user identity, which hampers the capture of personalized affective nuances; and (2) dependence on opaque, coarse reward signals that hinder development of verifiable empathetic reasoning. To address these gaps, we introduce KardiaBench, a large-scale user-grounded benchmark comprising 178,080 QA pairs across 22,080 multi-turn conversations anchored to 671 real-world profiles. The dataset is constructed via a model-in-the-loop pipeline with iterative rubric-guided refinement to ensure psychological plausibility and persona consistency. This progressive empathy pipeline that integrates user comprehension, contextual reasoning, and emotion perception into conversations, followed by iterative critique and rubric-based refinement to ensure psychological plausibility, emotional fidelity, and persona consistency. Building on this, we propose Kardia-R1, a framework that trains models for interpretable, stepwise empathetic cognition. Kardia-R1 leverages Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL), a GRPO-based method that uses explainable, human-aligned rubric rewards to tightly couple user understanding, emotional inference, and supportive response generation. Extensive experiments across four LLM backbones demonstrate that Kardia-R1 consistently outperforms othet methods in emotion accuracy, empathy, relevance, persona consistency, and safety. Our dataset and model will be released at this https URL.
摘要：随着网络平台朝着更加个性化和情感复杂性的方向发展，会话代理必须超越肤浅的同理心，以展示身份感知的情感推理。然而，现有系统面临两个局限性：（1）依赖于以情境为中心的数据集，缺乏持久的用户身份，这妨碍了捕获个性化的情感细微差别； (2) 对不透明、粗糙的奖励信号的依赖阻碍了可验证的共情推理的发展。为了解决这些差距，我们引入了 KardiaBench，这是一个以用户为基础的大规模基准测试，包含 22,080 个多轮对话中的 178,080 个 QA 对，这些对话锚定到 671 个真实世界的配置文件。该数据集是通过模型在环管道构建的，并通过迭代准则引导细化，以确保心理合理性和角色一致性。这种渐进式的同理心管道将用户理解、情境推理和情感感知整合到对话中，然后进行迭代批评和基于标题的细化，以确保心理合理性、情感保真度和角色一致性。在此基础上，我们提出了 Kardia-R1，这是一个训练可解释、逐步共情认知模型的框架。 Kardia-R1 利用 Rubric-as-Judge Empathetic Reinforcement Learning (Rubric-ERL)，这是一种基于 GRPO 的方法，它使用可解释的、人性化的 rubric 奖励来紧密结合用户理解、情感推理和支持性响应生成。跨越四个 LLM 主干的广泛实验表明，Kardia-R1 在情感准确性、同理心、相关性、角色一致性和安全性方面始终优于其他方法。我们的数据集和模型将在此 https URL 发布。

Title: PromptBridge: Cross-Model Prompt Transfer for Large Language Models

Authors: Yaxuan Wang, Quan Liu, Zhenting Wang, Zichao Li, Wei Wei, Yang Liu, Yujia Bao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01420
Pdf URL: https://arxiv.org/pdf/2512.01420
Copy Paste: [[2512.01420]] PromptBridge: Cross-Model Prompt Transfer for Large Language Models(https://arxiv.org/abs/2512.01420)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large language models (LLMs) underpin applications in code generation, mathematical reasoning, and agent-based workflows. In practice, systems access LLMs via commercial APIs or open-source deployments, and the model landscape (e.g., GPT, Claude, Llama) evolves rapidly. This rapid evolution forces frequent model switches driven by capability, cost, deployment constraints, and privacy. Yet prompts are highly model-sensitive: reusing a prompt engineered for one model on another often yields substantially worse performance than a prompt optimized for the target model. We term this phenomenon Model Drifting. Through extensive empirical analysis across diverse LLM configurations, we show that model drifting is both common and severe. To address this challenge, we introduce PromptBridge, a training-free framework that preserves prompt effectiveness under model switches, enabling cross-model prompt transfer without costly per-task or per-model re-optimization. PromptBridge requires only a small set of alignment tasks for calibration. It first applies Model-Adaptive Reflective Prompt Evolution (MAP-RPE) to obtain task- and model-specific optimal prompts via iterative reflective refinement and quantitative evaluation. Using the resulting calibrated prompt pairs for the source and target models, PromptBridge learns a cross-model prompt mapping. At test time, i.e., for an unseen task, given a source-model prompt, this mapping directly produces an optimized prompt for the target model. Experiments in single-agent and multi-agent settings show that PromptBridge consistently improves downstream accuracy while reducing migration effort. The code will be available soon.
摘要：大型语言模型 (LLM) 支撑着代码生成、数学推理和基于代理的工作流程中的应用程序。在实践中，系统通过商业 API 或开源部署访问 LLM，并且模型环境（例如 GPT、Claude、Llama）快速发展。这种快速发展迫使由能力、成本、部署限制和隐私驱动的频繁模型切换。然而，提示对模型高度敏感：在另一个模型上重用为一个模型设计的提示通常会产生比针对目标模型优化的提示差得多的性能。我们将这种现象称为模型漂移。通过对不同法学硕士配置的广泛实证分析，我们表明模型漂移既常见又严重。为了应对这一挑战，我们引入了 PromptBridge，这是一个免训练框架，可以在模型切换下保持即时有效性，从而实现跨模型即时传输，而无需昂贵的每个任务或每个模型重新优化。 PromptBridge 只需要一小部分对齐任务即可进行校准。它首先应用模型自适应反射提示进化（MAP-RPE），通过迭代反射细化和定量评估来获得特定于任务和模型的最佳提示。使用源模型和目标模型的校准提示对，PromptBridge 可以学习跨模型提示映射。在测试时，即对于看不见的任务，给定源模型提示，此映射直接为目标模型生成优化的提示。单代理和多代理设置中的实验表明，PromptBridge 持续提高下游准确性，同时减少迁移工作。该代码即将推出。

Title: Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech

Authors: Bharatdeep Hazarika, Arya Suneesh, Prasanna Devadiga, Pawan Kumar Rajpoot, Anshuman B Suresh, Ahmed Ifthaquar Hussain
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01439
Pdf URL: https://arxiv.org/pdf/2512.01439
Copy Paste: [[2512.01439]] Multilingual Conversational AI for Financial Assistance: Bridging Language Barriers in Indian FinTech(https://arxiv.org/abs/2512.01439)
Keywords: language model, agent
Abstract: India's linguistic diversity presents both opportunities and challenges for fintech platforms. While the country has 31 major languages and over 100 minor ones, only 10\% of the population understands English, creating barriers to financial inclusion. We present a multilingual conversational AI system for a financial assistance use case that supports code-mixed languages like Hinglish, enabling natural interactions for India's diverse user base. Our system employs a multi-agent architecture with language classification, function management, and multilingual response generation. Through comparative analysis of multiple language models and real-world deployment, we demonstrate significant improvements in user engagement while maintaining low latency overhead (4-8\%). This work contributes to bridging the language gap in digital financial services for emerging markets.
摘要：印度的语言多样性为金融科技平台带来了机遇和挑战。尽管该国拥有 31 种主要语言和 100 多种小语种，但只有 10% 的人口懂英语，这为金融包容性造成了障碍。我们为财务援助用例提供了一个多语言对话人工智能系统，支持印度英语等代码混合语言，从而为印度多样化的用户群提供自然的交互。我们的系统采用多代理架构，具有语言分类、功能管理和多语言响应生成功能。通过对多种语言模型和实际部署的比较分析，我们证明了用户参与度的显着改进，同时保持了低延迟开销 (4-8\%)。这项工作有助于缩小新兴市场数字金融服务的语言差距。

Title: Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages

Authors: Jozef Kubík, Marek Šuppa, Martin Takáč
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.01460
Pdf URL: https://arxiv.org/pdf/2512.01460
Copy Paste: [[2512.01460]] Enhancing BERT Fine-Tuning for Sentiment Analysis in Lower-Resourced Languages(https://arxiv.org/abs/2512.01460)
Keywords: language model
Abstract: Limited data for low-resource languages typically yield weaker language models (LMs). Since pre-training is compute-intensive, it is more pragmatic to target improvements during fine-tuning. In this work, we examine the use of Active Learning (AL) methods augmented by structured data selection strategies which we term 'Active Learning schedulers', to boost the fine-tuning process with a limited amount of training data. We connect the AL to data clustering and propose an integrated fine-tuning pipeline that systematically combines AL, clustering, and dynamic data selection schedulers to enhance model's performance. Experiments in the Slovak, Maltese, Icelandic and Turkish languages show that the use of clustering during the fine-tuning phase together with AL scheduling can simultaneously produce annotation savings up to 30% and performance improvements up to four F1 score points, while also providing better fine-tuning stability.
摘要：低资源语言的有限数据通常会产生较弱的语言模型 (LM)。由于预训练是计算密集型的，因此在微调期间瞄准改进更为务实。在这项工作中，我们研究了主动学习（AL）方法的使用，并通过结构化数据选择策略（我们称之为“主动学习调度程序”）进行增强，以通过有限数量的训练数据来促进微调过程。我们将 AL 与数据集群连接起来，并提出了一个集成的微调管道，系统地结合了 AL、集群和动态数据选择调度程序，以增强模型的性能。斯洛伐克语、马耳他语、冰岛语和土耳其语的实验表明，在微调阶段使用聚类与 AL 调度相结合，可以同时节省高达 30% 的注释，提高高达 4 个 F1 分数，同时还提供更好的微调稳定性。

Title: MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages

Authors: Yexing Du, Kaiyuan Liu, Youcheng Pan, Bo Yang, Keqi Deng, Xie Chen, Yang Xiang, Ming Liu, Bin Qin, YaoWei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01512
Pdf URL: https://arxiv.org/pdf/2512.01512
Copy Paste: [[2512.01512]] MCAT: Scaling Many-to-Many Speech-to-Text Translation with MLLMs to 70 Languages(https://arxiv.org/abs/2512.01512)
Keywords: language model, llm
Abstract: Multimodal Large Language Models (MLLMs) have achieved great success in Speech-to-Text Translation (S2TT) tasks. However, current research is constrained by two key challenges: language coverage and efficiency. Most of the popular S2TT datasets are substantially English-centric, which restricts the scaling-up of MLLMs' many-to-many translation capabilities. Moreover, the inference speed of MLLMs degrades dramatically when the speech is converted into long sequences (e.g., 750 tokens). To address these limitations, we propose a Multilingual Cost-effective Accelerated Speech-to-Text Translator (MCAT) framework, which includes two innovations. First, a language scaling method that leverages curriculum learning and a data balancing strategy is introduced to extend the language coverage supported by MLLMs to 70 languages and achieve mutual translation among these languages. Second, an optimized speech adapter module is designed to reduce the length of the speech sequence to only 30 tokens. Extensive experiments were conducted on MLLMs of different scales (9B and 27B). The experimental results demonstrate that MCAT not only surpasses state-of-the-art end-to-end models on the FLEURS dataset across 70x69 directions but also enhances batch inference efficiency. This is achieved with only ~100M trainable parameters and by using only 10 hours of S2TT data per language. Furthermore, we have released MCAT as open-source to promote the development of MLLMs for robust S2TT capabilities. The code and models are released at this https URL.
摘要：多模态大语言模型 (MLLM) 在语音到文本翻译 (S2TT) 任务中取得了巨大成功。然而，当前的研究受到两个关键挑战的限制：语言覆盖范围和效率。大多数流行的 S2TT 数据集基本上以英语为中心，这限制了 MLLM 多对多翻译能力的扩展。此外，当语音转换为长序列（例如 750 个标记）时，MLLM 的推理速度会急剧下降。为了解决这些限制，我们提出了一种具有成本效益的多语言加速语音到文本翻译器 (MCAT) 框架，其中包括两项创新。首先，引入利用课程学习和数据平衡策略的语言扩展方法，将MLLM支持的语言覆盖范围扩展到70种语言，并实现这些语言之间的相互翻译。其次，设计了优化的语音适配器模块，将语音序列的长度减少到仅 30 个标记。在不同规模的 MLLM（9B 和 27B）上进行了大量的实验。实验结果表明，MCAT 不仅超越了 FLEURS 数据集上 70x69 方向上最先进的端到端模型，而且还提高了批量推理效率。这是通过仅约 100M 可训练参数以及每种语言仅使用 10 小时的 S2TT 数据来实现的。此外，我们还开源了 MCAT，以促进 MLLM 的开发，以实现强大的 S2TT 功能。代码和模型在此 https URL 发布。

Title: Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces

Authors: Edward Ajayi, Eudoxie Umwari, Mawuli Deku, Prosper Singadi, Jules Udahemuka, Bekalu Tadele, Chukuemeka Edeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01557
Pdf URL: https://arxiv.org/pdf/2512.01557
Copy Paste: [[2512.01557]] Language Diversity: Evaluating Language Usage and AI Performance on African Languages in Digital Spaces(https://arxiv.org/abs/2512.01557)
Keywords: language model, llm, prompt
Abstract: This study examines the digital representation of African languages and the challenges this presents for current language detection tools. We evaluate their performance on Yoruba, Kinyarwanda, and Amharic. While these languages are spoken by millions, their online usage on conversational platforms is often sparse, heavily influenced by English, and not representative of the authentic, monolingual conversations prevalent among native speakers. This lack of readily available authentic data online creates a challenge of scarcity of conversational data for training language models. To investigate this, data was collected from subreddits and local news sources for each language. The analysis showed a stark contrast between the two sources. Reddit data was minimal and characterized by heavy code-switching. Conversely, local news media offered a robust source of clean, monolingual language data, which also prompted more user engagement in the local language on the news publishers social media pages. Language detection models, including the specialized AfroLID and a general LLM, performed with near-perfect accuracy on the clean news data but struggled with the code-switched Reddit posts. The study concludes that professionally curated news content is a more reliable and effective source for training context-rich AI models for African languages than data from conversational platforms. It also highlights the need for future models that can process clean and code-switched text to improve the detection accuracy for African languages.
摘要：这项研究探讨了非洲语言的数字表示以及这给当前语言检测工具带来的挑战。我们评估他们在约鲁巴语、基尼亚卢旺达语和阿姆哈拉语方面的表现。虽然这些语言有数百万人使用，但它们在对话平台上的在线使用量往往很少，而且深受英语的影响，并且不能代表母语人士中普遍存在的真实的单语对话。由于缺乏在线可用的真实数据，因此给训练语言模型带来了对话数据稀缺的挑战。为了调查这一点，我们从每种语言的子版块和当地新闻来源收集了数据。分析显示两个来源之间存在鲜明对比。 Reddit 数据很少，而且代码转换频繁。相反，当地新闻媒体提供了干净、单语语言数据的强大来源，这也促使更多用户在新闻出版商的社交媒体页面上使用当地语言。语言检测模型，包括专门的 AfroLID 和通用法学硕士，在干净的新闻数据上表现得近乎完美，但在处理代码转换的 Reddit 帖子时却遇到了困难。该研究的结论是，与对话平台的数据相比，专业策划的新闻内容是训练非洲语言上下文丰富的人工智能模型更可靠、更有效的来源。它还强调了未来模型需要能够处理干净的代码转换文本，以提高非洲语言的检测准确性。

Title: MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark

Authors: Yuezhang Peng, Chonghao Cai, Ziang Liu, Shuai Fan, Sheng Jiang, Hua Xu, Yuxin Liu, Qiguang Chen, Kele Xu, Yao Li, Sheng Wang, Libo Qin, Xie Chen
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2512.01603
Pdf URL: https://arxiv.org/pdf/2512.01603
Copy Paste: [[2512.01603]] MAC-SLU: Multi-Intent Automotive Cabin Spoken Language Understanding Benchmark(https://arxiv.org/abs/2512.01603)
Keywords: language model, llm
Abstract: Spoken Language Understanding (SLU), which aims to extract user semantics to execute downstream tasks, is a crucial component of task-oriented dialog systems. Existing SLU datasets generally lack sufficient diversity and complexity, and there is an absence of a unified benchmark for the latest Large Language Models (LLMs) and Large Audio Language Models (LALMs). This work introduces MAC-SLU, a novel Multi-Intent Automotive Cabin Spoken Language Understanding Dataset, which increases the difficulty of the SLU task by incorporating authentic and complex multi-intent data. Based on MAC-SLU, we conducted a comprehensive benchmark of leading open-source LLMs and LALMs, covering methods like in-context learning, supervised fine-tuning (SFT), and end-to-end (E2E) and pipeline paradigms. Our experiments show that while LLMs and LALMs have the potential to complete SLU tasks through in-context learning, their performance still lags significantly behind SFT. Meanwhile, E2E LALMs demonstrate performance comparable to pipeline approaches and effectively avoid error propagation from speech recognition. Code\footnote{this https URL\_SLU} and datasets\footnote{this http URL\_SLU} are released publicly.
摘要：口语理解（SLU）旨在提取用户语义来执行下游任务，是面向任务的对话系统的重要组成部分。现有的SLU数据集普遍缺乏足够的多样性和复杂性，并且对于最新的大型语言模型（LLM）和大型音频语言模型（LALM）缺乏统一的基准。这项工作引入了 MAC-SLU，一种新颖的多意图汽车舱口语理解数据集，它通过合并真实且复杂的多意图数据来增加 SLU 任务的难度。基于MAC-SLU，我们对领先的开源LLM和LALM进行了全面的基准测试，涵盖上下文学习、监督微调（SFT）、端到端（E2E）和管道范式等方法。我们的实验表明，虽然 LLM 和 LALM 有潜力通过上下文学习完成 SLU 任务，但它们的性能仍然明显落后于 SFT。同时，E2E LALM 表现出与管道方法相当的性能，并有效避免语音识别中的错误传播。 Code\footnote{此 https URL\_SLU} 和 datasets\footnote{此 http URL\_SLU} 已公开发布。

Title: Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems

Authors: Dengyun Peng, Qiguang Chen, Bofei Liu, Jiannan Guan, Libo Qin, Zheng Yan, Jinhao Liu, Jianshu Zhang, Wanxiang Che
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01661
Pdf URL: https://arxiv.org/pdf/2512.01661
Copy Paste: [[2512.01661]] Learning the Boundary of Solvability: Aligning LLMs to Detect Unsolvable Problems(https://arxiv.org/abs/2512.01661)
Keywords: llm, hallucination
Abstract: Ensuring LLM reliability requires not only solving complex problems but also recognizing when a problem is unsolvable. Current models often struggle to distinguish objective unsolvability (inherent contradictions in the problem) from subjective capability limitations (problems beyond the model's competence), which leads to hallucinations and overconfidence. To address this, we propose UnsolvableQA and UnsolvableRL to solve feasible problems, detect inherent contradictions, and prudently refuse tasks beyond capability. Specifically, we construct UnsolvableQA, a dataset of paired solvable and unsolvable instances derived via a dual-track methodology: programmatic generation for logic puzzles and a novel "Reverse Construction" method that injects contradictions into valid reasoning chains for mathematics. Building on this dataset, we introduce UnsolvableRL, a reinforcement learning framework with three reward components jointly accounting for accuracy, unsolvability, and difficulty. Empirical results show that our approach achieves near-perfect unsolvability detection while also improving accuracy on solvable tasks. Crucially, we identify Capability Collapse, demonstrating that explicit exposure to unsolvable data is indispensable for preventing models from becoming systematically overconfident. Our code and data are available at this https URL.
摘要：确保法学硕士的可靠性不仅需要解决复杂的问题，还需要识别问题何时无法解决。当前的模型常常难以区分客观的不可解决性（问题的内在矛盾）和主观的能力限制（超出模型能力的问题），这会导致幻觉和过度自信。为了解决这个问题，我们提出 UnsolvableQA 和 UnsolvableRL 来解决可行的问题，检测内在的矛盾，并谨慎地拒绝超出能力的任务。具体来说，我们构建了 UnsolvableQA，这是通过双轨方法导出的成对可解和不可解实例的数据集：逻辑谜题的编程生成和一种新颖的“反向构造”方法，该方法将矛盾注入到有效的数学推理链中。在此数据集的基础上，我们引入了 UnsolvableRL，这是一个强化学习框架，具有三个奖励组件，共同考虑了准确性、不可解性和难度。经验结果表明，我们的方法实现了近乎完美的不可解性检测，同时还提高了可解任务的准确性。至关重要的是，我们确定了能力崩溃，这表明明确暴露于无法解决的数据对于防止模型变得系统性过度自信是必不可少的。我们的代码和数据可在此 https URL 中获取。

Title: MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

Authors: Stefano Zeppieri
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2512.01710
Pdf URL: https://arxiv.org/pdf/2512.01710
Copy Paste: [[2512.01710]] MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications(https://arxiv.org/abs/2512.01710)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Models (LLMs) excel at generating coherent text within a single prompt but fall short in sustaining relevance, personalization, and continuity across extended interactions. Human communication, however, relies on multiple forms of memory, from recalling past conversations to adapting to personal traits and situational context. This paper introduces the Mixed Memory-Augmented Generation (MMAG) pattern, a framework that organizes memory for LLM-based agents into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory. Drawing inspiration from cognitive psychology, we map these layers to technical components and outline strategies for coordination, prioritization, and conflict resolution. We demonstrate the approach through its implementation in the Heero conversational agent, where encrypted long-term bios and conversational history already improve engagement and retention. We further discuss implementation concerns around storage, retrieval, privacy, and latency, and highlight open challenges. MMAG provides a foundation for building memory-rich language agents that are more coherent, proactive, and aligned with human needs.
摘要：大型语言模型 (LLM) 擅长在单个提示中生成连贯的文本，但在维持扩展交互中的相关性、个性化和连续性方面存在不足。然而，人类交流依赖于多种形式的记忆，从回忆过去的对话到适应个人特征和情境背景。本文介绍了混合记忆增强生成（MMAG）模式，该框架将基于 LLM 的代理的记忆组织为五个交互层：会话、长期用户、情景和事件相关、感觉和情境感知以及短期工作记忆。从认知心理学中汲取灵感，我们将这些层映射到技术组件，并概述协调、优先级和冲突解决的策略。我们通过在 Heero 对话代理中实施该方法来演示该方法，其中加密的长期简历和对话历史记录已经提高了参与度和保留率。我们进一步讨论有关存储、检索、隐私和延迟的实施问题，并强调开放的挑战。 MMAG 为构建记忆丰富的语言代理奠定了基础，这些语言代理更加连贯、主动且符合人类需求。

Title: Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks

Authors: Jiannan Guan, Qiguang Chen, Libo Qin, Dengyun Peng, Jinhao Liu, Liangyu Huo, Jian Xie, Wanxiang Che
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01725
Pdf URL: https://arxiv.org/pdf/2512.01725
Copy Paste: [[2512.01725]] Beware of Reasoning Overconfidence: Pitfalls in the Reasoning Process for Multi-solution Tasks(https://arxiv.org/abs/2512.01725)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) excel in reasoning tasks requiring a single correct answer, but they perform poorly in multi-solution tasks that require generating comprehensive and diverse answers. We attribute this limitation to \textbf{reasoning overconfidence}: a tendency to express undue certainty in an incomplete solution set. To examine the effect, we introduce \textit{MuSoBench}, a benchmark of multi-solution problems. Experiments show that the conventional short chain-of-thought (Short-CoT) prompting paradigm exhibits pronounced overconfidence, whereas the emerging long chain-of-thought (Long-CoT) approach mitigates it through iterative exploration and self-reflection. We further characterise observable behaviours and influential factors. To probe the underlying cause, we propose the \textbf{cognitive-rigidity hypothesis}, which posits that overconfidence arises when the reasoning process prematurely converges on a narrow set of thought paths. An attention-entropy analysis offers preliminary support for this view. These findings provide tools for assessing the completeness of LLM reasoning and highlight the need to move evaluation beyond single-answer accuracy toward comprehensive exploration.
摘要：大型语言模型 (LLM) 在需要单一正确答案的推理任务中表现出色，但在需要生成全面且多样化答案的多解决方案任务中表现不佳。我们将此限制归因于 \textbf{推理过度自信}：在不完整的解决方案集中表达过度确定性的倾向。为了检查效果，我们引入了 \textit{MuSoBench}，一个多解问题的基准。实验表明，传统的短思维链（Short-CoT）提示范式表现出明显的过度自信，而新兴的长思维链（Long-CoT）方法通过迭代探索和自我反思来缓解这种过度自信。我们进一步描述可观察的行为和影响因素。为了探究根本原因，我们提出了 \textbf{认知刚性假说}，该假说认为，当推理过程过早地收敛于一组狭窄的思维路径时，就会出现过度自信。注意力熵分析为这一观点提供了初步支持。这些发现提供了评估法学硕士推理完整性的工具，并强调需要将评估从单一答案的准确性转向全面探索。

Title: InnoGym: Benchmarking the Innovation Potential of AI Agents

Authors: Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang
Subjects: cs.CL, cs.AI, cs.CV, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2512.01822
Pdf URL: https://arxiv.org/pdf/2512.01822
Copy Paste: [[2512.01822]] InnoGym: Benchmarking the Innovation Potential of AI Agents(https://arxiv.org/abs/2512.01822)
Keywords: llm, agent
Abstract: LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
摘要：法学硕士和特工在代码生成、数学推理和科学发现方面取得了令人瞩目的进步。然而，现有的基准主要衡量正确性，忽视了解决方案背后方法的多样性。真正的创新不仅取决于产生正确的答案，还取决于方法的独创性。我们推出了 InnoGym，这是第一个旨在系统评估人工智能代理创新潜力的基准和框架。 InnoGym 引入了两个互补的指标：性能增益（衡量对最著名解决方案的改进）和新颖性（捕获与先前方法的方法差异）。该基准包括来自现实世界工程和科学领域的 18 项精心策划的任务，每项任务都通过资源过滤、评估器验证和解决方案收集进行标准化。此外，我们还提供 iGym，这是一个用于可重复和长期评估的统一执行环境。大量的实验表明，虽然一些代理产生了新颖的方法，但它们缺乏鲁棒性限制了性能的提升。这些结果凸显了创造力和有效性之间的关键差距，强调需要制定评估两者的基准。

Title: Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Authors: Jinghan Jia, Nathalie Baracaldo, Sijia Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01848
Pdf URL: https://arxiv.org/pdf/2512.01848
Copy Paste: [[2512.01848]] Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability(https://arxiv.org/abs/2512.01848)
Keywords: language model, chain-of-thought
Abstract: Large reasoning models (LRMs) extend large language models by generating explicit chain-of-thought (CoT) reasoning, significantly improving mathematical and logical problem solving. However, this explicit reasoning process also introduces new safety risks, as unsafe behaviors often emerge within intermediate reasoning trajectories, even when final answers appear harmless. Existing safety alignment approaches primarily rely on supervised fine-tuning (SFT) over safety-oriented long CoT datasets. While intuitive, we find that SFT produces inconsistent safety improvements, degrades reasoning ability, and generalizes poorly across model families. These limitations suggest that purely supervised approaches are insufficient for robust safety alignment in LRMs. To address this, we investigate reinforcement learning (RL) as a complementary optimization framework for LRM safety training. Unlike SFT, RL directly optimizes model policies with reward feedback, enabling more adaptive and stable alignment. Extensive experiments across multiple model families and benchmarks show that RL achieves stronger and more consistent safety gains while maintaining reasoning competence. Further analysis of reflection dynamics and token-level entropy reveals that RL suppresses unsafe exploratory reasoning while preserving reflective depth, leading to safer and more reliable reasoning processes.
摘要：大型推理模型 (LRM) 通过生成显式的思想链 (CoT) 推理来扩展大型语言模型，从而显着改进数学和逻辑问题的解决。然而，这种显式推理过程也引入了新的安全风险，因为即使最终答案看起来无害，不安全行为也经常出现在中间推理轨迹中。现有的安全对齐方法主要依赖于对面向安全的长 CoT 数据集的监督微调 (SFT)。虽然很直观，但我们发现 SFT 产生的安全性改进不一致，降低了推理能力，并且在模型系列之间的泛化能力很差。这些局限性表明，纯粹的监督方法不足以实现 LRM 中稳健的安全调整。为了解决这个问题，我们研究了强化学习 (RL) 作为 LRM 安全训练的补充优化框架。与 SFT 不同，强化学习通过奖励反馈直接优化模型策略，从而实现更具适应性和稳定的对齐。跨多个模型系列和基准的大量实验表明，强化学习在保持推理能力的同时实现了更强大、更一致的安全增益。对反射动力学和令牌级熵的进一步分析表明，强化学习在保留反射深度的同时抑制了不安全的探索性推理，从而实现更安全、更可靠的推理过程。

Title: BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages

Authors: Hrishikesh Terdalkar, Kirtan Bhojani, Aryan Dongare, Omm Aditya Behera
Subjects: cs.CL, cs.AI, cs.ET
Abstract URL: https://arxiv.org/abs/2512.01852
Pdf URL: https://arxiv.org/pdf/2512.01852
Copy Paste: [[2512.01852]] BHRAM-IL: A Benchmark for Hallucination Recognition and Assessment in Multiple Indian Languages(https://arxiv.org/abs/2512.01852)
Keywords: language model, llm, hallucination
Abstract: Large language models (LLMs) are increasingly deployed in multilingual applications but often generate plausible yet incorrect or misleading outputs, known as hallucinations. While hallucination detection has been studied extensively in English, under-resourced Indian languages remain largely unexplored. We present BHRAM-IL, a benchmark for hallucination recognition and assessment in multiple Indian languages, covering Hindi, Gujarati, Marathi, Odia, along with English. The benchmark comprises 36,047 curated questions across nine categories spanning factual, numerical, reasoning, and linguistic tasks. We evaluate 14 state-of-the-art multilingual LLMs on a benchmark subset of 10,265 questions, analyzing cross-lingual and factual hallucinations across languages, models, scales, categories, and domains using category-specific metrics normalized to (0,1) range. Aggregation over all categories and models yields a primary score of 0.23 and a language-corrected fuzzy score of 0.385, demonstrating the usefulness of BHRAM-IL for hallucination-focused evaluation. The dataset, and the code for generation and evaluation are available on GitHub (this https URL) and HuggingFace (this https URL) to support future research in multilingual hallucination detection and mitigation.
摘要：大型语言模型 (LLM) 越来越多地部署在多语言应用程序中，但通常会生成看似合理但不正确或具有误导性的输出，称为幻觉。虽然幻觉检测已经在英语中得到了广泛的研究，但资源匮乏的印度语言在很大程度上仍未得到探索。我们推出了 BHRAM-IL，这是多种印度语言的幻觉识别和评估基准，涵盖印地语、古吉拉特语、马拉地语、奥迪亚语以及英语。该基准包括 36,047 个精选问题，涉及事实、数字、推理和语言任务的九个类别。我们在 10,265 个问题的基准子集上评估了 14 个最先进的多语言法学硕士，使用归一化到 (0,1) 范围的特定类别指标来分析跨语言、模型、尺度、类别和领域的跨语言和事实幻觉。所有类别和模型的聚合产生 0.23 的初级分数和 0.385 的语言校正模糊分数，证明了 BHRAM-IL 对于以幻觉为中心的评估的有用性。数据集以及生成和评估的代码可在 GitHub（此 https URL）和 HuggingFace（此 https URL）上获取，以支持多语言幻觉检测和缓解的未来研究。

Title: Cross-Lingual Interleaving for Speech Language Models

Authors: Adel Moumen, Guangzhi Sun, Philip C. Woodland
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01865
Pdf URL: https://arxiv.org/pdf/2512.01865
Copy Paste: [[2512.01865]] Cross-Lingual Interleaving for Speech Language Models(https://arxiv.org/abs/2512.01865)
Keywords: language model, gpt
Abstract: Spoken Language Models (SLMs) aim to learn linguistic competence directly from speech using discrete units, widening access to Natural Language Processing (NLP) technologies for languages with limited written resources. However, progress has been largely English-centric due to scarce spoken evaluation benchmarks and training data, making cross-lingual learning difficult. We present a cross-lingual interleaving method that mixes speech tokens across languages without textual supervision. We also release an EN-FR training dataset, TinyStories (~42k hours), together with EN-FR spoken StoryCloze and TopicCloze benchmarks for cross-lingual semantic evaluation, both synthetically generated using GPT-4. On 360M and 1B SLMs under matched training-token budgets, interleaving improves monolingual semantic accuracy, enables robust cross-lingual continuation, and strengthens cross-lingual hidden-state alignment. Taken together, these results indicate that cross-lingual interleaving is a simple, scalable route to building multilingual SLMs that understand and converse across languages. All resources will be made open-source to support reproducibility.
摘要：口语语言模型 (SLM) 旨在使用离散单元直接从语音中学习语言能力，从而为书面资源有限的语言拓宽自然语言处理 (NLP) 技术的使用范围。然而，由于口语评估基准和训练数据稀缺，进展主要以英语为中心，使得跨语言学习变得困难。我们提出了一种跨语言交错方法，可以在没有文本监督的情况下混合跨语言的语音标记。我们还发布了 EN-FR 训练数据集 TinyStories（约 42k 小时），以及用于跨语言语义评估的 EN-FR 口语 StoryCloze 和 TopicCloze 基准，两者都是使用 GPT-4 综合生成的。在匹配训练令牌预算下的 360M 和 1B SLM 上，交错提高了单语言语义准确性，实现了强大的跨语言延续，并加强了跨语言隐藏状态对齐。总而言之，这些结果表明，跨语言交错是构建跨语言理解和对话的多语言 SLM 的一种简单、可扩展的途径。所有资源都将开源以支持可重复性。

Title: Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models

Authors: Heloisa Candello, Muneeza Azmat, Uma Sushmitha Gunturi, Raya Horesh, Rogerio Abreu de Paula, Heloisa Pimentel, Marcelo Carpinette Grave, Aminat Adebiyi, Tiago Machado, Maysa Malfiza Garcia de Macedo
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2512.01892
Pdf URL: https://arxiv.org/pdf/2512.01892
Copy Paste: [[2512.01892]] Exploring Human Perceptions of AI Responses: Insights from a Mixed-Methods Study on Risk Mitigation in Generative Models(https://arxiv.org/abs/2512.01892)
Keywords: llm
Abstract: With the rapid uptake of generative AI, investigating human perceptions of generated responses has become crucial. A major challenge is their `aptitude' for hallucinating and generating harmful contents. Despite major efforts for implementing guardrails, human perceptions of these mitigation strategies are largely unknown. We conducted a mixed-method experiment for evaluating the responses of a mitigation strategy across multiple-dimensions: faithfulness, fairness, harm-removal capacity, and relevance. In a within-subject study design, 57 participants assessed the responses under two conditions: harmful response plus its mitigation and solely mitigated response. Results revealed that participants' native language, AI work experience, and annotation familiarity significantly influenced evaluations. Participants showed high sensitivity to linguistic and contextual attributes, penalizing minor grammar errors while rewarding preserved semantic contexts. This contrasts with how language is often treated in the quantitative evaluation of LLMs. We also introduced new metrics for training and evaluating mitigation strategies and insights for human-AI evaluation studies.
摘要：随着生成人工智能的快速普及，研究人类对生成响应的感知变得至关重要。一个主要的挑战是他们产生幻觉和产生有害内容的“能力”。尽管在实施护栏方面做出了巨大努力，但人们对这些缓解策略的看法在很大程度上还是未知的。我们进行了一项混合方法实验，从多个维度评估缓解策略的反应：忠实性、公平性、消除危害能力和相关性。在一项受试者内研究设计中，57 名参与者评估了两种情况下的反应：有害反应加上其缓解措施和仅缓解反应。结果显示，参与者的母语、人工智能工作经验和注释熟悉程度对评估有显着影响。参与者对语言和上下文属性表现出高度敏感，惩罚轻微的语法错误，同时奖励保留的语义上下文。这与法学硕士定量评估中通常对待语言的方式形成鲜明对比。我们还引入了用于培训和评估缓解策略的新指标以及人类人工智能评估研究的见解。

Title: OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation

Authors: Jinzheng Yu, Yang Xu, Haozhen Li, Junqi Li, Yifan Feng, Ligu Zhu, Hao Shen, Lei Shi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01896
Pdf URL: https://arxiv.org/pdf/2512.01896
Copy Paste: [[2512.01896]] OPOR-Bench: Evaluating Large Language Models on Online Public Opinion Report Generation(https://arxiv.org/abs/2512.01896)
Keywords: language model, agent
Abstract: Online Public Opinion Reports consolidate news and social media for timely crisis management by governments and enterprises. While large language models have made automated report generation technically feasible, systematic research in this specific area remains notably absent, particularly lacking formal task definitions and corresponding benchmarks. To bridge this gap, we define the Automated Online Public Opinion Report Generation (OPOR-GEN) task and construct OPOR-BENCH, an event-centric dataset covering 463 crisis events with their corresponding news articles, social media posts, and a reference summary. To evaluate report quality, we propose OPOR-EVAL, a novel agent-based framework that simulates human expert evaluation by analyzing generated reports in context. Experiments with frontier models demonstrate that our framework achieves high correlation with human judgments. Our comprehensive task definition, benchmark dataset, and evaluation framework provide a solid foundation for future research in this critical domain.
摘要：在线舆情报告整合新闻和社交媒体，帮助政府和企业及时进行危机管理。虽然大型语言模型使自动报告生成在技术上变得可行，但该特定领域的系统研究仍然明显缺乏，特别是缺乏正式的任务定义和相应的基准。为了弥补这一差距，我们定义了自动在线舆情报告生成 (OPOR-GEN) 任务并构建了 OPOR-BENCH，这是一个以事件为中心的数据集，涵盖 463 个危机事件及其相应的新闻文章、社交媒体帖子和参考摘要。为了评估报告质量，我们提出了 OPOR-EVAL，这是一种基于代理的新型框架，通过在上下文中分析生成的报告来模拟人类专家评估。前沿模型的实验表明，我们的框架与人类判断实现了高度相关。我们全面的任务定义、基准数据集和评估框架为这一关键领域的未来研究奠定了坚实的基础。

Title: Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Authors: Lihu Chen, Xiang Yin, Francesca Toni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01909
Pdf URL: https://arxiv.org/pdf/2512.01909
Copy Paste: [[2512.01909]] Latent Debate: A Surrogate Framework for Interpreting LLM Thinking(https://arxiv.org/abs/2512.01909)
Keywords: language model, llm, hallucination, agent
Abstract: Understanding the internal thinking process of Large Language Models (LLMs) and the cause of hallucinations remains a key challenge. To this end, we introduce latent debate, a novel framework for interpreting model predictions through the lens of implicit internal arguments. Unlike the current work of self-consistency and multi-agent debate, which relies on explicit debates among multiple answers or multiple models, latent debate captures the hidden supporting and attacking signals that arise within a single model during a single inference. We first present a model- and task-agnostic conceptual framework, and then instantiate it symbolically to approximate the thinking process of LLMs on True/False prediction tasks. Empirical studies demonstrate that latent debate is a faithful structured surrogate model that has highly consistent predictions with the original LLM. Beyond interpretability, we demonstrate that latent debate provides a strong baseline for hallucination detection. Further analysis reveals strong correlations between hallucinations and debate patterns, such as a high degree of latent debates in the middle layers is linked to a higher risk of hallucinations. These findings position latent debate as a potential framework for understanding internal mechanisms of LLMs, especially for scenarios where internal (dis)agreements appear during the inference steps.
摘要：理解大型语言模型（LLM）的内部思维过程和幻觉的原因仍然是一个关键挑战。为此，我们引入了潜在辩论，这是一种通过隐含内部论点解释模型预测的新颖框架。与当前的自洽和多智能体辩论工作不同，这些工作依赖于多个答案或多个模型之间的明确辩论，潜在辩论捕获了单个推理过程中单个模型中出现的隐藏支持和攻击信号。我们首先提出一个与模型和任务无关的概念框架，然后将其象征性地实例化，以近似法学硕士在真/假预测任务上的思维过程。实证研究表明，潜在辩论是一种忠实的结构化替代模型，与原始法学硕士的预测高度一致。除了可解释性之外，我们还证明潜在辩论为幻觉检测提供了强有力的基线。进一步的分析表明，幻觉和辩论模式之间存在很强的相关性，例如中间层的高度潜在辩论与较高的幻觉风险有关。这些发现将潜在争论定位为理解法学硕士内部机制的潜在框架，特别是对于在推理步骤中出现内部（分歧）协议的情况。

Title: Rectifying LLM Thought from Lens of Optimization

Authors: Junnan Liu, Hongwei Liu, Songyang Zhang, Kai Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2512.01925
Pdf URL: https://arxiv.org/pdf/2512.01925
Copy Paste: [[2512.01925]] Rectifying LLM Thought from Lens of Optimization(https://arxiv.org/abs/2512.01925)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Recent advancements in large language models (LLMs) have been driven by their emergent reasoning capabilities, particularly through long chain-of-thought (CoT) prompting, which enables thorough exploration and deliberation. Despite these advances, long-CoT LLMs often exhibit suboptimal reasoning behaviors, such as overthinking and excessively protracted reasoning chains, which can impair performance. In this paper, we analyze reasoning processes through an optimization lens, framing CoT as a gradient descent procedure where each reasoning step constitutes an update toward problem resolution. Building on this perspective, we introduce RePro (Rectifying Process-level Reward), a novel approach to refine LLM reasoning during post-training. RePro defines a surrogate objective function to assess the optimization process underlying CoT, utilizing a dual scoring mechanism to quantify its intensity and stability. These scores are aggregated into a composite process-level reward, seamlessly integrated into reinforcement learning with verifiable rewards (RLVR) pipelines to optimize LLMs. Extensive experiments across multiple reinforcement learning algorithms and diverse LLMs, evaluated on benchmarks spanning mathematics, science, and coding, demonstrate that RePro consistently enhances reasoning performance and mitigates suboptimal reasoning behaviors.
摘要：大型语言模型 (LLM) 的最新进展是由其新兴推理能力推动的，特别是通过长链思维 (CoT) 提示，这使得彻底的探索和深思熟虑成为可能。尽管取得了这些进步，但长期 CoT 法学硕士经常表现出次优的推理行为，例如过度思考和过度拖延的推理链，这可能会损害表现。在本文中，我们通过优化镜头分析推理过程，将 CoT 视为梯度下降过程，其中每个推理步骤都构成问题解决的更新。基于这个观点，我们引入了 RePro（纠正过程级奖励），这是一种在培训后完善 LLM 推理的新方法。 RePro 定义了一个替代目标函数来评估 CoT 的优化过程，利用双重评分机制来量化其强度和稳定性。这些分数被汇总成一个复合的过程级奖励，无缝集成到具有可验证奖励的强化学习 (RLVR) 管道中，以优化法学硕士。跨多种强化学习算法和不同法学硕士的广泛实验，根据数学、科学和编码的基准进行评估，表明 RePro 持续增强推理性能并减少次优推理行为。

Title: How Far Are We from Genuinely Useful Deep Research Agents?

Authors: Dingling Zhang, He Zhu, Jincheng Ren, Kangqi Song, Xinran Zhou, Boyu Feng, Shudong Liu, Jiabin Luo, Weihao Xie, Zhaohui Wang, Tianrui Qin, King Zhu, Yuqing Wang, Qianben Chen, Yuchen Eleanor Jiang, Wei Wang, Jiaheng Liu, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.01948
Pdf URL: https://arxiv.org/pdf/2512.01948
Copy Paste: [[2512.01948]] How Far Are We from Genuinely Useful Deep Research Agents?(https://arxiv.org/abs/2512.01948)
Keywords: llm, agent
Abstract: Deep Research Agents (DRAs) aim to automatically produce analyst-level reports through iterative information retrieval and synthesis. However, most existing DRAs were validated on question-answering benchmarks, while research on generating comprehensive reports remains overlooked. Worse, current benchmarks for report synthesis suffer from task complexity and subjective metrics -- this fails to reflect user demands and limits the practical utility of generated reports. To address these gaps, we present Fine-grained DEepResearch bench (FINDER), an enhanced benchmark consisting of 100 human-curated research tasks with 419 structured checklist items that standardize report structure, analytical depth, and factual grounding. Based on approximately 1,000 reports produced by mainstream DRAs, we further propose Deep rEsearch Failure Taxonomy (DEFT), the first failure taxonomy for deep research agents. DEFT contains 14 fine-grained failure modes across reasoning, retrieval, and generation, and is built upon grounded theory with human-LLM co-annotating and inter-annotator reliability validation. Our experimental findings reveal that current DRAs struggle not with task comprehension but with evidence integration, verification, and reasoning-resilient planning.
摘要：深度研究代理 (DRA) 旨在通过迭代信息检索和综合自动生成分析师级别的报告。然而，大多数现有的 DRA 都是在问答基准上进行验证的，而生成综合报告的研究仍然被忽视。更糟糕的是，当前的报告合成基准受到任务复杂性和主观指标的影响——这无法反映用户需求并限制了生成报告的实际效用。为了弥补这些差距，我们推出了细粒度深度研究基准 (FINDER)，这是一个增强的基准，由 100 项人工策划的研究任务和 419 个结构化清单项目组成，这些项目标准化了报告结构、分析深度和事实基础。基于主流 DRA 生成的约 1,000 份报告，我们进一步提出深度研究失败分类法（DEFT），这是深度研究代理的第一个失败分类法。 DEFT 包含 14 种跨越推理、检索和生成的细粒度故障模式，并建立在具有人类法学硕士联合注释和注释者间可靠性验证的扎根理论之上。我们的实验结果表明，当前的 DRA 并非在任务理解方面存在困难，而是在证据整合、验证和推理弹性规划方面存在困难。

Title: The Art of Scaling Test-Time Compute for Large Language Models

Authors: Aradhye Agarwal, Ayan Sengupta, Tanmoy Chakraborty
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2512.02008
Pdf URL: https://arxiv.org/pdf/2512.02008
Copy Paste: [[2512.02008]] The Art of Scaling Test-Time Compute for Large Language Models(https://arxiv.org/abs/2512.02008)
Keywords: language model, llm
Abstract: Test-time scaling (TTS) -- the dynamic allocation of compute during inference -- is a promising direction for improving reasoning in large language models (LLMs). However, a systematic comparison of well-known TTS strategies under identical conditions is missing, and the influence of model type and problem difficulty on performance remains unclear. To address these gaps, we conduct the first large-scale study of TTS, spanning over thirty billion tokens generated using eight open-source LLMs (7B to 235B parameters), across four reasoning datasets. We observe three consistent trends: (1) no single TTS strategy universally dominates; (2) reasoning models exhibit distinct trace-quality patterns across problem difficulty and trace length, forming short-horizon and long-horizon categories; and (3) for a given model type, the optimal TTS performance scales monotonically with compute budget. Based on these insights, we provide a practical recipe for selecting the best TTS strategy, considering problem difficulty, model type, and compute budget, providing a practical guide to effective inference-time scaling.
摘要：测试时间缩放（TTS）——推理过程中计算的动态分配——是改进大型语言模型（LLM）推理的一个有前途的方向。然而，缺乏对相同条件下著名 TTS 策略的系统比较，并且模型类型和问题难度对性能的影响仍不清楚。为了解决这些差距，我们对 TTS 进行了首次大规模研究，涵盖使用八个开源 LLM（7B 到 235B 参数）生成的超过 300 亿个令牌，涵盖四个推理数据集。我们观察到三个一致的趋势：（1）没有单一的 TTS 策略能够普遍占据主导地位；（2）推理模型在问题难度和轨迹长度上表现出不同的轨迹质量模式，形成短视野和长视野类别； (3) 对于给定的模型类型，最佳 TTS 性能随计算预算单调扩展。基于这些见解，我们提供了选择最佳 TTS 策略的实用方法，考虑到问题难度、模型类型和计算预算，为有效的推理时间扩展提供了实用指南。

Title: Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Authors: Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Song Han
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2512.02010
Pdf URL: https://arxiv.org/pdf/2512.02010
Copy Paste: [[2512.02010]] Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling(https://arxiv.org/abs/2512.02010)
Keywords: language model, llm
Abstract: As large language models have grown larger, low-precision numerical formats such as NVFP4 have become increasingly popular due to the speed and memory benefits they provide. However, to accelerate computation with NVFP4, all matrix multiplication operands--weights and activations in the forward pass, and weights, activations, and gradients in the backward pass--must be quantized to NVFP4, often leading to divergence during training and performance degradation during inference. NVFP4 by evaluating multiple potential scale factors for each block of values. To address this issue, in this work we introduce Four Over Six (4/6), a modification to the NVFP4 quantization algorithm that evaluates two potential scale factors for each block of values. Unlike integer formats, floating-point formats such as FP4 have the most quantization error on near-maximal values in each block, which we find to be primarily responsible for downstream performance degradation. We find that for some blocks, scaling to smaller FP4 values makes the distribution of representable values more uniform, improving representation of near-maximal values. Importantly, 4/6 can be implemented efficiently on NVIDIA Blackwell GPUs, making it viable to use while training LLMs with NVFP4. In pre-training experiments with transformer and hybrid model architectures, we find that 4/6 prevents divergence in several cases, bringing training loss significantly closer to BF16 compared to models trained with current state-of-the-art NVFP4 training recipes. We also find that 4/6 can be easily incorporated into many different post-training quantization methods and generally improves downstream accuracy. We hope this inspires future work in training and deploying models with NVFP4.
摘要：随着大型语言模型变得越来越大，低精度数值格式（例如 NVFP4）由于其提供的速度和内存优势而变得越来越流行。然而，为了使用 NVFP4 加速计算，所有矩阵乘法操作数（前向传递中的权重和激活，以及后向传递中的权重、激活和梯度）都必须量化为 NVFP4，这通常会导致训练期间的发散和推理期间的性能下降。 NVFP4 通过评估每个值块的多个潜在比例因子。为了解决这个问题，在这项工作中，我们引入了六分之四 (4/6)，这是对 NVFP4 量化算法的修改，用于评估每个值块的两个潜在比例因子。与整数格式不同，浮点格式（例如 FP4）在每个块中接近最大值时具有最大的量化误差，我们发现这是造成下游性能下降的主要原因。我们发现，对于某些块，缩放到较小的 FP4 值会使可表示值的分布更加均匀，从而改善了接近最大值的表示。重要的是，4/6 可以在 NVIDIA Blackwell GPU 上高效实现，使其可以在使用 NVFP4 训练 LLM 时使用。在使用 Transformer 和混合模型架构进行预训练实验时，我们发现 4/6 在多种情况下可以防止发散，与使用当前最先进的 NVFP4 训练配方训练的模型相比，训练损失明显更接近 BF16。我们还发现 4/6 可以很容易地合并到许多不同的训练后量化方法中，并且通常可以提高下游精度。我们希望这能够激发未来使用 NVFP4 训练和部署模型的工作。