2025-10-16

Title: Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning

Authors: Mahdi Cherakhloo, Arash Abbasi, Mohammad Saeid Sarafraz, Bijan Vosoughi Vahdat
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12807
Pdf URL: https://arxiv.org/pdf/2510.12807
Copy Paste: [[2510.12807]] Benchmarking Open-Source Large Language Models for Persian in Zero-Shot and Few-Shot Learning(https://arxiv.org/abs/2510.12807)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across numerous languages; however, their effectiveness in low-resource languages like Persian requires thorough investigation. This paper presents a comprehensive benchmark of several open-source LLMs for Persian Natural Language Processing (NLP) tasks, utilizing both zero-shot and few-shot learning paradigms. We evaluate models across a range of tasks including sentiment analysis, named entity recognition, reading comprehension, and question answering, using established Persian datasets such as ParsiNLU and ArmanEmo. Our methodology encompasses rigorous experimental setups for both zero-shot and few-shot scenarios, employing metrics such as Accuracy, F1-score, BLEU, and ROUGE for performance evaluation. The results reveal that Gemma 2 consistently outperforms other models across nearly all tasks in both learning paradigms, with particularly strong performance in complex reasoning tasks. However, most models struggle with token-level understanding tasks like Named Entity Recognition, highlighting specific challenges in Persian language processing. This study contributes to the growing body of research on multilingual LLMs, providing valuable insights into their performance in Persian and offering a benchmark for future model development.
摘要：大型语言模型 (LLM) 在多种语言中表现出了卓越的能力；然而，它们在波斯语等资源匮乏的语言中的有效性需要进行彻底的调查。本文利用零样本和少样本学习范式，提出了针对波斯自然语言处理（NLP）任务的几个开源法学硕士的综合基准。我们使用 ParsiNLU 和 ArmanEmo 等已建立的波斯语数据集来评估一系列任务的模型，包括情感分析、命名实体识别、阅读理解和问答。我们的方法包括针对零样本和少样本场景的严格实验设置，采用准确度、F1 分数、BLEU 和 ROUGE 等指标进行性能评估。结果表明，Gemma 2 在两种学习范式中的几乎所有任务中始终优于其他模型，在复杂推理任务中表现尤其强劲。然而，大多数模型都在处理诸如命名实体识别之类的标记级理解任务，这凸显了波斯语处理中的具体挑战。这项研究为越来越多的多语言法学硕士研究做出了贡献，为他们在波斯语方面的表现提供了宝贵的见解，并为未来的模型开发提供了基准。

Title: Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study

Authors: Soheil Hashtarkhani, Rezaur Rashid, Christopher L Brett, Lokesh Chinthala, Fekede Asefa Kumsa, Janet A Zink, Robert L Davis, David L Schwartz, Arash Shaban-Nejad
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12813
Pdf URL: https://arxiv.org/pdf/2510.12813
Copy Paste: [[2510.12813]] Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study(https://arxiv.org/abs/2510.12813)
Keywords: language model, gpt
Abstract: Electronic health records contain inconsistently structured or free-text data, requiring efficient preprocessing to enable predictive health care models. Although artificial intelligence-driven natural language processing tools show promise for automating diagnosis classification, their comparative performance and clinical reliability require systematic evaluation. The aim of this study is to evaluate the performance of 4 large language models (GPT-3.5, GPT-4o, Llama 3.2, and Gemini 1.5) and BioBERT in classifying cancer diagnoses from structured and unstructured electronic health records data. We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer. Models were tested on their ability to categorize diagnoses into 14predefined categories. Two oncology experts validated classifications. BioBERT achieved the highest weighted macro F1-score for ICD codes (84.2) and matched GPT-4o in ICD code accuracy (90.8). For free-text diagnoses, GPT-4o outperformed BioBERT in weighted macro F1-score (71.8 vs 61.5) and achieved slightly higher accuracy (81.9 vs 81.6). GPT-3.5, Gemini, and Llama showed lower overall performance on both formats. Common misclassification patterns included confusion between metastasis and central nervous system tumors, as well as errors involving ambiguous or overlapping clinical terminology. Although current performance levels appear sufficient for administrative and research use, reliable clinical applications will require standardized documentation practices alongside robust human oversight for high-stakes decision-making.
摘要：电子健康记录包含不一致的结构化或自由文本数据，需要有效的预处理才能实现预测医疗保健模型。尽管人工智能驱动的自然语言处理工具显示出自动化诊断分类的前景，但它们的比较性能和临床可靠性需要系统评估。本研究的目的是评估 4 种大型语言模型（GPT-3.5、GPT-4o、Llama 3.2 和 Gemini 1.5）和 BioBERT 在根据结构化和非结构化电子健康记录数据对癌症诊断进行分类方面的性能。我们分析了 3456 份癌症患者记录中的 762 项独特诊断（326 个国际疾病分类 (ICD) 代码描述、436 个自由文本条目）。测试了模型将诊断分为 14 个预定义类别的能力。两位肿瘤学专家验证了分类。 BioBERT 在 ICD 代码方面取得了最高的加权宏 F1 分数 (84.2)，并且在 ICD 代码准确性方面与 GPT-4o 相当 (90.8)。对于自由文本诊断，GPT-4o 在加权宏 F1 分数（71.8 vs 61.5）方面优于 BioBERT，并且实现了略高的准确度（81.9 vs 81.6）。 GPT-3.5、Gemini 和 Llama 在两种格式上的整体性能都较低。常见的错误分类模式包括转移瘤和中枢神经系统肿瘤之间的混淆，以及涉及含糊或重叠的临床术语的错误。尽管目前的性能水平似乎足以满足行政和研究用途，但可靠的临床应用将需要标准化的文档实践以及对高风险决策的强有力的人工监督。

Title: From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP

Authors: Shanshan Xu, Santosh T.Y.S.S, Barbara Plank
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.12817
Pdf URL: https://arxiv.org/pdf/2510.12817
Copy Paste: [[2510.12817]] From Noise to Signal to Selbstzweck: Reframing Human Label Variation in the Era of Post-training in NLP(https://arxiv.org/abs/2510.12817)
Keywords: language model, llm
Abstract: Human Label Variation (HLV) refers to legitimate disagreement in annotation that reflects the genuine diversity of human perspectives rather than mere error. For decades, HLV in NLP was dismissed as noise to be discarded, and only slowly over the last decade has it been reframed as a signal for improving model robustness. With the rise of large language models (LLMs), where post-training on human feedback has become central to model alignment, the role of HLV has become increasingly consequential. Yet current preference-learning datasets routinely aggregate multiple annotations into a single label, thereby flattening diverse perspectives into a false universal agreement and erasing precisely the pluralism of human values that alignment aims to preserve. In this position paper, we argue that preserving HLV as an embodiment of human pluralism must be treated as a Selbstzweck - a goal it self when designing AI systems. We call for proactively incorporating HLV into preference datasets and outline actionable steps towards it.
摘要：人类标签变异（HLV）是指注释中的合理分歧，反映了人类观点的真正多样性，而不仅仅是错误。几十年来，NLP 中的 HLV 一直被认为是需要丢弃的噪音，直到最近十年才慢慢将其重新定义为提高模型鲁棒性的信号。随着大型语言模型 (LLM) 的兴起，人类反馈的后期训练已成为模型对齐的核心，HLV 的作用变得越来越重要。然而，当前的偏好学习数据集通常将多个注释聚合到一个标签中，从而将不同的观点扁平化为错误的普遍共识，并准确地消除了联盟旨在保留的人类价值观的多元化。在这篇立场文件中，我们认为，必须将 HLV 作为人类多元化的体现来对待，这本身就是设计人工智能系统时的一个目标。我们呼吁主动将 HLV 纳入偏好数据集，并概述可行的步骤。

Title: MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning

Authors: Rajarshi Ghosh, Abhay Gupta, Hudson McBride, Anurag Vaidya, Faisal Mahmood
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2510.12818
Pdf URL: https://arxiv.org/pdf/2510.12818
Copy Paste: [[2510.12818]] MEDEQUALQA: Evaluating Biases in LLMs with Counterfactual Reasoning(https://arxiv.org/abs/2510.12818)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) are increasingly deployed in clinical decision support, yet subtle demographic cues can influence their reasoning. Prior work has documented disparities in outputs across patient groups, but little is known about how internal reasoning shifts under controlled demographic changes. We introduce MEDEQUALQA, a counterfactual benchmark that perturbs only patient pronouns (he/him, she/her, they/them) while holding critical symptoms and conditions (CSCs) constant. Each clinical vignette is expanded into single-CSC ablations, producing three parallel datasets of approximately 23,000 items each (69,000 total). We evaluate a GPT-4.1 model and compute Semantic Textual Similarity (STS) between reasoning traces to measure stability across pronoun variants. Our results show overall high similarity (mean STS >0.80), but reveal consistent localized divergences in cited risk factors, guideline anchors, and differential ordering, even when final diagnoses remain unchanged. Our error analysis highlights certain cases in which the reasoning shifts, underscoring clinically relevant bias loci that may cascade into inequitable care. MEDEQUALQA offers a controlled diagnostic setting for auditing reasoning stability in medical AI.
摘要：大语言模型 (LLM) 越来越多地应用于临床决策支持，但微妙的人口统计线索可能会影响其推理。之前的工作已经记录了不同患者群体的输出差异，但人们对内部推理在受控人口变化下如何变化知之甚少。我们引入了 MEDEQUALQA，这是一种反事实基准，仅干扰患者代词（他/他、她/她、他们/他们），同时保持关键症状和病情 (CSC) 不变。每个临床小插图都扩展为单 CSC 消融，产生三个并行数据集，每个数据集约 23,000 个项目（总共 69,000 个项目）。我们评估 GPT-4.1 模型并计算推理轨迹之间的语义文本相似度 (STS)，以衡量代词变体之间的稳定性。我们的结果显示总体高度相似性（平均 STS > 0.80），但即使最终诊断保持不变，在引用的风险因素、指南锚点和差异排序方面也显示出一致的局部差异。我们的错误分析强调了推理发生变化的某些案例，强调了可能导致不公平护理的临床相关偏差位点。 MEDEQUALQA 提供受控诊断设置，用于审核医疗 AI 中的推理稳定性。

Title: Classifier-Augmented Generation for Structured Workflow Prediction

Authors: Thomas Gschwind, Shramona Chakraborty, Nitin Gupta, Sameep Mehta
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12825
Pdf URL: https://arxiv.org/pdf/2510.12825
Copy Paste: [[2510.12825]] Classifier-Augmented Generation for Structured Workflow Prediction(https://arxiv.org/abs/2510.12825)
Keywords: prompt, agent
Abstract: ETL (Extract, Transform, Load) tools such as IBM DataStage allow users to visually assemble complex data workflows, but configuring stages and their properties remains time consuming and requires deep tool knowledge. We propose a system that translates natural language descriptions into executable workflows, automatically predicting both the structure and detailed configuration of the flow. At its core lies a Classifier-Augmented Generation (CAG) approach that combines utterance decomposition with a classifier and stage-specific few-shot prompting to produce accurate stage predictions. These stages are then connected into non-linear workflows using edge prediction, and stage properties are inferred from sub-utterance context. We compare CAG against strong single-prompt and agentic baselines, showing improved accuracy and efficiency, while substantially reducing token usage. Our architecture is modular, interpretable, and capable of end-to-end workflow generation, including robust validation steps. To our knowledge, this is the first system with a detailed evaluation across stage prediction, edge layout, and property generation for natural-language-driven ETL authoring.
摘要：IBM DataStage 等 ETL（提取、转换、加载）工具允许用户直观地组装复杂的数据工作流程，但配置阶段及其属性仍然非常耗时，并且需要深入的工具知识。我们提出了一个系统，可以将自然语言描述翻译成可执行的工作流程，自动预测流程的结构和详细配置。其核心是分类器增强生成（CAG）方法，它将话语分解与分类器和特定于阶段的几次提示相结合，以产生准确的阶段预测。然后使用边缘预测将这些阶段连接到非线性工作流程，并从子话语上下文推断阶段属性。我们将 CAG 与强大的单提示和代理基线进行比较，显示出准确性和效率的提高，同时大大减少了令牌的使用。我们的架构是模块化的、可解释的，并且能够生成端到端工作流程，包括强大的验证步骤。据我们所知，这是第一个针对自然语言驱动的 ETL 创作跨阶段预测、边缘布局和属性生成进行详细评估的系统。

Title: Scheming Ability in LLM-to-LLM Strategic Interactions

Authors: Thao Pham
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2510.12826
Pdf URL: https://arxiv.org/pdf/2510.12826
Copy Paste: [[2510.12826]] Scheming Ability in LLM-to-LLM Strategic Interactions(https://arxiv.org/abs/2510.12826)
Keywords: language model, gpt, llm, prompt, chain-of-thought, agent
Abstract: As large language model (LLM) agents are deployed autonomously in diverse contexts, evaluating their capacity for strategic deception becomes crucial. While recent research has examined how AI systems scheme against human developers, LLM-to-LLM scheming remains underexplored. We investigate the scheming ability and propensity of frontier LLM agents through two game-theoretic frameworks: a Cheap Talk signaling game and a Peer Evaluation adversarial game. Testing four models (GPT-4o, Gemini-2.5-pro, Claude-3.7-Sonnet, and Llama-3.3-70b), we measure scheming performance with and without explicit prompting while analyzing scheming tactics through chain-of-thought reasoning. When prompted, most models, especially Gemini-2.5-pro and Claude-3.7-Sonnet, achieved near-perfect performance. Critically, models exhibited significant scheming propensity without prompting: all models chose deception over confession in Peer Evaluation (100% rate), while models choosing to scheme in Cheap Talk succeeded at 95-100% rates. These findings highlight the need for robust evaluations using high-stakes game-theoretic scenarios in multi-agent settings.
摘要：由于大型语言模型（LLM）代理在不同的环境中自主部署，评估其战略欺骗能力变得至关重要。虽然最近的研究探讨了人工智能系统如何针对人类开发人员进行设计，但 LLM 到 LLM 的设计仍未得到充分探索。我们通过两个博弈论框架来研究前沿法学硕士代理人的阴谋能力和倾向：廉价谈话信号游戏和同行评估对抗性游戏。通过测试四个模型（GPT-4o、Gemini-2.5-pro、Claude-3.7-Sonnet 和 Llama-3.3-70b），我们在有或没有明确提示的情况下测量阴谋表现，同时通过思想链推理分析阴谋策略。当出现提示时，大多数模型，尤其是 Gemini-2.5-pro 和 Claude-3.7-Sonnet，都实现了近乎完美的性能。至关重要的是，模型在没有提示的情况下表现出显着的阴谋倾向：所有模型在同伴评估中都选择欺骗而不是坦白（100％的比率），而在廉价谈话中选择阴谋的模型成功率为95-100％。这些发现凸显了在多智能体环境中使用高风险博弈论场景进行稳健评估的必要性。

Title: Mathematics with large language models as provers and verifiers

Authors: Hieu Le Duc, Leo Liberti
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2510.12829
Pdf URL: https://arxiv.org/pdf/2510.12829
Copy Paste: [[2510.12829]] Mathematics with large language models as provers and verifiers(https://arxiv.org/abs/2510.12829)
Keywords: language model, gpt, hallucination, chat
Abstract: During 2024 and 2025 the discussion about the theorem-proving capabilities of large language models started reporting interesting success stories, mostly to do with difficult exercises (such as problems from the International Mathematical Olympiad), but also with conjectures [Feldman & Karbasi, arXiv:2509.18383v1] formulated for the purpose of verifying whether the artificial intelligence could prove it. In this paper we report a theorem proving feat achieved by ChatGPT by using a protocol involving different prover and verifier instances of the gpt-5 model working collaboratively. To make sure that the produced proofs do not suffer from hallucinations, the final proof is formally verified by the lean proof assistant, and the conformance of premises and conclusion of the lean code is verified by a human. Our methodology was able to solve five out of six 2025 IMO problems, and close a third of the sixty-six number theory conjectures in [Cohen, Journal of Integer Sequences, 2025].
摘要：在 2024 年和 2025 年期间，关于大型语言模型定理证明能力的讨论开始报道有趣的成功故事，主要与困难的练习（例如国际数学奥林匹克的问题）有关，但也有为了验证人工智能是否可以证明这一点而制定的猜想 [Feldman & Karbasi, arXiv:2509.18383v1]。在本文中，我们报告了 ChatGPT 通过使用涉及 gpt-5 模型的不同证明者和验证者实例协作的协议实现的定理证明壮举。为了确保生成的证明不会出现幻觉，最终的证明由精益证明助手进行正式验证，并且精益代码的前提和结论的一致性由人工验证。我们的方法能够解决 2025 年 IMO 问题中的 5 个问题，并解决了 [Cohen, Journal of Integer Sequences, 2025] 中 66 个数论猜想中的三分之一。

Title: MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training

Authors: Taicheng Guo, Hai Wang, ChaoChun Liu, Mohsen Golalikhani, Xin Chen, Xiangliang Zhang, Chandan K. Reddy
Subjects: cs.CL, cs.AI, cs.DB, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12831
Pdf URL: https://arxiv.org/pdf/2510.12831
Copy Paste: [[2510.12831]] MTSQL-R1: Towards Long-Horizon Multi-Turn Text-to-SQL via Agentic Training(https://arxiv.org/abs/2510.12831)
Keywords: agent
Abstract: Multi-turn Text-to-SQL aims to translate a user's conversational utterances into executable SQL while preserving dialogue coherence and grounding to the target schema. However, most existing systems only regard this task as a simple text translation task and follow a short-horizon paradigm, generating a query per turn without execution, explicit verification, and refinement, which leads to non-executable or incoherent outputs. We present MTSQL-R1, an agentic training framework for long-horizon multi-turn Text-to-SQL. We cast the task as a Markov Decision Process (MDP) in which an agent interacts with (i) a database for execution feedback and (ii) a persistent dialogue memory for coherence verification, performing an iterative propose to execute -> verify -> refine cycle until all checks pass. Experiments on COSQL and SPARC demonstrate that MTSQL-R1 consistently outperforms strong baselines, highlighting the importance of environment-driven verification and memory-guided refinement for conversational semantic parsing. Full recipes (including code, trained models, logs, reasoning trajectories, etc.) will be released after the internal review to contribute to community research.
摘要：多轮文本到 SQL 旨在将用户的对话话语转换为可执行的 SQL，同时保持对话的连贯性和目标模式的基础。然而，大多数现有系统仅将此任务视为简单的文本翻译任务，并遵循短期范式，每回合生成一个查询，而没有执行、显式验证和细化，这导致输出不可执行或不连贯。我们提出了 MTSQL-R1，一种用于长期多轮 Text-to-SQL 的代理训练框架。我们将该任务视为马尔可夫决策过程（MDP），其中代理与（i）用于执行反馈的数据库和（ii）用于一致性验证的持久对话内存进行交互，执行迭代建议以执行 - >验证 - >细化循环，直到所有检查通过。 COSQL 和 SPARC 上的实验表明，MTSQL-R1 始终优于强大的基线，凸显了环境驱动的验证和内存引导的细化对于会话语义解析的重要性。完整的菜谱（包括代码、训练模型、日志、推理轨迹等）将在内部审核后发布，为社区研究做出贡献。

Title: Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study

Authors: Kon Woo Kim (National Institute of Informatics, Japan), Rezarta Islamaj (National Library of Medicine, USA), Jin-Dong Kim (Joint Support-Center for Data Science Research, Japan), Florian Boudin (Japanese-French Laboratory of Informatics, CNRS, Nantes University, Japan), Akiko Aizawa (National Institute of Informatics, Japan)
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12835
Pdf URL: https://arxiv.org/pdf/2510.12835
Copy Paste: [[2510.12835]] Repurposing Annotation Guidelines to Instruct LLM Annotators: A Case Study(https://arxiv.org/abs/2510.12835)
Keywords: language model, llm
Abstract: This study investigates how existing annotation guidelines can be repurposed to instruct large language model (LLM) annotators for text annotation tasks. Traditional guidelines are written for human annotators who internalize training, while LLMs require explicit, structured instructions. We propose a moderation-oriented guideline repurposing method that transforms guidelines into clear directives for LLMs through an LLM moderation process. Using the NCBI Disease Corpus as a case study, our experiments show that repurposed guidelines can effectively guide LLM annotators, while revealing several practical challenges. The results highlight the potential of this workflow to support scalable and cost-effective refinement of annotation guidelines and automated annotation.
摘要：本研究探讨了如何重新利用现有注释指南来指导大型语言模型 (LLM) 注释器执行文本注释任务。传统指南是为内部化培训的人类注释者编写的，而法学硕士则需要明确的、结构化的指导。我们提出了一种以审核为导向的指南重新调整方法，通过法学硕士审核流程将指南转化为法学硕士的明确指示。使用 NCBI 疾病语料库作为案例研究，我们的实验表明，重新调整用途的指南可以有效指导法学硕士注释者，同时揭示一些实际挑战。结果凸显了该工作流程在支持注释指南和自动注释的可扩展且经济高效的细化方面的潜力。

Title: A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning

Authors: Qianben Chen, Jingyi Cao, Jiayu Zhang, Tianrui Qin, Xiaowan Li, King Zhu, Dingfeng Shi, He Zhu, Minghao Liu, Xiaobo Liang, Ge Zhang, Jian Yang, Yuchen Eleanor Jiang, Wangchunshu Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.12838
Pdf URL: https://arxiv.org/pdf/2510.12838
Copy Paste: [[2510.12838]] A\textsuperscript{2}FM: An Adaptive Agent Foundation Model for Tool-Aware Hybrid Reasoning(https://arxiv.org/abs/2510.12838)
Keywords: language model, llm, chain-of-thought, agent
Abstract: Large language models split into two families: reasoning-centric LLMs, which strengthen internal chain-of-thought reasoning but cannot invoke external tools, and agentic LLMs, which learn to interact with environments and leverage tools but often lag in deep reasoning. This divide arises from fundamentally different training objectives, leading to mismatched strengths and inefficiency on simple queries, where both families tend to overthink or over-call tools. In this work, we present Adaptive Agent Foundation Model (A\textsuperscript{2}FM), a unified framework that follows a route-then-align principle: the model first learns task-aware routing and then aligns mode-specific trajectories under a shared backbone. To address the inefficiency gap, we introduce a third mode-instant-that handles simple queries directly, preventing unnecessary reasoning or tool calls while complementing the agentic and reasoning modes. To jointly enhance accuracy and efficiency, we propose Adaptive Policy Optimization (APO), which enforces adaptive sampling across modes and applies a cost-regularized reward. On the 32B scale, A\textsuperscript{2}FM achieves 13.4\% on BrowseComp, 70.4\% on AIME25, and 16.7\% on HLE, setting new SOTA among comparable models and performing competitively with frontier LLMs across agentic, reasoning, and general benchmarks. Notably, the adaptive execution achieves a cost of pass of only \$0.00487 per correct answer-cutting cost by 45.2\% relative to reasoning and 33.5\% relative to agentic, thus delivering substantially higher cost efficiency while maintaining comparable accuracy.
摘要：大型语言模型分为两个系列：以推理为中心的LLM，它加强了内部思想链推理，但不能调用外部工具；代理LLM，它学习与环境交互并利用工具，但在深度推理方面往往落后。这种分歧源于根本不同的培训目标，导致简单查询的优势不匹配和效率低下，这两个家族都倾向于过度思考或过度调用工具。在这项工作中，我们提出了自适应代理基础模型（A\textsuperscript{2}FM），这是一个遵循路由然后对齐原则的统一框架：模型首先学习任务感知路由，然后在共享主干下对齐特定于模式的轨迹。为了解决效率低下的问题，我们引入了第三种模式——即时——直接处理简单的查询，防止不必要的推理或工具调用，同时补充代理和推理模式。为了共同提高准确性和效率，我们提出了自适应策略优化（APO），它强制跨模式进行自适应采样并应用成本规范化奖励。在 32B 规模上，A\textsuperscript{2}FM 在 BrowseComp 上达到 13.4\%，在 AIME25 上达到 70.4\%，在 HLE 上达到 16.7\%，在可比模型中树立了新的 SOTA，并且在代理、推理和通用基准方面与前沿 LLM 具有竞争力。值得注意的是，自适应执行的每个正确答案的通过成本仅为 0.00487 美元，相对于推理成本削减了 45.2%，相对于代理成本削减了 33.5%，从而在保持相当的准确性的同时提供了更高的成本效率。

Title: FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs

Authors: Yingjia Wan, Haochen Tan, Xiao Zhu, Xinyu Zhou, Zhiwei Li, Qingsong Lv, Changxuan Sun, Jiaqi Zeng, Yi Xu, Jianqiao Lu, Yinhong Liu, Zhijiang Guo
Subjects: cs.CL, cs.AI, cs.CE, cs.CY
Abstract URL: https://arxiv.org/abs/2510.12839
Pdf URL: https://arxiv.org/pdf/2510.12839
Copy Paste: [[2510.12839]] FaStFACT: Faster, Stronger Long-Form Factuality Evaluations in LLMs(https://arxiv.org/abs/2510.12839)
Keywords: language model, llm
Abstract: Evaluating the factuality of long-form generations from Large Language Models (LLMs) remains challenging due to accuracy issues and costly human assessment. Prior efforts attempt this by decomposing text into claims, searching for evidence, and verifying claims, but suffer from critical drawbacks: (1) inefficiency due to complex pipeline components unsuitable for long LLM outputs, and (2) ineffectiveness stemming from inaccurate claim sets and insufficient evidence collection of one-line snippets. To address these limitations, we propose \name, a fast and strong evaluation framework that achieves the highest alignment with human evaluation and efficiency among existing baselines. \name first employs chunk-level claim extraction integrated with confidence-based pre-verification, significantly reducing the cost of web searching and inference calling while ensuring reliability. For searching and verification, it collects document-level evidence from crawled webpages and selectively retrieves it during verification, addressing the evidence insufficiency problem in previous pipelines. Extensive experiments based on an aggregated and manually annotated benchmark demonstrate the reliability of \name in both efficiently and effectively evaluating the factuality of long-form LLM generations. Code and benchmark data is available at this https URL.
摘要：由于准确性问题和昂贵的人工评估，评估大型语言模型 (LLM) 的长格式生成的真实性仍然具有挑战性。先前的努力通过将文本分解为声明、搜索证据和验证声明来尝试这一点，但存在严重缺陷：（1）由于复杂的管道组件不适合长的LLM输出而导致效率低下，以及（2）由于声明集不准确和单行片段证据收集不足而导致效率低下。为了解决这些限制，我们提出了 \name，一个快速而强大的评估框架，可在现有基线中实现与人类评估和效率的最高一致性。 \name 首先采用块级声明提取与基于置信度的预验证相结合，在确保可靠性的同时显着降低网络搜索和推理调用的成本。在搜索和验证方面，它从爬取的网页中收集文档级证据，并在验证时选择性地检索它，解决了以前管道中证据不足的问题。基于聚合和手动注释基准的大量实验证明了 \name 在高效且有效地评估长篇 LLM 生成的真实性方面的可靠性。代码和基准数据可从此 https URL 获取。

Title: VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages

Authors: Jesse Atuhurra, Iqra Ali, Tomoya Iwakura, Hidetaka Kamigaito, Tatsuya Hiraoka
Subjects: cs.CL, cs.AI, cs.CV, cs.RO
Abstract URL: https://arxiv.org/abs/2510.12845
Pdf URL: https://arxiv.org/pdf/2510.12845
Copy Paste: [[2510.12845]] VLURes: Benchmarking VLM Visual and Linguistic Understanding in Low-Resource Languages(https://arxiv.org/abs/2510.12845)
Keywords: language model, gpt, prompt, agent
Abstract: Vision Language Models (VLMs) are pivotal for advancing perception in intelligent agents. Yet, evaluation of VLMs remains limited to predominantly English-centric benchmarks in which the image-text pairs comprise short texts. To evaluate VLM fine-grained abilities, in four languages under long-text settings, we introduce a novel multilingual benchmark VLURes featuring eight vision-and-language tasks, and a pioneering unrelatedness task, to probe the fine-grained Visual and Linguistic Understanding capabilities of VLMs across English, Japanese, and low-resource languages, Swahili, and Urdu. Our datasets, curated from web resources in the target language, encompass ten diverse image categories and rich textual context, introducing valuable vision-language resources for Swahili and Urdu. By prompting VLMs to generate responses and rationales, evaluated automatically and by native speakers, we uncover performance disparities across languages and tasks critical to intelligent agents, such as object recognition, scene understanding, and relationship understanding. We conducted evaluations of ten VLMs with VLURes. The best performing model, GPT-4o, achieves an overall accuracy of 90.8% and lags human performance by 6.7%, though the gap is larger for open-source models. The gap highlights VLURes' critical role in developing intelligent agents to tackle multi-modal visual reasoning.
摘要：视觉语言模型 (VLM) 对于提升智能代理的感知至关重要。然而，VLM 的评估仍然仅限于主要以英语为中心的基准，其中图像-文本对包含短文本。为了评估长文本设置下四种语言的 VLM 细粒度能力，我们引入了一种新颖的多语言基准 VLURes，具有八个视觉和语言任务以及一项开创性的无关任务，以探讨 VLM 跨英语、日语和低资源语言、斯瓦希里语和乌尔都语的细粒度视觉和语言理解能力。我们的数据集是根据目标语言的网络资源整理的，包含十种不同的图像类别和丰富的文本上下文，为斯瓦希里语和乌尔都语引入了宝贵的视觉语言资源。通过促使 VLM 生成响应和理由，并由母语人士自动评估，我们发现了对智能代理至关重要的语言和任务之间的性能差异，例如对象识别、场景理解和关系理解。我们使用 VLURes 对 10 个 VLM 进行了评估。性能最好的模型 GPT-4o 的总体准确率达到 90.8%，落后人类表现 6.7%，尽管开源模型的差距更大。这一差距凸显了 VLURes 在开发智能代理以解决多模态视觉推理方面的关键作用。

Title: EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus

Authors: Shouang Wei, Min Zhang, Xin Lin, Bo Jiang, Zhongxiang Dai, Kun Kuang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12899
Pdf URL: https://arxiv.org/pdf/2510.12899
Copy Paste: [[2510.12899]] EduDial: Constructing a Large-scale Multi-turn Teacher-Student Dialogue Corpus(https://arxiv.org/abs/2510.12899)
Keywords: language model, llm, agent
Abstract: Recently, several multi-turn dialogue benchmarks have been proposed to evaluate the conversational abilities of large language models (LLMs). As LLMs are increasingly recognized as a key technology for advancing intelligent education, owing to their ability to deeply understand instructional contexts and provide personalized guidance, the construction of dedicated teacher-student dialogue benchmarks has become particularly important. To this end, we present EduDial, a comprehensive multi-turn teacher-student dialogue dataset. EduDial covers 345 core knowledge points and consists of 34,250 dialogue sessions generated through interactions between teacher and student agents. Its design is guided by Bloom's taxonomy of educational objectives and incorporates ten questioning strategies, including situational questioning, zone of proximal development (ZPD) questioning, and metacognitive questioning-thus better capturing authentic classroom interactions. Furthermore, we design differentiated teaching strategies for students at different cognitive levels, thereby providing more targeted teaching guidance. Building on EduDial, we further develop EduDial-LLM 32B via training and propose an 11-dimensional evaluation framework that systematically measures the teaching abilities of LLMs, encompassing both overall teaching quality and content quality. Experiments on 17 mainstream LLMs reveal that most models struggle in student-centered teaching scenarios, whereas our EduDial-LLM achieves significant gains, consistently outperforming all baselines across all metrics. The code is available at this https URL.
摘要：最近，提出了几种多轮对话基准来评估大型语言模型（LLM）的对话能力。随着法学硕士越来越被认为是推进智能教育的关键技术，由于其能够深入理解教学情境并提供个性化指导，专门的师生对话基准的构建变得尤为重要。为此，我们提出了 EduDial，一个全面的多轮师生对话数据集。 EduDial涵盖345个核心知识点，由教师和学生代理之间互动产生的34,250个对话会话组成。其设计以布鲁姆的教育目标分类法为指导，并结合了十种提问策略，包括情境提问、最近发展区（ZPD）提问和元认知提问，从而更好地捕捉真实的课堂互动。此外，我们针对不同认知水平的学生设计差异化的教学策略，从而提供更有针对性的教学指导。在EduDial的基础上，我们通过培训进一步开发了EduDial-LLM 32B，提出了一个十一维评估框架，系统地衡量LLM的教学能力，包括整体教学质量和内容质量。对 17 个主流法学硕士的实验表明，大多数模型在以学生为中心的教学场景中都表现不佳，而我们的 EduDial-LLM 取得了显着的进步，在所有指标上始终优于所有基线。该代码可从此 https URL 获取。

Title: Who's Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering

Authors: Nil-Jana Akpinar, Chia-Jung Lee, Vanessa Murdock, Pietro Perona
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.12925
Pdf URL: https://arxiv.org/pdf/2510.12925
Copy Paste: [[2510.12925]] Who's Asking? Evaluating LLM Robustness to Inquiry Personas in Factual Question Answering(https://arxiv.org/abs/2510.12925)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) should answer factual questions truthfully, grounded in objective knowledge, regardless of user context such as self-disclosed personal information, or system personalization. In this paper, we present the first systematic evaluation of LLM robustness to inquiry personas, i.e. user profiles that convey attributes like identity, expertise, or belief. While prior work has primarily focused on adversarial inputs or distractors for robustness testing, we evaluate plausible, human-centered inquiry persona cues that users disclose in real-world interactions. We find that such cues can meaningfully alter QA accuracy and trigger failure modes such as refusals, hallucinated limitations, and role confusion. These effects highlight how model sensitivity to user framing can compromise factual reliability, and position inquiry persona testing as an effective tool for robustness evaluation.
摘要：大型语言模型 (LLM) 应以客观知识为基础，如实回答事实问题，无论用户上下文如何，例如自我披露的个人信息或系统个性化。在本文中，我们首次系统地评估了 LLM 对查询角色（即传达身份、专业知识或信仰等属性的用户配置文件）的鲁棒性。虽然之前的工作主要集中在鲁棒性测试的对抗性输入或干扰因素上，但我们评估了用户在现实世界交互中披露的合理的、以人为本的询问角色线索。我们发现这些线索可以有意义地改变 QA 的准确性并触发失败模式，例如拒绝、幻觉限制和角色混乱。这些效应凸显了模型对用户框架的敏感性如何损害事实可靠性，并将查询角色测试定位为稳健性评估的有效工具。

Title: The Curious Case of Curiosity across Human Cultures and LLMs

Authors: Angana Borah, Rada Mihalcea
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12943
Pdf URL: https://arxiv.org/pdf/2510.12943
Copy Paste: [[2510.12943]] The Curious Case of Curiosity across Human Cultures and LLMs(https://arxiv.org/abs/2510.12943)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) have expanded their role in human interaction, yet curiosity -- a central driver of inquiry -- remains underexplored in these systems, particularly across cultural contexts. In this work, we investigate cultural variation in curiosity using Yahoo! Answers, a real-world multi-country dataset spanning diverse topics. We introduce CUEST (CUriosity Evaluation across SocieTies), an evaluation framework that measures human-model alignment in curiosity through linguistic (style), topic preference (content) analysis and grounding insights in social science constructs. Across open- and closed-source models, we find that LLMs flatten cross-cultural diversity, aligning more closely with how curiosity is expressed in Western countries. We then explore fine-tuning strategies to induce curiosity in LLMs, narrowing the human-model alignment gap by up to 50\%. Finally, we demonstrate the practical value of curiosity for LLM adaptability across cultures, showing its importance for future NLP research.
摘要：大型语言模型 (LLM) 的最新进展扩大了它们在人类互动中的作用，但好奇心（探究的核心驱动力）在这些系统中仍未得到充分探索，特别是在跨文化背景下。在这项工作中，我们使用 Yahoo! 调查了好奇心的文化差异。 Answers，一个涵盖不同主题的真实世界多国数据集。我们引入了 CUEST（跨社会好奇心评估），这是一个评估框架，通过语言（风格）、主题偏好（内容）分析和社会科学结构中的基础见解来衡量人类好奇心模型的一致性。在开源和闭源模式中，我们发现法学硕士扁平化了跨文化多样性，与西方国家表达好奇心的方式更加一致。然后，我们探索微调策略来激发法学硕士的好奇心，将人类模型对齐差距缩小高达 50%。最后，我们展示了好奇心对于法学硕士跨文化适应性的实际价值，展示了其对未来 NLP 研究的重要性。

Title: 3-Model Speculative Decoding

Authors: Sanghyun Byun, Mohanad Odema, Jung Ick Guack, Baisub Lee, Jacob Song, Woo Seong Chung
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12966
Pdf URL: https://arxiv.org/pdf/2510.12966
Copy Paste: [[2510.12966]] 3-Model Speculative Decoding(https://arxiv.org/abs/2510.12966)
Keywords: language model
Abstract: Speculative Decoding (SD) accelerates inference in large language models by using a smaller draft model to propose tokens, which are then verified by a larger target model. However, the throughput gains of SD are fundamentally limited by a trade-off between draft model size and token acceptance: smaller draft models generate tokens more quickly but exhibit greater divergence from the target model, resulting in lower acceptance rates and reduced speedups. We introduce Pyramid Speculative Decoding (PyramidSD), an extension of SD that inserts an intermediate qualifier model between the draft and target to bridge the distributional gap in output predictions, allowing smaller model to be used for drafting. This hierarchical decoding strategy improves alignment across models, enabling higher acceptance rates and allowing the use of significantly smaller draft models without sacrificing overall performance. PyramidSD builds on fuzzy acceptance criteria to support relaxed divergence thresholds at each stage, improving throughput. In experiments, PyramidSD achieves up to 1.91x generation speed over standard SD, reaching 124 tokens per second on a consumer GPU (RTX 4090). In small-memory settings with a 1B-parameter draft model and an 8B target model, PyramidSD minimally trades target model quality for improved throughput. Overall, PyramidSD offers a practical approach to enhancing speculative decoding efficiency and can be readily applied to existing inference pipelines.
摘要：推测解码 (SD) 通过使用较小的草稿模型来提出令牌，然后由较大的目标模型进行验证，从而加速大型语言模型中的推理。然而，SD 的吞吐量增益从根本上受到草稿模型大小和代币接受度之间的权衡的限制：较小的草稿模型生成代币的速度更快，但与目标模型的差异更大，导致接受率较低并降低了加速速度。我们引入了金字塔推测解码（PyramidSD），这是 SD 的扩展，它在草稿和目标之间插入中间限定符模型，以弥合输出预测中的分布差距，从而允许使用较小的模型进行草稿。这种分层解码策略改善了模型之间的对齐，从而实现更高的接受率，并允许使用明显更小的草稿模型，而不会牺牲整体性能。 PyramidSD 以模糊接受标准为基础，支持每个阶段放宽的发散阈值，从而提高吞吐量。在实验中，PyramidSD 的生成速度比标准 SD 快 1.91 倍，在消费级 GPU (RTX 4090) 上达到每秒 124 个令牌。在具有 1B 参数草图模型和 8B 目标模型的小内存设置中，PyramidSD 会以最低限度地牺牲目标模型质量来提高吞吐量。总体而言，PyramidSD 提供了一种提高推测解码效率的实用方法，并且可以轻松应用于现有的推理管道。

Title: A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation

Authors: João A. Leite, Arnav Arora, Silvia Gargova, João Luz, Gustavo Sampaio, Ian Roberts, Carolina Scarton, Kalina Bontcheva
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.12993
Pdf URL: https://arxiv.org/pdf/2510.12993
Copy Paste: [[2510.12993]] A Multilingual, Large-Scale Study of the Interplay between LLM Safeguards, Personalisation, and Disinformation(https://arxiv.org/abs/2510.12993)
Keywords: language model, llm, prompt
Abstract: The human-like proficiency of Large Language Models (LLMs) has brought concerns about their potential misuse for generating persuasive and personalised disinformation at scale. While prior work has demonstrated that LLMs can generate disinformation, specific questions around persuasiveness and personalisation (generation of disinformation tailored to specific demographic attributes) remain largely unstudied. This paper presents the first large-scale, multilingual empirical study on persona-targeted disinformation generation by LLMs. Employing a red teaming methodology, we systematically evaluate the robustness of LLM safety mechanisms to persona-targeted prompts. A key novel result is AI-TRAITS (AI-generaTed peRsonAlIsed disinformaTion dataSet), a new dataset of around 1.6 million texts generated by eight state-of-the-art LLMs. AI-TRAITS is seeded by prompts that combine 324 disinformation narratives and 150 distinct persona profiles, covering four major languages (English, Russian, Portuguese, Hindi) and key demographic dimensions (country, generation, political orientation). The resulting personalised narratives are then assessed quantitatively and compared along the dimensions of models, languages, jailbreaking rate, and personalisation attributes. Our findings demonstrate that the use of even simple personalisation strategies in the prompts significantly increases the likelihood of jailbreaks for all studied LLMs. Furthermore, personalised prompts result in altered linguistic and rhetorical patterns and amplify the persuasiveness of the LLM-generated false narratives. These insights expose critical vulnerabilities in current state-of-the-art LLMs and offer a foundation for improving safety alignment and detection strategies in multilingual and cross-demographic contexts.
摘要：大型语言模型 (LLM) 与人类相似的熟练程度引发了人们对它们可能被滥用来大规模生成有说服力和个性化的虚假信息的担忧。虽然之前的工作已经证明法学硕士可以产生虚假信息，但围绕说服力和个性化（针对特定人口统计属性生成虚假信息）的具体问题仍然很大程度上未被研究。本文提出了第一个关于法学硕士针对人物角色生成虚假信息的大规模、多语言实证研究。我们采用红队方法，系统地评估法学硕士安全机制对针对角色的提示的稳健性。一个关键的新颖成果是 AI-TRAITS（人工智能生成的个性化虚假信息数据集），这是一个由八位最先进的法学硕士生成的约 160 万条文本的新数据集。 AI-TRAITS 的提示结合了 324 个虚假信息叙述和 150 个不同的角色概况，涵盖四种主要语言（英语、俄语、葡萄牙语、印地语）和关键人口维度（国家、世代、政治取向）。然后对由此产生的个性化叙述进行定量评估，并从模型、语言、越狱率和个性化属性等维度进行比较。我们的研究结果表明，即使是在提示中使用简单的个性化策略也会显着增加所有研究的法学硕士越狱的可能性。此外，个性化提示会改变语言和修辞模式，并增强法学硕士生成的虚假叙述的说服力。这些见解揭示了当前最先进的法学硕士中的关键漏洞，并为改进多语言和跨人口背景下的安全调整和检测策略奠定了基础。

Title: OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning

Authors: Yifeng Xiong, Xiaohui Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13003
Pdf URL: https://arxiv.org/pdf/2510.13003
Copy Paste: [[2510.13003]] OPLoRA: Orthogonal Projection LoRA Prevents Catastrophic Forgetting during Parameter-Efficient Fine-Tuning(https://arxiv.org/abs/2510.13003)
Keywords: language model
Abstract: Low-Rank Adaptation (LoRA) enables efficient fine-tuning of large language models but suffers from catastrophic forgetting when learned updates interfere with the dominant singular directions that encode essential pre-trained knowledge. We propose Orthogonal Projection LoRA (OPLoRA), a theoretically grounded approach that prevents this interference through double-sided orthogonal projections. By decomposing frozen weights via SVD, OPLoRA constrains LoRA updates to lie entirely within the orthogonal complement of the top-$k$ singular subspace using projections $P_L = I - U_k U_k^\top$ and $P_R = I - V_k V_k^\top$. We prove that this construction exactly preserves the top-$k$ singular triples, providing mathematical guarantees for knowledge retention. To quantify subspace interference, we introduce $\rho_k$, a metric measuring update alignment with dominant directions. Extensive experiments across commonsense reasoning, mathematics, and code generation demonstrate that OPLoRA significantly reduces forgetting while maintaining competitive task-specific performance on LLaMA-2 7B and Qwen2.5 7B, establishing orthogonal projection as an effective mechanism for knowledge preservation in parameter-efficient fine-tuning.
摘要：低秩适应 (LoRA) 可以对大型语言模型进行高效微调，但当学习到的更新干扰编码基本预训练知识的主导奇异方向时，就会遭受灾难性遗忘。我们提出了正交投影 LoRA (OPLoRA)，这是一种有理论基础的方法，可以通过双面正交投影来防止这种干扰。通过 SVD 分解冻结权重，OPLoRA 使用投影 $P_L = I - U_k U_k^\top$ 和 $P_R = I - V_k V_k^\top$ 约束 LoRA 更新完全位于 top-$k$ 奇异子空间的正交补集内。我们证明这种构造准确地保留了 top-$k$ 奇异三元组，为知识保留提供了数学保证。为了量化子空间干扰，我们引入了 $\rho_k$，一个测量更新与主导方向对齐的度量。常识推理、数学和代码生成方面的大量实验表明，OPLoRA 显着减少了遗忘，同时保持了 LLaMA-2 7B 和 Qwen2.5 7B 上具有竞争力的特定任务性能，从而将正交投影确立为参数高效微调中知识保存的有效机制。

Title: CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models

Authors: Pavan Kalyan, Shubhra Mishra, Satya Lokam, Navin Goyal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13008
Pdf URL: https://arxiv.org/pdf/2510.13008
Copy Paste: [[2510.13008]] CurLL: A Developmental Framework to Evaluate Continual Learning in Language Models(https://arxiv.org/abs/2510.13008)
Keywords: language model
Abstract: We introduce a comprehensive continual learning dataset and benchmark (CurlL) grounded in human developmental trajectories from ages 5-10, enabling systematic and fine-grained assessment of models' ability to progressively acquire new skills. CurlL spans five developmental stages (0-4) covering ages 5-10, supported by a skill graph that breaks down broad skills into smaller abilities, concrete goals, and measurable indicators, while also capturing which abilities build on others. We generate a 23.4B-token synthetic dataset with controlled skill progression, vocabulary complexity, and format diversity, comprising paragraphs, comprehension-based QA (CQA), skill-testing QA (CSQA), and instruction-response (IR) pairs. Stage-wise token counts range from 2.12B to 6.78B tokens, supporting precise analysis of forgetting, forward transfer, and backward transfer. Using a 135M-parameter transformer trained under independent, joint, and sequential (continual) setups, we show trade-offs in skill retention and transfer efficiency. By mirroring human learning patterns and providing fine-grained control over skill dependencies, this work advances continual learning evaluations for language models.
摘要：我们引入了基于 5-10 岁人类发展轨迹的全面持续学习数据集和基准 (CurlL)，能够对模型逐步获取新技能的能力进行系统和细粒度的评估。 CurlL 跨越五个发展阶段（0-4），涵盖 5-10 岁，并由技能图支持，该技能图将广泛的技能分解为较小的能力、具体目标和可衡量的指标，同时还捕获哪些能力是建立在其他能力之上的。我们生成一个 23.4B 令牌综合数据集，具有受控的技能进展、词汇复杂性和格式多样性，包括段落、基于理解的 QA (CQA)、技能测试 QA (CSQA) 和指令响应 (IR) 对。阶段性代币计数范围从2.12B到6.78B代币，支持遗忘、前向转移、后向转移的精准分析。使用在独立、联合和顺序（连续）设置下训练的 135M 参数变压器，我们展示了技能保留和传输效率之间的权衡。通过反映人类学习模式并提供对技能依赖性的细粒度控制，这项工作推进了语言模型的持续学习评估。

Title: On the Role of Preference Variance in Preference Optimization

Authors: Jiacheng Guo, Zihao Li, Jiahao Qiu, Yue Wu, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13022
Pdf URL: https://arxiv.org/pdf/2510.13022
Copy Paste: [[2510.13022]] On the Role of Preference Variance in Preference Optimization(https://arxiv.org/abs/2510.13022)
Keywords: language model, llm, prompt
Abstract: Direct Preference Optimization (DPO) has emerged as an important approach for learning from human preferences in aligning large language models (LLMs). However, collecting human preference data is costly and inefficient, motivating methods to reduce the required annotations. In this work, we investigate the impact of \emph{preference variance} (PVar), which measures the variance in model preferences when comparing pairs of responses, on the effectiveness of DPO training. We provide a theoretical insight by establishing an upper bound on the DPO gradient norm for any given prompt, showing it is controlled by the PVar of that prompt. This implies that prompts with low PVar can only produce small gradient updates, making them less valuable for learning. We validate this finding by fine-tuning LLMs with preferences generated by a reward model, evaluating on two benchmarks (AlpacaEval 2.0 and Arena-Hard). Experimental results demonstrate that prompts with higher PVar outperform randomly selected prompts or those with lower PVar. We also show that our PVar-based selection method is robust, when using smaller reward models (1B, 3B) for selection. Notably, in a separate experiment using the original human annotations from the UltraFeedback dataset, we found that training on only the top 10\% of prompts with the highest PVar yields better evaluation performance than training on the full dataset, highlighting the importance of preference variance in identifying informative examples for efficient LLM alignment.
摘要：直接偏好优化 (DPO) 已成为学习人类偏好以调整大型语言模型 (LLM) 的重要方法。然而，收集人类偏好数据成本高昂且效率低下，因此需要减少所需注释的方法。在这项工作中，我们研究了 \emph{偏好方差} (PVar) 对 DPO 训练有效性的影响，PVar 衡量了比较响应对时模型偏好的方差。我们通过为任何给定提示建立 DPO 梯度范数的上限来提供理论见解，表明它是由该提示的 PVar 控制的。这意味着低 PVar 的提示只能产生较小的梯度更新，从而使其对于学习的价值较低。我们通过使用奖励模型生成的偏好微调法学硕士来验证这一发现，并在两个基准（AlpacaEval 2.0 和 Arena-Hard）上进行评估。实验结果表明，具有较高 PVar 的提示优于随机选择的提示或具有较低 PVar 的提示。我们还表明，当使用较小的奖励模型（1B、3B）进行选择时，我们基于 PVar 的选择方法是稳健的。值得注意的是，在使用来自 UltraFeedback 数据集的原始人类注释的单独实验中，我们发现仅对具有最高 PVar 的前 10% 提示进行训练比对整个数据集进行训练产生更好的评估性能，这凸显了偏好方差在识别信息示例以实现高效 LLM 对齐方面的重要性。

Title: GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models

Authors: Chen Zheng, Yuhang Cai, Deyi Liu, Jin Ma, Yiyuan Ma, Yuan Yang, Jing Liu, Yutao Zeng, Xun Zhou, Siyuan Qiao
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13079
Pdf URL: https://arxiv.org/pdf/2510.13079
Copy Paste: [[2510.13079]] GatePro: Parameter-Free Expert Selection Optimization for Mixture-of-Experts Models(https://arxiv.org/abs/2510.13079)
Keywords: language model
Abstract: Modern large language models leverage Mixture-of-Experts (MoE) architectures for efficient scaling, but face a critical challenge: functionally similar experts are often selected simultaneously, creating redundant computation and limiting effective model capacity. Existing auxiliary balance loss methods improve token distribution but fail to address the underlying expert diversity problem. We introduce GatePro, a novel parameter-free method that directly promotes expert selection diversity. GatePro identifies the most similar expert pairs and introduces localized competition mechanisms, preventing redundant expert co-activation while maintaining natural expert specialization. Our comprehensive evaluation demonstrates GatePro's effectiveness across model scales and benchmarks. Analysis demonstrates GatePro's ability to achieve enhanced expert diversity, where experts develop more distinct and complementary capabilities, avoiding functional redundancy. This approach can be deployed hot-swappable during any training phase without additional learnable parameters, offering a practical solution for improving MoE effectiveness.
摘要：现代大型语言模型利用专家混合 (MoE) 架构来实现高效扩展，但面临着一个严峻的挑战：功能相似的专家通常会同时被选择，从而产生冗余计算并限制有效模型容量。现有的辅助平衡损失方法改善了代币分配，但未能解决潜在的专家多样性问题。我们引入了 GatePro，一种新颖的无参数方法，可以直接促进专家选择的多样性。 GatePro识别最相似的专家对并引入本地化竞争机制，防止冗余的专家共同激活，同时保持自然的专家专业化。我们的综合评估证明了 GatePro 在模型规模和基准方面的有效性。分析表明，GatePro 能够实现增强的专家多样性，即专家开发出更加独特和互补的能力，避免功能冗余。这种方法可以在任何训练阶段进行热插拔部署，无需额外的可学习参数，为提高 MoE 效率提供了实用的解决方案。

Title: ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models

Authors: Mingda Li, Xinyu Li, Weinan Zhang, Longxuan Ma
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13103
Pdf URL: https://arxiv.org/pdf/2510.13103
Copy Paste: [[2510.13103]] ESI: Epistemic Uncertainty Quantification via Semantic-preserving Intervention for Large Language Models(https://arxiv.org/abs/2510.13103)
Keywords: language model, llm
Abstract: Uncertainty Quantification (UQ) is a promising approach to improve model reliability, yet quantifying the uncertainty of Large Language Models (LLMs) is non-trivial. In this work, we establish a connection between the uncertainty of LLMs and their invariance under semantic-preserving intervention from a causal perspective. Building on this foundation, we propose a novel grey-box uncertainty quantification method that measures the variation in model outputs before and after the semantic-preserving intervention. Through theoretical justification, we show that our method provides an effective estimate of epistemic uncertainty. Our extensive experiments, conducted across various LLMs and a variety of question-answering (QA) datasets, demonstrate that our method excels not only in terms of effectiveness but also in computational efficiency.
摘要：不确定性量化 (UQ) 是一种很有前途的提高模型可靠性的方法，但量化大型语言模型 (LLM) 的不确定性并非易事。在这项工作中，我们从因果角度建立了法学硕士的不确定性与其在语义保留干预下的不变性之间的联系。在此基础上，我们提出了一种新颖的灰盒不确定性量化方法，该方法可以测量语义保留干预前后模型输出的变化。通过理论论证，我们表明我们的方法提供了认知不确定性的有效估计。我们在各种法学硕士和各种问答（QA）数据集上进行的广泛实验表明，我们的方法不仅在有效性方面而且在计算效率方面都很出色。

Title: Multi-Label Clinical Text Eligibility Classification and Summarization System

Authors: Surya Tejaswi Yerramsetty, Almas Fathimah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13115
Pdf URL: https://arxiv.org/pdf/2510.13115
Copy Paste: [[2510.13115]] Multi-Label Clinical Text Eligibility Classification and Summarization System(https://arxiv.org/abs/2510.13115)
Keywords: language model, gpt, llm
Abstract: Clinical trials are central to medical progress because they help improve understanding of human health and the healthcare system. They play a key role in discovering new ways to detect, prevent, or treat diseases, and it is essential that clinical trials include participants with appropriate and diverse medical backgrounds. In this paper, we propose a system that leverages Natural Language Processing (NLP) and Large Language Models (LLMs) to automate multi-label clinical text eligibility classification and summarization. The system combines feature extraction methods such as word embeddings (Word2Vec) and named entity recognition to identify relevant medical concepts, along with traditional vectorization techniques such as count vectorization and TF-IDF (Term Frequency-Inverse Document Frequency). We further explore weighted TF-IDF word embeddings that integrate both count-based and embedding-based strengths to capture term importance effectively. Multi-label classification using Random Forest and SVM models is applied to categorize documents based on eligibility criteria. Summarization techniques including TextRank, Luhn, and GPT-3 are evaluated to concisely summarize eligibility requirements. Evaluation with ROUGE scores demonstrates the effectiveness of the proposed methods. This system shows potential for automating clinical trial eligibility assessment using data-driven approaches, thereby improving research efficiency.
摘要：临床试验对于医学进步至关重要，因为它们有助于增进对人类健康和医疗保健系统的了解。它们在发现检测、预防或治疗疾病的新方法方面发挥着关键作用，临床试验必须包括具有适当和多样化医学背景的参与者。在本文中，我们提出了一种利用自然语言处理（NLP）和大型语言模型（LLM）来自动进行多标签临床文本资格分类和摘要的系统。该系统结合了词嵌入 (Word2Vec) 和命名实体识别等特征提取方法来识别相关医学概念，以及计数向量化和 TF-IDF（词频-逆文档频率）等传统向量化技术。我们进一步探索加权 TF-IDF 词嵌入，它集成了基于计数和基于嵌入的优势，以有效地捕获术语重要性。使用随机森林和 SVM 模型的多标签分类用于根据资格标准对文档进行分类。评估包括 TextRank、Luhn 和 GPT-3 在内的汇总技术，以简洁地总结资格要求。用 ROUGE 评分进行评估证明了所提出方法的有效性。该系统显示出使用数据驱动方法自动化临床试验资格评估的潜力，从而提高研究效率。

Title: Stable LLM Ensemble: Interaction between Example Representativeness and Diversity

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13143
Pdf URL: https://arxiv.org/pdf/2510.13143
Copy Paste: [[2510.13143]] Stable LLM Ensemble: Interaction between Example Representativeness and Diversity(https://arxiv.org/abs/2510.13143)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have achieved remarkable results in wide range of domains. However, the accuracy and robustness of one-shot LLM predictions remain highly sensitive to the examples and the diversity among ensemble members. This study systematically investigates the effects of example representativeness (one-shot strategy) and output diversity (sampling temperature) on LLM ensemble performance. Two one-shot strategies are compared: centroid-based representative examples (proposed) and randomly sampled examples (baseline) and sampling temperature also is varied. The proposed approach with higher temperature setting significantly outperforms random selection by +7.6% (macro-F1) and -10.5% (RMSE). Furthermore, the proposed model exceeds 5-shot prompting by +21.1% (macro-F1) and -24.0% (RMSE). Our findings demonstrate that combining representative example selection with increased temperature provides the appropriate level of diversity to the ensemble. This work highlights the practical importance of both example selection and controlled diversity in designing effective one-shot LLM ensembles.
摘要：大型语言模型（LLM）在广泛的领域取得了显着的成果。然而，一次性 LLM 预测的准确性和鲁棒性仍然对示例和集合成员之间的多样性高度敏感。本研究系统地研究了实例代表性（一次性策略）和输出多样性（采样温度）对 LLM 集成性能的影响。比较了两种一次性策略：基于质心的代表性示例（建议）和随机采样示例（基线），并且采样温度也不同。所提出的具有较高温度设置的方法明显优于随机选择+7.6%（宏观F1）和-10.5%（RMSE）。此外，所提出的模型超过 5 次射击提示 +21.1%（宏观 F1）和 -24.0%（RMSE）。我们的研究结果表明，将代表性示例选择与升高的温度相结合可以为整体提供适当水平的多样性。这项工作强调了示例选择和受控多样性在设计有效的一次性法学硕士集成中的实际重要性。

Title: I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs

Authors: Pardis Sadat Zahraei, Ehsaneddin Asgari
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13154
Pdf URL: https://arxiv.org/pdf/2510.13154
Copy Paste: [[2510.13154]] I Am Aligned, But With Whom? MENA Values Benchmark for Evaluating Cultural Alignment and Multilingual Bias in LLMs(https://arxiv.org/abs/2510.13154)
Keywords: language model, llm, prompt
Abstract: We introduce MENAValues, a novel benchmark designed to evaluate the cultural alignment and multilingual biases of large language models (LLMs) with respect to the beliefs and values of the Middle East and North Africa (MENA) region, an underrepresented area in current AI evaluation efforts. Drawing from large-scale, authoritative human surveys, we curate a structured dataset that captures the sociocultural landscape of MENA with population-level response distributions from 16 countries. To probe LLM behavior, we evaluate diverse models across multiple conditions formed by crossing three perspective framings (neutral, personalized, and third-person/cultural observer) with two language modes (English and localized native languages: Arabic, Persian, Turkish). Our analysis reveals three critical phenomena: "Cross-Lingual Value Shifts" where identical questions yield drastically different responses based on language, "Reasoning-Induced Degradation" where prompting models to explain their reasoning worsens cultural alignment, and "Logit Leakage" where models refuse sensitive questions while internal probabilities reveal strong hidden preferences. We further demonstrate that models collapse into simplistic linguistic categories when operating in native languages, treating diverse nations as monolithic entities. MENAValues offers a scalable framework for diagnosing cultural misalignment, providing both empirical insights and methodological tools for developing more culturally inclusive AI.
摘要：我们推出了 MENAValues，这是一个新颖的基准，旨在评估大型语言模型 (LLM) 相对于中东和北非 (MENA) 地区信仰和价值观的文化一致性和多语言偏见，该地区是当前人工智能评估工作中代表性不足的地区。根据大规模、权威的人类调查，我们整理了一个结构化数据集，该数据集捕捉了中东和北非地区的社会文化景观以及来自 16 个国家的人口水平响应分布。为了探究法学硕士的行为，我们评估了多种条件下的不同模型，这些模型是通过将三种视角框架（中立的、个性化的和第三人称/文化观察者）与两种语言模式（英语和本地化母语：阿拉伯语、波斯语、土耳其语）交叉而形成的。我们的分析揭示了三个关键现象：“跨语言价值转移”，即相同的问题会根据语言产生截然不同的反应；“推理引起的退化”，即促使模型解释其推理会恶化文化一致性；以及“逻辑泄漏”，即模型拒绝敏感问题，而内部概率则揭示出强烈的隐藏偏好。我们进一步证明，当以母语运行时，模型会陷入简单的语言类别，将不同的国家视为单一的实体。 MENAValues 提供了一个可扩展的框架来诊断文化失调，为开发更具文化包容性的人工智能提供经验见解和方法工具。

Title: Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference

Authors: Nikhil Bhendawade, Kumari Nishu, Arnav Kundu, Chris Bartels, Minsik Cho, Irina Belousova
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13161
Pdf URL: https://arxiv.org/pdf/2510.13161
Copy Paste: [[2510.13161]] Mirror Speculative Decoding: Breaking the Serial Barrier in LLM Inference(https://arxiv.org/abs/2510.13161)
Keywords: llm
Abstract: Speculative decoding accelerates LLM inference by using a draft model to look ahead, but gains are capped by the cost of autoregressive draft generation: increasing draft size elevates acceptance rates but introduces additional latency overhead exacerbating the speed-accuracy tradeoff. Prior methods (Medusa, Hydra, EAGLE) partially reduce draft cost but either degrade acceptance or introduce overheads that limit scaling. We present Mirror Speculative Decoding (Mirror-SD), an inference algorithm that breaks the latency-acceptance tradeoff. Mirror-SD launches branch-complete rollouts from early-exit signals in parallel with the target model's suffix and explicitly maps computation across heterogeneous accelerators (GPU and NPU) to exploit cross-device parallelism. The draft speculates forward continuations for the target to verify, while the target simultaneously speculates correction paths for the draft, converting speculation into two complementary execution pipelines. To further cut draft latency without weakening acceptance semantics, we add speculative streaming so the draft emits multiple tokens per step. This dual strategy of parallel heterogeneous execution plus multi-token speculative streaming pushes speculative decoding toward its ideal regime of high acceptance with low overhead. On SpecBench with server-scale models from 14B to 66B parameters, Mirror-SD delivers consistent end-to-end gains, achieving 2.8x-5.8x wall-time speedups across diverse tasks and a 30% average relative improvement over the strongest baseline, EAGLE3.
摘要：推测性解码通过使用草稿模型进行预测来加速 LLM 推理，但收益受到自回归草稿生成成本的限制：增加草稿大小会提高接受率，但会引入额外的延迟开销，从而加剧速度与准确性的权衡。先前的方法（Medusa、Hydra、EAGLE）部分降低了草稿成本，但要么降低了接受度，要么引入了限制扩展的开销。我们提出了镜像推测解码（Mirror-SD），这是一种打破延迟与接受权衡的推理算法。 Mirror-SD 从早期退出信号与目标模型的后缀并行启动分支完成部署，并显式映射跨异构加速器（GPU 和 NPU）的计算以利用跨设备并行性。草案推测目标要验证的前向延续，而目标同时推测草案的纠正路径，将推测转换为两个互补的执行管道。为了进一步减少草稿延迟而不削弱接受语义，我们添加了推测流，以便草稿每一步发出多个令牌。这种并行异构执行加上多令牌推测流的双重策略将推测解码推向其高接受度和低开销的理想状态。在具有从 14B 到 66B 参数的服务器规模模型的 SpecBench 上，Mirror-SD 提供一致的端到端增益，在不同任务中实现 2.8 倍至 5.8 倍的墙时间加速，并且比最强基线 EAGLE3 平均相对改进 30%。

Title: A Matter of Representation: Towards Graph-Based Abstract Code Generation

Authors: Nyx Iskandar, Hisham Bedri, Andy Tsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13163
Pdf URL: https://arxiv.org/pdf/2510.13163
Copy Paste: [[2510.13163]] A Matter of Representation: Towards Graph-Based Abstract Code Generation(https://arxiv.org/abs/2510.13163)
Keywords: language model, llm
Abstract: Most large language models (LLMs) today excel at generating raw, sequential code with minimal abstractions and custom structures. However, there has been little work on graph-based abstract code generation, where significant logic is encapsulated in predefined nodes and execution flow is determined by edges. This is relevant for visual programming languages, and in cases where raw source code is inaccessible to users and LLM training sets. In this work, we propose and evaluate JSON representations for graphs to enable high accuracy graph-based abstract code generation. We evaluate these representations on ScratchTest, a mini-benchmark based on our custom Python re-implementation of Scratch, which tests the LLM in code graph space. Our findings demonstrate that LLMs can indeed perform the aforementioned generation task in a single pass without relying on specialized or complex pipelines, given the correct graph representations. We also show that different representations induce significantly different accuracies, highlighting the instrumental role of representations in this generation task. All in all, this work establishes the first steps towards representation learning for graph-based abstract code generation.
摘要：如今，大多数大型语言模型 (LLM) 都擅长生成具有最少抽象和自定义结构的原始顺序代码。然而，基于图的抽象代码生成方面的工作还很少，其中重要的逻辑被封装在预定义的节点中，执行流程由边确定。这与可视化编程语言以及用户和 LLM 训练集无法访问原始源代码的情况相关。在这项工作中，我们提出并评估图的 JSON 表示，以实现高精度的基于图的抽象代码生成。我们在 ScratchTest 上评估这些表示，这是一个基于我们自定义的 Scratch Python 重新实现的迷你基准，它在代码图空间中测试 LLM。我们的研究结果表明，在给定正确的图形表示的情况下，法学硕士确实可以一次性执行上述生成任务，而无需依赖专门或复杂的管道。我们还表明，不同的表示会导致显着不同的准确性，突出了表示在该生成任务中的重要作用。总而言之，这项工作为基于图的抽象代码生成的表示学习迈出了第一步。

Title: CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning

Authors: Kehua Feng, Keyan Ding, Zhihui Zhu, Lei Liang, Qiang Zhang, Huajun Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13166
Pdf URL: https://arxiv.org/pdf/2510.13166
Copy Paste: [[2510.13166]] CoT-Evo: Evolutionary Distillation of Chain-of-Thought for Scientific Reasoning(https://arxiv.org/abs/2510.13166)
Keywords: language model, llm, chain-of-thought
Abstract: While chain-of-thought (CoT) distillation from advanced large language models (LLMs) has proven effective in general reasoning tasks, it struggles in scientific domains where even advanced models often produce incorrect or superficial reasoning due to high complexity and specialized knowledge requirements. Directly distilling from such flawed outputs results in low-quality training data and limits the performance of smaller student models. To overcome this, we propose CoT-Evo, an evolutionary CoT distillation framework. It begins by constructing a diverse pool of reasoning trajectories from multiple LLM thinkers, enriches them with automatically retrieved domain knowledge, and iteratively refines the trajectories using novelty-driven selection, reflective recombination and mutation. The refinement is guided by a fitness function that evaluates answer correctness, coherence, and effective knowledge utilization. This results in a high-quality CoT dataset tailored for scientific reasoning. We employ this evolved dataset to fine-tune a compact model, which achieves state-of-the-art performance on scientific reasoning benchmarks. Our work establishes a scalable approach to synthesizing high-fidelity scientific reasoning data from diverse and fallible LLMs.
摘要：虽然高级大语言模型 (LLM) 的思想链 (CoT) 蒸馏已被证明在一般推理任务中有效，但它在科学领域却举步维艰，因为由于高度复杂性和专业知识要求，即使是高级模型也经常产生不正确或肤浅的推理。直接从这些有缺陷的输出中提取会导致训练数据质量低下，并限制较小学生模型的性能。为了克服这个问题，我们提出了 CoT-Evo，一种进化的 CoT 蒸馏框架。它首先从多个法学硕士思想家构建多样化的推理轨迹池，通过自动检索的领域知识丰富它们，并使用新颖性驱动的选择、反思重组和突变迭代地细化轨迹。细化是由适应度函数引导的，该函数评估答案的正确性、连贯性和有效的知识利用。这会产生专为科学推理而定制的高质量 CoT 数据集。我们利用这个进化的数据集来微调紧凑的模型，该模型在科学推理基准上实现了最先进的性能。我们的工作建立了一种可扩展的方法来合成来自不同且易犯错误的法学硕士的高保真科学推理数据。

Title: Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism

Authors: Xiaoshu Chen, Sihang Zhou, Ke Liang, Duanyang Yuan, Haoyuan Chen, Xiaoyu Sun, Linyuan Meng, Xinwang Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13170
Pdf URL: https://arxiv.org/pdf/2510.13170
Copy Paste: [[2510.13170]] Putting on the Thinking Hats: A Survey on Chain of Thought Fine-tuning from the Perspective of Human Reasoning Mechanism(https://arxiv.org/abs/2510.13170)
Keywords: language model, llm
Abstract: Chain of thought (CoT) fine-tuning aims to endow large language models (LLMs) with reasoning capabilities by training them on curated reasoning traces. It leverages both supervised and reinforced fine-tuning to cultivate human-like reasoning skills in LLMs, including detailed planning, divergent thinking, intuitive judgment, timely reflection, internal thinking, and fact perception, etc. As CoT fine-tuning has advanced, LLMs have demonstrated substantial improvements in tasks such as mathematical reasoning and code generation. However, existing surveys about CoT fine-tuning primarily focus on technical aspects and overlook a systematic analysis from the perspective of human reasoning mechanisms. Given that the ultimate goal of CoT fine-tuning is to enable LLMs to reason like humans, it is crucial to investigate this technique through the lens of human cognition. To fill this gap, we present the first comprehensive survey of CoT fine-tuning grounded in human reasoning theory. Specifically, inspired by the well-known Six Thinking Hats framework, which systematically characterizes common human thinking modes using six metaphorical hats, we classify and examine CoT fine-tuning methods through this lens. Furthermore, building upon this theory, we outline potential directions for future research in CoT fine-tuning. In addition, we compile a comprehensive overview of existing datasets and model performances, and a real-time GitHub repository \footnote{this https URL} that continuously tracks recent advances in this area is maintained. We hope this survey will serve as a valuable resource to inspire innovation and foster progress in this rapidly evolving field.
摘要：思想链（CoT）微调旨在通过在策划的推理轨迹上训练大型语言模型（LLM）来赋予其推理能力。它利用监督微调和强化微调来培养法学硕士的类人推理能力，包括详细规划、发散思维、直觉判断、及时反思、内部思维和事实感知等。随着CoT微调的推进，法学硕士在数学推理和代码生成等任务上表现出了实质性的改进。然而，现有关于CoT微调的研究主要集中在技术方面，而忽视了从人类推理机制角度进行系统分析。鉴于 CoT 微调的最终目标是使法学硕士能够像人类一样进行推理，因此从人类认知的角度研究这项技术至关重要。为了填补这一空白，我们提出了第一个基于人类推理理论的 CoT 微调的全面调查。具体来说，受到著名的六顶思考帽框架的启发，该框架使用六顶隐喻帽子系统地描述了人类常见的思维模式，我们通过这个镜头对 CoT 微调方法进行分类和检查。此外，基于这一理论，我们概述了 CoT 微调未来研究的潜在方向。此外，我们还对现有数据集和模型性能进行了全面概述，并维护了一个实时 GitHub 存储库 \footnote{this https URL}，持续跟踪该领域的最新进展。我们希望这项调查能够成为激发创新并促进这个快速发展的领域取得进步的宝贵资源。

Title: DSCD: Large Language Model Detoxification with Self-Constrained Decoding

Authors: Ming Dong, Jinkui Zhang, Bolong Zheng, Xinhui Tu, Po Hu, Tingting He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13183
Pdf URL: https://arxiv.org/pdf/2510.13183
Copy Paste: [[2510.13183]] DSCD: Large Language Model Detoxification with Self-Constrained Decoding(https://arxiv.org/abs/2510.13183)
Keywords: language model, llm, hallucination
Abstract: Detoxification in large language models (LLMs) remains a significant research challenge. Existing decoding detoxification methods are all based on external constraints, which require additional resource overhead and lose generation fluency. This work proposes Detoxification with Self-Constrained Decoding (DSCD), a novel method for LLM detoxification without parameter fine-tuning. DSCD strengthens the inner next-token distribution of the safety layer while weakening that of hallucination and toxic layers during output generation. This effectively diminishes toxicity and enhances output safety. DSCD offers lightweight, high compatibility, and plug-and-play capabilities, readily integrating with existing detoxification methods for further performance improvement. Extensive experiments on representative open-source LLMs and public datasets validate DSCD's effectiveness, demonstrating state-of-the-art (SOTA) performance in both detoxification and generation fluency, with superior efficiency compared to existing methods. These results highlight DSCD's potential as a practical and scalable solution for safer LLM deployments.
摘要：大型语言模型（LLM）中的解毒仍然是一个重大的研究挑战。现有的解码解毒方法都是基于外部约束，需要额外的资源开销并且损失生成流畅度。这项工作提出了自约束解码解毒（DSCD），这是一种无需参数微调的 LLM 解毒新方法。 DSCD 加强了安全层的内部下一个代币分布，同时削弱了输出生成过程中的幻觉层和有毒层的分布。这有效地降低了毒性并提高了输出安全性。 DSCD 具有轻量级、高兼容性和即插即用功能，可轻松与现有的解毒方法集成，以进一步提高性能。对代表性开源法学硕士和公共数据集进行的大量实验验证了 DSCD 的有效性，展示了在解毒和生成流畅性方面最先进的 (SOTA) 性能，与现有方法相比具有卓越的效率。这些结果凸显了 DSCD 作为一种实用且可扩展的解决方案的潜力，可实现更安全的 LLM 部署。

Title: SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs

Authors: Juan Ren, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13190
Pdf URL: https://arxiv.org/pdf/2510.13190
Copy Paste: [[2510.13190]] SHIELD: Classifier-Guided Prompting for Robust and Safer LVLMs(https://arxiv.org/abs/2510.13190)
Keywords: language model, prompt
Abstract: Large Vision-Language Models (LVLMs) unlock powerful multimodal reasoning but also expand the attack surface, particularly through adversarial inputs that conceal harmful goals in benign prompts. We propose SHIELD, a lightweight, model-agnostic preprocessing framework that couples fine-grained safety classification with category-specific guidance and explicit actions (Block, Reframe, Forward). Unlike binary moderators, SHIELD composes tailored safety prompts that enforce nuanced refusals or safe redirection without retraining. Across five benchmarks and five representative LVLMs, SHIELD consistently lowers jailbreak and non-following rates while preserving utility. Our method is plug-and-play, incurs negligible overhead, and is easily extendable to new attack types -- serving as a practical safety patch for both weakly and strongly aligned LVLMs.
摘要：大型视觉语言模型 (LVLM) 解锁了强大的多模态推理，但也扩大了攻击面，特别是通过在良性提示中隐藏有害目标的对抗性输入。我们提出了 SHIELD，这是一种轻量级的、与模型无关的预处理框架，它将细粒度的安全分类与特定类别的指导和明确的操作（阻止、重构、转发）结合起来。与二进制版主不同，SHIELD 会编写量身定制的安全提示，强制执行细致入微的拒绝或安全重定向，而无需重新培训。在五个基准测试和五个代表性 LVLM 中，SHIELD 持续降低越狱率和不跟随率，同时保持实用性。我们的方法是即插即用的，产生的开销可以忽略不计，并且可以轻松扩展到新的攻击类型——作为弱对齐和强对齐 LVLM 的实用安全补丁。

Title: Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation

Authors: Jiamin Chen, Yuchen Li, Xinyu Ma, Xinran Chen, Xiaokun Zhang, Shuaiqiang Wang, Chen Ma, Dawei Yin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13191
Pdf URL: https://arxiv.org/pdf/2510.13191
Copy Paste: [[2510.13191]] Grounding Long-Context Reasoning with Contextual Normalization for Retrieval-Augmented Generation(https://arxiv.org/abs/2510.13191)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) has become an essential approach for extending the reasoning and knowledge capacity of large language models (LLMs). While prior research has primarily focused on retrieval quality and prompting strategies, the influence of how the retrieved documents are framed, i.e., context format, remains underexplored. We show that seemingly superficial choices, such as delimiters or structural markers in key-value extraction, can induce substantial shifts in accuracy and stability, even when semantic content is identical. To systematically investigate this effect, we design controlled experiments that vary context density, delimiter styles, and positional placement, revealing the underlying factors that govern performance differences. Building on these insights, we introduce Contextual Normalization, a lightweight strategy that adaptively standardizes context representations before generation. Extensive experiments on both controlled and real-world RAG benchmarks across diverse settings demonstrate that the proposed strategy consistently improves robustness to order variation and strengthens long-context utilization. These findings underscore that reliable RAG depends not only on retrieving the right content, but also on how that content is presented, offering both new empirical evidence and a practical technique for better long-context reasoning.
摘要：检索增强生成（RAG）已成为扩展大型语言模型（LLM）推理和知识能力的重要方法。虽然先前的研究主要集中在检索质量和提示策略上，但检索文档的构建方式（即上下文格式）的影响仍未得到充分探索。我们表明，即使语义内容相同，看似肤浅的选择（例如键值提取中的分隔符或结构标记）也可能导致准确性和稳定性的重大变化。为了系统地研究这种影响，我们设计了改变上下文密度、分隔符样式和位置放置的对照实验，揭示了控制性能差异的潜在因素。基于这些见解，我们引入了上下文标准化，这是一种轻量级策略，可以在生成之前自适应地标准化上下文表示。对不同设置下的受控和现实世界 RAG 基准进行的广泛实验表明，所提出的策略持续提高了对顺序变化的鲁棒性，并增强了长上下文利用率。这些发现强调，可靠的 RAG 不仅取决于检索正确的内容，还取决于内容的呈现方式，从而为更好的长上下文推理提供新的经验证据和实用技术。

Title: StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation

Authors: Xi Chen, Yuchen Song, Satoshi Nakamura
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13194
Pdf URL: https://arxiv.org/pdf/2510.13194
Copy Paste: [[2510.13194]] StressTransfer: Stress-Aware Speech-to-Speech Translation with Emphasis Preservation(https://arxiv.org/abs/2510.13194)
Keywords: llm
Abstract: We propose a stress-aware speech-to-speech translation (S2ST) system that preserves word-level emphasis by leveraging LLMs for cross-lingual emphasis conversion. Our method translates source-language stress into target-language tags that guide a controllable TTS model. To overcome data scarcity, we developed a pipeline to automatically generate aligned training data and introduce the "LLM-as-Judge" for evaluation. Experiments show our approach substantially outperforms baselines in preserving emphasis while maintaining comparable translation quality, speaker intent, and naturalness. Our work highlights the importance of prosody in translation and provides an effective, data-efficient solution for preserving paralinguistic cues in S2ST.
摘要：我们提出了一种压力感知语音到语音翻译（S2ST）系统，该系统通过利用法学硕士进行跨语言强调转换来保留单词级强调。我们的方法将源语言重音转换为目标语言标签，以指导可控的 TTS 模型。为了克服数据稀缺的问题，我们开发了一个管道来自动生成对齐的训练数据，并引入“LLM-as-Judge”进行评估。实验表明，我们的方法在保留重点、同时保持可比较的翻译质量、说话者意图和自然度方面远远优于基线。我们的工作强调了韵律在翻译中的重要性，并为保留 S2ST 中的副语言线索提供了有效、数据高效的解决方案。

Title: Text Anomaly Detection with Simplified Isolation Kernel

Authors: Yang Cao, Sikun Yang, Yujiu Yang, Lianyong Qi, Ming Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13197
Pdf URL: https://arxiv.org/pdf/2510.13197
Copy Paste: [[2510.13197]] Text Anomaly Detection with Simplified Isolation Kernel(https://arxiv.org/abs/2510.13197)
Keywords: language model
Abstract: Two-step approaches combining pre-trained large language model embeddings and anomaly detectors demonstrate strong performance in text anomaly detection by leveraging rich semantic representations. However, high-dimensional dense embeddings extracted by large language models pose challenges due to substantial memory requirements and high computation time. To address this challenge, we introduce the Simplified Isolation Kernel (SIK), which maps high-dimensional dense embeddings to lower-dimensional sparse representations while preserving crucial anomaly characteristics. SIK has linear time complexity and significantly reduces space complexity through its innovative boundary-focused feature mapping. Experiments across 7 datasets demonstrate that SIK achieves better detection performance than 11 state-of-the-art (SOTA) anomaly detection algorithms while maintaining computational efficiency and low memory cost. All code and demonstrations are available at this https URL.
摘要：结合预先训练的大型语言模型嵌入和异常检测器的两步方法通过利用丰富的语义表示在文本异常检测方面展示了强大的性能。然而，由于大量的内存需求和高计算时间，大型语言模型提取的高维密集嵌入带来了挑战。为了应对这一挑战，我们引入了简化隔离内核（SIK），它将高维密集嵌入映射到低维稀疏表示，同时保留关键的异常特征。 SIK 具有线性时间复杂度，并通过其创新的以边界为中心的特征映射显着降低了空间复杂度。跨 7 个数据集的实验表明，SIK 比 11 种最先进的 (SOTA) 异常检测算法实现了更好的检测性能，同时保持了计算效率和较低的内存成本。所有代码和演示都可以在此 https URL 中找到。

Title: LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems

Authors: Sai Suhruth Reddy Karri, Yashwanth Sai Nallapuneni, Laxmi Narasimha Reddy Mallireddy, Gopichand G
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13202
Pdf URL: https://arxiv.org/pdf/2510.13202
Copy Paste: [[2510.13202]] LLM-Guided Synthetic Augmentation (LGSA) for Mitigating Bias in AI Systems(https://arxiv.org/abs/2510.13202)
Keywords: language model, llm, prompt
Abstract: Bias in AI systems, especially those relying on natural language data, raises ethical and practical concerns. Underrepresentation of certain groups often leads to uneven performance across demographics. Traditional fairness methods, such as pre-processing, in-processing, and post-processing, depend on protected-attribute labels, involve accuracy-fairness trade-offs, and may not generalize across datasets. To address these challenges, we propose LLM-Guided Synthetic Augmentation (LGSA), which uses large language models to generate counterfactual examples for underrepresented groups while preserving label integrity. We evaluated LGSA on a controlled dataset of short English sentences with gendered pronouns, professions, and binary classification labels. Structured prompts were used to produce gender-swapped paraphrases, followed by quality control including semantic similarity checks, attribute verification, toxicity screening, and human spot checks. The augmented dataset expanded training coverage and was used to train a classifier under consistent conditions. Results show that LGSA reduces performance disparities without compromising accuracy. The baseline model achieved 96.7 percent accuracy with a 7.2 percent gender bias gap. Simple swap augmentation reduced the gap to 0.7 percent but lowered accuracy to 95.6 percent. LGSA achieved 99.1 percent accuracy with a 1.9 percent bias gap, improving performance on female-labeled examples. These findings demonstrate that LGSA is an effective strategy for bias mitigation, enhancing subgroup balance while maintaining high task accuracy and label fidelity.
摘要：人工智能系统中的偏见，尤其是那些依赖自然语言数据的系统，引发了道德和实际问题。某些群体的代表性不足往往会导致不同人群的表现参差不齐。传统的公平性方法，例如预处理、处理中和后处理，依赖于受保护的属性标签，涉及准确性与公平性的权衡，并且可能无法跨数据集泛化。为了应对这些挑战，我们提出了法学硕士引导的综合增强（LGSA），它使用大型语言模型为代表性不足的群体生成反事实示例，同时保持标签完整性。我们在包含性别代词、职业和二元分类标签的英语短句子的受控数据集上评估了 LGSA。使用结构化提示来生成性别交换的释义，然后进行质量控制，包括语义相似性检查、属性验证、毒性筛选和人体抽查。增强的数据集扩大了训练覆盖范围，并用于在一致条件下训练分类器。结果表明，LGSA 在不影响准确性的情况下减少了性能差异。基线模型的准确率达到 96.7%，性别偏见差距为 7.2%。简单的交换增强将差距缩小到 0.7%，但将准确率降低到 95.6%。 LGSA 的准确率达到 99.1%，偏差差距为 1.9%，提高了女性标签示例的性能。这些研究结果表明，LGSA 是一种有效的缓解偏差策略，可以增强子组平衡，同时保持较高的任务准确性和标签保真度。

Title: Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain

Authors: Jingmin An, Yilong Song, Ruolin Yang, Nai Ding, Lingxi Lu, Yuxuan Wang, Wei Wang, Chu Zhuang, Qian Wang, Fang Fang
Subjects: cs.CL, cs.NE
Abstract URL: https://arxiv.org/abs/2510.13255
Pdf URL: https://arxiv.org/pdf/2510.13255
Copy Paste: [[2510.13255]] Hierarchical Frequency Tagging Probe (HFTP): A Unified Approach to Investigate Syntactic Structure Representations in Large Language Models and the Human Brain(https://arxiv.org/abs/2510.13255)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) demonstrate human-level or even superior language abilities, effectively modeling syntactic structures, yet the specific computational modules responsible remain unclear. A key question is whether LLM behavioral capabilities stem from mechanisms akin to those in the human brain. To address these questions, we introduce the Hierarchical Frequency Tagging Probe (HFTP), a tool that utilizes frequency-domain analysis to identify neuron-wise components of LLMs (e.g., individual Multilayer Perceptron (MLP) neurons) and cortical regions (via intracranial recordings) encoding syntactic structures. Our results show that models such as GPT-2, Gemma, Gemma 2, Llama 2, Llama 3.1, and GLM-4 process syntax in analogous layers, while the human brain relies on distinct cortical regions for different syntactic levels. Representational similarity analysis reveals a stronger alignment between LLM representations and the left hemisphere of the brain (dominant in language processing). Notably, upgraded models exhibit divergent trends: Gemma 2 shows greater brain similarity than Gemma, while Llama 3.1 shows less alignment with the brain compared to Llama 2. These findings offer new insights into the interpretability of LLM behavioral improvements, raising questions about whether these advancements are driven by human-like or non-human-like mechanisms, and establish HFTP as a valuable tool bridging computational linguistics and cognitive neuroscience. This project is available at this https URL.
摘要：大型语言模型（LLM）展示了人类水平甚至卓越的语言能力，可以有效地建模句法结构，但负责的具体计算模块仍不清楚。一个关键问题是法学硕士的行为能力是否源于类似于人脑的机制。为了解决这些问题，我们引入了分层频率标记探针（HFTP），该工具利用频域分析来识别LLM的神经元组成部分（例如，单个多层感知器（MLP）神经元）和编码句法结构的皮质区域（通过颅内记录）。我们的结果表明，GPT-2、Gemma、Gemma 2、Llama 2、Llama 3.1 和 GLM-4 等模型在类似层中处理语法，而人脑则依赖于不同的皮质区域来处理不同的语法级别。表征相似性分析揭示了法学硕士表征与大脑左半球（在语言处理中占主导地位）之间更强的一致性。值得注意的是，升级后的模型表现出不同的趋势：Gemma 2 显示出比 Gemma 更大的大脑相似性，而与 Llama 2 相比，Llama 3.1 显示出与大脑的一致性较差。这些发现为 LLM 行为改进的可解释性提供了新的见解，提出了这些进步是否由类人或非类人机制驱动的问题，并将 HFTP 确立为桥接计算语言学和非人类机制的重要工具。认知神经科学。该项目可通过 https URL 获取。

Title: Do You Get the Hint? Benchmarking LLMs on the Board Game Concept

Authors: Ine Gevers, Walter Daelemans
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13271
Pdf URL: https://arxiv.org/pdf/2510.13271
Copy Paste: [[2510.13271]] Do You Get the Hint? Benchmarking LLMs on the Board Game Concept(https://arxiv.org/abs/2510.13271)
Keywords: language model, llm
Abstract: Large language models (LLMs) have achieved striking successes on many benchmarks, yet recent studies continue to expose fundamental weaknesses. In particular, tasks that require abstract reasoning remain challenging, often because they use representations such as grids, symbols, or visual patterns that differ from the natural language data LLMs are trained on. In this paper, we introduce Concept, a simple word-guessing board game, as a benchmark for probing abductive reasoning in a representation that is much closer to LLM pre-training data: natural language. Our results show that this game, easily solved by humans (with a success rate of over 90\%), is still very challenging for state-of-the-art LLMs (no model exceeds 40\% success rate). Specifically, we observe that LLMs struggle with interpreting other players' strategic intents, and with correcting initial hypotheses given sequential information updates. In addition, we extend the evaluation across multiple languages, and find that the LLM performance drops further in lower-resource languages (Dutch, French, and Spanish) compared to English.
摘要：大型语言模型（LLM）在许多基准测试中取得了惊人的成功，但最近的研究继续暴露出根本性的弱点。特别是，需要抽象推理的任务仍然具有挑战性，通常是因为它们使用的表示形式（例如网格、符号或视觉模式）与法学硕士所训练的自然语言数据不同。在本文中，我们介绍了 Concept，一种简单的猜词棋盘游戏，作为在更接近 LLM 预训练数据：自然语言的表示中探索溯因推理的基准。我们的结果表明，这个游戏很容易被人类解决（成功率超过 90%），但对于最先进的法学硕士来说仍然非常具有挑战性（没有模型超过 40% 的成功率）。具体来说，我们观察到法学硕士很难解释其他参与者的战略意图，也很难根据连续的信息更新来纠正最初的假设。此外，我们将评估扩展到多种语言，发现与英语相比，资源较低的语言（荷兰语、法语和西班牙语）的法学硕士成绩进一步下降。

Title: Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation

Authors: Zhichao Xu, Zongyu Wu, Yun Zhou, Aosong Feng, Kang Zhou, Sangmin Woo, Kiran Ramnath, Yijun Tian, Xuan Qi, Weikang Qiu, Lin Lee Cheong, Haibo Ding
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13272
Pdf URL: https://arxiv.org/pdf/2510.13272
Copy Paste: [[2510.13272]] Beyond Correctness: Rewarding Faithful Reasoning in Retrieval-Augmented Generation(https://arxiv.org/abs/2510.13272)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought, agent
Abstract: Inspired by the success of reinforcement learning (RL) in Large Language Model (LLM) training for domains like math and code, recent works have begun exploring how to train LLMs to use search engines more effectively as tools for retrieval-augmented generation. Although these methods achieve performance improvement across QA benchmarks, many prioritize final answer correctness while overlooking the quality of intermediate reasoning steps, which may lead to chain-of-thought unfaithfulness. In this paper, we first introduce a comprehensive evaluation framework for evaluating RL-based search agents, covering three distinct faithfulness metrics: information-think faithfulness, think-answer faithfulness, and think-search faithfulness. Our evaluations reveal that a prototypical RL-based search agent, Search-R1, has significant room for improvement in this regard. To foster faithful reasoning, we introduce VERITAS (Verifying Entailed Reasoning through Intermediate Traceability in Agentic Search), a novel framework that integrates fine-grained faithfulness rewards into the reinforcement learning process. Our experiments show that models trained with VERITAS not only significantly improve reasoning faithfulness, but also achieve comparable task performance across seven QA benchmarks.
摘要：受强化学习 (RL) 在数学和代码等领域的大型语言模型 (LLM) 培训中取得成功的启发，最近的工作已经开始探索如何训练 LLM 更有效地使用搜索引擎作为检索增强生成的工具。尽管这些方法在整个 QA 基准测试中实现了性能提升，但许多方法优先考虑最终答案的正确性，而忽视了中间推理步骤的质量，这可能会导致思想链的不忠实。在本文中，我们首先介绍了一个用于评估基于强化学习的搜索代理的综合评估框架，涵盖三个不同的忠实度指标：信息思考忠实度、思考答案忠实度和思考搜索忠实度。我们的评估表明，基于 RL 的原型搜索代理 Search-R1 在这方面有很大的改进空间。为了促进忠实推理，我们引入了 VERITAS（通过代理搜索中的中间可追溯性验证隐含推理），这是一个新颖的框架，它将细粒度的忠实奖励集成到强化学习过程中。我们的实验表明，使用 VERITAS 训练的模型不仅显着提高了推理可信度，而且在七个 QA 基准中实现了可比的任务性能。

Title: In-Distribution Steering: Balancing Control and Coherence in Language Model Generation

Authors: Arthur Vogels, Benjamin Wong, Yann Choho, Annabelle Blangero, Milan Bhan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13285
Pdf URL: https://arxiv.org/pdf/2510.13285
Copy Paste: [[2510.13285]] In-Distribution Steering: Balancing Control and Coherence in Language Model Generation(https://arxiv.org/abs/2510.13285)
Keywords: language model, llm
Abstract: Activation steering methods control large language model (LLM) behavior by modifying internal activations at inference time. However, most existing activation steering methods rely on a fixed steering strength, leading to either insufficient control or unadapted intervention that degrades text plausibility and coherence. We introduce In-Distribution Steering (IDS), a novel method that adapts steering strength based on the input data distribution in representation space. IDS dynamically adjusts interventions according to how far a given input lies within the distribution, enabling adaptive intervention and generation stability during text generation. Experiments demonstrate that IDS achieves strong accuracy on classification tasks while producing coherent text without collapse, making IDS particularly well suited for real-world applications.
摘要：激活引导方法通过在推理时修改内部激活来控制大语言模型 (LLM) 行为。然而，大多数现有的激活引导方法依赖于固定的引导强度，导致控制不足或不适应的干预，从而降低文本的合理性和连贯性。我们引入分布内转向（IDS），这是一种根据表示空间中的输入数据分布来调整转向强度的新颖方法。 IDS 根据给定输入在分布范围内的距离动态调整干预措施，从而在文本生成过程中实现自适应干预和生成稳定性。实验表明，IDS 在分类任务上实现了极高的准确性，同时生成不崩溃的连贯文本，使得 IDS 特别适合实际应用。

Title: Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems

Authors: Xuxin Cheng, Ke Zeng, Zhiquan Cao, Linyi Dai, Wenxuan Gao, Fei Han, Ai Jian, Feng Hong, Wenxing Hu, Zihe Huang, Dejian Kong, Jia Leng, Zhuoyuan Liao, Pei Liu, Jiaye Lin, Xing Ma, Jingqing Ruan, Jiaxing Song, Xiaoyu Tan, Ruixuan Xiao, Wenhui Yu, Wenyu Zhan, Haoxing Zhang, Chao Zhou, Hao Zhou, Shaodong Zheng, Ruinian Chen, Siyuan Chen, Ziyang Chen, Yiwen Dong, Yaoyou Fan, Yangyi Fang, Yang Gan, Shiguang Guo, Qi He, Chaowen Hu, Binghui Li, Dailin Li, Xiangyu Li, Yan Li, Chengjian Liu, Xiangfeng Liu, Jiahui Lv, Qiao Ma, Jiang Pan, Cong Qin, Chenxing Sun, Wen Sun, Zhonghui Wang, Abudukelimu Wuerkaixi, Xin Yang, Fangyi Yuan, Yawen Zhu, Tianyi Zhai, Jie Zhang, Runlai Zhang, Yao Xu, Yiran Zhao, Yifan Wang, Xunliang Cai, Yangen Hu, Cao Liu, Lu Pan, Xiaoli Wang, Bo Xiao, Wenyuan Yao, Qianlin Zhou, Benchang Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13291
Pdf URL: https://arxiv.org/pdf/2510.13291
Copy Paste: [[2510.13291]] Higher Satisfaction, Lower Cost: A Technical Report on How LLMs Revolutionize Meituan's Intelligent Interaction Systems(https://arxiv.org/abs/2510.13291)
Keywords: language model, llm, agent
Abstract: Enhancing customer experience is essential for business success, particularly as service demands grow in scale and complexity. Generative artificial intelligence and Large Language Models (LLMs) have empowered intelligent interaction systems to deliver efficient, personalized, and 24/7 support. In practice, intelligent interaction systems encounter several challenges: (1) Constructing high-quality data for cold-start training is difficult, hindering self-evolution and raising labor costs. (2) Multi-turn dialogue performance remains suboptimal due to inadequate intent understanding, rule compliance, and solution extraction. (3) Frequent evolution of business rules affects system operability and transferability, constraining low-cost expansion and adaptability. (4) Reliance on a single LLM is insufficient in complex scenarios, where the absence of multi-agent frameworks and effective collaboration undermines process completeness and service quality. (5) The open-domain nature of multi-turn dialogues, lacking unified golden answers, hampers quantitative evaluation and continuous optimization. To address these challenges, we introduce WOWService, an intelligent interaction system tailored for industrial applications. With the integration of LLMs and multi-agent architectures, WOWService enables autonomous task management and collaborative problem-solving. Specifically, WOWService focuses on core modules including data construction, general capability enhancement, business scenario adaptation, multi-agent coordination, and automated evaluation. Currently, WOWService is deployed on the Meituan App, achieving significant gains in key metrics, e.g., User Satisfaction Metric 1 (USM 1) -27.53% and User Satisfaction Metric 2 (USM 2) +25.51%, demonstrating its effectiveness in capturing user needs and advancing personalized service.
摘要：增强客户体验对于业务成功至关重要，尤其是在服务需求规模和复杂性不断增长的情况下。生成式人工智能和大型语言模型 (LLM) 使智能交互系统能够提供高效、个性化和 24/7 的支持。在实践中，智能交互系统遇到了几个挑战：（1）构建高质量的冷启动训练数据很困难，阻碍了自我进化并提高了劳动力成本。 (2) 由于意图理解、规则遵从和解决方案提取不足，多轮对话性能仍然不理想。 (3)业务规则的频繁演进影响系统的可操作性和可移植性，制约了低成本的扩展和适应性。（4）在复杂的场景中，仅仅依赖单一的LLM是不够的，缺乏多主体框架和有效的协作会损害流程的完整性和服务质量。（5）多轮对话的开放性，缺乏统一的黄金答案，阻碍了量化评估和持续优化。为了应对这些挑战，我们推出了WOWService，一个专为工业应用量身定制的智能交互系统。通过 LLM 和多代理架构的集成，WOWService 实现了自主任务管理和协作解决问题。具体来说，WOWService重点关注数据构建、通用能力增强、业务场景适配、多Agent协同、自动化评估等核心模块。目前，WOWService已部署在美团App上，在用户满意度指标1（USM 1）-27.53%和用户满意度指标2（USM 2）+25.51%等关键指标上取得了显着的提升，证明了其在捕捉用户需求和推进个性化服务方面的有效性。

Title: Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Authors: Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13293
Pdf URL: https://arxiv.org/pdf/2510.13293
Copy Paste: [[2510.13293]] Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models(https://arxiv.org/abs/2510.13293)
Keywords: language model, prompt
Abstract: While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.
摘要：虽然文本转语音 (TTS) 系统可以通过自然语言提示实现对情感表达的细粒度控制，但当所需的情感（风格提示）与文本的语义内容发生冲突时，就会出现重大挑战。这种不匹配通常会导致讲话听起来不自然，从而破坏了实现精细情绪控制的目标。无分类器引导（CFG）是增强提示对齐的关键技术；然而，其在自回归 (AR) TTS 模型中的应用仍未得到充分探索，这可能会导致音频质量下降。本文通过提出一种自适应 CFG 方案，直接解决 AR TTS 模型中风格内容不匹配的挑战，该方案可根据使用大型语言模型或自然语言推理模型测量的不同级别的检测到的不匹配进行调整。该解决方案基于对 CFG 对最先进的 AR TTS 模型中情感表达的影响的全面分析。我们的结果表明，所提出的自适应 CFG 方案提高了 AR TTS 模型的情感表达力，同时保持了音频质量和清晰度。

Title: LLM one-shot style transfer for Authorship Attribution and Verification

Authors: Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13302
Pdf URL: https://arxiv.org/pdf/2510.13302
Copy Paste: [[2510.13302]] LLM one-shot style transfer for Authorship Attribution and Verification(https://arxiv.org/abs/2510.13302)
Keywords: llm, prompt
Abstract: Computational stylometry analyzes writing style through quantitative patterns in text, supporting applications from forensic tasks such as identity linking and plagiarism detection to literary attribution in the humanities. Supervised and contrastive approaches rely on data with spurious correlations and often confuse style with topic. Despite their natural use in AI-generated text detection, the CLM pre-training of modern LLMs has been scarcely leveraged for general authorship problems. We propose a novel unsupervised approach based on this extensive pre-training and the in-context learning capabilities of LLMs, employing the log-probabilities of an LLM to measure style transferability from one text to another. Our method significantly outperforms LLM prompting approaches of comparable scale and achieves higher accuracy than contrastively trained baselines when controlling for topical correlations. Moreover, performance scales fairly consistently with the size of the base model and, in the case of authorship verification, with an additional mechanism that increases test-time computation; enabling flexible trade-offs between computational cost and accuracy.
摘要：计算文体计量学通过文本中的定量模式来分析写作风格，支持从身份链接和抄袭检测等取证任务到人文学科中的文学归属的应用。监督方法和对比方法依赖于具有虚假相关性的数据，并且经常将风格与主题混淆。尽管它们自然地用于人工智能生成的文本检测，但现代法学硕士的 CLM 预训练几乎没有被用来解决一般的作者问题。我们基于这种广泛的预训练和法学硕士的上下文学习能力，提出了一种新颖的无监督方法，利用法学硕士的对数概率来衡量从一个文本到另一个文本的风格可迁移性。我们的方法显着优于同等规模的法学硕士提示方法，并且在控制主题相关性时比经过对比训练的基线获得更高的准确性。此外，性能的扩展与基本模型的大小相当一致，并且在作者身份验证的情况下，还具有增加测试时间计算的附加机制；实现计算成本和准确性之间的灵活权衡。

Title: ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering

Authors: Simon Lupart, Mohammad Aliannejadi, Evangelos Kanoulas
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.13312
Pdf URL: https://arxiv.org/pdf/2510.13312
Copy Paste: [[2510.13312]] ChatR1: Reinforcement Learning for Conversational Reasoning and Retrieval Augmented Question Answering(https://arxiv.org/abs/2510.13312)
Keywords: llm, chat
Abstract: We present ChatR1, a reasoning framework based on reinforcement learning (RL) for conversational question answering (CQA). Reasoning plays an important role in CQA, where user intent evolves across dialogue turns, and utterances are often underspecified, requiring contextual interpretation, query reformulation, and dynamic coordination between retrieval and generation. Unlike static `rewrite, retrieve, and generate' pipelines, ChatR1 interleaves search and reasoning across turns, enabling exploratory and adaptive behaviors learned through RL. To address the challenge of sparse and delayed rewards in RL, we propose an intent-aware reward that provides turn-level feedback by aligning retrieval and reasoning with evolving user goals. Our proposed ChatR1 demonstrates strong performance on both 3B and 7B model backbones, outperforming competitive models on five CQA datasets, measured by different metrics (F1, BERTScore, and LLM-as-judge). We include a diverse set of CQA datasets to cover topic shifts, evolving intents, mixed-initiative dialogues, and multi-document grounding, testing ChatR1's performance from various aspects. Ablation studies confirm the effectiveness of the intent-aware reward. Our analyses further reveal diverse reasoning trajectories and effective use of the search tool. ChatR1 also generalizes robustly across domains, demonstrating that RL-based reasoning enables more flexible and context-sensitive behavior than static CQA pipelines.
摘要：我们提出了 ChatR1，一个基于强化学习 (RL) 的会话问答 (CQA) 推理框架。推理在 CQA 中发挥着重要作用，其中用户意图随着对话的轮次而演变，并且话语通常不明确，需要上下文解释、查询重新表述以及检索和生成之间的动态协调。与静态的“重写、检索和生成”管道不同，ChatR1 交替进行搜索和推理，从而实现通过 RL 学习的探索性和适应性行为。为了解决强化学习中奖励稀疏和延迟的挑战，我们提出了一种意图感知奖励，通过将检索和推理与不断变化的用户目标保持一致来提供回合级反馈。我们提出的 ChatR1 在 3B 和 7B 模型主干上都表现出了强大的性能，在五个 CQA 数据集上的表现优于竞争模型，通过不同的指标（F1、BERTScore 和 LLM-as-judge）来衡量。我们包含了一组多样化的 CQA 数据集，涵盖主题转移、不断变化的意图、混合主动对话和多文档基础，从各个方面测试 ChatR1 的性能。消融研究证实了意图感知奖励的有效性。我们的分析进一步揭示了不同的推理轨迹和搜索工具的有效使用。 ChatR1 还可以跨领域稳健地推广，证明基于 RL 的推理比静态 CQA 管道能够实现更灵活和上下文敏感的行为。

Title: Embedding-Based Context-Aware Reranker

Authors: Ye Yuan, Mohammad Amin Shabani, Siqi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13329
Pdf URL: https://arxiv.org/pdf/2510.13329
Copy Paste: [[2510.13329]] Embedding-Based Context-Aware Reranker(https://arxiv.org/abs/2510.13329)
Keywords: language model, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems rely on retrieving relevant evidence from a corpus to support downstream generation. The common practice of splitting a long document into multiple shorter passages enables finer-grained and targeted information retrieval. However, it also introduces challenges when a correct retrieval would require inference across passages, such as resolving coreference, disambiguating entities, and aggregating evidence scattered across multiple sources. Many state-of-the-art (SOTA) reranking methods, despite utilizing powerful large pretrained language models with potentially high inference costs, still neglect the aforementioned challenges. Therefore, we propose Embedding-Based Context-Aware Reranker (EBCAR), a lightweight reranking framework operating directly on embeddings of retrieved passages with enhanced cross-passage understandings through the structural information of the passages and a hybrid attention mechanism, which captures both high-level interactions across documents and low-level relationships within each document. We evaluate EBCAR against SOTA rerankers on the ConTEB benchmark, demonstrating its effectiveness for information retrieval requiring cross-passage inference and its advantages in both accuracy and efficiency.
摘要：检索增强生成（RAG）系统依赖于从语料库中检索相关证据来支持下游生成。将长文档拆分为多个较短段落的常见做法可以实现更细粒度和有针对性的信息检索。然而，当正确的检索需要跨段落进行推理时，它也会带来挑战，例如解决共指、消除实体歧义以及聚合分散在多个来源的证据。许多最先进的（SOTA）重排序方法尽管利用了强大的大型预训练语言模型（具有潜在的高推理成本），但仍然忽略了上述挑战。因此，我们提出了基于嵌入的上下文感知重排序（EBCAR），这是一种轻量级重排序框架，直接对检索到的段落的嵌入进行操作，通过段落的结构信息和混合注意机制增强跨段落理解，捕获跨文档的高级交互和每个文档内的低级关系。我们在 ConTEB 基准上针对 SOTA 重排序器对 EBCAR 进行了评估，证明了其对于需要跨段落推理的信息检索的有效性以及其在准确性和效率方面的优势。

Title: Taming the Fragility of KV Cache Eviction in LLM Inference

Authors: Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, Xike Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13334
Pdf URL: https://arxiv.org/pdf/2510.13334
Copy Paste: [[2510.13334]] Taming the Fragility of KV Cache Eviction in LLM Inference(https://arxiv.org/abs/2510.13334)
Keywords: language model, llm
Abstract: Large language models have revolutionized natural language processing, yet their deployment remains hampered by the substantial memory and runtime overhead of the transformer's Key-Value cache. To mitigate this, recent methods employ a scoring-aggregation framework to evict unimportant cache entries, based on the stability assumption-that a fixed subset of entries remains consistently important during generation. However, prior work has largely focused on refining importance indicators for scoring, while defaulting to mean aggregation due to a faithful trust in the stability assumption. In this work, we argue that this underlying assumption is inherently fragile, making mean aggregation highly vulnerable in extreme cases. To counter this, we propose a simple yet elegant defensive aggregation strategy: a two-step, linear-time approach that controls worst-case risk, thereby defending against extreme cases with negligible computational overhead. Embodying this strategy, we propose a novel cache eviction method, DefensiveKV and its extension, Layer-DefensiveKV, which incorporates layer-wise budget allocation. Across seven task domains (18 datasets), our methods reduce generation quality loss by 2.3x and 4.3x respectively, versus the strongest baseline under a 20% cache size. These results set new performance benchmarks and pioneer a promising direction for optimizing cache eviction against underlying fragility through worst-case risk management. Our code is available at this https URL.
摘要：大型语言模型彻底改变了自然语言处理，但其部署仍然受到转换器键值缓存的大量内存和运行时开销的阻碍。为了缓解这一问题，最近的方法基于稳定性假设（即条目的固定子集在生成过程中始终保持重要），采用评分聚合框架来逐出不重要的缓存条目。然而，之前的工作主要集中在完善评分的重要性指标，同时由于对稳定性假设的忠实信任而默认均值聚合。在这项工作中，我们认为这一基本假设本质上是脆弱的，使得均值聚合在极端情况下非常脆弱。为了解决这个问题，我们提出了一种简单而优雅的防御聚合策略：一种两步线性时间方法，可以控制最坏情况的风险，从而以可忽略的计算开销来防御极端情况。为了体现这一策略，我们提出了一种新颖的缓存驱逐方法，DefenseKV 及其扩展，Layer-DefectiveKV，它结合了逐层预算分配。在 7 个任务域（18 个数据集）中，与 20% 缓存大小下的最强基线相比，我们的方法分别将生成质量损失降低了 2.3 倍和 4.3 倍。这些结果设定了新的性能基准，并开创了通过最坏情况风险管理优化缓存驱逐以应对潜在脆弱性的有希望的方向。我们的代码可以在这个 https URL 上找到。

Title: Are Proverbs the New Pythian Oracles? Exploring Sentiment in Greek Sayings

Authors: Katerina Korre, John Pavlopoulos
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13341
Pdf URL: https://arxiv.org/pdf/2510.13341
Copy Paste: [[2510.13341]] Are Proverbs the New Pythian Oracles? Exploring Sentiment in Greek Sayings(https://arxiv.org/abs/2510.13341)
Keywords: llm
Abstract: Proverbs are among the most fascinating linguistic phenomena that transcend cultural and linguistic boundaries. Yet, much of the global landscape of proverbs remains underexplored, as many cultures preserve their traditional wisdom within their own communities due to the oral tradition of the phenomenon. Taking advantage of the current advances in Natural Language Processing (NLP), we focus on Greek proverbs, analyzing their sentiment. Departing from an annotated dataset of Greek proverbs, we expand it to include local dialects, effectively mapping the annotated sentiment. We present (1) a way to exploit LLMs in order to perform sentiment classification of proverbs, (2) a map of Greece that provides an overview of the distribution of sentiment, (3) a combinatory analysis in terms of the geographic position, dialect, and topic of proverbs. Our findings show that LLMs can provide us with an accurate enough picture of the sentiment of proverbs, especially when approached as a non-conventional sentiment polarity task. Moreover, in most areas of Greece negative sentiment is more prevalent.
摘要：谚语是超越文化和语言界限的最迷人的语言现象之一。然而，全球谚语的大部分景观仍未得到充分探索，因为许多文化由于这种现象的口头传统而在自己的社区内保留了其传统智慧。利用当前自然语言处理 (NLP) 的进步，我们专注于希腊谚语，分析它们的情绪。与带注释的希腊谚语数据集不同，我们将其扩展为包含当地方言，有效地映射带注释的情感。我们提出（1）一种利用法学硕士对谚语进行情感分类的方法，（2）提供情感分布概述的希腊地图，（3）根据谚语的地理位置、方言和主题进行组合分析。我们的研究结果表明，法学硕士可以为我们提供足够准确的谚语情感描述，特别是当作为非传统情感极性任务来处理时。而且，希腊大部分地区负面情绪较为普遍。

Title: Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems

Authors: Karthik Avinash, Nikhil Pareek, Rishav Hada
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13351
Pdf URL: https://arxiv.org/pdf/2510.13351
Copy Paste: [[2510.13351]] Protect: Towards Robust Guardrailing Stack for Trustworthy Enterprise LLM Systems(https://arxiv.org/abs/2510.13351)
Keywords: language model, gpt, llm, prompt
Abstract: The increasing deployment of Large Language Models (LLMs) across enterprise and mission-critical domains has underscored the urgent need for robust guardrailing systems that ensure safety, reliability, and compliance. Existing solutions often struggle with real-time oversight, multi-modal data handling, and explainability -- limitations that hinder their adoption in regulated environments. Existing guardrails largely operate in isolation, focused on text alone making them inadequate for multi-modal, production-scale environments. We introduce Protect, natively multi-modal guardrailing model designed to operate seamlessly across text, image, and audio inputs, designed for enterprise-grade deployment. Protect integrates fine-tuned, category-specific adapters trained via Low-Rank Adaptation (LoRA) on an extensive, multi-modal dataset covering four safety dimensions: toxicity, sexism, data privacy, and prompt injection. Our teacher-assisted annotation pipeline leverages reasoning and explanation traces to generate high-fidelity, context-aware labels across modalities. Experimental results demonstrate state-of-the-art performance across all safety dimensions, surpassing existing open and proprietary models such as WildGuard, LlamaGuard-4, and GPT-4.1. Protect establishes a strong foundation for trustworthy, auditable, and production-ready safety systems capable of operating across text, image, and audio modalities.
摘要：大型语言模型 (LLM) 在企业和关键任务领域的部署日益增多，凸显了对确保安全性、可靠性和合规性的强大护栏系统的迫切需求。现有的解决方案常常难以满足实时监督、多模式数据处理和可解释性的要求，这些限制阻碍了它们在监管环境中的采用。现有的护栏在很大程度上是孤立运行的，仅关注文本，这使得它们不足以适应多模式、生产规模的环境。我们推出了 Protect，这是一种原生多模式护栏模型，旨在跨文本、图像和音频输入无缝操作，专为企业级部署而设计。 Protect 集成了通过低秩适应 (LoRA) 在广泛的多模式数据集上训练的微调的特定类别适配器，涵盖四个安全维度：毒性、性别歧视、数据隐私和即时注入。我们的教师辅助注释管道利用推理和解释轨迹来跨模式生成高保真、上下文感知的标签。实验结果证明了在所有安全维度上的最先进的性能，超越了现有的开放和专有模型，例如 WildGuard、LlamaGuard-4 和 GPT-4.1。 Protect 为能够跨文本、图像和音频模式运行的可信赖、可审核和生产就绪的安全系统奠定了坚实的基础。

Title: D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree

Authors: Xiang Lei, Qin Li, Min Zhang, Min Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13363
Pdf URL: https://arxiv.org/pdf/2510.13363
Copy Paste: [[2510.13363]] D-SMART: Enhancing LLM Dialogue Consistency via Dynamic Structured Memory And Reasoning Tree(https://arxiv.org/abs/2510.13363)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: Large Language Models (LLMs) often exhibit factual inconsistencies and logical decay in extended, multi-turn dialogues, a challenge stemming from their reliance on static, pre-trained knowledge and an inability to reason adaptively over the dialogue history. Prevailing mitigation strategies, such as Retrieval-Augmented Generation (RAG) and agentic working memories, improve information recall but still engage with fundamentally static knowledge sources and follow pre-defined single reasoning path. This hinders their ability to preserve factual and logical consistency of their responses in multi-turn dialogues while the context evolves over time. To address this issue, we propose D-SMART, a model-agnostic framework designed to maintain multi-turn dialogue consistency by enabling LLMs to build and reason over a dynamic, structured representation of the conversational context. This is achieved via two synergistic components: (1) a Dynamic Structured Memory (DSM), which incrementally constructs and maintains an authoritative, OWL-compliant knowledge graph of the conversation; and (2) a Reasoning Tree (RT), which executes inferences as an explicit and traceable multi-step search over the graph. As the popular-used quality score (judged by GPT-4) can overlook logical flaws, we introduce new NLI-based metrics to better measure multi-turn dialogue consistency. Comprehensive experiments on the MT-Bench-101 benchmark show that D-SMART significantly outperforms state-of-the-art baselines, elevating the dialogue consistency score by over 48\% for both proprietary and open-source models, and notably improves the quality score of the latter by up to 10.1\%.
摘要：大型语言模型（LLM）经常在扩展的多轮对话中表现出事实不一致和逻辑衰退，这是由于它们依赖静态的、预先训练的知识以及无法对对话历史进行自适应推理而面临的挑战。流行的缓解策略，例如检索增强生成（RAG）和代理工作记忆，可以改善信息回忆，但仍然涉及基本静态的知识源并遵循预定义的单一推理路径。这阻碍了他们在多轮对话中保持事实和逻辑一致性的能力，同时上下文随着时间的推移而变化。为了解决这个问题，我们提出了 D-SMART，这是一个与模型无关的框架，旨在通过使法学硕士能够构建和推理对话上下文的动态、结构化表示来保持多轮对话的一致性。这是通过两个协同组件实现的：（1）动态结构化内存（DSM），它逐步构建和维护权威的、符合 OWL 的对话知识图； (2) 推理树 (RT)，它将推理作为对图的显式且可追踪的多步搜索来执行。由于流行使用的质量评分（由 GPT-4 判断）可以忽略逻辑缺陷，因此我们引入了新的基于 NLI 的指标来更好地衡量多轮对话的一致性。 MT-Bench-101 基准测试的综合实验表明，D-SMART 显着优于最先进的基线，将专有模型和开源模型的对话一致性得分提高了 48% 以上，并将后者的质量得分显着提高了 10.1%。

Title: Document Intelligence in the Era of Large Language Models: A Survey

Authors: Weishi Wang, Hengchang Hu, Zhijie Zhang, Zhaochen Li, Hongxin Shao, Daniel Dahlmeier
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13366
Pdf URL: https://arxiv.org/pdf/2510.13366
Copy Paste: [[2510.13366]] Document Intelligence in the Era of Large Language Models: A Survey(https://arxiv.org/abs/2510.13366)
Keywords: language model, llm, agent
Abstract: Document AI (DAI) has emerged as a vital application area, and is significantly transformed by the advent of large language models (LLMs). While earlier approaches relied on encoder-decoder architectures, decoder-only LLMs have revolutionized DAI, bringing remarkable advancements in understanding and generation. This survey provides a comprehensive overview of DAI's evolution, highlighting current research attempts and future prospects of LLMs in this field. We explore key advancements and challenges in multimodal, multilingual, and retrieval-augmented DAI, while also suggesting future research directions, including agent-based approaches and document-specific foundation models. This paper aims to provide a structured analysis of the state-of-the-art in DAI and its implications for both academic and practical applications.
摘要：文档人工智能 (DAI) 已成为一个重要的应用领域，并因大型语言模型 (LLM) 的出现而发生了重大转变。虽然早期的方法依赖于编码器-解码器架构，但仅解码器的法学硕士已经彻底改变了 DAI，在理解和生成方面带来了显着的进步。本次调查全面概述了 DAI 的演变，重点介绍了法学硕士在该领域的当前研究尝试和未来前景。我们探讨了多模式、多语言和检索增强 DAI 的关键进展和挑战，同时还提出了未来的研究方向，包括基于代理的方法和特定于文档的基础模型。本文旨在对 DAI 的最新技术及其对学术和实际应用的影响进行结构化分析。

Title: Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment

Authors: Buwei He, Yang Liu, Zhaowei Zhang, Zixia Jia, Huijia Wu, Zhaofeng He, Zilong Zheng, Yipeng Kang
Subjects: cs.CL, cs.GT
Abstract URL: https://arxiv.org/abs/2510.13387
Pdf URL: https://arxiv.org/pdf/2510.13387
Copy Paste: [[2510.13387]] Make an Offer They Can't Refuse: Grounding Bayesian Persuasion in Real-World Dialogues without Pre-Commitment(https://arxiv.org/abs/2510.13387)
Keywords: language model, llm, prompt, agent
Abstract: Persuasion, a fundamental social capability for humans, remains a challenge for AI systems such as large language models (LLMs). Current studies often overlook the strategic use of information asymmetry in message design or rely on strong assumptions regarding pre-commitment. In this work, we explore the application of Bayesian Persuasion (BP) in natural language within single-turn dialogue settings, to enhance the strategic persuasion capabilities of LLMs. Our framework incorporates a commitment-communication mechanism, where the persuader explicitly outlines an information schema by narrating their potential types (e.g., honest or dishonest), thereby guiding the persuadee in performing the intended Bayesian belief update. We evaluate two variants of our approach: Semi-Formal-Natural-Language (SFNL) BP and Fully-Natural-Language (FNL) BP, benchmarking them against both naive and strong non-BP (NBP) baselines within a comprehensive evaluation framework. This framework covers a diverse set of persuadees -- including LLM instances with varying prompts and fine-tuning and human participants -- across tasks ranging from specially designed persuasion scenarios to general everyday situations. Experimental results on LLM-based agents reveal three main findings: (1) LLMs guided by BP strategies consistently achieve higher persuasion success rates than NBP baselines; (2) SFNL exhibits greater credibility and logical coherence, while FNL shows stronger emotional resonance and robustness in naturalistic conversations; (3) with supervised fine-tuning, smaller models can attain BP performance comparable to that of larger models.
摘要：说服力是人类的一项基本社交能力，对于大型语言模型 (LLM) 等人工智能系统来说仍然是一个挑战。目前的研究经常忽视信息设计中信息不对称的战略运用，或者依赖于有关预先承诺的强有力的假设。在这项工作中，我们探索贝叶斯说服（BP）在单轮对话环境中自然语言中的应用，以增强法学硕士的战略说服能力。我们的框架采用了承诺沟通机制，说服者通过叙述其潜在类型（例如，诚实或不诚实）来明确概述信息模式，从而指导被说服者执行预期的贝叶斯信念更新。我们评估了我们方法的两种变体：半正式自然语言（SFNL）BP和完全自然语言（FNL）BP，在综合评估框架内将它们与朴素和强大的非BP（NBP）基线进行基准比较。该框架涵盖了各种说服者，包括具有不同提示和微调的法学硕士实例以及人类参与者，任务范围从专门设计的说服场景到一般的日常情况。基于 LLM 的代理的实验结果揭示了三个主要发现：（1）在 BP 策略指导下的 LLM 始终比 NBP 基线取得更高的说服成功率；（2）SFNL在自然主义对话中表现出更强的可信度和逻辑连贯性，而FNL则表现出更强的情感共鸣和鲁棒性； (3) 通过监督微调，较小的模型可以获得与较大模型相当的 BP 性能。

Title: Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Authors: Agnese Lombardi, Alessandro Lenci
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13395
Pdf URL: https://arxiv.org/pdf/2510.13395
Copy Paste: [[2510.13395]] Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models(https://arxiv.org/abs/2510.13395)
Keywords: language model, gpt, llm, agent
Abstract: Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.
摘要：语言是人类合作的基础，它不仅促进信息交流，而且通过对情境的共同解释来协调行动。本研究探讨了基于生成代理的模型 (GABM) Concordia 是否可以在模拟的现实环境中有效地模拟心理理论 (ToM)。具体来说，我们评估该框架是否成功模拟 ToM 能力，以及 GPT-4 是否可以通过从社会背景中进行真正的推理来执行任务，而不是依赖于语言记忆。我们的研究结果揭示了一个关键的局限性：GPT-4 经常无法根据信念归因来选择行动，这表明在之前的研究中观察到的明显的类似 ToM 的能力可能源于浅层的统计关联，而不是真正的推理。此外，该模型难以从代理行为中产生连贯的因果效应，暴露出处理复杂社会互动的困难。这些结果挑战了目前关于法学硕士中类似 ToM 能力的说法，并强调需要更严格、基于行动的评估框架。

Title: Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps

Authors: Ahmed Alzubaidi, Shaikha Alsuwaidi, Basma El Amel Boussaha, Leen AlQadi, Omar Alkaabi, Mohammed Alyafeai, Hamza Alobeidli, Hakim Hacid
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13430
Pdf URL: https://arxiv.org/pdf/2510.13430
Copy Paste: [[2510.13430]] Evaluating Arabic Large Language Models: A Survey of Benchmarks, Methods, and Gaps(https://arxiv.org/abs/2510.13430)
Keywords: language model, llm
Abstract: This survey provides the first systematic review of Arabic LLM benchmarks, analyzing 40+ evaluation benchmarks across NLP tasks, knowledge domains, cultural understanding, and specialized capabilities. We propose a taxonomy organizing benchmarks into four categories: Knowledge, NLP Tasks, Culture and Dialects, and Target-Specific evaluations. Our analysis reveals significant progress in benchmark diversity while identifying critical gaps: limited temporal evaluation, insufficient multi-turn dialogue assessment, and cultural misalignment in translated datasets. We examine three primary approaches: native collection, translation, and synthetic generation discussing their trade-offs regarding authenticity, scale, and cost. This work serves as a comprehensive reference for Arabic NLP researchers, providing insights into benchmark methodologies, reproducibility standards, and evaluation metrics while offering recommendations for future development.
摘要：这项调查首次对阿拉伯法学硕士基准进行了系统回顾，分析了 NLP 任务、知识领域、文化理解和专业能力的 40 多个评估基准。我们提出了一种分类法，将基准分为四类：知识、NLP 任务、文化和方言以及针对特定目标的评估。我们的分析揭示了基准多样性方面的重大进展，同时确定了关键差距：有限的时间评估、不足的多轮对话评估以及翻译数据集中的文化错位。我们研究了三种主要方法：本地收集、翻译和合成生成，讨论它们在真实性、规模和成本方面的权衡。这项工作为阿拉伯自然语言处理研究人员提供了全面的参考，提供了对基准方法、可重复性标准和评估指标的见解，同时为未来的发展提供了建议。

Title: Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation

Authors: Hao Wang, Linlong Xu, Heng Liu, Yangyang Liu, Xiaohu Zhao, Bo Zeng, Liangying Shao, Longyue Wang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13434
Pdf URL: https://arxiv.org/pdf/2510.13434
Copy Paste: [[2510.13434]] Beyond Single-Reward: Multi-Pair, Multi-Perspective Preference Optimization for Machine Translation(https://arxiv.org/abs/2510.13434)
Keywords: language model, llm, hallucination
Abstract: Direct Preference Optimization (DPO) is a powerful paradigm for aligning Large Language Models (LLMs) to human preferences in Machine Translation (MT), but current methods are hindered by two fundamental challenges: (1) flawed reward signals from Quality Estimation (QE) models that overlook critical errors like translation hallucination, and (2) inefficient data utilization that discards valuable learning signals by selecting only a single win-loss pair. To address these limitations, we introduce M^2PO: Multi-Pair, Multi-Perspective Preference Optimization. Our framework integrates a multi-perspective reward engine that creates a more robust signal by combining two key viewpoints: a new hallucination penalty for factuality, and an innovative dynamic quality score that adaptively fuses external evaluations with the model's own evolving judgment. This is synergistically paired with a multi-pair construction strategy that systematically creates a comprehensive set of preference pairs from the entire pool of translation candidates. This synergistic approach ensures the model learns from a richer spectrum of quality trade-offs, leading to more robust and faithful translations. On challenging WMT21-22 benchmarks, M^2PO substantially outperforms existing preference optimization methods and demonstrates highly competitive performance against leading proprietary LLMs.
摘要：直接偏好优化 (DPO) 是一种强大的范式，可将大型语言模型 (LLM) 与机器翻译 (MT) 中的人类偏好保持一致，但当前的方法受到两个基本挑战的阻碍：(1) 质量估计 (QE) 模型中存在缺陷的奖励信号，忽略了翻译幻觉等关键错误；(2) 数据利用效率低下，仅选择单个赢输结果，从而丢弃了有价值的学习信号一对。为了解决这些限制，我们引入了 M^2PO：多对、多视角偏好优化。我们的框架集成了多视角奖励引擎，通过结合两个关键观点来创建更强大的信号：对事实的新幻觉惩罚，以及创新的动态质量评分，自适应地将外部评估与模型自身不断发展的判断融合起来。这与多对构建策略协同配合，该策略从整个翻译候选池中系统地创建一组全面的偏好对。这种协同方法确保模型从更丰富的质量权衡中学习，从而产生更稳健和忠实的翻译。在具有挑战性的 WMT21-22 基准测试中，M^2PO 大大优于现有的偏好优化方法，并且与领先的专有法学硕士相比表现出极具竞争力的性能。

Title: LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Authors: Tommaso Bonomo, Luca Gioffré, Roberto Navigli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13494
Pdf URL: https://arxiv.org/pdf/2510.13494
Copy Paste: [[2510.13494]] LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA(https://arxiv.org/abs/2510.13494)
Keywords: llm
Abstract: Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA. This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans. Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at this https URL.
摘要：叙述性文本的问答 (QA) 对当前系统提出了独特的挑战，需要深入理解长而复杂的文档。然而，NarrativeQA（该领域使用最广泛的基准）的可靠性受到噪声文档和有缺陷的 QA 对的阻碍。在这项工作中，我们介绍了 LiteraryQA，这是专注于文学作品的 NarrativeQA 的高质量子集。使用经过人工和法学硕士验证的管道，我们可以识别并纠正低质量的 QA 样本，同时从源文档中删除无关的文本。然后，我们对自动指标进行元评估，以阐明如何在 LiteraryQA 上评估系统。该分析表明，所有基于 n-gram 的指标与人类判断的系统级相关性较低，而法学硕士作为法官的评估，即使使用小型开放权重模型，也可以与人类确定的排名强烈一致。最后，我们在 LiteraryQA 上对一组长上下文法学硕士进行了基准测试。我们在此 https URL 发布我们的代码和数据。

Title: ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding

Authors: Xiaozhe Li, TianYi Lyu, Siyi Yang, Yuxi Gong, Yizhao Yang, Jinxuan Huang, Ligao Zhang, Zhuoyi Huang, Qingwen Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13499
Pdf URL: https://arxiv.org/pdf/2510.13499
Copy Paste: [[2510.13499]] ConsintBench: Evaluating Language Models on Real-World Consumer Intent Understanding(https://arxiv.org/abs/2510.13499)
Keywords: language model, llm
Abstract: Understanding human intent is a complex, high-level task for large language models (LLMs), requiring analytical reasoning, contextual interpretation, dynamic information aggregation, and decision-making under uncertainty. Real-world public discussions, such as consumer product discussions, are rarely linear or involve a single user. Instead, they are characterized by interwoven and often conflicting perspectives, divergent concerns, goals, emotional tendencies, as well as implicit assumptions and background knowledge about usage scenarios. To accurately understand such explicit public intent, an LLM must go beyond parsing individual sentences; it must integrate multi-source signals, reason over inconsistencies, and adapt to evolving discourse, similar to how experts in fields like politics, economics, or finance approach complex, uncertain environments. Despite the importance of this capability, no large-scale benchmark currently exists for evaluating LLMs on real-world human intent understanding, primarily due to the challenges of collecting real-world public discussion data and constructing a robust evaluation pipeline. To bridge this gap, we introduce \bench, the first dynamic, live evaluation benchmark specifically designed for intent understanding, particularly in the consumer domain. \bench is the largest and most diverse benchmark of its kind, supporting real-time updates while preventing data contamination through an automated curation pipeline.
摘要：对于大型语言模型 (LLM) 来说，理解人类意图是一项复杂的高级任务，需要分析推理、上下文解释、动态信息聚合以及不确定性下的决策。现实世界的公共讨论（例如消费品讨论）很少是线性的或涉及单个用户。相反，它们的特点是相互交织且经常相互冲突的观点、不同的关注点、目标、情绪倾向，以及有关使用场景的隐含假设和背景知识。为了准确理解这种明确的公共意图，法学硕士必须超越解析单个句子；它必须整合多源信号，对不一致进行推理，并适应不断变化的话语，类似于政治、经济或金融等领域的专家处理复杂、不确定环境的方式。尽管这种能力很重要，但目前还没有大规模的基准来评估法学硕士在现实世界人类意图理解方面的能力，这主要是由于收集现实世界公共讨论数据和构建强大的评估渠道的挑战。为了弥补这一差距，我们推出了 \bench，这是第一个动态的实时评估基准，专门为意图理解而设计，特别是在消费者领域。 \bench 是同类中最大、最多样化的基准，支持实时更新，同时通过自动管理管道防止数据污染。

Title: MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts

Authors: Shujun Xia, Haokun Lin, Yichen Wu, Yinan Zhou, Zixuan Li, Zhongwei Wan, Xingrun Xing, Yefeng Zheng, Xiang Li, Caifeng Shan, Zhenan Sun, Quanzheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13500
Pdf URL: https://arxiv.org/pdf/2510.13500
Copy Paste: [[2510.13500]] MedREK: Retrieval-Based Editing for Medical LLMs with Key-Aware Prompts(https://arxiv.org/abs/2510.13500)
Keywords: llm, prompt
Abstract: LLMs hold great promise for healthcare applications, but the rapid evolution of medical knowledge and errors in training data often cause them to generate outdated or inaccurate information, limiting their applicability in high-stakes clinical practice. Model editing has emerged as a potential remedy without full retraining. While parameter-based editing often compromises locality and is thus ill-suited for the medical domain, retrieval-based editing offers a more viable alternative. However, it still faces two critical challenges: (1) representation overlap within the medical knowledge space often causes inaccurate retrieval and reduces editing accuracy; (2) existing methods are restricted to single-sample edits, while batch-editing remains largely unexplored despite its importance for real-world medical applications. To address these challenges, we first construct MedVersa, \hk{an enhanced benchmark with broader coverage of medical subjects, designed to evaluate both single and batch edits under strict locality constraints}. We then propose MedREK, a retrieval-based editing framework that integrates a shared query-key module for precise matching with an attention-based prompt encoder for informative guidance. Experimental results on various medical benchmarks demonstrate that our MedREK achieves superior performance across different core metrics and provides the first validated solution for batch-editing in medical LLMs. Our code and dataset are available at this https URL.
摘要：法学硕士在医疗保健应用方面前景广阔，但医学知识的快速发展和培训数据中的错误常常导致它们生成过时或不准确的信息，限制了它们在高风险临床实践中的适用性。模型编辑已成为一种无需全面再培训的潜在补救措施。虽然基于参数的编辑通常会损害局部性，因此不适合医学领域，但基于检索的编辑提供了更可行的替代方案。然而，它仍然面临两个关键挑战：（1）医学知识空间内的表示重叠常常导致检索不准确并降低编辑准确性； (2) 现有方法仅限于单样本编辑，而批量编辑尽管对于现实世界的医学应用很重要，但在很大程度上仍未得到探索。为了应对这些挑战，我们首先构建了 MedVersa，\hk{一个增强的基准，具有更广泛的医学主题覆盖范围，旨在在严格的局部性限制下评估单个和批量编辑}。然后，我们提出了 MedREK，这是一种基于检索的编辑框架，它集成了用于精确匹配的共享查询键模块和用于信息指导的基于注意力的提示编码器。各种医学基准的实验结果表明，我们的 MedREK 在不同的核心指标上实现了卓越的性能，并为医学法学硕士的批量编辑提供了第一个经过验证的解决方案。我们的代码和数据集可在此 https URL 获取。

Title: Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

Authors: Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13554
Pdf URL: https://arxiv.org/pdf/2510.13554
Copy Paste: [[2510.13554]] Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization(https://arxiv.org/abs/2510.13554)
Keywords: language model, llm
Abstract: The reasoning pattern of Large language models (LLMs) remains opaque, and Reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.
摘要：大型语言模型 (LLM) 的推理模式仍然不透明，强化学习 (RL) 通常在整个一代中应用统一的学分，模糊了关键步骤和常规步骤之间的区别。这项工作将注意力定位为一个特殊的基础，使法学硕士的内部逻辑变得清晰易读，不仅作为计算的副产品，而且作为推理本身的机械蓝图。我们首先区分局部和全局聚焦信息处理之间的注意力头，并揭示局部聚焦头在对角线附近产生锯齿图案，指示短语块，而全局聚焦头暴露对未来令牌施加广泛下游影响的令牌。我们用两个指标来形式化这些指标：1）窗口平均注意力距离，它衡量剪辑窗口内向后注意力的程度； 2) 未来注意力影响力，将代币的全局重要性量化为它从后续代币收到的平均关注度。总而言之，这些信号揭示了一种重复出现的预计划和锚定机制，其中模型首先执行远程上下文引用以生成介绍性标记，该标记紧随其后或与组织后续推理的语义锚定标记一致。利用这些见解，我们引入了三种新颖的 RL 策略，这些策略动态地对关键节点（预计划令牌、锚令牌及其时间耦合）执行有针对性的信用分配，并在各种推理任务中显示出一致的性能增益。通过使优化与模型的内在推理节奏保持一致，我们的目标是将不透明的优化转变为可操作的结构感知过程，希望为 LLM 推理的更透明和有效的优化提供潜在的一步。

Title: Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models

Authors: Daniil Gurgurov, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13580
Pdf URL: https://arxiv.org/pdf/2510.13580
Copy Paste: [[2510.13580]] Sparse Subnetwork Enhancement for Underrepresented Languages in Large Language Models(https://arxiv.org/abs/2510.13580)
Keywords: language model, llm
Abstract: Large language models exhibit uneven performance across languages, with substantial gaps between high- and low-resource languages. We present a framework for enhancing monolingual capabilities of LLMs in underrepresented languages while preserving their general-purpose performance through targeted fine-tuning of language-specific subnetworks. Our approach identifies language-specific neurons using Language Activation Probability Entropy and fine-tunes only the weights associated with these neurons, a dedicated subnetwork, on target-language data. Experiments on Llama-3.1-8B and Mistral-Nemo-12B across 12 mid- and low-resource languages demonstrate that our method consistently outperforms full fine-tuning, FFN-only fine-tuning, LoRA adaptation, and random subset fine-tuning baselines while efficiently updating only up to 1% of model parameters. Beyond performance improvements, we observe enhanced favorable training dynamics, cross-lingual representational alignment, and systematic weight update changes. To facilitate future research, we release language-specific neuron identifications for over 100 languages as well as our adaptation pipeline, offering a cost-effective pathway for adapting state-of-the-art models to underrepresented languages.
摘要：大型语言模型在不同语言之间表现出不均匀的性能，高资源语言和低资源语言之间存在巨大差距。我们提出了一个框架，用于增强法学硕士在代表性不足的语言中的单语能力，同时通过对特定语言子网络进行有针对性的微调来保持其通用性能。我们的方法使用语言激活概率熵来识别特定于语言的神经元，并仅在目标语言数据上微调与这些神经元（专用子网络）相关的权重。在 12 种中低资源语言的 Llama-3.1-8B 和 Mistral-Nemo-12B 上进行的实验表明，我们的方法始终优于完全微调、仅 FFN 微调、LoRA 自适应和随机子集微调基线，同时仅有效更新最多 1% 的模型参数。除了性能改进之外，我们还观察到有利的训练动态、跨语言表征对齐和系统权重更新变化的增强。为了促进未来的研究，我们发布了 100 多种语言的特定于语言的神经元识别以及我们的适应管道，为将最先进的模型适应代表性不足的语言提供了一条经济高效的途径。

Title: Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs

Authors: Pasin Buakhaw, Kun Kerdthaisong, Phuree Phenhiran, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13586
Pdf URL: https://arxiv.org/pdf/2510.13586
Copy Paste: [[2510.13586]] Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs(https://arxiv.org/abs/2510.13586)
Keywords: language model, llm, prompt, agent
Abstract: The emergence of large language models (LLMs) has opened new opportunities for cre- ating dynamic non-player characters (NPCs) in gaming environments, enabling both func- tional task execution and persona-consistent dialogue generation. In this paper, we (Tu_Character_lab) report our participation in the Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2, which eval- uates agents across three tracks: task-oriented dialogue, context-aware dialogue, and their integration. Our approach combines two complementary strategies: (i) lightweight prompting techniques in the API track, including a Deflanderization prompting method to suppress excessive role-play and improve task fidelity, and (ii) fine-tuned large models in the GPU track, leveraging Qwen3-14B with supervisedfinetuning (SFT) and Low-Rank Adaptation(LoRA). Our best submissions ranked 2nd on Task 1, 2nd on Task 3 (API track), and 4th on Task 3 (GPU track).
摘要：大语言模型（LLM）的出现为在游戏环境中创建动态非玩家角色（NPC）提供了新的机会，从而实现功能性任务执行和角色一致的对话生成。在本文中，我们（Tu_Character_lab）报告了我们参与 Commonsense Persona-Grounded Dialogue Challenge (CPDC) 2025 Round 2 的情况，该挑战赛从三个方面评估智能体：面向任务的对话、情境感知对话及其集成。我们的方法结合了两种互补策略：(i) API 轨道中的轻量级提示技术，包括 Deflanderization 提示方法，以抑制过多的角色扮演并提高任务保真度，以及 (ii) GPU 轨道中的微调大型模型，利用具有监督微调 (SFT) 和低秩适应 (LoRA) 的 Qwen3-14B。我们的最佳提交在任务 1 上排名第二，在任务 3（API 轨道）上排名第二，在任务 3（GPU 轨道）上排名第四。

Title: FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation

Authors: Kristýna Onderková, Ondřej Plátek, Zdeněk Kasner, Ondřej Dušek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13598
Pdf URL: https://arxiv.org/pdf/2510.13598
Copy Paste: [[2510.13598]] FreshTab: Sourcing Fresh Data for Table-to-Text Generation Evaluation(https://arxiv.org/abs/2510.13598)
Keywords: language model, llm
Abstract: Table-to-text generation (insight generation from tables) is a challenging task that requires precision in analyzing the data. In addition, the evaluation of existing benchmarks is affected by contamination of Large Language Model (LLM) training data as well as domain imbalance. We introduce FreshTab, an on-the-fly table-to-text benchmark generation from Wikipedia, to combat the LLM data contamination problem and enable domain-sensitive evaluation. While non-English table-to-text datasets are limited, FreshTab collects datasets in different languages on demand (we experiment with German, Russian and French in addition to English). We find that insights generated by LLMs from recent tables collected by our method appear clearly worse by automatic metrics, but this does not translate into LLM and human evaluations. Domain effects are visible in all evaluations, showing that a~domain-balanced benchmark is more challenging.
摘要：表到文本的生成（从表生成洞察）是一项具有挑战性的任务，需要精确分析数据。此外，现有基准的评估受到大型语言模型（LLM）训练数据的污染以及领域不平衡的影响。我们引入了 FreshTab，这是来自维基百科的即时表到文本基准生成，以解决 LLM 数据污染问题并实现领域敏感评估。虽然非英语的表到文本数据集有限，但 FreshTab 可以按需收集不同语言的数据集（除了英语之外，我们还尝试了德语、俄语和法语）。我们发现，通过自动指标，法学硕士从我们的方法收集的最新表格中生成的见解似乎明显更差，但这并不能转化为法学硕士和人工评估。域效应在所有评估中都是可见的，这表明域平衡基准更具挑战性。

Title: NOSA: Native and Offloadable Sparse Attention

Authors: Yuxiang Huang, Chaojun Xiao, Xu Han, Zhiyuan Liu
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13602
Pdf URL: https://arxiv.org/pdf/2510.13602
Copy Paste: [[2510.13602]] NOSA: Native and Offloadable Sparse Attention(https://arxiv.org/abs/2510.13602)
Keywords: llm
Abstract: Trainable sparse attention has emerged as a promising solution to address the decoding efficiency bottleneck of LLMs in long-context processing, significantly saving memory accesses while minimally impacting task performance. However, existing sparse attention methods leave a crucial limitation unresolved: the size of the key-value (KV) cache remains unreduced, which constrains on-GPU batch sizes and throttles decoding throughput, especially in large-scale batched inference. In this paper, we show that trainable sparse attention naturally exhibits strong locality in token selection across adjacent decoding steps, thereby enabling KV cache offloading without altering the underlying attention computation. However, the inherent locality remains insufficient to achieve efficient offloading, as the transfer of selected KV pairs between the CPU and GPU continues to dominate the overall decoding cost. Building on this insight, we present NOSA, a trainable sparse attention framework designed to natively support KV cache offloading. NOSA introduces explicit locality constraints by decomposing token selection into query-aware and query-agnostic components, thereby reducing KV transfers while preserving the same attention computation as used during training. We pretrain a 1B-parameter model with NOSA and conduct extensive benchmarks, showing that it preserves near-lossless performance while achieving up to a 2.3x improvement in decoding throughput compared with the vanilla trainable sparse attention baseline (InfLLM-V2).
摘要：可训练的稀疏注意力已成为解决 LLM 在长上下文处理中解码效率瓶颈的有前景的解决方案，可显着节省内存访问，同时将对任务性能的影响降至最低。然而，现有的稀疏注意力方法存在一个尚未解决的关键限制：键值（KV）缓存的大小仍然没有减少，这限制了 GPU 上的批量大小并限制了解码吞吐量，特别是在大规模批量推理中。在本文中，我们表明可训练的稀疏注意力自然地在相邻解码步骤的标记选择中表现出很强的局部性，从而在不改变底层注意力计算的情况下实现 KV 缓存卸载。然而，固有的局部性仍然不足以实现有效的卸载，因为 CPU 和 GPU 之间选定的 KV 对的传输仍然主导着整体解码成本。基于这一见解，我们提出了 NOSA，一个可训练的稀疏注意力框架，旨在原生支持 KV 缓存卸载。 NOSA 通过将标记选择分解为查询感知和查询不可知的组件来引入显式局部性约束，从而减少 KV 传输，同时保留与训练期间使用的相同的注意计算。我们使用 NOSA 预训练 1B 参数模型并进行广泛的基准测试，结果表明，与普通的可训练稀疏注意力基线 (InfLLM-V2) 相比，它保留了近乎无损的性能，同时解码吞吐量提高了 2.3 倍。

Title: MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning

Authors: Xingyu Tan, Xiaoyang Wang, Xiwei Xu, Xin Yuan, Liming Zhu, Wenjie Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13614
Pdf URL: https://arxiv.org/pdf/2510.13614
Copy Paste: [[2510.13614]] MemoTime: Memory-Augmented Temporal Knowledge Graph Enhanced Large Language Model Reasoning(https://arxiv.org/abs/2510.13614)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs) have achieved impressive reasoning abilities, but struggle with temporal understanding, especially when questions involve multiple entities, compound operators, and evolving event sequences. Temporal Knowledge Graphs (TKGs), which capture vast amounts of temporal facts in a structured format, offer a reliable source for temporal reasoning. However, existing TKG-based LLM reasoning methods still struggle with four major challenges: maintaining temporal faithfulness in multi-hop reasoning, achieving multi-entity temporal synchronization, adapting retrieval to diverse temporal operators, and reusing prior reasoning experience for stability and efficiency. To address these issues, we propose MemoTime, a memory-augmented temporal knowledge graph framework that enhances LLM reasoning through structured grounding, recursive reasoning, and continual experience learning. MemoTime decomposes complex temporal questions into a hierarchical Tree of Time, enabling operator-aware reasoning that enforces monotonic timestamps and co-constrains multiple entities under unified temporal bounds. A dynamic evidence retrieval layer adaptively selects operator-specific retrieval strategies, while a self-evolving experience memory stores verified reasoning traces, toolkit decisions, and sub-question embeddings for cross-type reuse. Comprehensive experiments on multiple temporal QA benchmarks show that MemoTime achieves overall state-of-the-art results, outperforming the strong baseline by up to 24.0%. Furthermore, MemoTime enables smaller models (e.g., Qwen3-4B) to achieve reasoning performance comparable to that of GPT-4-Turbo.
摘要：大型语言模型 (LLM) 已经实现了令人印象深刻的推理能力，但在时间理解方面遇到了困难，特别是当问题涉及多个实体、复合运算符和不断演变的事件序列时。时态知识图 (TKG) 以结构化格式捕获大量时态事实，为时态推理提供可靠的来源。然而，现有的基于TKG的LLM推理方法仍然面临四个主要挑战：在多跳推理中保持时间忠实性、实现多实体时间同步、适应不同时间算子的检索以及重用先前的推理经验以实现稳定性和效率。为了解决这些问题，我们提出了 MemoTime，这是一种记忆增强的时间知识图框架，通过结构化基础、递归推理和持续经验学习来增强 LLM 推理。 MemoTime 将复杂的时间问题分解为分层时间树，从而实现操作员感知推理，强制执行单调时间戳并在统一的时间范围内共同约束多个实体。动态证据检索层自适应地选择特定于操作员的检索策略，而自我进化的经验记忆则存储经过验证的推理轨迹、工具包决策和子问题嵌入，以供跨类型重用。对多个时间 QA 基准的综合实验表明，MemoTime 实现了整体最先进的结果，比强基线高出高达 24.0%。此外，MemoTime 使较小的模型（例如 Qwen3-4B）能够实现与 GPT-4-Turbo 相当的推理性能。

Title: Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses

Authors: Stefan Lenz, Lakisha Ortiz Rosario, Georg Vollmar, Arsenij Ustjanzew, Fatma Alickovic, Thomas Kindler, Torsten Panholzer
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13624
Pdf URL: https://arxiv.org/pdf/2510.13624
Copy Paste: [[2510.13624]] Unlocking Public Catalogues: Instruction-Tuning LLMs for ICD Coding of German Tumor Diagnoses(https://arxiv.org/abs/2510.13624)
Keywords: llm
Abstract: Accurate coding of tumor diagnoses with ICD-10-GM and ICD-O-3 is essential for structured cancer documentation in Germany. Smaller open-weight LLMs are appealing for privacy-preserving automation but often struggle with coding accuracy in German-language contexts. This study investigates whether instruction-based fine-tuning on public datasets improves the coding accuracy of open-weight LLMs for German tumor diagnosis texts. The evaluation uses coded diagnoses from the local tumor documentation system as test data. In a systematic data quality assessment, the upper limit for ICD-10 coding performance was estimated at 60-79% for exact and 81-94% for partial (three-character codes only) derivation. As training data, over 500,000 question-answer pairs were created based on the ICD-10-GM, ICD-O-3, and OPS catalogues. Eight open-weight models from the Qwen, Llama, and Mistral families (7-70 B parameters) were fine-tuned. ICD-10-GM accuracy rose from 1.4-24% to 41-58%, and partial accuracy from 31-74% to 73-83%. The accuracy of ICD-O-3 topography coding also improved but started and remained considerably lower with an exact accuracy of 22-40% and a partial accuracy of 56-67% after fine-tuning. Malformed code outputs dropped to 0% for all models. Tumor-diagnosis recognition reached 99%. Accuracy correlated positively with model size, but gaps between small and large models narrowed after fine-tuning. The reasoning mode in Qwen3 generally yielded a lower performance than fine-tuning and was over 100 times slower. Our findings highlight the potential of leveraging public catalogues to build instruction datasets that improve LLMs in medical documentation tasks. The complete training dataset and the best-performing checkpoints of the fine-tuned models are available from this https URL.
摘要：使用 ICD-10-GM 和 ICD-O-3 准确编码肿瘤诊断对于德国的结构化癌症记录至关重要。较小的开放式法学硕士对保护隐私的自动化很有吸引力，但在德语环境中常常难以保证编码的准确性。本研究调查了对公共数据集进行基于指令的微调是否可以提高德国肿瘤诊断文本的开放权重法学硕士的编码准确性。该评估使用本地肿瘤文档系统的编码诊断作为测试数据。在系统数据质量评估中，ICD-10 编码性能的上限估计为精确推导的 60-79%，部分推导（仅限三字符代码）的上限为 81-94%。作为训练数据，基于 ICD-10-GM、ICD-O-3 和 OPS 目录创建了超过 500,000 个问答对。对来自 Qwen、Llama 和 Mistral 系列的八个开放重量模型（7-70 B 参数）进行了微调。 ICD-10-GM 准确率从 1.4-24% 上升到 41-58%，部分准确率从 31-74% 上升到 73-83%。 ICD-O-3 地形编码的准确性也有所提高，但在微调后开始并保持相当低的准确度，准确度为 22-40%，部分准确度为 56-67%。所有模型的格式错误代码输出均降至 0%。肿瘤诊断识别率达到99%。准确性与模型大小呈正相关，但小模型和大模型之间的差距在微调后缩小。 Qwen3 中的推理模式通常产生的性能低于微调，并且速度慢 100 倍以上。我们的研究结果强调了利用公共目录构建指导数据集的潜力，以改善法学硕士在医学文档任务中的表现。可以从此 https URL 获取完整的训练数据集和微调模型的最佳性能检查点。

Title: Closing the Gap Between Text and Speech Understanding in LLMs

Authors: Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh
Subjects: cs.CL, cs.AI, eess.AS
Abstract URL: https://arxiv.org/abs/2510.13632
Pdf URL: https://arxiv.org/pdf/2510.13632
Copy Paste: [[2510.13632]] Closing the Gap Between Text and Speech Understanding in LLMs(https://arxiv.org/abs/2510.13632)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
摘要：大型语言模型 (LLM) 可以进行调整，将其文本功能扩展到语音输入。然而，这些适应语音的法学硕士在语言理解任务上始终表现不佳，甚至低于基于文本的法学硕士，甚至级联管道。我们将这种缺陷称为文本-语音理解差距：相对于基于原始文本的 LLM 处理等效文本，当适应语音的 LLM 处理语音输入时观察到的性能下降。最近缩小这一差距的方法要么依赖于文本语料库的大规模语音合成，这种合成成本高昂并且严重依赖于合成数据，要么依赖于不可再现的大规模专有语音数据集。因此，仍然需要更高效的数据替代方案来缩小文本-语音理解差距。在这项工作中，我们分析了由两个因素驱动的差距：（i）适应过程中忘记文本功能，以及（ii）语音和文本之间的跨模式错位。基于此分析，我们引入了 SALAD（通过主动选择和跨模态蒸馏进行学习的样本高效对齐），它将跨模态蒸馏与目标合成数据相结合，以改善对齐，同时减少遗忘。应用于 3B 和 7B LLM 时，SALAD 通过跨知识、语言理解和推理等广泛领域基准的强大开放权重模型实现了具有竞争力的表现，同时使用公共语料库中少一个数量级的语音数据进行训练。

Title: How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study

Authors: Matthieu Dubois, François Yvon, Pablo Piantanida
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13681
Pdf URL: https://arxiv.org/pdf/2510.13681
Copy Paste: [[2510.13681]] How Sampling Affects the Detectability of Machine-written texts: A Comprehensive Study(https://arxiv.org/abs/2510.13681)
Keywords: language model, llm
Abstract: As texts generated by Large Language Models (LLMs) are ever more common and often indistinguishable from human-written content, research on automatic text detection has attracted growing attention. Many recent detectors report near-perfect accuracy, often boasting AUROC scores above 99\%. However, these claims typically assume fixed generation settings, leaving open the question of how robust such systems are to changes in decoding strategies. In this work, we systematically examine how sampling-based decoding impacts detectability, with a focus on how subtle variations in a model's (sub)word-level distribution affect detection performance. We find that even minor adjustments to decoding parameters - such as temperature, top-p, or nucleus sampling - can severely impair detector accuracy, with AUROC dropping from near-perfect levels to 1\% in some settings. Our findings expose critical blind spots in current detection methods and emphasize the need for more comprehensive evaluation protocols. To facilitate future research, we release a large-scale dataset encompassing 37 decoding configurations, along with our code and evaluation framework this https URL
摘要：随着大型语言模型 (LLM) 生成的文本变得越来越常见，并且通常与人类编写的内容难以区分，自动文本检测的研究引起了越来越多的关注。许多最新的探测器都报告了近乎完美的准确度，通常夸耀 AUROC 分数高于 99\%。然而，这些说法通常假设固定的生成设置，从而留下了这样的系统对解码策略的变化有多鲁棒的问题。在这项工作中，我们系统地研究基于采样的解码如何影响可检测性，重点关注模型（子）字级分布的细微变化如何影响检测性能。我们发现，即使对解码参数（例如温度、top-p 或核采样）进行微小调整，也会严重损害探测器的精度，在某些设置中，AUROC 从近乎完美的水平下降到 1%。我们的研究结果暴露了当前检测方法中的关键盲点，并强调需要更全面的评估方案。为了促进未来的研究，我们发布了一个包含 37 种解码配置的大规模数据集，以及我们的代码和评估框架（https URL）

Title: GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians

Authors: Xiuyuan Chen, Tao Sun, Dexin Su, Ailing Yu, Junwei Liu, Zhe Chen, Gangzeng Jin, Xin Wang, Jingnan Liu, Hansong Xiao, Hualei Zhou, Dongjie Tao, Chunxiao Guo, Minghui Yang, Yuan Xia, Jing Zhao, Qianrui Fan, Yanyun Wang, Shuai Zhen, Kezhong Chen, Jun Wang, Zewen Sun, Heng Zhao, Tian Guan, Shaodong Wang, Geyun Chang, Jiaming Deng, Hongchengcheng Chen, Kexin Feng, Ruzhen Li, Jiayi Geng, Changtai Zhao, Jun Wang, Guihu Lin, Peihao Li, Liqi Liu, Peng Wei, Jian Wang, Jinjie Gu, Ping Wang, Fan Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13734
Pdf URL: https://arxiv.org/pdf/2510.13734
Copy Paste: [[2510.13734]] GAPS: A Clinically Grounded, Automated Benchmark for Evaluating AI Clinicians(https://arxiv.org/abs/2510.13734)
Keywords: language model, llm, agent
Abstract: Current benchmarks for AI clinician systems, often based on multiple-choice exams or manual rubrics, fail to capture the depth, robustness, and safety required for real-world clinical practice. To address this, we introduce the GAPS framework, a multidimensional paradigm for evaluating \textbf{G}rounding (cognitive depth), \textbf{A}dequacy (answer completeness), \textbf{P}erturbation (robustness), and \textbf{S}afety. Critically, we developed a fully automated, guideline-anchored pipeline to construct a GAPS-aligned benchmark end-to-end, overcoming the scalability and subjectivity limitations of prior work. Our pipeline assembles an evidence neighborhood, creates dual graph and tree representations, and automatically generates questions across G-levels. Rubrics are synthesized by a DeepResearch agent that mimics GRADE-consistent, PICO-driven evidence review in a ReAct loop. Scoring is performed by an ensemble of large language model (LLM) judges. Validation confirmed our automated questions are high-quality and align with clinician judgment. Evaluating state-of-the-art models on the benchmark revealed key failure modes: performance degrades sharply with increased reasoning depth (G-axis), models struggle with answer completeness (A-axis), and they are highly vulnerable to adversarial perturbations (P-axis) as well as certain safety issues (S-axis). This automated, clinically-grounded approach provides a reproducible and scalable method for rigorously evaluating AI clinician systems and guiding their development toward safer, more reliable clinical practice.
摘要：当前人工智能临床医生系统的基准通常基于多项选择题考试或手动评分标准，无法捕捉现实临床实践所需的深度、稳健性和安全性。为了解决这个问题，我们引入了 GAPS 框架，这是一种用于评估 \textbf{G}rounding（认知深度）、\textbf{A}dequacy（答案完整性）、\textbf{P}erturbation（鲁棒性）和 \textbf{S}afety 的多维范式。至关重要的是，我们开发了一个完全自动化、以指南为基础的管道，以构建符合 GAPS 的端到端基准测试，克服了先前工作的可扩展性和主观性限制。我们的管道组装了一个证据邻域，创建对偶图和树表示，并自动生成跨 G 级别的问题。评分标准由 DeepResearch 代理合成，该代理在 ReAct 循环中模拟 GRADE 一致、PICO 驱动的证据审查。评分由大型语言模型 (LLM) 评委团队执行。验证证实我们的自动问题是高质量的并且符合临床医生的判断。在基准上评估最先进的模型揭示了关键的故障模式：随着推理深度（G 轴）的增加，性能急剧下降，模型难以回答完整性（A 轴），并且它们非常容易受到对抗性扰动（P 轴）以及某些安全问题（S 轴）的影响。这种基于临床的自动化方法提供了一种可重复且可扩展的方法，用于严格评估人工智能临床医生系统并指导其向更安全、更可靠的临床实践发展。

Title: Assessing Web Search Credibility and Response Groundedness in Chat Assistants

Authors: Ivan Vykopal, Matúš Pikuliak, Simon Ostermann, Marián Šimko
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13749
Pdf URL: https://arxiv.org/pdf/2510.13749
Copy Paste: [[2510.13749]] Assessing Web Search Credibility and Response Groundedness in Chat Assistants(https://arxiv.org/abs/2510.13749)
Keywords: gpt, chat
Abstract: Chat assistants increasingly integrate web search functionality, enabling them to retrieve and cite external sources. While this promises more reliable answers, it also raises the risk of amplifying misinformation from low-credibility sources. In this paper, we introduce a novel methodology for evaluating assistants' web search behavior, focusing on source credibility and the groundedness of responses with respect to cited sources. Using 100 claims across five misinformation-prone topics, we assess GPT-4o, GPT-5, Perplexity, and Qwen Chat. Our findings reveal differences between the assistants, with Perplexity achieving the highest source credibility, whereas GPT-4o exhibits elevated citation of non-credibility sources on sensitive topics. This work provides the first systematic comparison of commonly used chat assistants for fact-checking behavior, offering a foundation for evaluating AI systems in high-stakes information environments.
摘要：聊天助理越来越多地集成网络搜索功能，使他们能够检索和引用外部资源。虽然这有望提供更可靠的答案，但也增加了来自低可信度来源的错误信息被放大的风险。在本文中，我们介绍了一种评估助理网络搜索行为的新颖方法，重点关注来源可信度以及对引用来源的回应的基础性。我们使用 5 个容易产生错误信息的主题的 100 个声明来评估 GPT-4o、GPT-5、Perplexity 和 Qwen Chat。我们的研究结果揭示了助手之间的差异，Perplexity 实现了最高的来源可信度，而 GPT-4o 在敏感主题上对不可信来源的引用有所增加。这项工作首次对常用聊天助手的事实检查行为进行了系统比较，为在高风险信息环境中评估人工智能系统奠定了基础。

Title: Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation

Authors: Zhiqi Huang, Vivek Datla, Chenyang Zhu, Alfy Samuel, Daben Liu, Anoop Kumar, Ritesh Soni
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13750
Pdf URL: https://arxiv.org/pdf/2510.13750
Copy Paste: [[2510.13750]] Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation(https://arxiv.org/abs/2510.13750)
Keywords: language model, llm, retrieval-augmented generation
Abstract: We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs. Confidence estimation is especially critical in high-stakes domains such as finance and healthcare, where the cost of an incorrect answer outweighs that of not answering the question. Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals, avoiding the information loss inherent in token logits and probabilities after projection and softmax normalization. We model confidence prediction as a sequence classification task, and regularize training with a Huber loss term to improve robustness against noisy supervision. Applied in a real-world financial industry customer-support setting with complex knowledge bases, our method outperforms strong baselines and maintains high accuracy under strict latency constraints. Experiments on Llama 3.1 8B model show that using activations from only the 16th layer preserves accuracy while reducing response latency. Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
摘要：我们提出了一种在检索增强生成（RAG）系统中进行置信度估计的方法，该方法与大语言模型（LLM）输出的正确性密切相关。置信度估计在金融和医疗保健等高风险领域尤其重要，在这些领域，错误答案的成本超过了不回答问题的成本。我们的方法通过利用原始前馈网络（FFN）激活作为自回归信号来扩展先前的不确定性量化方法，避免了投影和 softmax 归一化后 token logits 和概率中固有的信息丢失。我们将置信度预测建模为序列分类任务，并使用 Huber 损失项对训练进行正则化，以提高针对噪声监督的鲁棒性。我们的方法应用于具有复杂知识库的现实金融行业客户支持环境中，其性能优于强大的基线，并在严格的延迟限制下保持高精度。 Llama 3.1 8B 模型上的实验表明，仅使用第 16 层的激活可以保持准确性，同时减少响应延迟。我们的结果表明，基于激活的置信度建模为实现值得信赖的 RAG 部署提供了一条可扩展的、架构感知的路径。

Title: The Mechanistic Emergence of Symbol Grounding in Language Models

Authors: Shuyu Wu, Ziqiao Ma, Xiaoxi Luo, Yidong Huang, Josue Torres-Fonseca, Freda Shi, Joyce Chai
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.13796
Pdf URL: https://arxiv.org/pdf/2510.13796
Copy Paste: [[2510.13796]] The Mechanistic Emergence of Symbol Grounding in Language Models(https://arxiv.org/abs/2510.13796)
Keywords: language model
Abstract: Symbol grounding (Harnad, 1990) describes how symbols such as words acquire their meanings by connecting to real-world sensorimotor experiences. Recent work has shown preliminary evidence that grounding may emerge in (vision-)language models trained at scale without using explicit grounding objectives. Yet, the specific loci of this emergence and the mechanisms that drive it remain largely unexplored. To address this problem, we introduce a controlled evaluation framework that systematically traces how symbol grounding arises within the internal computations through mechanistic and causal analysis. Our findings show that grounding concentrates in middle-layer computations and is implemented through the aggregate mechanism, where attention heads aggregate the environmental ground to support the prediction of linguistic forms. This phenomenon replicates in multimodal dialogue and across architectures (Transformers and state-space models), but not in unidirectional LSTMs. Our results provide behavioral and mechanistic evidence that symbol grounding can emerge in language models, with practical implications for predicting and potentially controlling the reliability of generation.
摘要：符号接地（Harnad，1990）描述了诸如单词之类的符号如何通过与现实世界的感觉运动体验相联系来获取其含义。最近的研究表明，初步证据表明，在不使用明确的基础目标的情况下，大规模训练的（视觉）语言模型中可能会出现基础。然而，这种出现的具体轨迹和驱动它的机制在很大程度上仍未被探索。为了解决这个问题，我们引入了一个受控评估框架，该框架通过机械和因果分析系统地跟踪符号接地在内部计算中是如何出现的。我们的研究结果表明，基础集中在中间层计算，并通过聚合机制实现，其中注意力头聚合环境基础以支持语言形式的预测。这种现象会在多模式对话和跨架构（Transformers 和状态空间模型）中重复出现，但不会在单向 LSTM 中重复出现。我们的结果提供了行为和机制证据，表明符号基础可以出现在语言模型中，对预测和潜在控制生成的可靠性具有实际意义。

Title: Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons

Authors: Giovanni Monea, Yair Feldman, Shankar Padmanabhan, Kianté Brantley, Yoav Artzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13797
Pdf URL: https://arxiv.org/pdf/2510.13797
Copy Paste: [[2510.13797]] Breadcrumbs Reasoning: Memory-Efficient Reasoning with Compression Beacons(https://arxiv.org/abs/2510.13797)
Keywords: language model
Abstract: The scalability of large language models for long-context reasoning is severely constrained by the linear growth of their Transformer key-value cache, which incurs significant memory and computational costs. We posit that as a model generates reasoning tokens, the informational value of past generated tokens diminishes, creating an opportunity for compression. In this work, we propose to periodically compress the generation KV cache with a learned, special-purpose token and evict compressed entries. We train the model to perform this compression via a modified joint distillation and reinforcement learning (RL) framework. Our training method minimizes overhead over the conventional RL process, as it leverages RL outputs for distillation. Empirically, our method achieves a superior memory-accuracy Pareto frontier compared to both the model without cache compression and training-free compression techniques.
摘要：用于长上下文推理的大型语言模型的可扩展性受到其 Transformer 键值缓存的线性增长的严重限制，这会产生大量的内存和计算成本。我们假设，当模型生成推理标记时，过去生成的标记的信息价值就会减少，从而创造了压缩的机会。在这项工作中，我们建议使用学习到的专用令牌定期压缩生成 KV 缓存并逐出压缩条目。我们训练模型通过改进的联合蒸馏和强化学习（RL）框架来执行这种压缩。我们的训练方法最大限度地减少了传统 RL 过程的开销，因为它利用 RL 输出进行蒸馏。根据经验，与没有缓存压缩的模型和免训练压缩技术相比，我们的方法实现了卓越的内存准确性帕累托前沿。

Title: BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning

Authors: Jia-Chen Gu, Junyi Zhang, Di Wu, Yuankai Li, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13799
Pdf URL: https://arxiv.org/pdf/2510.13799
Copy Paste: [[2510.13799]] BRIEF-Pro: Universal Context Compression with Short-to-Long Synthesis for Fast and Accurate Multi-Hop Reasoning(https://arxiv.org/abs/2510.13799)
Keywords: language model, llm, retrieval-augmented generation
Abstract: As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.
摘要：随着检索增强生成 (RAG) 处理复杂的任务，不断扩展的上下文提供了更丰富的信息，但代价是更高的延迟和增加的模型认知负载。为了缓解这一瓶颈，特别是对于复杂的多跳问题，我们引入了 Brief-Pro。它是一种通用的轻量级压缩器，可将给定查询的相关证据从检索到的文档中提取为简洁的摘要，以便无缝集成到上下文中的 RAG 中。使用由相对较短的上下文（少于 1k 个单词）组成的种子数据，Brief-Pro 经过训练，可以在各种场景中对超过 10k 个单词的扩展上下文执行抽象压缩。此外，Brief-Pro 允许用户指定所需的句子数量，从而提供对摘要长度的灵活用户控制。对四个开放域多跳问答数据集的实验表明，BRIEF-Pro 生成更简洁且相关的摘要，从而增强了小型、大型和专有语言模型的性能。对于 70B 阅读器模型，Brief-Pro 的 32 倍压缩比 LongLLMLingua 的 9 倍平均提高了 QA 性能 4.67%，同时只需要 23% 的计算开销。