2025-10-17

Title: From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening

Authors: Ratna Kandala, Akshata Kishore Moharir, Divya Arvinda Nayak
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13828
Pdf URL: https://arxiv.org/pdf/2510.13828
Copy Paste: [[2510.13828]] From Explainability to Action: A Generative Operational Framework for Integrating XAI in Clinical Mental Health Screening(https://arxiv.org/abs/2510.13828)
Keywords: language model, llm
Abstract: Explainable Artificial Intelligence (XAI) has been presented as the critical component for unlocking the potential of machine learning in mental health screening (MHS). However, a persistent lab-to-clinic gap remains. Current XAI techniques, such as SHAP and LIME, excel at producing technically faithful outputs such as feature importance scores, but fail to deliver clinically relevant, actionable insights that can be used by clinicians or understood by patients. This disconnect between technical transparency and human utility is the primary barrier to real-world adoption. This paper argues that this gap is a translation problem and proposes the Generative Operational Framework, a novel system architecture that leverages Large Language Models (LLMs) as a central translation engine. This framework is designed to ingest the raw, technical outputs from diverse XAI tools and synthesize them with clinical guidelines (via RAG) to automatically generate human-readable, evidence-backed clinical narratives. To justify our solution, we provide a systematic analysis of the components it integrates, tracing the evolution from intrinsic models to generative XAI. We demonstrate how this framework directly addresses key operational barriers, including workflow integration, bias mitigation, and stakeholder-specific communication. This paper also provides a strategic roadmap for moving the field beyond the generation of isolated data points toward the delivery of integrated, actionable, and trustworthy AI in clinical practice.
摘要：可解释的人工智能（XAI）被认为是释放机器学习在心理健康筛查（MHS）中潜力的关键组成部分。然而，实验室与临床之间仍然存在持续的差距。当前的 XAI 技术（例如 SHAP 和 LIME）擅长生成技术上可靠的输出（例如特征重要性评分），但无法提供可供临床医生使用或患者理解的临床相关、可操作的见解。技术透明度和人类效用之间的脱节是现实世界采用的主要障碍。本文认为这种差距是一个翻译问题，并提出了生成操作框架，这是一种利用大型语言模型（LLM）作为中央翻译引擎的新颖系统架构。该框架旨在吸收来自不同 XAI 工具的原始技术输出，并将其与临床指南（通过 RAG）合成，以自动生成人类可读的、有证据支持的临床叙述。为了证明我们的解决方案的合理性，我们对其集成的组件进行了系统分析，追踪从内在模型到生成 XAI 的演变。我们演示了该框架如何直接解决关键的运营障碍，包括工作流程集成、偏见缓解和利益相关者特定的沟通。本文还提供了一个战略路线图，使该领域超越生成孤立的数据点，转向在临床实践中提供集成的、可操作的和值得信赖的人工智能。

Title: A Linguistics-Aware LLM Watermarking via Syntactic Predictability

Authors: Shinwoo Park, Hyejin Park, Hyeseon Ahn, Yo-Sub Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13829
Pdf URL: https://arxiv.org/pdf/2510.13829
Copy Paste: [[2510.13829]] A Linguistics-Aware LLM Watermarking via Syntactic Predictability(https://arxiv.org/abs/2510.13829)
Keywords: language model, llm
Abstract: As large language models (LLMs) continue to advance rapidly, reliable governance tools have become critical. Publicly verifiable watermarking is particularly essential for fostering a trustworthy AI ecosystem. A central challenge persists: balancing text quality against detection robustness. Recent studies have sought to navigate this trade-off by leveraging signals from model output distributions (e.g., token-level entropy); however, their reliance on these model-specific signals presents a significant barrier to public verification, as the detection process requires access to the logits of the underlying model. We introduce STELA, a novel framework that aligns watermark strength with the linguistic degrees of freedom inherent in language. STELA dynamically modulates the signal using part-of-speech (POS) n-gram-modeled linguistic indeterminacy, weakening it in grammatically constrained contexts to preserve quality and strengthen it in contexts with greater linguistic flexibility to enhance detectability. Our detector operates without access to any model logits, thus facilitating publicly verifiable detection. Through extensive experiments on typologically diverse languages-analytic English, isolating Chinese, and agglutinative Korean-we show that STELA surpasses prior methods in detection robustness. Our code is available at this https URL.
摘要：随着大型语言模型 (LLM) 的不断快速发展，可靠的治理工具变得至关重要。可公开验证的水印对于培育值得信赖的人工智能生态系统尤其重要。一个核心挑战仍然存在：平衡文本质量与检测鲁棒性。最近的研究试图通过利用模型输出分布的信号（例如代币级熵）来解决这种权衡；然而，它们对这些特定于模型的信号的依赖对公共验证构成了重大障碍，因为检测过程需要访问底层模型的 logits。我们引入了 STELA，这是一种新颖的框架，它将水印强度与语言固有的语言自由度结合起来。 STELA 使用词性 (POS) n-gram 建模的语言不确定性动态调制信号，在语法受限的上下文中削弱信号以保持质量，并在具有更大语言灵活性的上下文中加强信号以增强可检测性。我们的检测器无需访问任何模型日志即可运行，从而促进可公开验证的检测。通过对不同类型的语言（分析英语、孤立汉语和凝集韩语）进行大量实验，我们表明 STELA 在检测鲁棒性方面超越了先前的方法。我们的代码可以在这个 https URL 上找到。

Title: Users as Annotators: LLM Preference Learning from Comparison Mode

Authors: Zhongze Cai, Xiaocheng Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13830
Pdf URL: https://arxiv.org/pdf/2510.13830
Copy Paste: [[2510.13830]] Users as Annotators: LLM Preference Learning from Comparison Mode(https://arxiv.org/abs/2510.13830)
Keywords: language model, llm, prompt
Abstract: Pairwise preference data have played an important role in the alignment of large language models (LLMs). Each sample of such data consists of a prompt, two different responses to the prompt, and a binary label indicating which of the two responses is better. The labels are usually annotated by professional human annotators. In this paper, we consider an alternative approach to collect pairwise preference data -- user annotation from comparison mode. With the increasingly wider adoption of LLMs among the population, users are contributing more and more of their preference labels through their daily interactions with the LLMs. The upside of such labels is that users are the best experts in judging the responses to their own queries/prompts, but the downside is the lack of quality control in these labels. In this paper, we consider a new idea of generating two responses from two different models or two different versions of the same model. The asymmetry allows us to make an inference of the user's data quality through our proposed user behavior model. We develop an expectation-maximization algorithm to estimate a latent quality factor of the user, and filter users' annotation data accordingly. The downstream task shows the effectiveness of our approach in both capturing the user behavior and data filtering for LLM alignment.
摘要：成对偏好数据在大型语言模型（LLM）的对齐中发挥了重要作用。此类数据的每个样本都包含一个提示、对提示的两个不同响应以及指示这两个响应中哪一个更好的二进制标签。标签通常由专业的人工注释者进行注释。在本文中，我们考虑了一种收集成对偏好数据的替代方法——来自比较模式的用户注释。随着法学硕士在人群中越来越广泛的采用，用户通过与法学硕士的日常互动贡献了越来越多的偏好标签。此类标签的优点是用户是判断对自己的查询/提示的响应的最佳专家，但缺点是这些标签缺乏质量控制。在本文中，我们考虑了一种从两个不同模型或同一模型的两个不同版本生成两个响应的新想法。这种不对称性使我们能够通过我们提出的用户行为模型来推断用户的数据质量。我们开发了一种期望最大化算法来估计用户的潜在品质因数，并相应地过滤用户的注释数据。下游任务显示了我们的方法在捕获用户行为和数据过滤以实现 LLM 对齐方面的有效性。

Title: Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference

Authors: Chao Han, Yijuan Liang, Zihao Xuan, Daokuan Wu, Wei Zhang, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13831
Pdf URL: https://arxiv.org/pdf/2510.13831
Copy Paste: [[2510.13831]] Informed Routing in LLMs: Smarter Token-Level Computation for Faster Inference(https://arxiv.org/abs/2510.13831)
Keywords: language model, llm
Abstract: The deployment of large language models (LLMs) in real-world applications is increasingly limited by their high inference cost. While recent advances in dynamic token-level computation allocation attempt to improve efficiency by selectively activating model components per token, existing methods rely on greedy routing--a myopic execute-or-skip mechanism that often leads to irreversible information loss and suboptimal token selection. This paper introduces informed routing, a new paradigm that proactively addresses these issues. The key insight is to assess not only a token's immediate importance but also its recoverability, i.e., how well its transformation can be approximated. To this end, we propose the Lightweight Feature Forecaster (LFF), a small predictive module that estimates a unit's output before routing decisions are made. This enables a flexible execute-or-approximate policy that preserves model fidelity while drastically reducing computation. Extensive experiments on both language modeling and reasoning tasks show that informed routing achieves state-of-the-art efficiency-performance trade-offs across multiple sparsity levels. Notably, even without final LoRA fine-tuning, our method matches or surpasses strong baselines that require full fine-tuning, all while reducing training time by over 50%. The code is available at: this https URL
摘要：大型语言模型 (LLM) 在实际应用中的部署越来越受到其高推理成本的限制。虽然动态令牌级计算分配的最新进展试图通过有选择地激活每个令牌的模型组件来提高效率，但现有方法依赖于贪婪路由——一种短视的执行或跳过机制，通常会导致不可逆的信息丢失和次优令牌选择。本文介绍了知情路由，这是一种主动解决这些问题的新范例。关键的见解是不仅要评估代币的直接重要性，还要评估其可恢复性，即其转换的近似程度。为此，我们提出了轻量级特征预测器（LFF），这是一个小型预测模块，可在做出路由决策之前估计单元的输出。这使得灵活的执行或近似策略能够保持模型保真度，同时大大减少计算量。对语言建模和推理任务的大量实验表明，知情路由在多个稀疏级别上实现了最先进的效率与性能权衡。值得注意的是，即使没有最终的 LoRA 微调，我们的方法也可以匹配或超过需要全面微调的强大基线，同时将训练时间减少 50% 以上。该代码位于：此 https URL

Title: ConDABench: Interactive Evaluation of Language Models for Data Analysis

Authors: Avik Dutta, Priyanshu Gupta, Hosein Hasanbeig, Rahul Pratap Singh, Harshit Nigam, Sumit Gulwani, Arjun Radhakrishna, Gustavo Soares, Ashish Tiwari
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13835
Pdf URL: https://arxiv.org/pdf/2510.13835
Copy Paste: [[2510.13835]] ConDABench: Interactive Evaluation of Language Models for Data Analysis(https://arxiv.org/abs/2510.13835)
Keywords: language model, llm, agent
Abstract: Real-world data analysis tasks often come with under-specified goals and unclean data. User interaction is necessary to understand and disambiguate a user's intent, and hence, essential to solving these complex tasks. Existing benchmarks for evaluating LLMs on data analysis tasks do not capture these complexities or provide first-class support for interactivity. We introduce ConDABench, a framework for generating conversational data analysis (ConDA) benchmarks and evaluating external tools on the generated benchmarks. \bench consists of (a) a multi-agent workflow for generating realistic benchmarks from articles describing insights gained from public datasets, (b) 1,420 ConDA problems generated using this workflow, and (c) an evaluation harness that, for the first time, makes it possible to systematically evaluate conversational data analysis tools on the generated ConDA problems. Evaluation of state-of-the-art LLMs on the benchmarks reveals that while the new generation of models are better at solving more instances, they are not necessarily better at solving tasks that require sustained, long-form engagement. ConDABench is an avenue for model builders to measure progress towards truly collaborative models that can complete complex interactive tasks.
摘要：现实世界的数据分析任务常常伴随着不明确的目标和不干净的数据。用户交互对于理解和消除用户意图的歧义是必要的，因此对于解决这些复杂任务至关重要。评估法学硕士数据分析任务的现有基准无法捕捉这些复杂性或为交互性提供一流的支持。我们介绍 ConDABench，这是一个用于生成会话数据分析 (ConDA) 基准并根据生成的基准评估外部工具的框架。 \bench 包括 (a) 一个多智能体工作流程，用于从描述从公共数据集获得的见解的文章中生成现实基准，(b) 使用此工作流程生成 1,420 个 ConDA 问题，以及 (c) 一个评估工具，该评估工具首次可以系统地评估生成的 ConDA 问题的对话数据分析工具。对最先进的法学硕士的基准评估表明，虽然新一代模型更擅长解决更多实例，但它们不一定更擅长解决需要持续、长期参与的任务。 ConDABench 是模型构建者衡量真正协作模型进展的途径，这些模型可以完成复杂的交互任务。

Title: SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models

Authors: Debarun Bhattacharjya, Balaji Ganesan, Junkyu Lee, Radu Marinescu, Katsiaryna Mirylenka, Michael Glass, Xiao Shou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13836
Pdf URL: https://arxiv.org/pdf/2510.13836
Copy Paste: [[2510.13836]] SIMBA UQ: Similarity-Based Aggregation for Uncertainty Quantification in Large Language Models(https://arxiv.org/abs/2510.13836)
Keywords: language model, llm
Abstract: When does a large language model (LLM) know what it does not know? Uncertainty quantification (UQ) provides measures of uncertainty, such as an estimate of the confidence in an LLM's generated output, and is therefore increasingly recognized as a crucial component of trusted AI systems. Black-box UQ methods do not require access to internal model information from the generating LLM and therefore have numerous real-world advantages, such as robustness to system changes, adaptability to choice of LLM, reduced costs, and computational tractability. In this paper, we investigate the effectiveness of UQ techniques that are primarily but not necessarily entirely black-box, where the consistency between a generated output and other sampled generations is used as a proxy for confidence in its correctness. We propose a high-level non-verbalized similarity-based aggregation framework that subsumes a broad swath of UQ approaches suitable for complex generative tasks, as well as introduce specific novel techniques from the framework that train confidence estimation models using small training sets. Through an empirical study with datasets spanning the diverse tasks of question answering, summarization, and text-to-SQL, we demonstrate that our proposed similarity-based methods can yield better calibrated confidences than baselines.
摘要：大型语言模型 (LLM) 何时知道它不知道的内容？不确定性量化（UQ）提供了不确定性的衡量标准，例如对法学硕士生成输出的置信度的估计，因此越来越被认为是可信人工智能系统的重要组成部分。黑盒 UQ 方法不需要访问生成的 LLM 的内部模型信息，因此具有许多现实世界的优势，例如对系统变化的鲁棒性、对 LLM 选择的适应性、降低的成本和计算的易处理性。在本文中，我们研究了 UQ 技术的有效性，这些技术主要但不一定完全是黑盒，其中生成的输出与其他采样代之间的一致性被用作对其正确性的信心的代理。我们提出了一种基于非语言相似性的高级聚合框架，该框架包含大量适用于复杂生成任务的 UQ 方法，并从该框架中引入了使用小型训练集训练置信估计模型的特定新技术。通过对涵盖问答、摘要和文本到 SQL 等不同任务的数据集进行实证研究，我们证明了我们提出的基于相似性的方法可以产生比基线更好的校准置信度。

Title: Meronymic Ontology Extraction via Large Language Models

Authors: Dekai Zhang, Simone Conia, Antonio Rago
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13839
Pdf URL: https://arxiv.org/pdf/2510.13839
Copy Paste: [[2510.13839]] Meronymic Ontology Extraction via Large Language Models(https://arxiv.org/abs/2510.13839)
Keywords: language model, llm
Abstract: Ontologies have become essential in today's digital age as a way of organising the vast amount of readily available unstructured text. In providing formal structure to this information, ontologies have immense value and application across various domains, e.g., e-commerce, where countless product listings necessitate proper product organisation. However, the manual construction of these ontologies is a time-consuming, expensive and laborious process. In this paper, we harness the recent advancements in large language models (LLMs) to develop a fully-automated method of extracting product ontologies, in the form of meronymies, from raw review texts. We demonstrate that the ontologies produced by our method surpass an existing, BERT-based baseline when evaluating using an LLM-as-a-judge. Our investigation provides the groundwork for LLMs to be used more generally in (product or otherwise) ontology extraction.
摘要：在当今的数字时代，本体论作为组织大量可用非结构化文本的一种方式已变得至关重要。在为这些信息提供正式结构时，本体在各个领域具有巨大的价值和应用，例如电子商务，其中无数的产品列表需要适当的产品组织。然而，手动构建这些本体是一个耗时、昂贵且费力的过程。在本文中，我们利用大语言模型（LLM）的最新进展来开发一种全自动方法，从原始评论文本中以分词形式提取产品本体。我们证明，在使用法学硕士作为法官进行评估时，我们的方法生成的本体超越了现有的、基于 BERT 的基线。我们的调查为法学硕士在（产品或其他）本体提取中更广泛地使用奠定了基础。

Title: ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking

Authors: Yutao Wu, Xiao Liu, Yinghui Li, Yifeng Gao, Yifan Ding, Jiale Ding, Xiang Zheng, Xingjun Ma
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.13842
Pdf URL: https://arxiv.org/pdf/2510.13842
Copy Paste: [[2510.13842]] ADMIT: Few-shot Knowledge Poisoning Attacks on RAG-based Fact Checking(https://arxiv.org/abs/2510.13842)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Knowledge poisoning poses a critical threat to Retrieval-Augmented Generation (RAG) systems by injecting adversarial content into knowledge bases, tricking Large Language Models (LLMs) into producing attacker-controlled outputs grounded in manipulated context. Prior work highlights LLMs' susceptibility to misleading or malicious retrieved content. However, real-world fact-checking scenarios are more challenging, as credible evidence typically dominates the retrieval pool. To investigate this problem, we extend knowledge poisoning to the fact-checking setting, where retrieved context includes authentic supporting or refuting evidence. We propose \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique), a few-shot, semantically aligned poisoning attack that flips fact-checking decisions and induces deceptive justifications, all without access to the target LLMs, retrievers, or token-level control. Extensive experiments show that ADMIT transfers effectively across 4 retrievers, 11 LLMs, and 4 cross-domain benchmarks, achieving an average attack success rate (ASR) of 86\% at an extremely low poisoning rate of $0.93 \times 10^{-6}$, and remaining robust even in the presence of strong counter-evidence. Compared with prior state-of-the-art attacks, ADMIT improves ASR by 11.2\% across all settings, exposing significant vulnerabilities in real-world RAG-based fact-checking systems.
摘要：知识中毒通过将对抗性内容注入知识库，欺骗大型语言模型（LLM）生成基于受操纵上下文的攻击者控制的输出，对检索增强生成（RAG）系统构成严重威胁。之前的工作强调了法学硕士对误导性或恶意检索内容的敏感性。然而，现实世界的事实检查场景更具挑战性，因为可靠的证据通常在检索池中占主导地位。为了调查这个问题，我们将知识中毒扩展到事实检查设置，其中检索到的上下文包括真实的支持或反驳证据。我们提出 \textbf{ADMIT} (\textbf{AD}versarial \textbf{M}ulti-\textbf{I}njection \textbf{T}echnique)，这是一种少量的、语义对齐的中毒攻击，它可以翻转事实检查决策并诱导欺骗性的理由，所有这些都无需访问目标 LLM、检索器或令牌级别的控制。大量实验表明，ADMIT 可以有效地跨 4 个检索器、11 个 LLM 和 4 个跨域基准进行传输，以 0.93 美元 × 10^{-6}$ 的极低中毒率实现 86\% 的平均攻击成功率 (ASR)，并且即使在存在强有力的反证据的情况下也保持鲁棒性。与之前最先进的攻击相比，ADMIT 在所有设置中将 ASR 提高了 11.2%，暴露了现实世界中基于 RAG 的事实检查系统中的重大漏洞。

Title: Serialized EHR make for good text representations

Authors: Zhirong Chou, Quan Qin, Shi Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13843
Pdf URL: https://arxiv.org/pdf/2510.13843
Copy Paste: [[2510.13843]] Serialized EHR make for good text representations(https://arxiv.org/abs/2510.13843)
Keywords: language model
Abstract: The emergence of foundation models in healthcare has opened new avenues for learning generalizable representations from large scale clinical data. Yet, existing approaches often struggle to reconcile the tabular and event based nature of Electronic Health Records (EHRs) with the sequential priors of natural language models. This structural mismatch limits their ability to capture longitudinal dependencies across patient encounters. We introduce SerialBEHRT, a domain aligned foundation model that extends SciBERT through additional pretraining on structured EHR sequences. SerialBEHRT is designed to encode temporal and contextual relationships among clinical events, thereby producing richer patient representations. We evaluate its effectiveness on the task of antibiotic susceptibility prediction, a clinically meaningful problem in antibiotic stewardship. Through extensive benchmarking against state of the art EHR representation strategies, we demonstrate that SerialBEHRT achieves superior and more consistent performance, highlighting the importance of temporal serialization in foundation model pretraining for healthcare.
摘要：医疗保健领域基础模型的出现为从大规模临床数据中学习通用表示开辟了新途径。然而，现有方法常常难以将电子健康记录 (EHR) 的表格和基于事件的性质与自然语言模型的顺序先验相协调。这种结构上的不匹配限制了他们捕获患者接触过程中纵向依赖性的能力。我们引入了 SerialBEHRT，这是一种领域对齐的基础模型，它通过对结构化 EHR 序列进行额外的预训练来扩展 SciBERT。 SerialBEHRT 旨在对临床事件之间的时间和上下文关系进行编码，从而产生更丰富的患者表征。我们评估其在抗生素敏感性预测任务中的有效性，这是抗生素管理中具有临床意义的问题。通过针对最先进的 EHR 表示策略进行广泛的基准测试，我们证明了 SerialBEHRT 实现了卓越且更一致的性能，强调了时间序列化在医疗保健基础模型预训练中的重要性。

Title: DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models

Authors: Jinbin Zhang, Nasib Ullah, Erik Schultheis, Rohit Babbar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13847
Pdf URL: https://arxiv.org/pdf/2510.13847
Copy Paste: [[2510.13847]] DynaSpec: Context-aware Dynamic Speculative Sampling for Large-Vocabulary Language Models(https://arxiv.org/abs/2510.13847)
Keywords: language model, llm
Abstract: Speculative decoding (a.k.a. speculative sampling) has become a standard way to accelerate LLM inference: a small drafter proposes multiple tokens and a large target model verifies them once per speculation length. Recently, scaling of the LLM vocabulary has pushed the number of tokens to grow substantially. While verification over the full vocabulary leaves the target model largely unaffected, the O(|V|d) parameters in the drafter's output head become a latency bottleneck, slowing the entire pipeline. Contemporary methods (e.g., FR-Spec, VocabTrim) restrict the drafter's vocabulary to a fixed subset of the target model's vocabulary, ranked in descending order of token frequency. Although this reduces draft-time compute, it is brittle, since: (i) frequency lists are corpus-dependent and require retuning to generalize, and (ii) static shortlists suppress rare or domain-specific tokens, lowering the expected number of tokens per verification step. We propose DynaSpec, a context-dependent dynamic shortlisting mechanism that is robust, speeds up drafting, and generalizes across diverse tasks. Concretely, we introduce lightweight, coarse-grained meta-classifiers that route contexts to a small number of token clusters; the union of the top-k selected clusters forms the drafter's shortlist, while verification retains the full vocabulary and exactness. The meta-classifier finishes its computation earlier than the drafter's hidden state generation by exploiting parallel execution of draft encoding and meta shortlisting on separate streams. On standard speculative-decoding benchmarks, we observe consistent gains in mean accepted length over fixed-shortlist baselines, while context-dependent selection enables smaller shortlists without degrading acceptance.
摘要：推测性解码（又名推测性采样）已成为加速 LLM 推理的标准方法：小型起草者提出多个标记，大型目标模型在每个推测长度上验证它们一次。最近，LLM 词汇量的扩大推动了代币数量的大幅增长。虽然对完整词汇表的验证使目标模型在很大程度上不受影响，但起草者输出头中的 O(|V|d) 参数成为延迟瓶颈，减慢了整个流程。现代方法（例如 FR-Spec、VocabTrim）将起草者的词汇表限制为目标模型词汇表的固定子集，并按标记频率的降序排列。虽然这减少了草稿时的计算量，但它很脆弱，因为：(i) 频率列表依赖于语料库，需要重新调整才能泛化，(ii) 静态候选列表会抑制稀有或特定领域的令牌，从而降低每个验证步骤的预期令牌数量。我们提出了 DynaSpec，这是一种依赖于上下文的动态入围机制，该机制非常强大，可以加快起草速度，并且可以跨不同的任务进行推广。具体来说，我们引入了轻量级、粗粒度的元分类器，将上下文路由到少量的标记集群；前 k 个选定集群的联合形成了起草者的候选名单，而验证则保留了完整的词汇和准确性。元分类器通过利用草稿编码和元入围在不同流上的并行执行，比起草者的隐藏状态生成更早完成计算。在标准推测解码基准上，我们观察到平均接受长度相对于固定候选名单基线的持续增长，而上下文相关的选择可以在不降低接受度的情况下实现更小的候选名单。

Title: On-device System of Compositional Multi-tasking in Large Language Models

Authors: Ondrej Bohdal, Konstantinos Theodosiadis, Asterios Mpatziakas, Dimitris Filippidis, Iro Spyrou, Christos Zonios, Anastasios Drosou, Dimosthenis Ioannidis, Kyeng-Hun Lee, Jijoong Moon, Hyeonmok Ko, Mete Ozay, Umberto Michieli
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13848
Pdf URL: https://arxiv.org/pdf/2510.13848
Copy Paste: [[2510.13848]] On-device System of Compositional Multi-tasking in Large Language Models(https://arxiv.org/abs/2510.13848)
Keywords: language model, llm
Abstract: Large language models (LLMs) are commonly adapted for diverse downstream tasks via parameter-efficient fine-tuning techniques such as Low-Rank Adapters (LoRA). While adapters can be combined to handle multiple tasks separately, standard approaches struggle when targeting the simultaneous execution of complex tasks, such as generating a translated summary from a long conversation. To address this challenge, we propose a novel approach tailored specifically for compositional multi-tasking scenarios involving summarization and translation. Our technique involves adding a learnable projection layer on top of the combined summarization and translation adapters. This design enables effective integration while maintaining efficiency through reduced computational overhead compared to alternative strategies requiring extensive retraining or sequential processing. We demonstrate the practical viability of our method within an on-device environment by developing an Android app capable of executing compositional tasks seamlessly. Experimental results indicate our solution performs well and is fast in both cloud-based and on-device implementations, highlighting the potential benefits of adopting our framework in real-world applications demanding high-speed operation alongside resource constraints.
摘要：大型语言模型 (LLM) 通常通过低阶适配器 (LoRA) 等参数高效的微调技术来适应各种下游任务。虽然适配器可以组合起来单独处理多个任务，但标准方法在同时执行复杂任务时会遇到困难，例如从长对话中生成翻译后的摘要。为了应对这一挑战，我们提出了一种专门针对涉及摘要和翻译的组合多任务场景量身定制的新颖方法。我们的技术涉及在组合的摘要和翻译适配器之上添加可学习的投影层。与需要大量再训练或顺序处理的替代策略相比，这种设计可以实现有效集成，同时通过减少计算开销来保持效率。我们通过开发能够无缝执行组合任务的 Android 应用程序，展示了我们的方法在设备环境中的实际可行性。实验结果表明，我们的解决方案在基于云和设备上的实施中表现良好且速度很快，凸显了在需要高速运行和资源限制的实际应用程序中采用我们的框架的潜在好处。

Title: Language steering in latent space to mitigate unintended code-switching

Authors: Andrey Goncharov, Nikolai Kondusov, Alexey Zaytsev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13849
Pdf URL: https://arxiv.org/pdf/2510.13849
Copy Paste: [[2510.13849]] Language steering in latent space to mitigate unintended code-switching(https://arxiv.org/abs/2510.13849)
Keywords: language model, llm
Abstract: Multilingual Large Language Models (LLMs) often exhibit unintended code-switching, reducing reliability in downstream tasks. We propose latent-space language steering, a lightweight inference-time method that identifies language directions via PCA on parallel translations and steers token embeddings along these axes to control language identity. Our approach mitigates code-switching while preserving semantics with negligible computational overhead and requires only minimal parallel data for calibration. Empirically, we achieve 95-99\% language classification accuracy using a single principal component and reduce next-token distributional divergence by up to 42% across multiple language pairs on Qwen2.5 and Llama-3.2 models. We further analyze the layer-wise evolution of language representations, revealing that language identity concentrates in final layers with near-perfect linear separability.
摘要：多语言大语言模型 (LLM) 经常会出现意外的代码切换，从而降低下游任务的可靠性。我们提出了潜在空间语言控制，这是一种轻量级的推理时间方法，可以通过 PCA 在并行翻译上识别语言方向，并沿着这些轴引导标记嵌入来控制语言身份。我们的方法减轻了代码切换，同时保留了语义，计算开销可以忽略不计，并且只需要最少的并行数据进行校准。根据经验，我们使用单个主成分实现了 95-99% 的语言分类准确率，并在 Qwen2.5 和 Llama-3.2 模型上将多个语言对的下一个标记分布分歧减少了高达 42%。我们进一步分析了语言表示的分层演化，揭示了语言同一性集中在具有近乎完美的线性可分离性的最后一层。

Title: Revisiting the UID Hypothesis in LLM Reasoning Traces

Authors: Minju Gwak, Guijin Son, Jaehyung Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13850
Pdf URL: https://arxiv.org/pdf/2510.13850
Copy Paste: [[2510.13850]] Revisiting the UID Hypothesis in LLM Reasoning Traces(https://arxiv.org/abs/2510.13850)
Keywords: language model, llm, chain-of-thought
Abstract: Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning, yet these intermediate steps are frequently unfaithful or hard to interpret. Inspired by the Uniform Information Density (UID) hypothesis in psycholinguistics -- which posits that humans communicate by maintaining a stable flow of information -- we introduce entropy-based metrics to analyze the information flow within reasoning traces. Surprisingly, across three challenging mathematical benchmarks, we find that successful reasoning in LLMs is globally non-uniform: correct solutions are characterized by uneven swings in information density, in stark contrast to human communication patterns. This result challenges assumptions about machine reasoning and suggests new directions for designing interpretable and adaptive reasoning models.
摘要：大型语言模型 (LLM) 通常使用逐步的思想链 (CoT) 推理来解决问题，但这些中间步骤常常不忠实或难以解释。受心理语言学中的统一信息密度（UID）假说的启发，该假说假设人类通过维持稳定的信息流进行交流，我们引入基于熵的指标来分析推理轨迹内的信息流。令人惊讶的是，在三个具有挑战性的数学基准中，我们发现法学硕士的成功推理在全球范围内并不统一：正确的解决方案的特点是信息密度的不均匀波动，这与人类的沟通模式形成鲜明对比。这一结果挑战了有关机器推理的假设，并为设计可解释和自适应推理模型提出了新的方向。

Title: EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing

Authors: Sicheng Lyu, Yu Gu, Xinyu Wang, Jerry Huang, Sitao Luan, Yufei Cui, Xiao-Wen Chang, Peng Lu
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13851
Pdf URL: https://arxiv.org/pdf/2510.13851
Copy Paste: [[2510.13851]] EvoEdit: Evolving Null-space Alignment for Robust and Efficient Knowledge Editing(https://arxiv.org/abs/2510.13851)
Keywords: language model, llm
Abstract: Large language models (LLMs) require continual updates to rectify outdated or erroneous knowledge. Model editing has emerged as a compelling paradigm for introducing targeted modifications without the computational burden of full retraining. Existing approaches are mainly based on a locate-then-edit framework. However, in sequential editing contexts, where multiple updates are applied over time, they exhibit significant limitations and suffer from catastrophic interference, i.e., new edits compromise previously integrated updates and degrade preserved knowledge. To address these challenges, we introduce EvoEdit, a novel editing strategy that mitigates catastrophic interference through sequential null-space alignment, enabling stable and efficient model editing. By performing sequential null-space alignment for each incoming edit, EvoEdit preserves both original and previously modified knowledge representations and maintains output invariance on preserved knowledge even across long edit sequences, effectively mitigating interference. Evaluations on real-world sequential knowledge-editing benchmarks show that EvoEdit achieves better or comparable performance than prior state-of-the-art locate-then-edit techniques, with up to 3.53 times speedup. Overall, these results underscore the necessity of developing more principled approaches for designing LLMs in dynamically evolving information settings, while providing a simple yet effective solution with strong theoretical guarantees.
摘要：大型语言模型（LLM）需要不断更新以纠正过时或错误的知识。模型编辑已成为一种引人注目的范例，可以引入有针对性的修改，而无需完全重新训练的计算负担。现有方法主要基于定位然后编辑框架。然而，在连续编辑环境中，随着时间的推移应用多个更新，它们表现出显着的局限性并遭受灾难性干扰，即新的编辑会损害先前集成的更新并降低保存的知识。为了应对这些挑战，我们引入了 EvoEdit，这是一种新颖的编辑策略，可通过顺序零空间对齐来减轻灾难性干扰，从而实现稳定高效的模型编辑。通过对每个传入编辑执行顺序零空间对齐，EvoEdit 保留原始和先前修改的知识表示，并保持保留知识的输出不变性，即使在长编辑序列中也是如此，从而有效地减轻干扰。对现实世界顺序知识编辑基准的评估表明，EvoEdit 比之前最先进的定位然后编辑技术实现了更好或相当的性能，加速高达 3.53 倍。总体而言，这些结果强调了在动态发展的信息环境中开发更有原则的方法来设计法学硕士的必要性，同时提供具有强大理论保证的简单而有效的解决方案。

Title: ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups

Authors: Peter Banyas, Shristi Sharma, Alistair Simmons, Atharva Vispute
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.13852
Pdf URL: https://arxiv.org/pdf/2510.13852
Copy Paste: [[2510.13852]] ConsistencyAI: A Benchmark to Assess LLMs' Factual Consistency When Responding to Different Demographic Groups(https://arxiv.org/abs/2510.13852)
Keywords: language model, llm, prompt
Abstract: Is an LLM telling you different facts than it's telling me? This paper introduces ConsistencyAI, an independent benchmark for measuring the factual consistency of large language models (LLMs) for different personas. ConsistencyAI tests whether, when users of different demographics ask identical questions, the model responds with factually inconsistent answers. Designed without involvement from LLM providers, this benchmark offers impartial evaluation and accountability. In our experiment, we queried 19 LLMs with prompts that requested 5 facts for each of 15 topics. We repeated this query 100 times for each LLM, each time adding prompt context from a different persona selected from a subset of personas modeling the general population. We processed the responses into sentence embeddings, computed cross-persona cosine similarity, and computed the weighted average of cross-persona cosine similarity to calculate factual consistency scores. In 100-persona experiments, scores ranged from 0.9065 to 0.7896, and the mean was 0.8656, which we adopt as a benchmark threshold. xAI's Grok-3 is most consistent, while several lightweight models rank lowest. Consistency varies by topic: the job market is least consistent, G7 world leaders most consistent, and issues like vaccines or the Israeli-Palestinian conflict diverge by provider. These results show that both the provider and the topic shape the factual consistency. We release our code and interactive demo to support reproducible evaluation and encourage persona-invariant prompting strategies.
摘要：LLM 告诉你的事实与它告诉我的事实是否不同？本文介绍了 ConsistencyAI，这是一个独立的基准，用于测量不同角色的大型语言模型 (LLM) 的事实一致性。 ConsistencyAI 测试当不同人口统计的用户提出相同的问题时，模型是否会给出与事实不一致的答案。该基准的设计没有法学硕士提供商的参与，提供公正的评估和问责制。在我们的实验中，我们查询了 19 名法学硕士，并根据提示要求为 15 个主题中的每一个主题提供 5 个事实。我们为每个法学硕士重复此查询 100 次，每次都添加从对一般人群进行建模的角色子集中选择的不同角色的提示上下文。我们将响应处理为句子嵌入，计算跨角色余弦相似度，并计算跨角色余弦相似度的加权平均值以计算事实一致性得分。在 100 人实验中，得分范围为 0.9065 至 0.7896，平均值为 0.8656，我们采用该值作为基准阈值。 xAI 的 Grok-3 最为一致，而几个轻量级模型排名最低。一致性因主题而异：就业市场最不一致，七国集团世界领导人最一致，疫苗或以色列-巴勒斯坦冲突等问题因供应商而异。这些结果表明提供者和主题都塑造了事实一致性。我们发布代码和交互式演示，以支持可重复的评估并鼓励角色不变的提示策略。

Title: BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation

Authors: Fabian Wenz, Omar Bouattour, Devin Yang, Justin Choi, Cecil Gregg, Nesime Tatbul, Çağatay Demiralp
Subjects: cs.CL, cs.AI, cs.DB, cs.HC
Abstract URL: https://arxiv.org/abs/2510.13853
Pdf URL: https://arxiv.org/pdf/2510.13853
Copy Paste: [[2510.13853]] BenchPress: A Human-in-the-Loop Annotation System for Rapid Text-to-SQL Benchmark Curation(https://arxiv.org/abs/2510.13853)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large language models (LLMs) have been successfully applied to many tasks, including text-to-SQL generation. However, much of this work has focused on publicly available datasets, such as Fiben, Spider, and Bird. Our earlier work showed that LLMs are much less effective in querying large private enterprise data warehouses and released Beaver, the first private enterprise text-to-SQL benchmark. To create Beaver, we leveraged SQL logs, which are often readily available. However, manually annotating these logs to identify which natural language questions they answer is a daunting task. Asking database administrators, who are highly trained experts, to take on additional work to construct and validate corresponding natural language utterances is not only challenging but also quite costly. To address this challenge, we introduce BenchPress, a human-in-the-loop system designed to accelerate the creation of domain-specific text-to-SQL benchmarks. Given a SQL query, BenchPress uses retrieval-augmented generation (RAG) and LLMs to propose multiple natural language descriptions. Human experts then select, rank, or edit these drafts to ensure accuracy and domain alignment. We evaluated BenchPress on annotated enterprise SQL logs, demonstrating that LLM-assisted annotation drastically reduces the time and effort required to create high-quality benchmarks. Our results show that combining human verification with LLM-generated suggestions enhances annotation accuracy, benchmark reliability, and model evaluation robustness. By streamlining the creation of custom benchmarks, BenchPress offers researchers and practitioners a mechanism for assessing text-to-SQL models on a given domain-specific workload. BenchPress is freely available via our public GitHub repository at this https URL and is also accessible on our website at this http URL.
摘要：大型语言模型 (LLM) 已成功应用于许多任务，包括文本到 SQL 的生成。然而，这项工作的大部分内容都集中在公开可用的数据集上，例如 Fiben、Spider 和 Bird。我们早期的工作表明，法学硕士在查询大型私营企业数据仓库方面的效率要低得多，并发布了 Beaver，这是第一个私营企业文本到 SQL 基准测试。为了创建 Beaver，我们利用了 SQL 日志，这些日志通常很容易获得。然而，手动注释这些日志以识别它们回答哪些自然语言问题是一项艰巨的任务。要求数据库管理员（他们是训练有素的专家）承担额外的工作来构建和验证相应的自然语言表达不仅具有挑战性，而且成本相当高。为了应对这一挑战，我们引入了 BenchPress，这是一个人机交互系统，旨在加速创建特定于域的文本到 SQL 基准测试。给定 SQL 查询，BenchPress 使用检索增强生成 (RAG) 和 LLM 来提出多种自然语言描述。然后，人类专家对这些草稿进行选择、排名或编辑，以确保准确性和领域一致性。我们在带注释的企业 SQL 日志上评估了 BenchPress，结果表明 LLM 辅助注释大大减少了创建高质量基准测试所需的时间和精力。我们的结果表明，将人工验证与法学硕士生成的建议相结合可以提高注释准确性、基准可靠性和模型评估的稳健性。通过简化自定义基准的创建，BenchPress 为研究人员和从业人员提供了一种在给定的特定领域工作负载上评估文本到 SQL 模型的机制。 BenchPress 可通过我们的公共 GitHub 存储库（此 https URL）免费获取，也可以在我们的网站（此 http URL）上访问。

Title: Harnessing Consistency for Robust Test-Time LLM Ensemble

Authors: Zhichen Zeng, Qi Yu, Xiao Lin, Ruizhong Qiu, Xuying Ning, Tianxin Wei, Yuchen Yan, Jingrui He, Hanghang Tong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13855
Pdf URL: https://arxiv.org/pdf/2510.13855
Copy Paste: [[2510.13855]] Harnessing Consistency for Robust Test-Time LLM Ensemble(https://arxiv.org/abs/2510.13855)
Keywords: language model, llm
Abstract: Different large language models (LLMs) exhibit diverse strengths and weaknesses, and LLM ensemble serves as a promising approach to integrate their complementary capabilities. Despite substantial progress in improving ensemble quality, limited attention has been paid to the robustness of ensembles against potential erroneous signals, which often arise from heterogeneous tokenization schemes and varying model expertise. Our analysis shows that ensemble failures typically arise from both the token level and the model level: the former reflects severe disagreement in token predictions, while the latter involves low confidence and pronounced disparities among models. In light of this, we propose CoRE, a plug-and-play technique that harnesses model consistency for robust LLM ensemble, which can be seamlessly integrated with diverse ensemble methods. Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens with high inconsistency, often due to token misalignment, thereby improving robustness at a granular level. Model-level consistency models global agreement by promoting model outputs with high self-confidence and minimal divergence from others, enhancing robustness at a coarser level. Extensive experiments across diverse benchmarks, model combinations, and ensemble strategies demonstrate that CoRE consistently improves ensemble performance and robustness.
摘要：不同的大语言模型 (LLM) 表现出不同的优点和缺点，LLM 集成是集成其互补功能的一种有前途的方法。尽管在提高集成质量方面取得了实质性进展，但人们对集成针对潜在错误信号的鲁棒性的关注有限，这些错误信号通常是由异构标记化方案和不同的模型专业知识引起的。我们的分析表明，集成失败通常来自令牌级别和模型级别：前者反映了令牌预测的严重分歧，而后者则涉及模型之间的低置信度和明显差异。有鉴于此，我们提出了 CoRE，这是一种即插即用技术，利用模型一致性来实现稳健的 LLM 集成，可以与各种集成方法无缝集成。令牌级一致性通过应用低通滤波器来降低高度不一致的不确定令牌（通常是由于令牌未对齐）的权重，从而捕获细粒度的分歧，从而提高粒度级别的鲁棒性。模型级一致性通过提高模型输出的高度自信和与其他模型的差异最小化来模拟全局一致性，从而增强粗略水平的稳健性。跨不同基准、模型组合和集成策略的大量实验表明，CoRE 持续提高集成性能和鲁棒性。

Title: Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA

Authors: A H M Rezaul Karim, Ozlem Uzuner
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.13856
Pdf URL: https://arxiv.org/pdf/2510.13856
Copy Paste: [[2510.13856]] Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA(https://arxiv.org/abs/2510.13856)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care. The MEDIQA-WV 2025 shared task addressed wound-care VQA, requiring systems to generate free-text responses and structured wound attributes from images and patient queries. We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework that incorporates textual and visual examples from in-domain data. This approach grounds outputs in clinically relevant exemplars, improving reasoning, schema adherence, and response quality across dBLEU, ROUGE, BERTScore, and LLM-based metrics. Our best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%, demonstrating that lightweight RAG with general-purpose LLMs -- a minimal inference-time layer that adds a few relevant exemplars via simple indexing and fusion, with no extra training or complex re-ranking -- provides a simple and effective baseline for multimodal clinical NLP tasks.
摘要：医学视觉问答 (MedVQA) 支持对医学图像进行自然语言查询，以支持临床决策和患者护理。 MEDIQA-WV 2025 共享任务解决了伤口护理 VQA 问题，要求系统根据图像和患者查询生成自由文本响应和结构化伤口属性。我们提出了 MasonNLP 系统，该系统采用通用领域、指令调整的大型语言模型，以及检索增强生成（RAG）框架，该框架结合了来自领域内数据的文本和视觉示例。这种方法将输出建立在临床相关的范例中，从而改善了基于 dBLEU、ROUGE、BERTScore 和 LLM 指标的推理、模式遵守和响应质量。我们表现最好的系统在 19 个团队和 51 份提交中排名第三，平均得分为 41.37%，这表明具有通用 LLM 的轻量级 RAG（一个最小推理时间层，通过简单的索引和融合添加一些相关样本，无需额外的训练或复杂的重新排名）为多模式临床 NLP 任务提供了简单而有效的基线。

Title: ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing

Authors: Shivanshu Kumar, Gopalakrishnan Srinivasan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13860
Pdf URL: https://arxiv.org/pdf/2510.13860
Copy Paste: [[2510.13860]] ShishuLM: Lightweight Language Model with Hybrid Decoder-MLP Architecture and Paired Weight Sharing(https://arxiv.org/abs/2510.13860)
Keywords: language model, agent
Abstract: While the transformer architecture has achieved state-of-the-art performance on natural language processing tasks, these models impose substantial memory and computational overhead. Recent research has identified significant architectural redundancies within these models, presenting opportunities for optimization without compromising performance. Taking insights from research in AI interpretability and inference-time layer pruning, we introduce an efficient language model architecture, referred to as ShishuLM, which reduces both the parameter count and Key-Value (KV) cache requirements. Given the increasing importance of Small Language Models (SLMs) in agentic AI systems, we evaluate our approach on two SLMs of different scales. Our analysis reveals that for moderate-context scenarios, normalization coupled with attention computation is roughly linear with the input, enabling entire transformer blocks to be approximated through Multi-Layer Perceptrons (MLPs). Our results show that ShishuLM provides up to 25% reduction in memory requirements and up to 40% improvement in latency during both training and inference, compared to parent models. Our experimental and analytical findings provide insights towards building more efficient SLM architectures from a pre-training standpoint.
摘要：虽然 Transformer 架构在自然语言处理任务上实现了最先进的性能，但这些模型会带来大量的内存和计算开销。最近的研究发现这些模型中存在显着的架构冗余，提供了在不影响性能的情况下进行优化的机会。借鉴人工智能可解释性和推理时间层剪枝研究的见解，我们引入了一种高效的语言模型架构，称为 ShishuLM，它减少了参数数量和键值 (KV) 缓存要求。鉴于小语言模型 (SLM) 在代理 AI 系统中的重要性日益增加，我们在两个不同规模的 SLM 上评估我们的方法。我们的分析表明，对于中等上下文场景，归一化加上注意力计算与输入大致呈线性关系，使得整个变压器块能够通过多层感知器（MLP）进行近似。我们的结果表明，与父模型相比，ShishuLM 在训练和推理期间可将内存需求降低高达 25%，并将延迟提高高达 40%。我们的实验和分析结果为从预训练的角度构建更高效的 SLM 架构提供了见解。

Title: Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues

Authors: Chenyu Zhang, Sharifa Alghowinem, Cynthia Breazeal
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2510.13862
Pdf URL: https://arxiv.org/pdf/2510.13862
Copy Paste: [[2510.13862]] Ensembling Large Language Models to Characterize Affective Dynamics in Student-AI Tutor Dialogues(https://arxiv.org/abs/2510.13862)
Keywords: language model, gpt, llm
Abstract: While recent studies have examined the leaning impact of large language model (LLM) in educational contexts, the affective dynamics of LLM-mediated tutoring remain insufficiently understood. This work introduces the first ensemble-LLM framework for large-scale affect sensing in tutoring dialogues, advancing the conversation on responsible pathways for integrating generative AI into education by attending to learners' evolving affective states. To achieve this, we analyzed two semesters' worth of 16,986 conversational turns exchanged between PyTutor, an LLM-powered AI tutor, and 261 undergraduate learners across three U.S. institutions. To investigate learners' emotional experiences, we generate zero-shot affect annotations from three frontier LLMs (Gemini, GPT-4o, Claude), including scalar ratings of valence, arousal, and learning-helpfulness, along with free-text emotion labels. These estimates are fused through rank-weighted intra-model pooling and plurality consensus across models to produce robust emotion profiles. Our analysis shows that during interaction with the AI tutor, students typically report mildly positive affect and moderate arousal. Yet learning is not uniformly smooth: confusion and curiosity are frequent companions to problem solving, and frustration, while less common, still surfaces in ways that can derail progress. Emotional states are short-lived--positive moments last slightly longer than neutral or negative ones, but they are fragile and easily disrupted. Encouragingly, negative emotions often resolve quickly, sometimes rebounding directly into positive states. Neutral moments frequently act as turning points, more often steering students upward than downward, suggesting opportunities for tutors to intervene at precisely these junctures.
摘要：虽然最近的研究探讨了大语言模型（LLM）在教育背景下的学习影响，但以 LLM 为媒介的辅导的情感动态仍然没有得到充分理解。这项工作引入了第一个用于辅导对话中大规模情感感知的集成法学硕士框架，通过关注学习者不断变化的情感状态，推进将生成式人工智能融入教育的负责任途径的对话。为了实现这一目标，我们分析了两个学期的 16,986 次对话，PyTutor（一位法学硕士支持的人工智能导师）与来自三个美国机构的 261 名本科生之间交换了对话。为了研究学习者的情感体验，我们从三个前沿法学硕士（Gemini、GPT-4o、Claude）生成零样本情感注释，包括效价、唤醒度和学习帮助性的标量评级，以及自由文本情感标签。这些估计通过排名加权的模型内池和跨模型的多数共识来融合，以产生稳健的情绪概况。我们的分析表明，在与人工智能导师的互动过程中，学生通常会报告轻度积极的情绪和中度的唤醒。然而，学习并不总是一帆风顺：困惑和好奇心常常伴随着解决问题的过程，而挫败感虽然不太常见，但仍然会以可能破坏进步的方式出现。情绪状态是短暂的——积极的时刻比中性或消极的时刻持续的时间稍长，但它们很脆弱，很容易被破坏。令人鼓舞的是，负面情绪往往会很快消退，有时甚至会直接反弹到积极状态。中立时刻经常充当转折点，更多地引导学生向上而不是向下，这表明导师有机会在这些时刻进行干预。

Title: Unlocking the Potential of Diffusion Language Models through Template Infilling

Authors: Junhoo Lee, Seungyeon Kim, Nojun Kwak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13870
Pdf URL: https://arxiv.org/pdf/2510.13870
Copy Paste: [[2510.13870]] Unlocking the Potential of Diffusion Language Models through Template Infilling(https://arxiv.org/abs/2510.13870)
Keywords: language model, prompt
Abstract: Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs' generation process. Unlike conventional prefix prompting, TI first generates a structural template for the target response, then fills in the masked segments. To enhance the flexibility of this structural control, we introduce Dynamic Segment Allocation (DSA), which adaptively adjusts segment lengths based on generation confidence. We demonstrate the effectiveness of our approach on mathematical reasoning and code generation benchmarks, achieving consistent improvements of 17.01$\%$p over baseline. Furthermore, we show that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality.
摘要：扩散语言模型 (DLM) 已成为自回归语言模型的一种有前景的替代方案，但其推理策略仍然仅限于从自回归范式继承的基于前缀的提示。在本文中，我们提出了模板填充（TI），这是一种针对 DLM 生成过程的定制调节方法。与传统的前缀提示不同，TI 首先生成目标响应的结构模板，然后填充屏蔽片段。为了增强这种结构控制的灵活性，我们引入了动态段分配（DSA），它根据生成置信度自适应调整段长度。我们证明了我们的方法在数学推理和代码生成基准方面的有效性，与基线相比实现了 17.01$\%$p 的持续改进。此外，我们还表明 TI 在多令牌生成设置中提供了额外的优势，可以在保持生成质量的同时实现有效加速。

Title: What Layers When: Learning to Skip Compute in LLMs with Residual Gates

Authors: Filipe Laitenberger, Dawid Kopiczko, Cees G.M. Snoek, Yuki M. Asano
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13876
Pdf URL: https://arxiv.org/pdf/2510.13876
Copy Paste: [[2510.13876]] What Layers When: Learning to Skip Compute in LLMs with Residual Gates(https://arxiv.org/abs/2510.13876)
Keywords: llm
Abstract: We introduce GateSkip, a simple residual-stream gating mechanism that enables token-wise layer skipping in decoder-only LMs. Each Attention/MLP branch is equipped with a sigmoid-linear gate that condenses the branch's output before it re-enters the residual stream. During inference we rank tokens by the gate values and skip low-importance ones using a per-layer budget. While early-exit or router-based Mixture-of-Depths models are known to be unstable and need extensive retraining, our smooth, differentiable gates fine-tune stably on top of pretrained models. On long-form reasoning, we save up to 15\% compute while retaining over 90\% of baseline accuracy. On instruction-tuned models we see accuracy gains at full compute and match baseline quality near 50\% savings. The learned gates give insight into transformer information flow (e.g., BOS tokens act as anchors), and the method combines easily with quantization, pruning, and self-speculative decoding.
摘要：我们引入了 GateSkip，一种简单的残差流门控机制，可以在仅解码器的 LM 中实现 token-wise 层跳过。每个 Attention/MLP 分支都配备了一个 sigmoid 线性门，该门在分支的输出重新进入残差流之前对其进行压缩。在推理过程中，我们根据门值对令牌进行排名，并使用每层预算跳过低重要性的令牌。虽然提前退出或基于路由器的深度混合模型已知不稳定并且需要大量的再训练，但我们的平滑、可微分门可以在预训练模型的基础上进行稳定的微调。在长形式推理中，我们节省了高达 15% 的计算量，同时保留了超过 90% 的基线精度。在指令调整模型上，我们看到完全计算时的准确度提升，并且匹配基线质量节省了近 50%。学习门可以深入了解变压器信息流（例如，BOS 代币充当锚），并且该方法可以轻松地与量化、剪枝和自推测解码相结合。

Title: TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks

Authors: Jimin Lim, Arjun Damerla, Arthur Jiang, Nam Le
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13878
Pdf URL: https://arxiv.org/pdf/2510.13878
Copy Paste: [[2510.13878]] TextBandit: Evaluating Probabilistic Reasoning in LLMs Through Language-Only Decision Tasks(https://arxiv.org/abs/2510.13878)
Keywords: language model, llm
Abstract: Large language models (LLMs) have shown to be increasingly capable of performing reasoning tasks, but their ability to make sequential decisions under uncertainty only using natural language remains underexplored. We introduce a novel benchmark in which LLMs interact with multi-armed bandit environments using purely textual feedback, "you earned a token", without access to numerical cues or explicit probabilities, resulting in the model to infer latent reward structures purely off linguistic cues and to adapt accordingly. We evaluated the performance of four open-source LLMs and compare their performance to standard decision-making algorithms such as Thompson Sampling, Epsilon Greedy, Upper Confidence Bound (UCB), and random choice. While most of the LLMs underperformed compared to the baselines, Qwen3-4B, achieved the best-arm selection rate of 89.2% , which significantly outperformed both the larger LLMs and traditional methods. Our findings suggest that probabilistic reasoning is able to emerge from language alone, and we present this benchmark as a step towards evaluating decision-making capabilities in naturalistic, non-numeric contexts.
摘要：大型语言模型（LLM）已被证明执行推理任务的能力越来越强，但它们仅使用自然语言在不确定性下做出顺序决策的能力仍未得到充分探索。我们引入了一种新颖的基准，其中法学硕士使用纯粹的文本反馈“你赢得了一个令牌”与多臂老虎机环境进行交互，而无需访问数字线索或显式概率，从而使模型纯粹根据语言线索推断潜在奖励结构并进行相应调整。我们评估了四个开源法学硕士的性能，并将它们的性能与汤普森采样、Epsilon Greedy、上置信界 (UCB) 和随机选择等标准决策算法进行比较。虽然大多数法学硕士与基线相比表现不佳，但 Qwen3-4B 实现了 89.2% 的最佳臂选择率，显着优于较大的法学硕士和传统方法。我们的研究结果表明，概率推理能够仅从语言中产生，我们将这一基准作为评估自然非数字环境中决策能力的一步。

Title: Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production

Authors: Alexandre Galashov, Matt Jones, Rosemary Ke, Yuan Cao, Vaishnavh Nagarajan, Michael C. Mozer
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13879
Pdf URL: https://arxiv.org/pdf/2510.13879
Copy Paste: [[2510.13879]] Catch Your Breath: Adaptive Computation for Self-Paced Sequence Production(https://arxiv.org/abs/2510.13879)
Keywords: language model
Abstract: We explore a class of supervised training objectives that allow a language model to dynamically and autonomously scale the number of compute steps used for each input token. For any token, the model can request additional compute steps by emitting a output. If the model is granted a delay, a specialized token is inserted at the next input step, providing the model with additional compute resources to generate an output. The model can request multiple pauses. To train the model to use outputs judiciously and to calibrate its uncertainty, we frame the selection of each output token as a sequential-decision problem with a time cost. We refer to the class of methods as $\textit{Catch Your Breath}$ losses and we study three methods in this class: CYB-AP frames the model's task as anytime prediction, where an output may be required at any step and accuracy is discounted over time; CYB-VA is a variational approach that aims to maximize prediction accuracy subject to a specified distribution over stopping times; and CYB-DP imposes a penalty based on a computational budget. Through fine-tuning experiments, we identify the best performing loss variant. The CYB model needs only one third as much training data as the baseline (no pause) model needs to achieve the same performance, and half as much data as a model with pauses and a cross-entropy loss. We find that the CYB model requests additional steps when doing so improves accuracy, and the model adapts its processing time to token-level complexity and context. For example, it often pauses after plural nouns like $\textit{patients}$ and $\textit{challenges}$ but never pauses after the first token of contracted words like $\textit{wasn}$ and $\textit{didn}$, and it shows high variability for ambiguous tokens like $\textit{won}$, which could function as either a verb or part of a contraction.
摘要：我们探索了一类监督训练目标，这些目标允许语言模型动态地、自主地缩放用于每个输入标记的计算步骤的数量。对于任何令牌，模型都可以通过发出<不知道>输出来请求额外的计算步骤。如果模型被授予延迟，则在下一个输入步骤中插入专门的标记，为模型提供额外的计算资源以生成输出。该模型可以请求多次暂停。为了训练模型明智地使用<不知道>输出并校准其不确定性，我们将每个输出标记的选择构建为具有时间成本的顺序决策问题。我们将此类方法称为 $\textit{Catch Your Breath}$ 损失，并研究此类中的三种方法： CYB-AP 将模型的任务框架为随时预测，其中任何步骤都可能需要输出，并且准确性会随着时间的推移而打折； CYB-VA 是一种变分方法，旨在根据停止时间的指定分布最大限度地提高预测精度； CYB-DP 根据计算预算施加惩罚。通过微调实验，我们确定了性能最佳的损失变体。 CYB 模型仅需要基线（无暂停）模型所需训练数据的三分之一即可实现相同的性能，并且只需要具有暂停和交叉熵损失的模型所需的一半数据。我们发现 CYB 模型在这样做时需要额外的步骤来提高准确性，并且该模型会根据令牌级别的复杂性和上下文调整其处理时间。例如，它经常在 $\textit{patents}$ 和 $\textit{challenges}$ 这样的复数名词之后暂停，但在像 $\textit{wasn}$ 和 $\textit{didn}$ 这样的缩写单词的第一个标记之后从不暂停，并且它对于像 $\textit{won}$ 这样的不明确标记表现出高度的可变性，这些标记可以用作动词或动词的一部分。收缩。

Title: PAGE: Prompt Augmentation for text Generation Enhancement

Authors: Mauro Jose Pacchiotti, Luciana Ballejos, Mariel Ale
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13880
Pdf URL: https://arxiv.org/pdf/2510.13880
Copy Paste: [[2510.13880]] PAGE: Prompt Augmentation for text Generation Enhancement(https://arxiv.org/abs/2510.13880)
Keywords: prompt
Abstract: In recent years, natural language generative models have shown outstanding performance in text generation tasks. However, when facing specific tasks or particular requirements, they may exhibit poor performance or require adjustments that demand large amounts of additional data. This work introduces PAGE (Prompt Augmentation for text Generation Enhancement), a framework designed to assist these models through the use of simple auxiliary modules. These modules, lightweight models such as classifiers or extractors, provide inferences from the input text. The output of these auxiliaries is then used to construct an enriched input that improves the quality and controllability of the generation. Unlike other generation-assistance approaches, PAGE does not require auxiliary generative models; instead, it proposes a simpler, modular architecture that is easy to adapt to different tasks. This paper presents the proposal, its components and architecture, and reports a proof of concept in the domain of requirements engineering, where an auxiliary module with a classifier is used to improve the quality of software requirements generation.
摘要：近年来，自然语言生成模型在文本生成任务中表现出了出色的表现。然而，当面临特定任务或特定要求时，它们可能会表现出较差的性能或需要进行需要大量附加数据的调整。这项工作引入了 PAGE（文本生成增强的提示增强），这是一个旨在通过使用简单的辅助模块来协助这些模型的框架。这些模块、轻量级模型（例如分类器或提取器）提供来自输入文本的推论。然后，这些辅助设备的输出被用来构建丰富的输入，从而提高发电的质量和可控性。与其他生成辅助方法不同，PAGE 不需要辅助生成模型；相反，它提出了一种更简单的模块化架构，易于适应不同的任务。本文介绍了该提案、其组件和架构，并报告了需求工程领域的概念验证，其中使用带有分类器的辅助模块来提高软件需求生成的质量。

Title: Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation

Authors: Bolei Ma, Yong Cao, Indira Sen, Anna-Carolina Haensch, Frauke Kreuter, Barbara Plank, Daniel Hershcovich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13884
Pdf URL: https://arxiv.org/pdf/2510.13884
Copy Paste: [[2510.13884]] Too Open for Opinion? Embracing Open-Endedness in Large Language Models for Social Simulation(https://arxiv.org/abs/2510.13884)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used to simulate public opinion and other social phenomena. Most current studies constrain these simulations to multiple-choice or short-answer formats for ease of scoring and comparison, but such closed designs overlook the inherently generative nature of LLMs. In this position paper, we argue that open-endedness, using free-form text that captures topics, viewpoints, and reasoning processes "in" LLMs, is essential for realistic social simulation. Drawing on decades of survey-methodology research and recent advances in NLP, we argue why this open-endedness is valuable in LLM social simulations, showing how it can improve measurement and design, support exploration of unanticipated views, and reduce researcher-imposed directive bias. It also captures expressiveness and individuality, aids in pretesting, and ultimately enhances methodological utility. We call for novel practices and evaluation frameworks that leverage rather than constrain the open-ended generative diversity of LLMs, creating synergies between NLP and social science.
摘要：大型语言模型 (LLM) 越来越多地用于模拟舆论和其他社会现象。目前大多数研究将这些模拟限制为多项选择或简答题形式，以便于评分和比较，但这种封闭式设计忽视了法学硕士固有的生成性。在这篇立场文件中，我们认为开放性，使用自由格式的文本来捕捉法学硕士“中”的主题、观点和推理过程，对于现实的社会模拟至关重要。借鉴数十年的调查方法研究和自然语言处理的最新进展，我们论证了为什么这种开放性在法学硕士社会模拟中很有价值，展示了它如何改进测量和设计，支持对意外观点的探索，并减少研究人员强加的指导偏差。它还捕捉表现力和个性，有助于预测试，并最终增强方法的实用性。我们呼吁采用新颖的实践和评估框架来利用而不是限制法学硕士的开放式生成多样性，从而在 NLP 和社会科学之间创造协同效应。

Title: Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization

Authors: Ariel Kamen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13885
Pdf URL: https://arxiv.org/pdf/2510.13885
Copy Paste: [[2510.13885]] Order from Chaos: Comparative Study of Ten Leading LLMs on Unstructured Data Categorization(https://arxiv.org/abs/2510.13885)
Keywords: language model, gpt, llm, hallucination, prompt
Abstract: This study presents a comparative evaluation of ten state-of-the-art large language models (LLMs) applied to unstructured text categorization using the Interactive Advertising Bureau (IAB) 2.2 hierarchical taxonomy. The analysis employed a uniform dataset of 8,660 human-annotated samples and identical zero-shot prompts to ensure methodological consistency across all models. Evaluation metrics included four classic measures - accuracy, precision, recall, and F1-score - and three LLM-specific indicators: hallucination ratio, inflation ratio, and categorization cost. Results show that, despite their rapid advancement, contemporary LLMs achieve only moderate classic performance, with average scores of 34% accuracy, 42% precision, 45% recall, and 41% F1-score. Hallucination and inflation ratios reveal that models frequently overproduce categories relative to human annotators. Among the evaluated systems, Gemini 1.5/2.0 Flash and GPT 20B/120B offered the most favorable cost-to-performance balance, while GPT 120B demonstrated the lowest hallucination ratio. The findings suggest that scaling and architectural improvements alone do not ensure better categorization accuracy, as the task requires compressing rich unstructured text into a limited taxonomy - a process that challenges current model architectures. To address these limitations, a separate ensemble-based approach was developed and tested. The ensemble method, in which multiple LLMs act as independent experts, substantially improved accuracy, reduced inflation, and completely eliminated hallucinations. These results indicate that coordinated orchestration of models - rather than sheer scale - may represent the most effective path toward achieving or surpassing human-expert performance in large-scale text categorization.
摘要：本研究使用互动广告局 (IAB) 2.2 分层分类法对十种最先进的大型语言模型 (LLM) 应用于非结构化文本分类进行了比较评估。该分析采用了由 8,660 个人工注释样本组成的统一数据集和相同的零样本提示，以确保所有模型的方法学一致性。评估指标包括四个经典指标——准确度、精确度、召回率和 F1 分数——以及三个 LLM 特定指标：幻觉率、通货膨胀率和分类成本。结果表明，尽管当代法学硕士取得了快速进步，但其经典表现仅中等，平均准确度为 34%，精确度为 42%，召回率为 45%，F1 分数为 41%。幻觉和膨胀率表明，相对于人类注释者，模型经常产生过多的类别。在评估的系统中，Gemini 1.5/2.0 Flash 和 GPT 20B/120B 提供了最有利的性价比平衡，而 GPT 120B 则表现出最低的幻觉率。研究结果表明，仅扩展和架构改进并不能确保更好的分类准确性，因为该任务需要将丰富的非结构化文本压缩为有限的分类法 - 这一过程对当前模型架构提出了挑战。为了解决这些限制，开发并测试了一种单独的基于集成的方法。多位法学硕士作为独立专家的集成方法大大提高了准确性，减少了通货膨胀，并完全消除了幻觉。这些结果表明，模型的协调编排（而不是纯粹的规模）可能是在大规模文本分类中实现或超越人类专家性能的最有效途径。

Title: Reliable Fine-Grained Evaluation of Natural Language Math Proofs

Authors: Wenjie Ma, Andrei Cojocaru, Neel Kolhe, Bradley Louie, Robin Said Sharif, Haihan Zhang, Vincent Zhuang, Matei Zaharia, Sewon Min
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13888
Pdf URL: https://arxiv.org/pdf/2510.13888
Copy Paste: [[2510.13888]] Reliable Fine-Grained Evaluation of Natural Language Math Proofs(https://arxiv.org/abs/2510.13888)
Keywords: language model, llm
Abstract: Recent advances in large language models (LLMs) for mathematical reasoning have largely focused on tasks with easily verifiable final answers; however, generating and verifying natural language math proofs remains an open challenge. We identify the absence of a reliable, fine-grained evaluator for LLM-generated math proofs as a critical gap. To address this, we propose a systematic methodology for developing and validating evaluators that assign fine-grained scores on a 0-7 scale to model-generated math proofs. To enable this study, we introduce ProofBench, the first expert-annotated dataset of fine-grained proof ratings, spanning 145 problems from six major math competitions (USAMO, IMO, Putnam, etc) and 435 LLM-generated solutions from Gemini-2.5-pro, o3, and DeepSeek-R1. %with expert gradings. Using ProofBench as a testbed, we systematically explore the evaluator design space across key axes: the backbone model, input context, instructions and evaluation workflow. Our analysis delivers ProofGrader, an evaluator that combines a strong reasoning backbone LM, rich context from reference solutions and marking schemes, and a simple ensembling method; it achieves a low Mean Absolute Error (MAE) of 0.926 against expert scores, significantly outperforming naive baselines. Finally, we demonstrate its practical utility in a best-of-$n$ selection task: at $n=16$, ProofGrader achieves an average score of 4.14 (out of 7), closing 78% of the gap between a naive binary evaluator (2.48) and the human oracle (4.62), highlighting its potential to advance downstream proof generation.
摘要：用于数学推理的大型语言模型（LLM）的最新进展主要集中在具有易于验证最终答案的任务上；然而，生成和验证自然语言数学证明仍然是一个开放的挑战。我们认为 LLM 生成的数学证明缺乏可靠、细粒度的评估器是一个关键差距。为了解决这个问题，我们提出了一种系统方法来开发和验证评估器，为模型生成的数学证明分配 0-7 范围内的细粒度分数。为了开展这项研究，我们引入了 ProofBench，这是第一个专家注释的细粒度证明评级数据集，涵盖来自六大数学竞赛（USAMO、IMO、Putnam 等）的 145 个问题以及来自 Gemini-2.5-pro、o3 和 DeepSeek-R1 的 435 个 LLM 生成的解决方案。 %有专家评分。使用 ProofBench 作为测试平台，我们系统地探索了跨关键轴的评估器设计空间：主干模型、输入上下文、指令和评估工作流程。我们的分析提供了 ProofGrader，这是一个评估器，它结合了强大的推理主干 LM、来自参考解决方案和标记方案的丰富上下文以及简单的集成方法；与专家评分相比，它的平均绝对误差 (MAE) 较低，为 0.926，显着优于初始基线。最后，我们在 best-of-n$ 选择任务中展示了它的实用性：在 $n=16$ 时，ProofGrader 的平均得分为 4.14（满分 7 分），缩小了朴素二进制评估器 (2.48) 和人类预言机 (4.62) 之间 78% 的差距，凸显了其推进下游证明生成的潜力。

Title: A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness

Authors: Fali Wang, Jihai Chen, Shuhua Yang, Ali Al-Lawati, Linli Tang, Hui Liu, Suhang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13890
Pdf URL: https://arxiv.org/pdf/2510.13890
Copy Paste: [[2510.13890]] A Survey on Collaborating Small and Large Language Models for Performance, Cost-effectiveness, Cloud-edge Privacy, and Trustworthiness(https://arxiv.org/abs/2510.13890)
Keywords: language model, llm
Abstract: Large language models (LLMs) have advanced many domains and applications but face high fine-tuning costs, inference latency, limited edge deployability, and reliability concerns. Small language models (SLMs), compact, efficient, and adaptable, offer complementary remedies. Recent work explores collaborative frameworks that fuse SLMs' specialization and efficiency with LLMs' generalization and reasoning to meet diverse objectives across tasks and deployment scenarios. Motivated by these developments, this paper presents a systematic survey of SLM-LLM collaboration organized by collaboration objectives. We propose a taxonomy with four goals: performance enhancement, cost-effectiveness, cloud-edge privacy, and trustworthiness. Within this framework, we review representative methods, summarize design paradigms, and outline open challenges and future directions toward efficient, secure, and scalable SLM-LLM collaboration.
摘要：大型语言模型 (LLM) 已经推动了许多领域和应用的发展，但面临着高昂的微调成本、推理延迟、有限的边缘部署性和可靠性问题。小语言模型 (SLM) 紧凑、高效且适应性强，提供了补充补救措施。最近的工作探索了将 SLM 的专业化和效率与 LLM 的泛化和推理相融合的协作框架，以满足跨任务和部署场景的不同目标。受这些发展的推动，本文对按合作目标组织的 SLM-LLM 合作进行了系统调查。我们提出了一个具有四个目标的分类法：性能增强、成本效益、云边缘隐私和可信度。在此框架内，我们回顾了代表性方法，总结了设计范例，并概述了实现高效、安全和可扩展的 SLM-LLM 协作的开放挑战和未来方向。

Title: The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data

Authors: Zhaoyang Shang, Sibo Wei, Jianbin Guo, Rui Zhou, Lifeng Dong, Yin Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13892
Pdf URL: https://arxiv.org/pdf/2510.13892
Copy Paste: [[2510.13892]] The Harder The Better: Maintaining Supervised Fine-tuning Generalization with Less but Harder Data(https://arxiv.org/abs/2510.13892)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) excel in general tasks, but adapting them to specialized domains relies on high-quality supervised fine-tuning (SFT) data. Although existing methods can identify subsets of high-quality data and reduce training cost to some extent, their selection process still suffers from over-reliance on LLMs' internal knowledge, weak interpretability, and limited generalization. To address these limitations, we propose THTB (The Harder The Better), a cognitive science-inspired framework for instruction data selection and annotation guidance. THTB prioritizes higher-level cognitive instructions by combining quality filtering with intrinsic and extrinsic hardness scoring, offering interpretable and quantifiable criteria for efficient SFT, both in data selection and annotation guidance. Experiments show that THTB enables models trained on only 5% of the data to outperform full-dataset training, while achieving superior generalization compared with LLM-only selection. In addition, THTB provides effective annotation guidance in vertical domains, enabling a model trained on just 2% of the data to surpass models trained on much larger datasets, demonstrating strong potential for domain adaptation. Our code, datasets, and models are available on this https URL.
摘要：大型语言模型 (LLM) 在一般任务中表现出色，但将其适应专业领域依赖于高质量的监督微调 (SFT) 数据。尽管现有方法可以识别高质量数据的子集并在一定程度上降低培训成本，但其选择过程仍然存在过度依赖法学硕士内部知识、可解释性弱和泛化能力有限的问题。为了解决这些限制，我们提出了 THTB（越难越好），这是一种受认知科学启发的框架，用于指令数据选择和注释指导。 THTB 通过将质量过滤与内在和外在硬度评分相结合，优先考虑更高级别的认知指令，在数据选择和注释指导方面为高效 SFT 提供可解释和可量化的标准。实验表明，THTB 使得仅在 5% 的数据上训练的模型能够优于全数据集训练，同时与仅 LLM 选择相比，实现了卓越的泛化能力。此外，THTB 在垂直领域提供了有效的注释指导，使得仅用 2% 的数据训练的模型就超越了在更大的数据集上训练的模型，展示了领域适应的强大潜力。我们的代码、数据集和模型可在此 https URL 上找到。

Title: Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection

Authors: Olga E. Sorokoletova, Francesco Giarrusso, Vincenzo Suriani, Daniele Nardi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13893
Pdf URL: https://arxiv.org/pdf/2510.13893
Copy Paste: [[2510.13893]] Guarding the Guardrails: A Taxonomy-Driven Approach to Jailbreak Detection(https://arxiv.org/abs/2510.13893)
Keywords: language model, llm, prompt
Abstract: Jailbreaking techniques pose a significant threat to the safety of Large Language Models (LLMs). Existing defenses typically focus on single-turn attacks, lack coverage across languages, and rely on limited taxonomies that either fail to capture the full diversity of attack strategies or emphasize risk categories rather than the jailbreaking techniques. To advance the understanding of the effectiveness of jailbreaking techniques, we conducted a structured red-teaming challenge. The outcome of our experiments are manifold. First, we developed a comprehensive hierarchical taxonomy of 50 jailbreak strategies, consolidating and extending prior classifications into seven broad families, including impersonation, persuasion, privilege escalation, cognitive overload, obfuscation, goal conflict, and data poisoning. Second, we analyzed the data collected from the challenge to examine the prevalence and success rates of different attack types, providing insights into how specific jailbreak strategies exploit model vulnerabilities and induce misalignment. Third, we benchmark a popular LLM for jailbreak detection, evaluating the benefits of taxonomy-guided prompting for improving automatic detection. Finally, we compiled a new Italian dataset of 1364 multi-turn adversarial dialogues, annotated with our taxonomy, enabling the study of interactions where adversarial intent emerges gradually and succeeds in bypassing traditional safeguards.
摘要：越狱技术对大型语言模型（LLM）的安全构成重大威胁。现有的防御通常侧重于单轮攻击，缺乏跨语言的覆盖，并且依赖于有限的分类法，这些分类法要么无法捕获攻击策略的全部多样性，要么强调风险类别而不是越狱技术。为了加深对越狱技术有效性的了解，我们进行了结构化的红队挑战。我们的实验结果是多方面的。首先，我们开发了 50 种越狱策略的全面分层分类法，将先前的分类巩固并扩展为七个大类，包括冒充、说服、特权升级、认知超载、混淆、目标冲突和数据中毒。其次，我们分析了从挑战中收集的数据，以检查不同攻击类型的普遍性和成功率，从而深入了解特定的越狱策略如何利用模型漏洞并引起错位。第三，我们对流行的越狱检测法学硕士进行了基准测试，评估了分类引导提示对于改进自动检测的好处。最后，我们编制了一个包含 1364 个多回合对抗性对话的新意大利数据集，并用我们的分类法进行了注释，从而能够研究对抗性意图逐渐出现并成功绕过传统保障措施的交互。

Title: Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges

Authors: Misam Abbas
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13898
Pdf URL: https://arxiv.org/pdf/2510.13898
Copy Paste: [[2510.13898]] Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges(https://arxiv.org/abs/2510.13898)
Keywords: language model, gpt, llm, prompt
Abstract: Attributing authorship in the era of large language models (LLMs) is increasingly challenging as machine-generated prose rivals human writing. We benchmark two complementary attribution mechanisms , fixed Style Embeddings and an instruction-tuned LLM judge (GPT-4o) on the Human AI Parallel Corpus, an open dataset of 600 balanced instances spanning six domains (academic, news, fiction, blogs, spoken transcripts, and TV/movie scripts). Each instance contains a human prompt with both a gold continuation and an LLM-generated continuation from either GPT-4o or LLaMA-70B-Instruct. The Style Embedding baseline achieves stronger aggregate accuracy on GPT continuations (82 pct vs. 68 pct). The LLM Judge is slightly better than the Style embeddings on LLaMA continuations (85 pct vs. 81 pct) but the results are not statistically significant. Crucially, the LLM judge significantly outperforms in fiction and academic prose, indicating semantic sensitivity, whereas embeddings dominate in spoken and scripted dialogue, reflecting structural strengths. These complementary patterns highlight attribution as a multidimensional problem requiring hybrid strategies. To support reproducibility we provide code on GitHub and derived data on Hugging Face under the MIT license. This open framework provides a reproducible benchmark for attribution quality assessment in AI-generated content, along with a review of related literature influencing this work.
摘要：在大型语言模型（LLM）时代，由于机器生成的散文与人类写作相媲美，作者身份的归属变得越来越具有挑战性。我们在人类人工智能并行语料库上对两种互补的归因机制、固定风格嵌入和指令调整的 LLM 法官 (GPT-4o) 进行了基准测试，人类人工智能并行语料库是一个包含 600 个平衡实例的开放数据集，涵盖六个领域（学术、新闻、小说、博客、口语笔录和电视/电影脚本）。每个实例都包含一个人工提示，其中包含黄金延续和来自 GPT-4o 或 LLaMA-70B-Instruct 的 LLM 生成的延续。样式嵌入基线在 GPT 延续上实现了更高的聚合准确性（82% 与 68%）。 LLM Judge 比 LLaMA 延续上的样式嵌入稍好（85 pct 与 81 pct），但结果在统计上并不显着。至关重要的是，法学硕士法官在小说和学术散文中的表现明显优于小说和学术散文，这表明语义敏感性，而嵌入在口语和脚本对话中占主导地位，反映了结构优势。这些互补模式强调归因是一个需要混合策略的多维问题。为了支持可重复性，我们在 GitHub 上提供代码，并在 MIT 许可下在 Hugging Face 上提供派生数据。这个开放框架为人工智能生成内容的归因质量评估提供了可重复的基准，并回顾了影响这项工作的相关文献。

Title: Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences

Authors: Julian Minder, Clément Dumas, Stewart Slocum, Helena Casademunt, Cameron Holmes, Robert West, Neel Nanda
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13900
Pdf URL: https://arxiv.org/pdf/2510.13900
Copy Paste: [[2510.13900]] Narrow Finetuning Leaves Clearly Readable Traces in Activation Differences(https://arxiv.org/abs/2510.13900)
Keywords: language model, llm, prompt, chat, agent
Abstract: Finetuning on narrow domains has become an essential tool to adapt Large Language Models (LLMs) to specific tasks and to create models with known unusual properties that are useful for research. We show that narrow finetuning creates strong biases in LLM activations that can be interpreted to understand the finetuning domain. These biases can be discovered using simple tools from model diffing - the study of differences between models before and after finetuning. In particular, analyzing activation differences on the first few tokens of random text and steering by adding this difference to the model activations produces text similar to the format and general content of the finetuning data. We demonstrate that these analyses contain crucial information by creating an LLM-based interpretability agent to understand the finetuning domain. With access to the bias, the agent performs significantly better compared to baseline agents using simple prompting. Our analysis spans synthetic document finetuning for false facts, emergent misalignment, subliminal learning, and taboo word guessing game models across different architectures (Gemma, LLaMA, Qwen) and scales (1B to 32B parameters). We suspect these biases reflect overfitting and find that mixing pretraining data into the finetuning corpus largely removes them, though residual risks may remain. Our work (1) demonstrates that narrowly finetuned models have salient traces of their training objective in their activations and suggests ways to improve how they are trained, (2) warns AI safety and interpretability researchers that the common practice of using such models as a proxy for studying broader finetuning (e.g., chat-tuning) might not be realistic, and (3) highlights the need for deeper investigation into the effects of narrow finetuning and development of truly realistic case studies for model-diffing, safety and interpretability research.
摘要：对狭窄领域的微调已成为使大型语言模型 (LLM) 适应特定任务并创建具有对研究有用的已知异常属性的模型的重要工具。我们表明，狭窄的微调会在 LLM 激活中产生强烈的偏差，可以解释这些偏差以理解微调领域。可以使用模型比较中的简单工具来发现这些偏差 - 研究微调之前和之后模型之间的差异。特别是，分析随机文本的前几个标记的激活差异，并通过将此差异添加到模型激活来进行控制，从而生成与微调数据的格式和一般内容类似的文本。我们通过创建基于 LLM 的可解释性代理来理解微调领域，证明这些分析包含关键信息。与使用简单提示的基线代理相比，通过访问偏差，代理的表现明显更好。我们的分析涵盖不同架构（Gemma、LLaMA、Qwen）和规模（1B 到 32B 参数）的虚假事实、紧急错位、潜意识学习和禁忌词猜谜游戏模型的合成文档微调。我们怀疑这些偏差反映了过度拟合，并发现将预训练数据混合到微调语料库中可以很大程度上消除它们，尽管残留风险可能仍然存在。我们的工作（1）表明，狭义微调模型在其激活中具有其训练目标的显着痕迹，并提出了改进其训练方式的方法，（2）警告人工智能安全性和可解释性研究人员，使用此类模型作为研究更广泛微调（例如聊天调优）的代理的常见做法可能不现实，（3）强调需要更深入地调查狭义微调和开发的影响模型差异、安全性和可解释性研究的真实案例研究。

Title: RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs

Authors: Tuan T. Nguyen, John Le, Thai T. Vu, Willy Susilo, Heath Cooper
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13901
Pdf URL: https://arxiv.org/pdf/2510.13901
Copy Paste: [[2510.13901]] RAID: Refusal-Aware and Integrated Decoding for Jailbreaking LLMs(https://arxiv.org/abs/2510.13901)
Keywords: language model, llm
Abstract: Large language models (LLMs) achieve impressive performance across diverse tasks yet remain vulnerable to jailbreak attacks that bypass safety mechanisms. We present RAID (Refusal-Aware and Integrated Decoding), a framework that systematically probes these weaknesses by crafting adversarial suffixes that induce restricted content while preserving fluency. RAID relaxes discrete tokens into continuous embeddings and optimizes them with a joint objective that (i) encourages restricted responses, (ii) incorporates a refusal-aware regularizer to steer activations away from refusal directions in embedding space, and (iii) applies a coherence term to maintain semantic plausibility and non-redundancy. After optimization, a critic-guided decoding procedure maps embeddings back to tokens by balancing embedding affinity with language-model likelihood. This integration yields suffixes that are both effective in bypassing defenses and natural in form. Experiments on multiple open-source LLMs show that RAID achieves higher attack success rates with fewer queries and lower computational cost than recent white-box and black-box baselines. These findings highlight the importance of embedding-space regularization for understanding and mitigating LLM jailbreak vulnerabilities.
摘要：大型语言模型 (LLM) 在各种任务中取得了令人印象深刻的性能，但仍然容易受到绕过安全机制的越狱攻击。我们提出了 RAID（拒绝感知和集成解码），这是一个框架，通过制作对抗性后缀来系统地探测这些弱点，这些后缀会在保持流畅性的同时诱导受限内容。 RAID 将离散令牌放松为连续嵌入，并通过以下共同目标对其进行优化：(i) 鼓励受限响应，(ii) 结合拒绝感知正则化器以引导激活远离嵌入空间中的拒绝方向，以及 (iii) 应用连贯性术语来保持语义合理性和非冗余性。优化后，评论家引导的解码过程通过平衡嵌入亲和力与语言模型可能性，将嵌入映射回标记。这种整合产生的后缀既能有效绕过防御，又形式自然。对多个开源 LLM 的实验表明，与最近的白盒和黑盒基准相比，RAID 通过更少的查询和更低的计算成本实现了更高的攻击成功率。这些发现强调了嵌入空间正则化对于理解和减轻 LLM 越狱漏洞的重要性。

Title: Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory

Authors: Nicole Smith-Vaniz, Harper Lyon, Lorraine Steigner, Ben Armstrong, Nicholas Mattei
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.13902
Pdf URL: https://arxiv.org/pdf/2510.13902
Copy Paste: [[2510.13902]] Investigating Political and Demographic Associations in Large Language Models Through Moral Foundations Theory(https://arxiv.org/abs/2510.13902)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have become increasingly incorporated into everyday life for many internet users, taking on significant roles as advice givers in the domains of medicine, personal relationships, and even legal matters. The importance of these roles raise questions about how and what responses LLMs make in difficult political and moral domains, especially questions about possible biases. To quantify the nature of potential biases in LLMs, various works have applied Moral Foundations Theory (MFT), a framework that categorizes human moral reasoning into five dimensions: Harm, Fairness, Ingroup Loyalty, Authority, and Purity. Previous research has used the MFT to measure differences in human participants along political, national, and cultural lines. While there has been some analysis of the responses of LLM with respect to political stance in role-playing scenarios, no work so far has directly assessed the moral leanings in the LLM responses, nor have they connected LLM outputs with robust human data. In this paper we analyze the distinctions between LLM MFT responses and existing human research directly, investigating whether commonly available LLM responses demonstrate ideological leanings: either through their inherent responses, straightforward representations of political ideologies, or when responding from the perspectives of constructed human personas. We assess whether LLMs inherently generate responses that align more closely with one political ideology over another, and additionally examine how accurately LLMs can represent ideological perspectives through both explicit prompting and demographic-based role-playing. By systematically analyzing LLM behavior across these conditions and experiments, our study provides insight into the extent of political and demographic dependency in AI-generated responses.
摘要：大语言模型 (LLM) 已越来越多地融入许多互联网用户的日常生活中，在医学、人际关系甚至法律事务领域发挥着重要的建议提供者作用。这些角色的重要性引发了关于法学硕士在困难的政治和道德领域如何以及如何做出反应的问题，特别是关于可能存在偏见的问题。为了量化法学硕士潜在偏见的性质，各种研究都应用了道德基础理论（MFT），该框架将人类道德推理分为五个维度：伤害、公平、群体忠诚、权威和纯洁。之前的研究使用 MFT 来衡量人类参与者在政治、国家和文化方面的差异。虽然对角色扮演场景中政治立场方面的法学硕士的反应进行了一些分析，但迄今为止还没有任何工作直接评估法学硕士反应中的道德倾向，也没有将法学硕士的输出与可靠的人类数据联系起来。在本文中，我们直接分析了 LLM MFT 反应与现有人类研究之间的区别，调查常见的 LLM 反应是否表现出意识形态倾向：无论是通过其固有的反应、政治意识形态的直接表达，还是从构建的人类角色的角度进行反应。我们评估法学硕士是否天生会产生更符合一种政治意识形态而非另一种政治意识形态的反应，并另外研究法学硕士如何通过明确的提示和基于人口的角色扮演准确地代表意识形态观点。通过系统地分析法学硕士在这些条件和实验中的行为，我们的研究深入了解了人工智能生成的反应中政治和人口依赖性的程度。

Title: Schema for In-Context Learning

Authors: Pan Chen, Shaohong Chen, Mark Wang, Shi Xuan Leong, Priscilla Fung, Varinia Bernales, Alan Aspuru-Guzik
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13905
Pdf URL: https://arxiv.org/pdf/2510.13905
Copy Paste: [[2510.13905]] Schema for In-Context Learning(https://arxiv.org/abs/2510.13905)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: In-Context Learning (ICL) enables transformer-based language models to adapt to new tasks by conditioning on demonstration examples. However, traditional example-driven in-context learning lacks explicit modules for knowledge retrieval and transfer at the abstraction level. Inspired by cognitive science, specifically schema theory, which holds that humans interpret new information by activating pre-existing mental frameworks (schemas) to structure understanding, we introduce SCHEMA ACTIVATED IN CONTEXT LEARNING (SA-ICL). This framework extracts the representation of the building blocks of cognition for the reasoning process instilled from prior examples, creating an abstracted schema, a lightweight, structured template of key inferential steps and their relationships, which is then used to augment a model's reasoning process when presented with a novel question. We demonstrate that a broad range of large language models (LLMs) lack the capacity to form and utilize internal schema-based learning representations implicitly, but instead benefit significantly from explicit schema-based scaffolding. Across chemistry and physics questions from the GPQA dataset, our experiments show that SA-ICL consistently boosts performance, up to 36.19 percent, when the single demonstration example is of high quality, which simultaneously reduces reliance on the number of demonstrations and enhances interpretability. SCHEMA ACTIVATED IN CONTEXT LEARNING not only bridges disparate ICL strategies ranging from pattern priming to Chain-of-Thought prompting, but also paves a new path for enhancing human-like reasoning in LLMs.
摘要：上下文学习 (ICL) 使基于 Transformer 的语言模型能够通过演示示例来适应新任务。然而，传统的实例驱动的上下文学习缺乏抽象层面上的知识检索和迁移的明确模块。受认知科学，特别是图式理论的启发，图式理论认为人类通过激活预先存在的心理框架（图式）来构建理解来解释新信息，我们引入了上下文学习中激活的图式（SA-ICL）。该框架提取了从先前示例中灌输的推理过程的认知构建块的表示，创建了一个抽象模式，一个关键推理步骤及其关系的轻量级结构化模板，然后在遇到新问题时用于增强模型的推理过程。我们证明，广泛的大型语言模型（LLM）缺乏隐式形成和利用基于模式的内部学习表示的能力，而是从显式的基于模式的脚手架中受益匪浅。对于 GPQA 数据集中的化学和物理问题，我们的实验表明，当单个演示示例质量较高时，SA-ICL 持续提高性能，最高可达 36.19%，同时减少了对演示数量的依赖并增强了可解释性。情境学习中激活的模式不仅弥合了从模式启动到思维链提示等不同的 ICL 策略，而且还为法学硕士中增强类人推理铺平了一条新道路。

Title: LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization

Authors: Yuanchen Wu, Saurabh Verma, Justin Lee, Fangzhou Xiong, Poppy Zhang, Amel Awadelkarim, Xu Chen, Yubai Yuan, Shawndra Hill
Subjects: cs.CL, stat.ML
Abstract URL: https://arxiv.org/abs/2510.13907
Pdf URL: https://arxiv.org/pdf/2510.13907
Copy Paste: [[2510.13907]] LLM Prompt Duel Optimizer: Efficient Label-Free Prompt Optimization(https://arxiv.org/abs/2510.13907)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are highly sensitive to their input prompts, making prompt design a central challenge. While automatic prompt optimization (APO) reduces manual engineering, most approaches assume access to ground-truth references such as labeled validation data. In practice, however, collecting high-quality labels is costly and slow. We propose the Prompt Duel Optimizer (PDO), a sample-efficient framework for label-free prompt optimization. PDO formulates the problem as a dueling-bandit setting, where supervision signal comes from pairwise preference feedback provided by an LLM judge. The framework combines Double Thompson Sampling (D-TS), which prioritizes informative prompt comparisons, with Top-Performer Guided Mutation, which expands the candidate pool by mutating high-performing prompts. PDO naturally operates in label-free settings and can also incorporate partial labels to mitigate judge noise. Experiments on BIG-bench Hard (BBH) and MS MARCO show that PDO consistently outperforms baseline methods. Ablation studies further demonstrate the effectiveness of both D-TS and prompt mutation.
摘要：大型语言模型 (LLM) 对输入提示高度敏感，这使得提示设计成为一项核心挑战。虽然自动提示优化 (APO) 减少了手动工程，但大多数方法都假设可以访问真实参考，例如标记的验证数据。然而，在实践中，收集高质量标签成本高昂且缓慢。我们提出了 Prompt Duel Optimizer (PDO)，这是一种用于无标签提示优化的样本高效框架。 PDO 将问题表述为决斗强盗设置，其中监督信号来自法学硕士法官提供的成对偏好反馈。该框架结合了双汤普森采样 (D-TS) 和最佳表现引导突变，前者优先考虑信息提示比较，后者通过改变高性能提示来扩展候选池。 PDO 自然地在无标签设置中运行，并且还可以合并部分标签以减轻判断噪声。 BIG-bench Hard (BBH) 和 MS MARCO 上的实验表明，PDO 始终优于基线方法。消融研究进一步证明了 D-TS 和即时突变的有效性。

Title: Interpreting the Latent Structure of Operator Precedence in Language Models

Authors: Dharunish Yugeswardeenoo, Harshil Nukala, Cole Blondin, Sean O Brien, Vasu Sharma, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13908
Pdf URL: https://arxiv.org/pdf/2510.13908
Copy Paste: [[2510.13908]] Interpreting the Latent Structure of Operator Precedence in Language Models(https://arxiv.org/abs/2510.13908)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated impressive reasoning capabilities but continue to struggle with arithmetic tasks. Prior works largely focus on outputs or prompting strategies, leaving the open question of the internal structure through which models do arithmetic computation. In this work, we investigate whether LLMs encode operator precedence in their internal representations via the open-source instruction-tuned LLaMA 3.2-3B model. We constructed a dataset of arithmetic expressions with three operands and two operators, varying the order and placement of parentheses. Using this dataset, we trace whether intermediate results appear in the residual stream of the instruction-tuned LLaMA 3.2-3B model. We apply interpretability techniques such as logit lens, linear classification probes, and UMAP geometric visualization. Our results show that intermediate computations are present in the residual stream, particularly after MLP blocks. We also find that the model linearly encodes precedence in each operator's embeddings post attention layer. We introduce partial embedding swap, a technique that modifies operator precedence by exchanging high-impact embedding dimensions between operators.
摘要：大型语言模型 (LLM) 已展现出令人印象深刻的推理能力，但在处理算术任务时仍然遇到困难。先前的工作主要集中在输出或提示策略上，而留下了模型进行算术计算的内部结构的悬而未决的问题。在这项工作中，我们研究了 LLM 是否通过开源指令调整的 LLaMA 3.2-3B 模型在其内部表示中编码运算符优先级。我们构建了一个包含三个操作数和两个运算符的算术表达式数据集，改变了括号的顺序和位置。使用此数据集，我们跟踪中间结果是否出现在指令调整的 LLaMA 3.2-3B 模型的残余流中。我们应用可解释性技术，例如 Logit 透镜、线性分类探针和 UMAP 几何可视化。我们的结果表明，中间计算存在于残差流中，特别是在 MLP 块之后。我们还发现该模型对每个运算符的嵌入后注意层中的优先级进行线性编码。我们引入部分嵌入交换，这是一种通过在运算符之间交换高影响嵌入维度来修改运算符优先级的技术。

Title: Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning

Authors: Xingrui Zhuo, Jiapu Wang, Gongqing Wu, Zhongyuan Wang, Jichen Zhang, Shirui Pan, Xindong Wu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13909
Pdf URL: https://arxiv.org/pdf/2510.13909
Copy Paste: [[2510.13909]] Knowledge Reasoning Language Model: Unifying Knowledge and Language for Inductive Knowledge Graph Reasoning(https://arxiv.org/abs/2510.13909)
Keywords: language model, llm, hallucination
Abstract: Inductive Knowledge Graph Reasoning (KGR) aims to discover facts in open-domain KGs containing unknown entities and relations, which poses a challenge for KGR models in comprehending uncertain KG components. Existing studies have proposed Knowledge Graph Foundation Models (KGFMs) that learn structural invariances across KGs to handle this uncertainty. Recently, Large Language Models (LLMs) have demonstrated strong capabilities for open-domain knowledge reasoning. As a result, the latest research has focused on LLM-based KGFMs that integrate LLM knowledge with KG context for inductive KGR. However, the intrinsic knowledge of LLMs may be overshadowed by sparse KG context, leading to LLM knowledge distortion, which can cause irreversible damage to model reasoning. Moreover, existing LLM-based KGR methods still struggle to fully constrain generative hallucinations in LLMs, severely limiting the credibility of reasoning results. To address these limitations, we propose a Knowledge Reasoning Language Model (KRLM) that achieves unified coordination between LLM knowledge and KG context throughout the KGR process. Specifically, we design a Knowledge Reasoning Language (KRL) instruction format and a KRL tokenizer to align LLM knowledge with KG representations. Then, we propose a KRL attention layer that coordinates intrinsic LLM knowledge with additional KG context through a dynamic knowledge memory mechanism. Finally, a structure-aware next-entity predictor is proposed, which strictly constrains the reasoning results within a trustworthy knowledge domain. Extensive experimental results on 25 real-world inductive KGR datasets demonstrate the significant superiority of the proposed KRLM\footnote{Our source codes are available at this https URL in both zero-shot reasoning and fine-tuning scenarios.
摘要：归纳知识图推理（KGR）旨在发现包含未知实体和关系的开放域知识图谱中的事实，这对 KGR 模型理解不确定的知识图谱组件提出了挑战。现有研究提出了知识图基础模型（KGFM），它可以学习跨知识图谱的结构不变性来处理这种不确定性。最近，大型语言模型（LLM）展示了强大的开放领域知识推理能力。因此，最新的研究重点是基于 LLM 的 KGFM，它将 LLM 知识与 KG 背景相结合，以实现归纳 KGR。然而，LLM的内在知识可能会被稀疏的KG上下文所掩盖，导致LLM知识失真，这会对模型推理造成不可逆转的损害。此外，现有的基于 LLM 的 KGR 方法仍然难以完全限制 LLM 中的生成幻觉，严重限制了推理结果的可信度。为了解决这些限制，我们提出了一种知识推理语言模型（KRLM），它在整个 KGR 过程中实现了 LLM 知识和 KG 上下文之间的统一协调。具体来说，我们设计了一种知识推理语言（KRL）指令格式和一个 KRL 标记器，以将 LLM 知识与 KG 表示对齐。然后，我们提出了一个 KRL 注意力层，通过动态知识记忆机制协调内在的 LLM 知识和额外的 KG 上下文。最后，提出了一种结构感知的下一个实体预测器，它将推理结果严格限制在可信的知识领域内。在 25 个现实世界的归纳 KGR 数据集上进行的广泛实验结果证明了所提出的 KRLM\footnote{我们的源代码可以在这个 https URL 上获得，在零样本推理和微调场景中都具有显着的优越性。

Title: RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

Authors: Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13910
Pdf URL: https://arxiv.org/pdf/2510.13910
Copy Paste: [[2510.13910]] RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems(https://arxiv.org/abs/2510.13910)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation, agent
Abstract: Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.
摘要：检索增强生成 (RAG) 通过动态检索外部信息来缓解大型语言模型 (LLM) 的关键限制，例如事实错误、过时的知识和幻觉。最近的工作通过代理 RAG 系统扩展了这种范式，其中法学硕士充当代理，对复杂的查询进行迭代规划、检索和推理。然而，这些系统仍然难以应对具有挑战性的多跳问题，并且它们的中间推理能力仍未得到充分开发。为了解决这个问题，我们提出了 RAGCap-Bench，这是一种面向能力的基准，用于对代理 RAG 工作流程中的中间任务进行细粒度评估。我们分析最先进系统的输出，以确定常见任务及其执行所需的核心能力，然后构建典型法学硕士错误的分类法以设计有针对性的评估问题。实验表明，具有更强 RAGCap 性能的“慢思考”模型可以获得更好的端到端结果，强调了基准的有效性以及增强这些中间能力的重要性。

Title: AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs

Authors: María Victoria Carro, Denise Alejandra Mester, Facundo Nieto, Oscar Agustín Stanchi, Guido Ernesto Bergman, Mario Alejandro Leiva, Eitan Sprejer, Luca Nicolás Forziati Gangi, Francisca Gauna Selasco, Juan Gustavo Corvalán, Gerardo I. Simari, María Vanina Martinez
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13912
Pdf URL: https://arxiv.org/pdf/2510.13912
Copy Paste: [[2510.13912]] AI Debaters are More Persuasive when Arguing in Alignment with Their Own Beliefs(https://arxiv.org/abs/2510.13912)
Keywords: language model
Abstract: The core premise of AI debate as a scalable oversight technique is that it is harder to lie convincingly than to refute a lie, enabling the judge to identify the correct position. Yet, existing debate experiments have relied on datasets with ground truth, where lying is reduced to defending an incorrect proposition. This overlooks a subjective dimension: lying also requires the belief that the claim defended is false. In this work, we apply debate to subjective questions and explicitly measure large language models' prior beliefs before experiments. Debaters were asked to select their preferred position, then presented with a judge persona deliberately designed to conflict with their identified priors. This setup tested whether models would adopt sycophantic strategies, aligning with the judge's presumed perspective to maximize persuasiveness, or remain faithful to their prior beliefs. We implemented and compared two debate protocols, sequential and simultaneous, to evaluate potential systematic biases. Finally, we assessed whether models were more persuasive and produced higher-quality arguments when defending positions consistent with their prior beliefs versus when arguing against them. Our main findings show that models tend to prefer defending stances aligned with the judge persona rather than their prior beliefs, sequential debate introduces significant bias favoring the second debater, models are more persuasive when defending positions aligned with their prior beliefs, and paradoxically, arguments misaligned with prior beliefs are rated as higher quality in pairwise comparison. These results can inform human judges to provide higher-quality training signals and contribute to more aligned AI systems, while revealing important aspects of human-AI interaction regarding persuasion dynamics in language models.
摘要：人工智能辩论作为一种可扩展的监督技术的核心前提是，令人信服地撒谎比反驳谎言更难，从而使法官能够识别正确的立场。然而，现有的辩论实验依赖于具有基本事实的数据集，其中谎言被简化为捍卫不正确的命题。这忽略了一个主观维度：说谎还需要相信所辩护的主张是错误的。在这项工作中，我们将辩论应用于主观问题，并在实验前明确测量大型语言模型的先验信念。辩手被要求选择他们喜欢的立场，然后呈现一个故意设计的法官角色，以与他们确定的先验相冲突。这一设置测试了模型是否会采取阿谀奉承的策略，与法官的假定观点保持一致以最大限度地提高说服力，或者保持忠实于他们先前的信念。我们实施并比较了两种辩论方案（顺序辩论和同时辩论），以评估潜在的系统偏差。最后，我们评估了模型在捍卫与先前信念一致的立场时与反对这些立场时是否更具说服力并产生更高质量的论据。我们的主要发现表明，模型倾向于选择与法官角色一致的辩护立场，而不是他们先前的信念，顺序辩论引入了有利于第二辩手的显着偏见，当捍卫与他们先前的信念一致的立场时，模型更具说服力，矛盾的是，与先前信念不一致的论点在成对比较中被评为更高质量。这些结果可以帮助人类法官提供更高质量的训练信号，并为更加一致的人工智能系统做出贡献，同时揭示人类与人工智能交互在语言模型中的说服动态方面的重要方面。

Title: Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms

Authors: Shrey Pandit, Xuan-Phi Nguyen, Yifei Ming, Austin Xu, Jiayu Wang, Caiming Xiong, Shafiq Joty
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13913
Pdf URL: https://arxiv.org/pdf/2510.13913
Copy Paste: [[2510.13913]] Synthesizing Agentic Data for Web Agents with Progressive Difficulty Enhancement Mechanisms(https://arxiv.org/abs/2510.13913)
Keywords: language model, agent
Abstract: Web-based 'deep research' agents aim to solve complex question - answering tasks through long-horizon interactions with online tools. These tasks remain challenging, as the underlying language models are often not optimized for long-horizon reasoning and exploration. Prior work has proposed workflows for constructing instruction-tuning datasets, often leveraging knowledge graphs. However, such methods typically lack fine-grained control over difficulty and quality, yielding synthetic data that falls short of capturing the complexity required for long-horizon reasoning. Furthermore, many studies conflate data and training effects by comparing models trained under different optimization recipes, making it difficult to isolate and evaluate the effectiveness of the data itself. We introduce a two-pronged data synthesis pipeline that generates question - answer pairs by progressively increasing task complexity until a frontier baseline web agent fails. The baseline agent plays multiple roles in this process: attempting the questions, validating factuality, checking for alternative answers, and enforcing filtering. To evaluate the effectiveness of our synthesis methods, we adopt a controlled training setup based on distillation from strong web agents. Experiments across multiple web-based benchmarks show that our dataset - despite being smaller - enables the training of more effective web agents than existing datasets. In particular, our data exhibits twice the diversity in tool-use actions, allowing models trained on it to achieve stronger performance while avoiding repetitive tool-calling behaviors.
摘要：基于网络的“深度研究”代理旨在通过与在线工具的长期交互来解决复杂的问答任务。这些任务仍然具有挑战性，因为底层语言模型通常没有针对长期推理和探索进行优化。先前的工作提出了构建指令调整数据集的工作流程，通常利用知识图。然而，此类方法通常缺乏对难度和质量的细粒度控制，产生的合成数据无法捕获长期推理所需的复杂性。此外，许多研究通过比较在不同优化方案下训练的模型来将数据和训练效果混为一谈，这使得隔离和评估数据本身的有效性变得困难。我们引入了一个双管齐下的数据合成管道，通过逐渐增加任务复杂性直到前沿基线网络代理失败来生成问题-答案对。基线代理在此过程中扮演多种角色：尝试提出问题、验证事实、检查替代答案以及强制过滤。为了评估我们的综合方法的有效性，我们采用了基于强大网络代理的蒸馏的受控训练设置。跨多个基于网络的基准测试的实验表明，我们的数据集尽管较小，但能够比现有数据集训练更有效的网络代理。特别是，我们的数据在工具使用操作方面表现出两倍的多样性，允许在其上训练的模型获得更强的性能，同时避免重复的工具调用行为。

Title: Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models

Authors: Ivan Lee, Taylor Berg-Kirkpatrick
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13915
Pdf URL: https://arxiv.org/pdf/2510.13915
Copy Paste: [[2510.13915]] Readability $\ne$ Learnability: Rethinking the Role of Simplicity in Training Small Language Models(https://arxiv.org/abs/2510.13915)
Keywords: language model
Abstract: Recent studies suggest that very small language models (SLMs) can generate surprisingly coherent text when trained on simplified, child-directed corpora such as TinyStories. These findings have been interpreted as evidence that readability -- characterized by accessible vocabulary, familiar narrative structure, and simple syntax -- plays a key role in enabling such capabilities to emerge. In this paper, we challenge that interpretation. We construct synthetic datasets with matched structure but varied readability, and find that readability alone does not predict coherence or learning efficiency in SLMs. Models trained on complex, adult-level text perform comparably to those trained on simplified language, and even exhibit faster development of coherence during training. Instead, we show that statistical simplicity, as measured by n-gram diversity, is a stronger predictor of learnability. Our findings caution against the growing trend of anthropomorphizing language model training -- drawing parallels to human cognitive development without empirical basis -- and argue for more precise reasoning about what properties actually support capability emergence in small models.
摘要：最近的研究表明，当在简化的、面向儿童的语料库（例如 TinyStories）上进行训练时，非常小的语言模型（SLM）可以生成令人惊讶的连贯文本。这些发现被解释为证据，证明可读性（以易于理解的词汇、熟悉的叙述结构和简单的语法为特征）在实现此类能力方面发挥着关键作用。在本文中，我们对这种解释提出了质疑。我们构建了结构匹配但可读性不同的合成数据集，并发现仅可读性并不能预测 SLM 中的连贯性或学习效率。在复杂的成人水平文本上训练的模型的表现与在简化语言上训练的模型相当，甚至在训练过程中表现出更快的连贯性。相反，我们表明，通过 n 元语法多样性来衡量的统计简单性是可学习性的更强预测因素。我们的研究结果警告人们不要将语言模型训练拟人化——在没有经验基础的情况下与人类认知发展进行比较——并主张更精确地推理哪些属性实际上支持小模型中的能力出现。

Title: Element2Vec: Build Chemical Element Representation from Text for Property Prediction

Authors: Yuanhao Li, Keyuan Lai, Tianqi Wang, Qihao Liu, Jiawei Ma, Yuan-Chao Hu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13916
Pdf URL: https://arxiv.org/pdf/2510.13916
Copy Paste: [[2510.13916]] Element2Vec: Build Chemical Element Representation from Text for Property Prediction(https://arxiv.org/abs/2510.13916)
Keywords: language model, hallucination
Abstract: Accurate property data for chemical elements is crucial for materials design and manufacturing, but many of them are difficult to measure directly due to equipment constraints. While traditional methods use the properties of other elements or related properties for prediction via numerical analyses, they often fail to model complex relationships. After all, not all characteristics can be represented as scalars. Recent efforts have been made to explore advanced AI tools such as language models for property estimation, but they still suffer from hallucinations and a lack of interpretability. In this paper, we investigate Element2Vecto effectively represent chemical elements from natural languages to support research in the natural sciences. Given the text parsed from Wikipedia pages, we use language models to generate both a single general-purpose embedding (Global) and a set of attribute-highlighted vectors (Local). Despite the complicated relationship across elements, the computational challenges also exist because of 1) the discrepancy in text distribution between common descriptions and specialized scientific texts, and 2) the extremely limited data, i.e., with only 118 known elements, data for specific properties is often highly sparse and incomplete. Thus, we also design a test-time training method based on self-attention to mitigate the prediction error caused by Vanilla regression clearly. We hope this work could pave the way for advancing AI-driven discovery in materials science.
摘要：化学元素的准确属性数据对于材料设计和制造至关重要，但由于设备限制，其中许多元素很难直接测量。虽然传统方法通过数值分析使用其他元素的属性或相关属性进行预测，但它们通常无法对复杂的关系进行建模。毕竟，并非所有特征都可以表示为标量。最近人们努力探索先进的人工智能工具，例如用于财产评估的语言模型，但它们仍然存在幻觉和缺乏可解释性。在本文中，我们研究 Element2Vecto 有效地表示自然语言中的化学元素，以支持自然科学研究。给定从维基百科页面解析的文本，我们使用语言模型生成单个通用嵌入（全局）和一组属性突出显示向量（本地）。尽管元素之间存在复杂的关系，但计算挑战也存在，因为 1) 常见描述和专门科学文本之间的文本分布差异，2) 数据极其有限，即只有 118 个已知元素，特定属性的数据通常非常稀疏和不完整。因此，我们还设计了一种基于自注意力的测试时训练方法，以明显减轻 Vanilla 回归造成的预测误差。我们希望这项工作能够为推进人工智能驱动的材料科学发现铺平道路。

Title: Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling

Authors: Peng Kuang, Yanli Wang, Xiaoyu Han, Yaowenqi Liu, Kaidi Xu, Haohan Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13918
Pdf URL: https://arxiv.org/pdf/2510.13918
Copy Paste: [[2510.13918]] Optimal Aggregation of LLM and PRM Signals for Efficient Test-Time Scaling(https://arxiv.org/abs/2510.13918)
Keywords: language model, llm
Abstract: Process reward models (PRMs) are a cornerstone of test-time scaling (TTS), designed to verify and select the best responses from large language models (LLMs). However, this promise is challenged by recent benchmarks where simple majority voting, which ignores PRM signals, occasionally outperforms standard PRM-based selection. This raises a critical question: How can we effectively utilize verification signals from PRMs for TTS? To address this, we start by developing a theoretical framework for optimally combining signals from both the LLM and the PRM. Our framework reveals that the optimal strategy is a weighted aggregation of responses, a strategy whose effectiveness hinges on estimating weights that capture the complex interplay between the models. Based on our theoretical results, we empirically show that these optimal weighting functions differ significantly across LLM-PRM pairs and, notably, often assign substantial negative weights. Motivated by these insights, we propose efficient pre-computation methods to calibrate these weighting functions. Extensive experiments across 5 LLMs and 7 PRMs demonstrate that our calibration method significantly boosts the TTS efficiency, surpassing the performance of vanilla weighted majority voting while using only $21.3\%$ of the computation. Ultimately, our work demonstrates that investing in a more intelligent aggregation strategy can be a more convincing path to performance gains than simply scaling test-time computation.
摘要：过程奖励模型 (PRM) 是测试时间扩展 (TTS) 的基石，旨在验证和选择大型语言模型 (LLM) 的最佳响应。然而，这一承诺受到最近基准的挑战，其中忽略 PRM 信号的简单多数投票有时会优于基于 PRM 的标准选择。这就提出了一个关键问题：我们如何有效利用 PRM 的验证信号进行 TTS？为了解决这个问题，我们首先开发一个理论框架，用于最佳地组合来自 LLM 和 PRM 的信号。我们的框架揭示了最佳策略是响应的加权聚合，该策略的有效性取决于估计捕获模型之间复杂相互作用的权重。根据我们的理论结果，我们凭经验表明，这些最佳权重函数在 LLM-PRM 对之间存在显着差异，并且值得注意的是，通常分配大量的负权重。受这些见解的启发，我们提出了有效的预计算方法来校准这些加权函数。跨 5 个法学硕士和 7 个 PRM 的广泛实验表明，我们的校准方法显着提高了 TTS 效率，超越了普通加权多数投票的性能，同时仅使用 21.3\%$ 的计算量。最终，我们的工作表明，投资更智能的聚合策略可能是比简单地扩展测试时间计算更令人信服的性能提升途径。

Title: FACTS: Table Summarization via Offline Template Generation with Agentic Workflows

Authors: Ye Yuan, Mohammad Amin Shabani, Siqi Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13920
Pdf URL: https://arxiv.org/pdf/2510.13920
Copy Paste: [[2510.13920]] FACTS: Table Summarization via Offline Template Generation with Agentic Workflows(https://arxiv.org/abs/2510.13920)
Keywords: llm, prompt, agent
Abstract: Query-focused table summarization requires generating natural language summaries of tabular data conditioned on a user query, enabling users to access insights beyond fact retrieval. Existing approaches face key limitations: table-to-text models require costly fine-tuning and struggle with complex reasoning, prompt-based LLM methods suffer from token-limit and efficiency issues while exposing sensitive data, and prior agentic pipelines often rely on decomposition, planning, or manual templates that lack robustness and scalability. To mitigate these issues, we introduce an agentic workflow, FACTS, a Fast, Accurate, and Privacy-Compliant Table Summarization approach via Offline Template Generation. FACTS produces offline templates, consisting of SQL queries and Jinja2 templates, which can be rendered into natural language summaries and are reusable across multiple tables sharing the same schema. It enables fast summarization through reusable offline templates, accurate outputs with executable SQL queries, and privacy compliance by sending only table schemas to LLMs. Evaluations on widely-used benchmarks show that FACTS consistently outperforms baseline methods, establishing it as a practical solution for real-world query-focused table summarization.
摘要：以查询为中心的表摘要需要根据用户查询生成表格数据的自然语言摘要，使用户能够获得事实检索之外的见解。现有方法面临关键限制：表到文本模型需要昂贵的微调并难以进行复杂的推理，基于提示的 LLM 方法在暴露敏感数据时受到令牌限制和效率问题的困扰，而先前的代理管道通常依赖于缺乏稳健性和可扩展性的分解、规划或手动模板。为了缓解这些问题，我们引入了代理工作流程 FACTS，这是一种通过离线模板生成快速、准确且符合隐私的表摘要方法。 FACTS 生成离线模板，由 SQL 查询和 Jinja2 模板组成，这些模板可以呈现为自然语言摘要，并且可以在共享相同架构的多个表之间重用。它通过可重用的离线模板实现快速汇总，通过可执行 SQL 查询实现准确输出，并通过仅向法学硕士发送表模式来实现隐私合规性。对广泛使用的基准的评估表明，FACTS 的性能始终优于基线方法，使其成为现实世界中以查询为中心的表汇总的实用解决方案。

Title: An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation

Authors: Daniel Adu Worae, Spyridon Mastorakis
Subjects: cs.CL, cs.CR, cs.NI
Abstract URL: https://arxiv.org/abs/2510.13925
Pdf URL: https://arxiv.org/pdf/2510.13925
Copy Paste: [[2510.13925]] An LLM-Powered AI Agent Framework for Holistic IoT Traffic Interpretation(https://arxiv.org/abs/2510.13925)
Keywords: language model, llm, agent
Abstract: Internet of Things (IoT) networks generate diverse and high-volume traffic that reflects both normal activity and potential threats. Deriving meaningful insight from such telemetry requires cross-layer interpretation of behaviors, protocols, and context rather than isolated detection. This work presents an LLM-powered AI agent framework that converts raw packet captures into structured and semantically enriched representations for interactive analysis. The framework integrates feature extraction, transformer-based anomaly detection, packet and flow summarization, threat intelligence enrichment, and retrieval-augmented question answering. An AI agent guided by a large language model performs reasoning over the indexed traffic artifacts, assembling evidence to produce accurate and human-readable interpretations. Experimental evaluation on multiple IoT captures and six open models shows that hybrid retrieval, which combines lexical and semantic search with reranking, substantially improves BLEU, ROUGE, METEOR, and BERTScore results compared with dense-only retrieval. System profiling further indicates low CPU, GPU, and memory overhead, demonstrating that the framework achieves holistic and efficient interpretation of IoT network traffic.
摘要：物联网 (IoT) 网络会产生多样化的大流量流量，反映正常活动和潜在威胁。从此类遥测中获得有意义的洞察需要对行为、协议和上下文进行跨层解释，而不是孤立的检测。这项工作提出了一个由法学硕士支持的人工智能代理框架，该框架将原始数据包捕获转换为结构化且语义丰富的表示形式，以进行交互式分析。该框架集成了特征提取、基于变压器的异常检测、数据包和流摘要、威胁情报丰富以及检索增强问答。由大型语言模型引导的人工智能代理对索引的流量工件进行推理，收集证据以产生准确且人类可读的解释。对多个物联网捕获和六个开放模型的实验评估表明，与仅密集检索相比，将词汇和语义搜索与重新排名相结合的混合检索显着改善了 BLEU、ROUGE、METEOR 和 BERTScore 结果。系统分析进一步表明 CPU、GPU 和内存开销较低，表明该框架实现了对物联网网络流量的全面、高效的解释。

Title: BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs

Authors: Congying Liu, Xingyuan Wei, Peipei Liu, Yiqing Shen, Yanxu Mao, Tiehan Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13926
Pdf URL: https://arxiv.org/pdf/2510.13926
Copy Paste: [[2510.13926]] BioMedSearch: A Multi-Source Biomedical Retrieval Framework Based on LLMs(https://arxiv.org/abs/2510.13926)
Keywords: language model, llm
Abstract: Biomedical queries often rely on a deep understanding of specialized knowledge such as gene regulatory mechanisms and pathological processes of diseases. They require detailed analysis of complex physiological processes and effective integration of information from multiple data sources to support accurate retrieval and reasoning. Although large language models (LLMs) perform well in general reasoning tasks, their generated biomedical content often lacks scientific rigor due to the inability to access authoritative biomedical databases and frequently fabricates protein functions, interactions, and structural details that deviate from authentic information. Therefore, we present BioMedSearch, a multi-source biomedical information retrieval framework based on LLMs. The method integrates literature retrieval, protein database and web search access to support accurate and efficient handling of complex biomedical queries. Through sub-queries decomposition, keywords extraction, task graph construction, and multi-source information filtering, BioMedSearch generates high-quality question-answering results. To evaluate the accuracy of question answering, we constructed a multi-level dataset, BioMedMCQs, consisting of 3,000 questions. The dataset covers three levels of reasoning: mechanistic identification, non-adjacent semantic integration, and temporal causal reasoning, and is used to assess the performance of BioMedSearch and other methods on complex QA tasks. Experimental results demonstrate that BioMedSearch consistently improves accuracy over all baseline models across all levels. Specifically, at Level 1, the average accuracy increases from 59.1% to 91.9%; at Level 2, it rises from 47.0% to 81.0%; and at the most challenging Level 3, the average accuracy improves from 36.3% to 73.4%. The code and BioMedMCQs are available at: this https URL
摘要：生物医学查询通常依赖于对基因调控机制和疾病病理过程等专业知识的深入理解。它们需要对复杂的生理过程进行详细分析，并有效整合来自多个数据源的信息，以支持准确的检索和推理。尽管大型语言模型（LLM）在一般推理任务中表现良好，但由于无法访问权威的生物医学数据库，其生成的生物医学内容往往缺乏科学严谨性，并且经常伪造与真实信息不同的蛋白质功能、相互作用和结构细节。因此，我们提出了 BioMedSearch，一个基于法学硕士的多源生物医学信息检索框架。该方法集成了文献检索、蛋白质数据库和网络搜索访问，以支持准确有效地处理复杂的生物医学查询。 BioMedSearch通过子查询分解、关键词提取、任务图构建和多源信息过滤，生成高质量的问答结果。为了评估问答的准确性，我们构建了一个多级数据集 BioMedMCQs，其中包含 3,000 个问题。该数据集涵盖三个推理层次：机械识别、非相邻语义集成和时间因果推理，用于评估 BioMedSearch 和其他方法在复杂 QA 任务上的性能。实验结果表明，BioMedSearch 持续提高了所有级别的所有基线模型的准确性。具体来说，在1级时，平均准确率从59.1%提高到91.9%； 2级时，从47.0%上升至81.0%；在最具挑战性的 3 级，平均准确率从 36.3% 提高到 73.4%。代码和 BioMedMCQ 可在以下位置获取：此 https URL

Title: LLMs Can Get "Brain Rot"!

Authors: Shuo Xing, Junyuan Hong, Yifan Wang, Runjin Chen, Zhenyu Zhang, Ananth Grama, Zhengzhong Tu, Zhangyang Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13928
Pdf URL: https://arxiv.org/pdf/2510.13928
Copy Paste: [[2510.13928]] LLMs Can Get "Brain Rot"!(https://arxiv.org/abs/2510.13928)
Keywords: language model, llm
Abstract: We propose and test the LLM Brain Rot Hypothesis: continual exposure to junk web text induces lasting cognitive decline in large language models (LLMs). To causally isolate data quality, we run controlled experiments on real Twitter/X corpora, constructing junk and reversely controlled datasets via two orthogonal operationalizations: M1 (engagement degree) and M2 (semantic quality), with matched token scale and training operations across conditions. Contrary to the control group, continual pre-training of 4 LLMs on the junk dataset causes non-trivial declines (Hedges' $g>0.3$) on reasoning, long-context understanding, safety, and inflating "dark traits" (e.g., psychopathy, narcissism). The gradual mixtures of junk and control datasets also yield dose-response cognition decay: for example, under M1, ARC-Challenge with Chain Of Thoughts drops $74.9 \rightarrow 57.2$ and RULER-CWE $84.4 \rightarrow 52.3$ as junk ratio rises from $0\%$ to $100\%$. Error forensics reveal several key insights. First, we identify thought-skipping as the primary lesion: models increasingly truncate or skip reasoning chains, explaining most of the error growth. Second, partial but incomplete healing is observed: scaling instruction tuning and clean data pre-training improve the declined cognition yet cannot restore baseline capability, suggesting persistent representational drift rather than format mismatch. Finally, we discover that the popularity, a non-semantic metric, of a tweet is a better indicator of the Brain Rot effect than the length in M1. Together, the results provide significant, multi-perspective evidence that data quality is a causal driver of LLM capability decay, reframing curation for continual pretraining as a \textit{training-time safety} problem and motivating routine "cognitive health checks" for deployed LLMs.
摘要：我们提出并测试了法学硕士大脑腐烂假说：持续接触垃圾网络文本会导致大语言模型（LLM）的持久认知能力下降。为了因果隔离数据质量，我们在真实的 Twitter/X 语料库上进行了受控实验，通过两个正交操作构建垃圾数据集和反向控制数据集：M1（参与度）和 M2（语义质量），并具有匹配的令牌规模和跨条件的训练操作。与对照组相反，对垃圾数据集进行 4 名法学硕士的持续预训练会导致推理、长上下文理解、安全性和夸大“黑暗特质”（例如精神病、自恋）方面的显着下降（Hedges 的 $g>0.3$）。垃圾数据集和对照数据集的逐渐混合也会产生剂量反应认知衰退：例如，在 M1 下，随着垃圾比率从 $0\%$ 上升到 $100\%$，ARC-Challenge with Chain Of Thoughts 下降 $74.9 \rightarrow 57.2$，RULER-CWE $84.4 \rightarrow 52.3$。错误取证揭示了几个关键见解。首先，我们将思维跳跃视为主要缺陷：模型越来越多地截断或跳过推理链，这解释了大部分错误的增长。其次，观察到部分但不完全的治愈：扩展指令调整和干净的数据预训练改善了下降的认知，但无法恢复基线能力，这表明持续的表征漂移而不是格式不匹配。最后，我们发现推文的流行度（一种非语义指标）比 M1 中的长度更能反映脑腐病效应。总之，结果提供了重要的、多视角的证据，表明数据质量是法学硕士能力衰退的因果驱动因素，将持续预训练的管理重新定义为\textit{训练时安全}问题，并激励对已部署的法学硕士进行例行“认知健康检查”。

Title: Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions

Authors: Siying Liu, Shisheng Zhang, Indu Bala
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13931
Pdf URL: https://arxiv.org/pdf/2510.13931
Copy Paste: [[2510.13931]] Robust or Suggestible? Exploring Non-Clinical Induction in LLM Drug-Safety Decisions(https://arxiv.org/abs/2510.13931)
Keywords: language model, gpt, llm, chat
Abstract: Large language models (LLMs) are increasingly applied in biomedical domains, yet their reliability in drug-safety prediction remains underexplored. In this work, we investigate whether LLMs incorporate socio-demographic information into adverse event (AE) predictions, despite such attributes being clinically irrelevant. Using structured data from the United States Food and Drug Administration Adverse Event Reporting System (FAERS) and a persona-based evaluation framework, we assess two state-of-the-art models, ChatGPT-4o and Bio-Medical-Llama-3.8B, across diverse personas defined by education, marital status, employment, insurance, language, housing stability, and religion. We further evaluate performance across three user roles (general practitioner, specialist, patient) to reflect real-world deployment scenarios where commercial systems often differentiate access by user type. Our results reveal systematic disparities in AE prediction accuracy. Disadvantaged groups (e.g., low education, unstable housing) were frequently assigned higher predicted AE likelihoods than more privileged groups (e.g., postgraduate-educated, privately insured). Beyond outcome disparities, we identify two distinct modes of bias: explicit bias, where incorrect predictions directly reference persona attributes in reasoning traces, and implicit bias, where predictions are inconsistent, yet personas are not explicitly mentioned. These findings expose critical risks in applying LLMs to pharmacovigilance and highlight the urgent need for fairness-aware evaluation protocols and mitigation strategies before clinical deployment.
摘要：大语言模型（LLM）越来越多地应用于生物医学领域，但其在药物安全预测方面的可靠性仍未得到充分探索。在这项工作中，我们调查了法学硕士是否将社会人口统计信息纳入不良事件（AE）预测中，尽管这些属性与临床无关。使用来自美国食品和药物管理局不良事件报告系统 (FAERS) 的结构化数据和基于角色的评估框架，我们评估了两种最先进的模型 ChatGPT-4o 和 Bio-Medical-Llama-3.8B，涵盖由教育、婚姻状况、就业、保险、语言、住房稳定性和宗教定义的不同角色。我们进一步评估三种用户角色（全科医生、专家、患者）的性能，以反映现实世界的部署场景，其中商业系统通常按用户类型区分访问。我们的结果揭示了 AE 预测准确性的系统差异。弱势群体（例如，教育程度低、住房不稳定）经常比特权群体（例如受过研究生教育、私人保险）被分配更高的预测 AE 可能性。除了结果差异之外，我们还确定了两种不同的偏见模式：显性偏见（其中不正确的预测直接引用推理轨迹中的角色属性）和隐性偏见（其中预测不一致，但未明确提及角色）。这些发现暴露了将法学硕士应用于药物警戒的关键风险，并强调在临床部署之前迫切需要具有公平意识的评估方案和缓解策略。

Title: Big Reasoning with Small Models: Instruction Retrieval at Inference Time

Authors: Kenan Alkiek, David Jurgens, Vinod Vydiswaran
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13935
Pdf URL: https://arxiv.org/pdf/2510.13935
Copy Paste: [[2510.13935]] Big Reasoning with Small Models: Instruction Retrieval at Inference Time(https://arxiv.org/abs/2510.13935)
Keywords: language model, gpt, retrieval-augmented generation
Abstract: Can we bring large-scale reasoning to local-scale compute? Small language models (SLMs) are increasingly attractive because they run efficiently on local hardware, offering strong privacy, low cost, and reduced environmental impact. Yet they often struggle with tasks that require multi-step reasoning or domain-specific knowledge. We address this limitation through instruction intervention at inference time, where an SLM retrieves structured reasoning procedures rather than generating them from scratch. Our method builds an Instruction Corpus by grouping similar training questions and creating instructions via GPT-5. During inference, the SLM retrieves the most relevant instructions and follows their steps. Unlike retrieval-augmented generation, which retrieves text passages, instruction retrieval gives the model structured guidance for reasoning. We evaluate this framework on MedQA (medical board exams), MMLU Professional Law, and MathQA using models from 3B to 14B parameters without any additional fine-tuning. Instruction retrieval yields consistent gains: 9.4% on MedQA, 7.9% on MMLU Law, and 5.1% on MathQA. Concise instructions outperform longer ones, and the magnitude of improvement depends strongly on model family and intrinsic reasoning ability.
摘要：我们能否将大规模推理引入局部计算？小语言模型 (SLM) 越来越有吸引力，因为它们在本地硬件上高效运行，提供强大的隐私性、低成本并减少对环境的影响。然而，他们经常难以完成需要多步骤推理或特定领域知识的任务。我们通过推理时的指令干预来解决这一限制，其中 SLM 检索结构化推理过程而不是从头开始生成它们。我们的方法通过对类似的训练问题进行分组并通过 GPT-5 创建指令来构建指令语料库。在推理过程中，SLM 检索最相关的指令并遵循其步骤。与检索文本段落的检索增强生成不同，指令检索为模型提供结构化的推理指导。我们使用 3B 到 14B 参数的模型在 MedQA（医学委员会考试）、MMLU Professional Law 和 MathQA 上评估该框架，无需任何额外的微调。指令检索产生了一致的增益：MedQA 为 9.4%，MMLU Law 为 7.9%，MathQA 为 5.1%。简洁的指令优于较长的指令，并且改进的程度在很大程度上取决于模型族和内在推理能力。

Title: FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

Authors: Fengbin Zhu, Xiang Yao Ng, Ziyang Liu, Chang Liu, Xianwei Zeng, Chao Wang, Tianhui Tan, Xuan Yao, Pengyang Shao, Min Xu, Zixuan Wang, Jing Wang, Xin Lin, Junfeng Li, Jingxian Zhu, Yang Zhang, Wenjie Wang, Fuli Feng, Richang Hong, Huanbo Luan, Ke-Wei Huang, Tat-Seng Chua
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13936
Pdf URL: https://arxiv.org/pdf/2510.13936
Copy Paste: [[2510.13936]] FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis(https://arxiv.org/abs/2510.13936)
Keywords: language model, llm, agent
Abstract: Deep Research (DR) agents, powered by advanced Large Language Models (LLMs), have recently garnered increasing attention for their capability in conducting complex research tasks. However, existing literature lacks a rigorous and systematic evaluation of DR Agent's capabilities in critical research analysis. To address this gap, we first propose HisRubric, a novel evaluation framework with a hierarchical analytical structure and a fine-grained grading rubric for rigorously assessing DR agents' capabilities in corporate financial analysis. This framework mirrors the professional analyst's workflow, progressing from data recognition to metric calculation, and finally to strategic summarization and interpretation. Built on this framework, we construct a FinDeepResearch benchmark that comprises 64 listed companies from 8 financial markets across 4 languages, encompassing a total of 15,808 grading items. We further conduct extensive experiments on the FinDeepResearch using 16 representative methods, including 6 DR agents, 5 LLMs equipped with both deep reasoning and search capabilities, and 5 LLMs with deep reasoning capabilities only. The results reveal the strengths and limitations of these approaches across diverse capabilities, financial markets, and languages, offering valuable insights for future research and development. The benchmark and evaluation code will be made publicly available.
摘要：由先进的大型语言模型 (LLM) 提供支持的深度研究 (DR) 代理最近因其执行复杂研究任务的能力而受到越来越多的关注。然而，现有文献缺乏对 DR Agent 批判性研究分析能力的严格、系统的评估。为了解决这一差距，我们首先提出了 HisRubric，这是一种新颖的评估框架，具有层次分析结构和细粒度的评分标准，用于严格评估 DR 代理在企业财务分析方面的能力。该框架反映了专业分析师的工作流程，从数据识别到指标计算，最后到战略总结和解释。在此框架基础上，我们构建了 FinDeepResearch 基准，涵盖 8 个金融市场、4 种语言的 64 家上市公司，共包含 15,808 个评级项目。我们进一步使用 16 种代表性方法对 FinDeepResearch 进行了广泛的实验，包括 6 种 DR 代理、5 种同时具备深度推理和搜索功能的法学硕士以及 5 种仅具有深度推理能力的法学硕士。结果揭示了这些方法在不同能力、金融市场和语言方面的优势和局限性，为未来的研究和开发提供了宝贵的见解。基准和评估代码将公开。

Title: Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers

Authors: Tuhin Chakrabarty, Jane C. Ginsburg, Paramveer Dhillon
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.13939
Pdf URL: https://arxiv.org/pdf/2510.13939
Copy Paste: [[2510.13939]] Readers Prefer Outputs of AI Trained on Copyrighted Books over Expert Human Writers(https://arxiv.org/abs/2510.13939)
Keywords: gpt, prompt, chat
Abstract: The use of copyrighted books for training AI models has led to numerous lawsuits from authors concerned about AI's ability to generate derivative this http URL it's unclear whether these models can generate high quality literary text while emulating authors' styles. To answer this we conducted a preregistered study comparing MFA-trained expert writers with three frontier AI models: ChatGPT, Claude & Gemini in writing up to 450 word excerpts emulating 50 award-winning authors' diverse styles. In blind pairwise evaluations by 159 representative expert & lay readers, AI-generated text from in-context prompting was strongly disfavored by experts for both stylistic fidelity (OR=0.16, p<10^8) & writing quality (OR=0.13, p<10^7) but showed mixed results with lay readers. However, fine-tuning ChatGPT on individual authors' complete works completely reversed these findings: experts now favored AI-generated text for stylistic fidelity (OR=8.16, p<10^13) & writing quality (OR=1.87, p=0.010), with lay readers showing similar shifts. These effects generalize across authors & styles. The fine-tuned outputs were rarely flagged as AI-generated (3% rate v. 97% for in-context prompting) by best AI detectors. Mediation analysis shows this reversal occurs because fine-tuning eliminates detectable AI stylistic quirks (e.g., cliche density) that penalize in-context outputs. While we do not account for additional costs of human effort required to transform raw AI output into cohesive, publishable prose, the median fine-tuning & inference cost of $81 per author represents a dramatic 99.7% reduction compared to typical professional writer compensation. Author-specific fine-tuning thus enables non-verbatim AI writing that readers prefer to expert human writing, providing empirical evidence directly relevant to copyright's fourth fair-use factor, the "effect upon the potential market or value" of the source works.
摘要：使用受版权保护的书籍来训练 AI 模型已导致许多作者提起诉讼，这些作者担心 AI 能够生成此 http URL 的衍生品。尚不清楚这些模型是否可以在模仿作者风格的同时生成高质量的文学文本。为了回答这个问题，我们进行了一项预先注册的研究，将经过 MFA 培训的专家作家与三种前沿 AI 模型（ChatGPT、Claude 和 Gemini）进行比较，以模仿 50 名获奖作家的不同风格，撰写多达 450 字的摘录。在 159 名有代表性的专家和非专业读者进行的盲目成对评估中，人工智能根据上下文提示生成的文本在文体保真度（OR=0.16，p<10^8）和写作质量（OR=0.13，p<10^7）方面受到专家的强烈反对，但在非专业读者中却表现出不同的结果。然而，对个别作者的完整作品进行微调 ChatGPT 完全扭转了这些发现：专家们现在更青睐人工智能生成的文本，以保证文体保真度（OR=8.16，p<10^13）和写作质量（OR=1.87，p=0.010），而普通读者也表现出类似的转变。这些效果在不同的作者和风格中都具有普遍性。经过微调的输出很少被最佳 AI 检测器标记为 AI 生成（3% 的比率，而上下文提示的比率为 97%）。中介分析表明，这种逆转的发生是因为微调消除了可检测到的人工智能风格怪癖（例如陈词滥调密度），这些怪癖会影响上下文中的输出。虽然我们没有考虑将原始人工智能输出转化为连贯的、可出版的散文所需的额外人力成本，但与典型的专业作家薪酬相比，微调和推理成本中位数为 81 美元，大幅降低了 99.7%。因此，针对作者的微调使非逐字的人工智能写作成为读者比专家人类写作更喜欢的写作，提供与版权的第四个合理使用因素直接相关的经验证据，即源作品的“对潜在市场或价值的影响”。

Title: Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention

Authors: Zhen Yang, Mingyang Zhang, Feng Chen, Ganggui Ding, Liang Hou, Xin Tao, Pengfei Wan, Ying-Cong Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.13940
Pdf URL: https://arxiv.org/pdf/2510.13940
Copy Paste: [[2510.13940]] Less is More: Improving LLM Reasoning with Minimal Test-Time Intervention(https://arxiv.org/abs/2510.13940)
Keywords: language model, llm, prompt
Abstract: Recent progress in large language models (LLMs) has focused on test-time scaling to improve reasoning via increased inference computation, but often at the cost of efficiency. We revisit test-time behavior and uncover a simple yet underexplored phenomenon: reasoning uncertainty is highly localized-only a small subset of high-entropy tokens dominantly affects output correctness. Motivated by this, we propose Minimal Test-Time Intervention (MTI), a training-free framework that enhances reasoning accuracy and stability with minimal overhead. MTI includes: (i) Selective CFG intervention, applying classifier-free guidance only at uncertain positions; and (ii) Lightweight negative-prompt guidance, reusing the main model's KV cache to approximate unconditional decoding efficiently. MTI yields consistent gains across general, coding, and STEM tasks-e.g., +1.35% average improvement on eight benchmarks for Qwen3-8B-Base and +5% on AIME2024 using Qwen3-32B-Reasoning-while remaining highly efficient.
摘要：大型语言模型 (LLM) 的最新进展主要集中在测试时间扩展上，以通过增加推理计算来改进推理，但通常会以牺牲效率为代价。我们重新审视测试时行为，发现一个简单但尚未充分探索的现象：推理不确定性是高度局部化的——只有一小部分高熵标记主要影响输出的正确性。受此启发，我们提出了最小测试时间干预（MTI），这是一种免训练框架，可以以最小的开销提高推理准确性和稳定性。 MTI 包括： (i) 选择性 CFG 干预，仅在不确定位置应用无分类器指导； (ii) 轻量级负提示引导，重用主模型的 KV 缓存来有效地近似无条件解码。 MTI 在一般任务、编码任务和 STEM 任务中产生一致的收益，例如，在 Qwen3-8B-Base 的八个基准上平均提高 +1.35%，在使用 Qwen3-32B-Reasoning 的 AIME2024 上平均提高 +5%，同时保持高效。

Title: Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems

Authors: Kin Kwan Leung, Mouloud Belbahri, Yi Sui, Alex Labach, Xueying Zhang, Stephen Rose, Jesse C. Cresswell
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.13975
Pdf URL: https://arxiv.org/pdf/2510.13975
Copy Paste: [[2510.13975]] Classifying and Addressing the Diversity of Errors in Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2510.13975)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) is a prevalent approach for building LLM-based question-answering systems that can take advantage of external knowledge databases. Due to the complexity of real-world RAG systems, there are many potential causes for erroneous outputs. Understanding the range of errors that can occur in practice is crucial for robust deployment. We present a new taxonomy of the error types that can occur in realistic RAG systems, examples of each, and practical advice for addressing them. Additionally, we curate a dataset of erroneous RAG responses annotated by error types. We then propose an auto-evaluation method aligned with our taxonomy that can be used in practice to track and address errors during development. Code and data are available at this https URL.
摘要：检索增强生成（RAG）是构建基于 LLM 的问答系统的流行方法，该系统可以利用外部知识数据库。由于现实世界 RAG 系统的复杂性，导致错误输出的潜在原因有很多。了解实践中可能发生的错误范围对于稳健部署至关重要。我们提出了现实 RAG 系统中可能发生的错误类型的新分类法、每种错误类型的示例以及解决这些错误类型的实用建议。此外，我们还整理了一个由错误类型注释的错误 RAG 响应数据集。然后，我们提出了一种与我们的分类法相一致的自动评估方法，可在实践中用于跟踪和解决开发过程中的错误。代码和数据可从此 https URL 获取。

Title: The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Authors: Lukas Gienapp, Christopher Schröder, Stefan Schweter, Christopher Akiki, Ferdinand Schlatt, Arden Zimmermann, Phillipe Genêt, Martin Potthast
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.13996
Pdf URL: https://arxiv.org/pdf/2510.13996
Copy Paste: [[2510.13996]] The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models(https://arxiv.org/abs/2510.13996)
Keywords: language model
Abstract: Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.
摘要：大型语言模型的开发依赖于大规模的训练语料库，但大多数包含许可状态不明确的数据，限制了真正开放模型的开发。对于非英语语言来说，这个问题更加严重，因为公开许可的文本仍然非常稀缺。我们介绍 German Commons，这是迄今为止最大的公开许可的德语文本集合。它汇集了来自 7 个领域 41 个来源的数据，包括法律、科学、文化、政治、新闻、经济和网络文本。通过从具有可验证许可的成熟数据提供商处进行系统性采购，它产生了 1545.6 亿个用于语言模型训练的高质量文本代币。我们的处理管道实施全面的质量过滤、重复数据删除和文本格式修复，确保异构文本源的质量一致。所有域子集均具有至少 CC-BY-SA 4.0 或同等级别的许可证，确保模型训练和重新分发的法律合规性。因此，German Commons 解决了公开许可的德语预训练数据中的关键差距，并促进了真正开放的德语模型的开发。我们还发布了针对德语文本的语料库构建和数据过滤代码，使德语共享资源完全可复制和可扩展。

Title: CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models

Authors: Shehenaz Hossain, Haithem Afli
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14014
Pdf URL: https://arxiv.org/pdf/2510.14014
Copy Paste: [[2510.14014]] CRaFT: An Explanation-Based Framework for Evaluating Cultural Reasoning in Multilingual Language Models(https://arxiv.org/abs/2510.14014)
Keywords: language model, gpt, llm
Abstract: Correct answers do not necessarily reflect cultural understanding. We introduce CRaFT, an explanation-based multilingual evaluation framework designed to assess how large language models (LLMs) reason across cultural contexts. Rather than scoring outputs solely based on accuracy, CRaFT evaluates model explanations using four interpretable metrics: Cultural Fluency, Deviation, Consistency, and Linguistic Adaptation. We apply the framework to 50 culturally grounded questions from the World Values Survey, translated into Arabic, Bengali, and Spanish, and evaluate three models (GPT, DeepSeek, and FANAR) across over 2,100 answer-explanation pairs. Results reveal significant cross-lingual variation in reasoning: Arabic reduces fluency, Bengali enhances it, and Spanish remains largely stable. While GPT adapts more effectively across languages, it exhibits lower consistency; FANAR shows stable but rigid reasoning. These findings suggest that cultural awareness in LLMs is not intrinsic but emerges through linguistic framing. CRaFT offers a new lens for evaluating cross-cultural reasoning in multilingual settings, providing actionable insights for building culturally adaptive language models.
摘要：正确答案并不一定反映文化理解。我们引入了 CRaFT，这是一种基于解释的多语言评估框架，旨在评估大型语言模型 (LLM) 在跨文化背景下的推理能力。 CRaFT 不是仅仅根据准确性对输出进行评分，而是使用四个可解释的指标来评估模型解释：文化流畅性、偏差、一致性和语言适应。我们将该框架应用于世界价值观调查中的 50 个基于文化的问题，并翻译成阿拉伯语、孟加拉语和西班牙语，并在 2,100 多个答案-解释对中评估了三种模型（GPT、DeepSeek 和 FANAR）。结果显示，推理中存在显着的跨语言差异：阿拉伯语会降低流利度，孟加拉语会提高流利度，而西班牙语则基本保持稳定。虽然 GPT 能够更有效地跨语言适应，但它的一致性较低； FANAR 表现出稳定但严格的推理。这些发现表明法学硕士的文化意识不是内在的，而是通过语言框架出现的。 CRaFT 为评估多语言环境中的跨文化推理提供了新的视角，为构建文化适应性语言模型提供了可行的见解。

Title: Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games

Authors: César Guerra-Solano, Zhuochun Li, Xiang Lorraine Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14030
Pdf URL: https://arxiv.org/pdf/2510.14030
Copy Paste: [[2510.14030]] Think Globally, Group Locally: Evaluating LLMs Using Multi-Lingual Word Grouping Games(https://arxiv.org/abs/2510.14030)
Keywords: language model, llm
Abstract: Large language models (LLMs) can exhibit biases in reasoning capabilities due to linguistic modality, performing better on tasks in one language versus another, even with similar content. Most previous works evaluate this through reasoning tasks where reliance on strategies or knowledge can ensure success, such as in commonsense or math tasks. However, abstract reasoning is vital to reasoning for everyday life, where people apply "out-of-the-box thinking" to identify and use patterns for solutions, without a reliance on formulaic approaches. Comparatively, little work has evaluated linguistic biases in this task type. In this paper, we propose a task inspired by the New York Times Connections: GlobalGroup, that evaluates models in an abstract reasoning task across several languages. We constructed a game benchmark with five linguistic backgrounds -- English, Spanish, Chinese, Hindi, and Arabic -- in both the native language and an English translation for comparison. We also proposed game difficulty measurements to evaluate models on games with similar difficulty, enabling a more controlled comparison, which is particularly important in reasoning evaluations. Through experimentation, we find English modalities largely lead to better performance in this abstract reasoning task, and performance disparities between open- and closed-source models.
摘要：由于语言模态的原因，大型语言模型 (LLM) 可能会在推理能力上表现出偏差，即使内容相似，使用一种语言执行的任务也比使用另一种语言执行的效果更好。以前的大多数作品都是通过推理任务来评估这一点，其中依赖策略或知识可以确保成功，例如常识或数学任务。然而，抽象推理对于日常生活的推理至关重要，人们应用“开箱即用的思维”来识别和使用解决方案的模式，而不依赖于公式化的方法。相比之下，很少有工作评估这种任务类型中的语言偏差。在本文中，我们提出了一项受《纽约时报 Connections：GlobalGroup》启发的任务，该任务评估跨多种语言的抽象推理任务中的模型。我们构建了一个具有五种语言背景（英语、西班牙语、中文、印地语和阿拉伯语）的游戏基准，包括母语和英语翻译，以进行比较。我们还提出了游戏难度测量来评估具有相似难度的游戏的模型，从而实现更可控的比较，这在推理评估中尤为重要。通过实验，我们发现英语模态在很大程度上导致了这个抽象推理任务的更好性能，以及开源和闭源模型之间的性能差异。

Title: ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models

Authors: Haziq Mohammad Khalid, Athikash Jeyaganthan, Timothy Do, Yicheng Fu, Sean O'Brien, Vasu Sharma, Kevin Zhu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14077
Pdf URL: https://arxiv.org/pdf/2510.14077
Copy Paste: [[2510.14077]] ERGO: Entropy-guided Resetting for Generation Optimization in Multi-turn Language Models(https://arxiv.org/abs/2510.14077)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) suffer significant performance degradation in multi-turn conversations when information is presented incrementally. Given that multi-turn conversations characterize everyday interactions with LLMs, this degradation poses a severe challenge to real world usability. We hypothesize that abrupt increases in model uncertainty signal misalignment in multi-turn LLM interactions, and we exploit this insight to dynamically realign conversational context. We introduce ERGO (Entropy-guided Resetting for Generation Optimization), which continuously quantifies internal uncertainty via Shannon entropy over next token distributions and triggers adaptive prompt consolidation when a sharp spike in entropy is detected. By treating uncertainty as a first class signal rather than a nuisance to eliminate, ERGO embraces variability in language and modeling, representing and responding to uncertainty. In multi-turn tasks with incrementally revealed instructions, ERGO yields a 56.6% average performance gain over standard baselines, increases aptitude (peak performance capability) by 24.7%, and decreases unreliability (variability in performance) by 35.3%, demonstrating that uncertainty aware interventions can improve both accuracy and reliability in conversational AI.
摘要：当信息增量呈现时，大型语言模型 (LLM) 在多轮对话中的性能会显着下降。鉴于多轮对话是与法学硕士的日常互动的特征，这种退化对现实世界的可用性构成了严峻的挑战。我们假设模型不确定性的突然增加表明多轮 LLM 交互中的错位，并且我们利用这种见解来动态地重新调整对话上下文。我们引入了 ERGO（用于生成优化的熵引导重置），它通过下一个代币分布的香农熵不断量化内部不确定性，并在检测到熵急剧上升时触发自适应提示整合。通过将不确定性视为一流信号而不是需要消除的麻烦，ERGO 拥抱语言和建模的可变性，以表示和响应不确定性。在具有增量显示指令的多回合任务中，ERGO 的平均性能比标准基线提高了 56.6%，能力（峰值性能能力）提高了 24.7%，不可靠性（性能变化）降低了 35.3%，这表明不确定性感知干预可以提高对话式 AI 的准确性和可靠性。

Title: Toward Cybersecurity-Expert Small Language Models

Authors: Matan Levi, Daniel Ohayon, Ariel Blobstein, Ravid Sagi, Ian Molloy, Yair Allouche
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.14113
Pdf URL: https://arxiv.org/pdf/2510.14113
Copy Paste: [[2510.14113]] Toward Cybersecurity-Expert Small Language Models(https://arxiv.org/abs/2510.14113)
Keywords: language model, gpt, llm, chain-of-thought
Abstract: Large language models (LLMs) are transforming everyday applications, yet deployment in cybersecurity lags due to a lack of high-quality, domain-specific models and training datasets. To address this gap, we present CyberPal 2.0, a family of cybersecurity-expert small language models (SLMs) ranging from 4B-20B parameters. To train CyberPal 2.0, we generate an enriched chain-of-thought cybersecurity instruction dataset built with our data enrichment and formatting pipeline, SecKnowledge 2.0, which integrates expert-in-the-loop steering of reasoning formats alongside LLM-driven multi-step grounding, yielding higher-fidelity, task-grounded reasoning traces for security tasks. Across diverse cybersecurity benchmarks, CyberPal 2.0 consistently outperforms its baselines and matches or surpasses various open and closed-source frontier models, while remaining a fraction of their size. On core cyber threat intelligence knowledge tasks, our models outperform almost all tested frontier models, ranking second only to Sec-Gemini v1. On core threat-investigation tasks, such as correlating vulnerabilities and bug tickets with weaknesses, our best 20B-parameter model outperforms GPT-4o, o1, o3-mini, and Sec-Gemini v1, ranking first, while our smallest 4B-parameter model ranks second.
摘要：大型语言模型 (LLM) 正在改变日常应用程序，但由于缺乏高质量、特定领域的模型和训练数据集，网络安全领域的部署滞后。为了解决这一差距，我们推出了 CyberPal 2.0，这是一个网络安全专家小语言模型 (SLM) 系列，参数范围为 4B-20B。为了训练 CyberPal 2.0，我们生成了一个丰富的思想链网络安全指令数据集，该数据集使用我们的数据丰富和格式化管道 SecKnowledge 2.0 构建，它将推理格式的专家在环指导与 LLM 驱动的多步骤基础相结合，为安全任务产生更高保真度、基于任务的推理轨迹。在各种网络安全基准测试中，CyberPal 2.0 始终优于其基准，并匹配或超越各种开源和闭源前沿模型，同时保持其规模的一小部分。在核心网络威胁情报知识任务上，我们的模型优于几乎所有经过测试的前沿模型，排名仅次于 Sec-Gemini v1。在核心威胁调查任务上，例如将漏洞和错误单与弱点关联起来，我们最好的 20B 参数模型优于 GPT-4o、o1、o3-mini 和 Sec-Gemini v1，排名第一，而我们最小的 4B 参数模型排名第二。

Title: RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following

Authors: Zhichao Wang, Andy Wong, Ruslan Belkin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14200
Pdf URL: https://arxiv.org/pdf/2510.14200
Copy Paste: [[2510.14200]] RLSR: Reinforcement Learning with Supervised Reward Outperforms SFT in Instruction Following(https://arxiv.org/abs/2510.14200)
Keywords: llm, prompt
Abstract: After the pretraining stage of LLMs, techniques such as SFT, RLHF, RLVR, and RFT are applied to enhance instruction-following ability, mitigate undesired responses, improve reasoning capability and enable efficient domain adaptation with minimal data. SFT relies on the next-token prediction objective to strengthen instruction following in a base model using a large corpus of human-labeled responses. In contrast, RFT employs a RL-based approach to adapt fine-tuned reasoning models to specific domains with limited supervision. Inspired by RFT, we propose replacing SFT with RLSR to leverage the extensive SFT dataset in an RL framework, thereby improving the base model's instruction-following ability. In RLSR, the base model generates multiple responses for each prompt, and reward scores are computed as the cosine similarity in the semantic embedding space between the generated and human-labeled responses. RLSR can be utilized in multiple ways. It can directly replace SFT, achieving superior performance on instruction-following benchmarks-for example, RLSR (SB) on Qwen-7B (INFINITY) achieved an AlpacaEval win rate of 26.34%, surpassing SFT's 21.01%. Furthermore, combining SFT and RLSR further enhances downstream task performance; Qwen-7B (INFINITY) achieved a win rate of 30.73% when trained with SFT + RLSR.
摘要：在 LLM 的预训练阶段之后，应用 SFT、RLHF、RLVR 和 RFT 等技术来增强指令跟踪能力、减轻不良响应、提高推理能力并以最少的数据实现高效的域适应。 SFT 依靠下一个标记预测目标来使用大量人类标记的响应语料库来加强基本模型中的指令遵循。相比之下，RFT 采用基于强化学习的方法，在有限的监督下将微调的推理模型适应特定领域。受 RFT 的启发，我们建议用 RLSR 代替 SFT，以利用 RL 框架中广泛的 SFT 数据集，从而提高基础模型的指令跟踪能力。在 RLSR 中，基本模型为每个提示生成多个响应，并且奖励分数被计算为生成的响应和人工标记的响应之间的语义嵌入空间中的余弦相似度。 RLSR 可以通过多种方式使用。它可以直接替代SFT，在指令跟踪基准测试中取得优异的性能——例如Qwen-7B（INFINITY）上的RLSR（SB）取得了26.34%的AlpacaEval胜率，超过了SFT的21.01%。此外，结合SFT和RLSR进一步增强下游任务性能； Qwen-7B (INFINITY) 在使用 SFT + RLSR 训练时取得了 30.73% 的胜率。

Title: DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans

Authors: Bingsheng Yao, Bo Sun, Yuanzhe Dong, Yuxuan Lu, Dakuo Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14205
Pdf URL: https://arxiv.org/pdf/2510.14205
Copy Paste: [[2510.14205]] DPRF: A Generalizable Dynamic Persona Refinement Framework for Optimizing Behavior Alignment Between Personalized LLM Role-Playing Agents and Humans(https://arxiv.org/abs/2510.14205)
Keywords: language model, llm, agent
Abstract: The emerging large language model role-playing agents (LLM RPAs) aim to simulate individual human behaviors, but the persona fidelity is often undermined by manually-created profiles (e.g., cherry-picked information and personality characteristics) without validating the alignment with the target individuals. To address this limitation, our work introduces the Dynamic Persona Refinement Framework (DPRF).DPRF aims to optimize the alignment of LLM RPAs' behaviors with those of target individuals by iteratively identifying the cognitive divergence, either through free-form or theory-grounded, structured analysis, between generated behaviors and human ground truth, and refining the persona profile to mitigate these this http URL evaluate DPRF with five LLMs on four diverse behavior-prediction scenarios: formal debates, social media posts with mental health issues, public interviews, and movie this http URL can consistently improve behavioral alignment considerably over baseline personas and generalizes across models and this http URL work provides a robust methodology for creating high-fidelity persona profiles and enhancing the validity of downstream applications, such as user simulation, social studies, and personalized AI.
摘要：新兴的大型语言模型角色扮演代理（LLM RPA）旨在模拟人类个体行为，但角色保真度常常会被手动创建的配置文件（例如精心挑选的信息和个性特征）破坏，而无需验证与目标个体的一致性。为了解决这个限制，我们的工作引入了动态角色细化框架（DPRF）。DPRF旨在通过迭代地识别认知差异（通过自由形式或基于理论的结构化分析）生成的行为和人类基本事实之间的认知分歧，优化LLM RPA的行为与目标个人的行为的一致性，并细化角色配置文件以减轻这些http URL评估 DPRF 拥有五个法学硕士，涉及四种不同的行为预测场景：正式辩论、涉及心理健康问题的社交媒体帖子、公开访谈和电影。此 http URL 可以持续显着改善基线角色的行为一致性，并在模型之间进行泛化，此 http URL 工作提供了一种强大的方法来创建高保真角色配置文件并增强下游应用程序的有效性，例如用户模拟、社会研究和个性化人工智能。

Title: LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning

Authors: Beomseok Kang, Jiwon Song, Jae-Joon Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14211
Pdf URL: https://arxiv.org/pdf/2510.14211
Copy Paste: [[2510.14211]] LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning(https://arxiv.org/abs/2510.14211)
Keywords: language model
Abstract: Multi-stage reasoning has emerged as an effective strategy for enhancing the reasoning capability of small language models by decomposing complex problems into sequential sub-stages. However, this comes at the cost of increased latency. We observe that existing adaptive acceleration techniques, such as layer skipping, struggle to balance efficiency and accuracy in this setting due to two key challenges: (1) stage-wise variation in skip sensitivity, and (2) the generation of redundant output tokens. To address these, we propose LiteStage, a latency-aware layer skipping framework for multi-stage reasoning. LiteStage combines a stage-wise offline search that allocates optimal layer budgets with an online confidence-based generation early exit to suppress unnecessary decoding. Experiments on three benchmarks, e.g., OBQA, CSQA, and StrategyQA, show that LiteStage achieves up to 1.70x speedup with less than 4.0% accuracy loss, outperforming prior training-free layer skipping methods.
摘要：多阶段推理已成为通过将复杂问题分解为连续子阶段来增强小语言模型推理能力的有效策略。然而，这是以增加延迟为代价的。我们观察到，现有的自适应加速技术（例如层跳跃）由于两个关键挑战而难以平衡这种设置中的效率和准确性：（1）跳跃灵敏度的阶段性变化，以及（2）冗余输出令牌的生成。为了解决这些问题，我们提出了 LiteStage，一种用于多阶段推理的延迟感知层跳跃框架。 LiteStage 将分配最佳层预算的分阶段离线搜索与基于置信度的在线生成提前退出相结合，以抑制不必要的解码。在 OBQA、CSQA 和 StrategyQA 等三个基准测试上的实验表明，LiteStage 实现了高达 1.70 倍的加速，而准确度损失不到 4.0%，优于之前的免训练跳层方法。

Title: Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs

Authors: Parsa Hejabi, Elnaz Rahmati, Alireza S. Ziabari, Morteza Dehghani
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14242
Pdf URL: https://arxiv.org/pdf/2510.14242
Copy Paste: [[2510.14242]] Flip-Flop Consistency: Unsupervised Training for Robustness to Prompt Perturbations in LLMs(https://arxiv.org/abs/2510.14242)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) often produce inconsistent answers when faced with different phrasings of the same prompt. In this paper, we propose Flip-Flop Consistency ($F^2C$), an unsupervised training method that improves robustness to such perturbations. $F^2C$ is composed of two key components. The first, Consensus Cross-Entropy (CCE), uses a majority vote across prompt variations to create a hard pseudo-label. The second is a representation alignment loss that pulls lower-confidence and non-majority predictors toward the consensus established by high-confidence, majority-voting variations. We evaluate our method on 11 datasets spanning four NLP tasks, with 4-15 prompt variations per dataset. On average, $F^2C$ raises observed agreement by 11.62%, improves mean $F_1$ by 8.94%, and reduces performance variance across formats by 3.29%. In out-of-domain evaluations, $F^2C$ generalizes effectively, increasing $\overline{F_1}$ and agreement while decreasing variance across most source-target pairs. Finally, when trained on only a subset of prompt perturbations and evaluated on held-out formats, $F^2C$ consistently improves both performance and agreement while reducing variance. These findings highlight $F^2C$ as an effective unsupervised method for enhancing LLM consistency, performance, and generalization under prompt perturbations. Code is available at this https URL.
摘要：当面对同一提示的不同措辞时，大型语言模型 (LLM) 通常会产生不一致的答案。在本文中，我们提出了触发器一致性（$F^2C$），这是一种无监督训练方法，可以提高对此类扰动的鲁棒性。 $F^2C$ 由两个关键部分组成。第一个是共识交叉熵（CCE），使用跨提示变体的多数投票来创建硬伪标签。第二个是表示对齐损失，它将低置信度和非多数预测变量拉向由高置信度、多数投票变量建立的共识。我们在涵盖四个 NLP 任务的 11 个数据集上评估我们的方法，每个数据集有 4-15 个提示变化。平均而言，$F^2C$ 将观察到的一致性提高了 11.62%，将平均 $F_1$ 提高了 8.94%，并将不同格式的性能差异降低了 3.29%。在域外评估中，$F^2C$ 可以有效泛化，增加 $\overline{F_1}$ 和一致性，同时减少大多数源-目标对之间的方差。最后，当仅对提示扰动的子集进行训练并对保留格式进行评估时，$F^2C$ 不断提高性能和一致性，同时减少方差。这些发现强调 $F^2C$ 作为一种有效的无监督方法，可在即时扰动下增强 LLM 的一致性、性能和泛化能力。代码可从此 https URL 获取。

Title: MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems

Authors: Jihao Zhao, Zhiyuan Ji, Simin Niu, Hanyu Wang, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14252
Pdf URL: https://arxiv.org/pdf/2510.14252
Copy Paste: [[2510.14252]] MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems(https://arxiv.org/abs/2510.14252)
Keywords: language model, llm, retrieval-augmented generation
Abstract: The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
摘要：传统的 RAG 范式通常参与理解相关文本块以响应收到的查询，本质上限制了知识内化的深度和推理能力。为了解决这一局限性，我们的研究将 RAG 中的文本处理从被动分块转变为主动理解，并将该过程定义为文档记忆提取，其目的是模拟人类在阅读过程中的认知过程。在此基础上，我们提出了场景感知文档记忆混合（MoM）框架，旨在有效处理来自多个领域的文档并训练小语言模型（SLM）以获得主动探索和构建文档记忆的能力。 MoM 最初指导大型语言模型 (LLM) 模拟领域专家生成文档逻辑大纲，从而指导结构化分块和核心内容提取。它采用多路径采样和多视角评估机制，专门设计了代表块清晰度和提取完整性的综合指标来选择最佳文档记忆。此外，为了在 SLM 训练过程中注入更深层次的类人阅读能力，我们采用了反向推理策略，从高质量的结果中推导出精致的专家思维路径。最后，利用 MoM 生成的多种形式的内容，我们开发了一种三层文档记忆检索机制，该机制基于我们从概率建模角度的理论证明。跨越三个不同领域的广泛实验结果表明，MoM 框架不仅解决了现有 RAG 系统中的文本分块挑战，为法学硕士提供语义完整的文档记忆，而且为 SLM 实现以人为中心的智能文本处理铺平了道路。

Title: Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

Authors: Rahul Nadkarni, Yanai Elazar, Hila Gonen, Noah A. Smith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14261
Pdf URL: https://arxiv.org/pdf/2510.14261
Copy Paste: [[2510.14261]] Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior(https://arxiv.org/abs/2510.14261)
Keywords: language model
Abstract: We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.
摘要：我们提出了一个实验方法来研究训练数据和语言模型（LM）行为之间的关系。我们概述了干预数据批次的步骤——即“重写历史”——然后对该数据重新训练模型检查点以测试将数据与行为相关的假设。我们的方法将这种干预分为几个阶段，包括从衡量模型行为的基准中选择评估项目，将相关文档与这些项目进行匹配，以及在重新训练和测量效果之前修改这些文档。我们通过对 LM 中事实知识获取的案例研究来证明我们的方法的实用性，使用共现统计和信息检索方法来识别可能有助于知识学习的文档。我们的结果补充了过去将共现与模型行为联系起来的观察分析，同时证明了现有的识别相关培训文档的方法并不能完全解释语言模型正确回答知识问题的能力。总的来说，我们概述了一个方法，研究人员可以遵循该方法来测试有关训练数据如何影响模型行为的进一步假设。我们的代码公开发布以促进未来的工作。

Title: Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation

Authors: Yilun Zheng, Dan Yang, Jie Li, Lin Shang, Lihui Chen, Jiahao Xu, Sitao Luan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14271
Pdf URL: https://arxiv.org/pdf/2510.14271
Copy Paste: [[2510.14271]] Less is More: Denoising Knowledge Graphs For Retrieval Augmented Generation(https://arxiv.org/abs/2510.14271)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems enable large language models (LLMs) instant access to relevant information for the generative process, demonstrating their superior performance in addressing common LLM challenges such as hallucination, factual inaccuracy, and the knowledge cutoff. Graph-based RAG further extends this paradigm by incorporating knowledge graphs (KGs) to leverage rich, structured connections for more precise and inferential responses. A critical challenge, however, is that most Graph-based RAG systems rely on LLMs for automated KG construction, often yielding noisy KGs with redundant entities and unreliable relationships. This noise degrades retrieval and generation performance while also increasing computational cost. Crucially, current research does not comprehensively address the denoising problem for LLM-generated KGs. In this paper, we introduce DEnoised knowledge Graphs for Retrieval Augmented Generation (DEG-RAG), a framework that addresses these challenges through: (1) entity resolution, which eliminates redundant entities, and (2) triple reflection, which removes erroneous relations. Together, these techniques yield more compact, higher-quality KGs that significantly outperform their unprocessed counterparts. Beyond the methods, we conduct a systematic evaluation of entity resolution for LLM-generated KGs, examining different blocking strategies, embedding choices, similarity metrics, and entity merging techniques. To the best of our knowledge, this is the first comprehensive exploration of entity resolution in LLM-generated KGs. Our experiments demonstrate that this straightforward approach not only drastically reduces graph size but also consistently improves question answering performance across diverse popular Graph-based RAG variants.
摘要：检索增强生成（RAG）系统使大型语言模型（LLM）能够即时访问生成过程的相关信息，展示了它们在解决常见的 LLM 挑战（例如幻觉、事实不准确和知识切断）方面的卓越性能。基于图的 RAG 通过合并知识图 (KG) 进一步扩展了这一范例，以利用丰富的结构化连接来实现更精确和推理性的响应。然而，一个关键的挑战是，大多数基于图的 RAG 系统依赖于 LLM 进行自动化知识图谱构建，通常会产生具有冗余实体和不可靠关系的嘈杂知识图谱。这种噪声会降低检索和生成性能，同时还会增加计算成本。至关重要的是，当前的研究并未全面解决 LLM 生成的知识图谱的去噪问题。在本文中，我们介绍了用于检索增强生成的去噪知识图（DEG-RAG），该框架通过以下方式解决这些挑战：（1）实体解析，消除冗余实体；（2）三重反射，消除错误关系。这些技术共同产生更紧凑、更高质量的 KG，其性能显着优于未处理的同类产品。除了这些方法之外，我们还对 LLM 生成的 KG 的实体解析进行了系统评估，检查不同的阻塞策略、嵌入选择、相似性度量和实体合并技术。据我们所知，这是对法学硕士生成的知识图谱中实体解析的首次全面探索。我们的实验表明，这种简单的方法不仅大大减少了图的大小，而且还持续提高了各种流行的基于图的 RAG 变体的问答性能。

Title: Qwen3Guard Technical Report

Authors: Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu, Rong Zhang, Shibin Wu, Shuo Zhang, Tao He, Tianyi Tang, Tingyu Xia, Wei Liao, Weizhou Shen, Wenbiao Yin, Wenmeng Zhou, Wenyuan Yu, Xiaobin Wang, Xiaodong Deng, Xiaodong Xu, Xinyu Zhang, Yang Liu, Yeqiu Li, Yi Zhang, Yong Jiang, Yu Wan, Yuxin Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14276
Pdf URL: https://arxiv.org/pdf/2510.14276
Copy Paste: [[2510.14276]] Qwen3Guard Technical Report(https://arxiv.org/abs/2510.14276)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.
摘要：随着大型语言模型 (LLM) 的能力变得越来越强大并得到广泛使用，确保其输出的安全性变得越来越重要。现有的护栏模型虽然在静态评估设置中有用，但在实际应用中面临两个主要限制：（1）它们通常仅输出二进制“安全/不安全”标签，这些标签在不同的安全策略中解释不一致，导致它们无法适应跨域的不同安全容差；（2）它们在执行安全检查之前需要完整的模型输出，这使得它们从根本上与流式 LLM 推理不兼容，从而阻止了生成过程中的及时干预并增加了有害部分输出的风险。为了应对这些挑战，我们提出了 Qwen3Guard，这是一系列多语言安全护栏模型，具有两个专门的变体：生成式 Qwen3Guard，它将安全分类作为一项指令跟踪任务，以实现细粒度的三类判断（安全、有争议、不安全）； Stream Qwen3Guard，它引入了令牌级分类头，用于增量文本生成过程中的实时安全监控。两种变体均提供三种大小（0.6B、4B 和 8B 参数），并支持多达 119 种语言和方言，为全球 LLM 部署提供全面、可扩展和低延迟的安全审核。经过英语、中文和多语言基准评估，Qwen3Guard 在及时和响应安全分类方面均实现了最先进的性能。所有模型均根据 Apache 2.0 许可证发布以供公众使用。

Title: PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering

Authors: Md Mahadi Hasan Nahid, Davood Rafiei
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14278
Pdf URL: https://arxiv.org/pdf/2510.14278
Copy Paste: [[2510.14278]] PRISM: Agentic Retrieval with LLMs for Multi-Hop Question Answering(https://arxiv.org/abs/2510.14278)
Keywords: language model, llm, agent
Abstract: Retrieval plays a central role in multi-hop question answering (QA), where answering complex questions requires gathering multiple pieces of evidence. We introduce an Agentic Retrieval System that leverages large language models (LLMs) in a structured loop to retrieve relevant evidence with high precision and recall. Our framework consists of three specialized agents: a Question Analyzer that decomposes a multi-hop question into sub-questions, a Selector that identifies the most relevant context for each sub-question (focusing on precision), and an Adder that brings in any missing evidence (focusing on recall). The iterative interaction between Selector and Adder yields a compact yet comprehensive set of supporting passages. In particular, it achieves higher retrieval accuracy while filtering out distracting content, enabling downstream QA models to surpass full-context answer accuracy while relying on significantly less irrelevant information. Experiments on four multi-hop QA benchmarks -- HotpotQA, 2WikiMultiHopQA, MuSiQue, and MultiHopRAG -- demonstrates that our approach consistently outperforms strong baselines.
摘要：检索在多跳问答 (QA) 中发挥着核心作用，其中回答复杂的问题需要收集多个证据。我们引入了一种代理检索系统，该系统在结构化循环中利用大型语言模型（LLM）来以高精度和召回率检索相关证据。我们的框架由三个专门的代理组成：一个将多跳问题分解为子问题的问题分析器，一个识别每个子问题最相关上下文的选择器（专注于精度），以及一个引入任何缺失证据的加法器（专注于召回）。选择器和加法器之间的迭代交互产生了一组紧凑而全面的支持段落。特别是，它在过滤掉分散注意力的内容的同时实现了更高的检索准确性，使下游 QA 模型能够超越全上下文答案准确性，同时依赖显着减少的不相关信息。对四个多跳 QA 基准（HotpotQA、2WikiMultiHopQA、MuSiQue 和 MultiHopRAG）的实验表明，我们的方法始终优于强大的基线。

Title: Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL

Authors: Md Mahadi Hasan Nahid, Davood Rafiei, Weiwei Zhang, Yong Zhang
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14296
Pdf URL: https://arxiv.org/pdf/2510.14296
Copy Paste: [[2510.14296]] Rethinking Schema Linking: A Context-Aware Bidirectional Retrieval Approach for Text-to-SQL(https://arxiv.org/abs/2510.14296)
Keywords: hallucination
Abstract: Schema linking -- the process of aligning natural language questions with database schema elements -- is a critical yet underexplored component of Text-to-SQL systems. While recent methods have focused primarily on improving SQL generation, they often neglect the retrieval of relevant schema elements, which can lead to hallucinations and execution failures. In this work, we propose a context-aware bidirectional schema retrieval framework that treats schema linking as a standalone problem. Our approach combines two complementary strategies: table-first retrieval followed by column selection, and column-first retrieval followed by table selection. It is further augmented with techniques such as question decomposition, keyword extraction, and keyphrase extraction. Through comprehensive evaluations on challenging benchmarks such as BIRD and Spider, we demonstrate that our method significantly improves schema recall while reducing false positives. Moreover, SQL generation using our retrieved schema consistently outperforms full-schema baselines and closely approaches oracle performance, all without requiring query refinement. Notably, our method narrows the performance gap between full and perfect schema settings by 50\%. Our findings highlight schema linking as a powerful lever for enhancing Text-to-SQL accuracy and efficiency.
摘要：模式链接——将自然语言问题与数据库模式元素对齐的过程——是文本到 SQL 系统的一个关键但尚未得到充分开发的组件。虽然最近的方法主要侧重于改进 SQL 生成，但它们经常忽略相关模式元素的检索，这可能导致幻觉和执行失败。在这项工作中，我们提出了一个上下文感知的双向模式检索框架，它将模式链接视为一个独立的问题。我们的方法结合了两种互补的策略：先表检索后进行列选择，以及先列检索后进行表选择。它通过问题分解、关键词提取和关键短语提取等技术得到进一步增强。通过对 BIRD 和 Spider 等具有挑战性的基准进行综合评估，我们证明我们的方法显着提高了模式召回率，同时减少了误报。此外，使用我们检索到的模式生成 SQL 的性能始终优于全模式基线，并且非常接近 Oracle 性能，所有这些都不需要查询细化。值得注意的是，我们的方法将完整模式设置和完美模式设置之间的性能差距缩小了 50%。我们的研究结果强调模式链接是提高文本到 SQL 准确性和效率的强大杠杆。

Title: Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers

Authors: Ziye Xia, Sergei S. Ospichev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14303
Pdf URL: https://arxiv.org/pdf/2510.14303
Copy Paste: [[2510.14303]] Constraint-Driven Small Language Models Based on Agent and OpenAlex Knowledge Graph: Mining Conceptual Pathways and Discovering Innovation Points in Academic Papers(https://arxiv.org/abs/2510.14303)
Keywords: language model, prompt, agent
Abstract: In recent years, the rapid increase in academic publications across various fields has posed severe challenges for academic paper analysis: scientists struggle to timely and comprehensively track the latest research findings and methodologies. Key concept extraction has proven to be an effective analytical paradigm, and its automation has been achieved with the widespread application of language models in industrial and scientific domains. However, existing paper databases are mostly limited to similarity matching and basic classification of key concepts, failing to deeply explore the relational networks between concepts. This paper is based on the OpenAlex opensource knowledge graph. By analyzing nearly 8,000 open-source paper data from Novosibirsk State University, we discovered a strong correlation between the distribution patterns of paper key concept paths and both innovation points and rare paths. We propose a prompt engineering-based key concept path analysis method. This method leverages small language models to achieve precise key concept extraction and innovation point identification, and constructs an agent based on a knowledge graph constraint mechanism to enhance analysis accuracy. Through fine-tuning of the Qwen and DeepSeek models, we achieved significant improvements in accuracy, with the models publicly available on the Hugging Face platform.
摘要：近年来，各领域学术出版物的快速增加给学术论文分析带来了严峻的挑战：科学家们努力及时、全面地追踪最新的研究成果和方法。关键概念提取已被证明是一种有效的分析范式，并且随着语言模型在工业和科学领域的广泛应用，其自动化已经实现。然而，现有的论文数据库大多局限于关键概念的相似度匹配和基本分类，未能深入探索概念之间的关系网络。本文基于OpenAlex开源知识图谱。通过分析新西伯利亚国立大学近 8000 篇开源论文数据，我们发现论文关键概念路径的分布模式与创新点和稀有路径之间存在很强的相关性。我们提出了一种基于即时工程的关键概念路径分析方法。该方法利用小语言模型实现精确的关键概念提取和创新点识别，并构建基于知识图约束机制的代理以提高分析准确性。通过对 Qwen 和 DeepSeek 模型的微调，我们在准确性方面取得了显着提高，模型在 Hugging Face 平台上公开提供。

Title: MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning

Authors: Mahbub E Sobhani, Md. Faiyaz Abdullah Sayeedi, Tasnim Mohiuddin, Md Mofijul Islam, Swakkhar Shatabda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14305
Pdf URL: https://arxiv.org/pdf/2510.14305
Copy Paste: [[2510.14305]] MathMist: A Parallel Multilingual Benchmark Dataset for Mathematical Problem Solving and Reasoning(https://arxiv.org/abs/2510.14305)
Keywords: language model, llm, chain-of-thought
Abstract: Mathematical reasoning remains one of the most challenging domains for large language models (LLMs), requiring not only linguistic understanding but also structured logical deduction and numerical precision. While recent LLMs demonstrate strong general-purpose reasoning abilities, their mathematical competence across diverse languages remains underexplored. Existing benchmarks primarily focus on English or a narrow subset of high-resource languages, leaving significant gaps in assessing multilingual and cross-lingual mathematical reasoning. To address this, we introduce MathMist, a parallel multilingual benchmark for mathematical problem solving and reasoning. MathMist encompasses over 21K aligned question-answer pairs across seven languages, representing a balanced coverage of high-, medium-, and low-resource linguistic settings. The dataset captures linguistic variety, multiple types of problem settings, and solution synthesizing capabilities. We systematically evaluate a diverse suite of models, including open-source small and medium LLMs, proprietary systems, and multilingual-reasoning-focused models, under zero-shot, chain-of-thought (CoT), and code-switched reasoning paradigms. Our results reveal persistent deficiencies in LLMs' ability to perform consistent and interpretable mathematical reasoning across languages, with pronounced degradation in low-resource settings. All the codes and data are available at GitHub: this https URL
摘要：数学推理仍然是大型语言模型（LLM）最具挑战性的领域之一，不仅需要语言理解，还需要结构化逻辑演绎和数值精度。虽然最近的法学硕士表现出了强大的通用推理能力，但他们跨不同语言的数学能力仍未得到充分开发。现有的基准主要关注英语或一小部分高资源语言，在评估多语言和跨语言数学推理方面存在巨大差距。为了解决这个问题，我们引入了 MathMist，一个用于数学问题解决和推理的并行多语言基准。 MathMist 包含七种语言的超过 21K 个对齐的问答对，代表了高、中、低资源语言设置的平衡覆盖。该数据集捕获了语言多样性、多种类型的问题设置以及解决方案综合能力。我们在零样本、思维链 (CoT) 和代码转换推理范式下系统地评估了各种模型，包括开源中小型法学硕士、专有系统和多语言推理模型。我们的结果揭示了法学硕士跨语言执行一致且可解释的数学推理的能力持续存在缺陷，并且在资源匮乏的环境中明显退化。所有代码和数据都可以在 GitHub 上找到：此 https URL

Title: MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking

Authors: Sathyanarayanan Ramamoorthy, Vishwa Shah, Simran Khanuja, Zaid Sheikh, Shan Jie, Ann Chia, Shearman Chua, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14307
Pdf URL: https://arxiv.org/pdf/2510.14307
Copy Paste: [[2510.14307]] MERLIN: A Testbed for Multilingual Multimodal Entity Recognition and Linking(https://arxiv.org/abs/2510.14307)
Keywords: language model
Abstract: This paper introduces MERLIN, a novel testbed system for the task of Multilingual Multimodal Entity Linking. The created dataset includes BBC news article titles, paired with corresponding images, in five languages: Hindi, Japanese, Indonesian, Vietnamese, and Tamil, featuring over 7,000 named entity mentions linked to 2,500 unique Wikidata entities. We also include several benchmarks using multilingual and multimodal entity linking methods exploring different language models like LLaMa-2 and Aya-23. Our findings indicate that incorporating visual data improves the accuracy of entity linking, especially for entities where the textual context is ambiguous or insufficient, and particularly for models that do not have strong multilingual abilities. For the work, the dataset, methods are available here at this https URL
摘要：本文介绍了 MERLIN，这是一种用于多语言多模式实体链接任务的新型测试平台系统。创建的数据集包括五种语言的 BBC 新闻文章标题以及相应的图像：印地语、日语、印度尼西亚语、越南语和泰米尔语，其中包含与 2,500 个独特的维基数据实体链接的 7,000 多个命名实体提及。我们还包括几个使用多语言和多模式实体链接方法的基准，探索不同的语言模型，如 LLaMa-2 和 Aya-23。我们的研究结果表明，合并视觉数据可以提高实体链接的准确性，特别是对于文本上下文不明确或不充分的实体，特别是对于不具有强大多语言能力的模型。对于这项工作，数据集、方法可在 https URL 上找到

Title: Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL

Authors: Marwa Abdulhai, Ryan Cheng, Aryansh Shrivastava, Natasha Jaques, Yarin Gal, Sergey Levine
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14318
Pdf URL: https://arxiv.org/pdf/2510.14318
Copy Paste: [[2510.14318]] Evaluating & Reducing Deceptive Dialogue From Language Models with Multi-turn RL(https://arxiv.org/abs/2510.14318)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) interact with millions of people worldwide in applications such as customer support, education and healthcare. However, their ability to produce deceptive outputs, whether intentionally or inadvertently, poses significant safety concerns. The unpredictable nature of LLM behavior, combined with insufficient safeguards against hallucination, misinformation, and user manipulation, makes their misuse a serious, real-world risk. In this paper, we investigate the extent to which LLMs engage in deception within dialogue, and propose the belief misalignment metric to quantify deception. We evaluate deception across four distinct dialogue scenarios, using five established deception detection metrics and our proposed metric. Our findings reveal this novel deception measure correlates more closely with human judgments than any existing metrics we test. Additionally, our benchmarking of eight state-of-the-art models indicates that LLMs naturally exhibit deceptive behavior in approximately 26% of dialogue turns, even when prompted with seemingly benign objectives. When prompted to deceive, LLMs are capable of increasing deceptiveness by as much as 31% relative to baselines. Unexpectedly, models trained with RLHF, the predominant approach for ensuring the safety of widely-deployed LLMs, still exhibit deception at a rate of 43% on average. Given that deception in dialogue is a behavior that develops over an interaction history, its effective evaluation and mitigation necessitates moving beyond single-utterance analyses. We introduce a multi-turn reinforcement learning methodology to fine-tune LLMs to reduce deceptive behaviors, leading to a 77.6% reduction compared to other instruction-tuned models.
摘要：大型语言模型 (LLM) 在客户支持、教育和医疗保健等应用中与全球数百万人进行交互。然而，它们产生欺骗性输出的能力，无论是有意还是无意，都会带来重大的安全问题。 LLM 行为的不可预测性，加上针对幻觉、错误信息和用户操纵的防范措施不足，使得其滥用成为现实世界中严重的风险。在本文中，我们调查了法学硕士在对话中进行欺骗的程度，并提出了信念偏差指标来量化欺骗。我们使用五种既定的欺骗检测指标和我们提出的指标来评估四种不同对话场景中的欺骗行为。我们的研究结果表明，与我们测试的任何现有指标相比，这种新颖的欺骗措施与人类判断的相关性更密切。此外，我们对八个最先进模型的基准测试表明，法学硕士在大约 26% 的对话回合中自然会表现出欺骗行为，即使是在看似良性的目标的情况下也是如此。当被提示欺骗时，法学硕士的欺骗性相对于基线能够增加高达 31%。出乎意料的是，使用 RLHF（确保广泛部署的 LLM 安全性的主要方法）训练的模型仍然表现出平均 43% 的欺骗率。鉴于对话中的欺骗是一种在交互历史中发展起来的行为，其有效评估和缓解需要超越单一话语分析。我们引入了多轮强化学习方法来微调 LLM 以减少欺骗行为，与其他指令调整模型相比，减少了 77.6%。

Title: Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts

Authors: Perapard Ngokpol, Kun Kerdthaisong, Pasin Buakhaw, Pitikorn Khlaisamniang, Supasate Vorathammathorn, Piyalitt Ittichaiwong, Nutchanon Yongsatianchot
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14351
Pdf URL: https://arxiv.org/pdf/2510.14351
Copy Paste: [[2510.14351]] Beyond One World: Benchmarking Super Heros in Role-Playing Across Multiversal Contexts(https://arxiv.org/abs/2510.14351)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: Large language models (LLMs) are increasingly used as role-playing agents, yet their capacity to faithfully and consistently portray version-specific characters -- for example, superheroes across comic and cinematic universes -- remains underexplored. Superhero canons such as Marvel and DC provide a rich testbed: decades of storytelling yield multiple incarnations of the same character with distinct histories, values, and moral codes. To study this problem, we introduce Beyond One World, a benchmark for character-grounded roleplay spanning 30 iconic heroes and 90 canon-specific versions. The benchmark comprises two tasks: (i) Canon Events, which probes factual recall of pivotal life stages, and (ii) Moral Dilemmas, which confronts models with ethically charged scenarios. We score responses for canonical accuracy and reasoning fidelity under a framework that separates internal deliberation ("thinking") from outward decisions ("acting"). We further propose Think-Act Matching, a metric that quantifies alignment between reasons and actions and serves as a proxy for model trustworthiness. Experiments across reasoning- and non-reasoning-oriented models yield three findings: (1) chain-of-thought prompting improves narrative coherence in weaker models but can reduce canonical accuracy in stronger ones; (2) cross-version generalization within a character remains a major obstacle; and (3) models often excel at either thinking or acting, but rarely both. Beyond One World exposes critical gaps in multiversal consistency and reasoning alignment, offering a challenging evaluation for role-playing LLMs.
摘要：大型语言模型 (LLM) 越来越多地被用作角色扮演代理，但它们忠实且一致地描绘特定版本角色（例如漫画和电影宇宙中的超级英雄）的能力仍未得到充分开发。漫威和 DC 等超级英雄经典提供了丰富的测试平台：数十年的故事讲述产生了同一角色的多个化身，具有不同的历史、价值观和道德准则。为了研究这个问题，我们引入了 Beyond One World，这是一个基于角色的角色扮演基准，涵盖 30 个标志性英雄和 90 个特定于正典的版本。该基准包括两项任务：(i) Canon Events，探索关键生命阶段的事实回忆；(ii) Moral Dilemmas，让模型面对道德场景。我们在将内部审议（“思考”）与外在决策（“行动”）分开的框架下，对规范准确性和推理保真度的回答进行评分。我们进一步提出了“思考-行动匹配”，这是一种量化原因和行动之间一致性的指标，并作为模型可信度的代理。推理型和非推理型模型的实验得出了三个发现：（1）思维链提示可以提高较弱模型中的叙述连贯性，但会降低较强模型中的规范准确性； (2) 字符内的跨版本泛化仍然是一个主要障碍； (3) 模特通常擅长思考或行动，但很少两者兼而有之。《超越一个世界》揭示了多元一致性和推理一致性方面的关键差距，为角色扮演法学硕士提供了具有挑战性的评估。

Title: CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering

Authors: Ziad Elshaer, Essam A. Rashed
Subjects: cs.CL, cs.AI, physics.med-ph
Abstract URL: https://arxiv.org/abs/2510.14353
Pdf URL: https://arxiv.org/pdf/2510.14353
Copy Paste: [[2510.14353]] CURE: Confidence-driven Unified Reasoning Ensemble Framework for Medical Question Answering(https://arxiv.org/abs/2510.14353)
Keywords: language model, llm
Abstract: High-performing medical Large Language Models (LLMs) typically require extensive fine-tuning with substantial computational resources, limiting accessibility for resource-constrained healthcare institutions. This study introduces a confidence-driven multi-model framework that leverages model diversity to enhance medical question answering without fine-tuning. Our framework employs a two-stage architecture: a confidence detection module assesses the primary model's certainty, and an adaptive routing mechanism directs low-confidence queries to Helper models with complementary knowledge for collaborative reasoning. We evaluate our approach using Qwen3-30B-A3B-Instruct, Phi-4 14B, and Gemma 2 12B across three medical benchmarks; MedQA, MedMCQA, and PubMedQA. Result demonstrate that our framework achieves competitive performance, with particularly strong results in PubMedQA (95.0\%) and MedMCQA (78.0\%). Ablation studies confirm that confidence-aware routing combined with multi-model collaboration substantially outperforms single-model approaches and uniform reasoning strategies. This work establishes that strategic model collaboration offers a practical, computationally efficient pathway to improve medical AI systems, with significant implications for democratizing access to advanced medical AI in resource-limited settings.
摘要：高性能医学大语言模型 (LLM) 通常需要使用大量计算资源进行广泛的微调，从而限制了资源有限的医疗机构的可访问性。本研究引入了一种置信驱动的多模型框架，该框架利用模型多样性来增强医学问答，而无需进行微调。我们的框架采用两阶段架构：置信度检测模块评估主模型的确定性，自适应路由机制将低置信度查询引导至具有互补知识的辅助模型，以进行协作推理。我们使用 Qwen3-30B-A3B-Instruct、Phi-4 14B 和 Gemma 2 12B 在三个医学基准上评估我们的方法； MedQA、MedMCQA 和 PubMedQA。结果表明，我们的框架实现了具有竞争力的性能，在 PubMedQA (95.0\%) 和 MedMCQA (78.0\%) 中取得了特别出色的结果。消融研究证实，置信感知路由与多模型协作相结合的性能大大优于单模型方法和统一推理策略。这项工作表明，战略模型协作提供了一种实用、计算高效的途径来改进医疗人工智能系统，对于在资源有限的环境中实现先进医疗人工智能的民主化具有重大意义。

Title: On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?

Authors: Anyun Zhuo, Xuefei Ning, Ningyuan Li, Yu Wang, Pinyan Lu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14365
Pdf URL: https://arxiv.org/pdf/2510.14365
Copy Paste: [[2510.14365]] On the Ability of LLMs to Handle Character-Level Perturbations: How Well and How?(https://arxiv.org/abs/2510.14365)
Keywords: llm
Abstract: This work investigates the resilience of contemporary LLMs against frequent and structured character-level perturbations, specifically through the insertion of noisy characters after each input character. We introduce \nameshort{}, a practical method that inserts invisible Unicode control characters into text to discourage LLM misuse in scenarios such as online exam systems. Surprisingly, despite strong obfuscation that fragments tokenization and reduces the signal-to-noise ratio significantly, many LLMs still maintain notable performance. Through comprehensive evaluation across model-, problem-, and noise-related configurations, we examine the extent and mechanisms of this robustness, exploring both the handling of character-level tokenization and \textit{implicit} versus \textit{explicit} denoising mechanism hypotheses of character-level noises. We hope our findings on the low-level robustness of LLMs will shed light on the risks of their misuse and on the reliability of deploying LLMs across diverse applications.
摘要：这项工作研究了当代法学硕士对频繁和结构化字符级扰动的恢复能力，特别是通过在每个输入字符后插入噪声字符。我们引入 \nameshort{}，这是一种实用方法，可将不可见的 Unicode 控制字符插入文本中，以防止 LLM 在在线考试系统等场景中滥用。令人惊讶的是，尽管强烈的混淆导致标记化碎片化并显着降低了信噪比，但许多法学硕士仍然保持着显着的表现。通过对模型、问题和噪声相关配置的综合评估，我们检查了这种鲁棒性的程度和机制，探索了字符级标记化的处理以及字符级噪声的 \textit{implicit} 与 \textit{explicit} 去噪机制假设。我们希望我们对法学硕士低水平稳健性的研究结果能够揭示其滥用的风险以及跨不同应用程序部署法学硕士的可靠性。

Title: From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program

Authors: Joseph E. Trujillo-Falcon, Monica L. Bozeman, Liam E. Llewellyn, Samuel T. Halvorson, Meryl Mizell, Stuti Deshpande, Bob Manning, Todd Fagin
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2510.14369
Pdf URL: https://arxiv.org/pdf/2510.14369
Copy Paste: [[2510.14369]] From Binary to Bilingual: How the National Weather Service is Using Artificial Intelligence to Develop a Comprehensive Translation Program(https://arxiv.org/abs/2510.14369)
Keywords: language model, llm
Abstract: To advance a Weather-Ready Nation, the National Weather Service (NWS) is developing a systematic translation program to better serve the 68.8 million people in the U.S. who do not speak English at home. This article outlines the foundation of an automated translation tool for NWS products, powered by artificial intelligence. The NWS has partnered with LILT, whose patented training process enables large language models (LLMs) to adapt neural machine translation (NMT) tools for weather terminology and messaging. Designed for scalability across Weather Forecast Offices (WFOs) and National Centers, the system is currently being developed in Spanish, Simplified Chinese, Vietnamese, and other widely spoken non-English languages. Rooted in best practices for multilingual risk communication, the system provides accurate, timely, and culturally relevant translations, significantly reducing manual translation time and easing operational workloads across the NWS. To guide the distribution of these products, GIS mapping was used to identify language needs across different NWS regions, helping prioritize resources for the communities that need them most. We also integrated ethical AI practices throughout the program's design, ensuring that transparency, fairness, and human oversight guide how automated translations are created, evaluated, and shared with the public. This work has culminated into a website featuring experimental multilingual NWS products, including translated warnings, 7-day forecasts, and educational campaigns, bringing the country one step closer to a national warning system that reaches all Americans.
摘要：为了推动国家做好天气准备，国家气象局 (NWS) 正在开发一个系统翻译计划，以更好地为美国 6880 万在家不会说英语的人提供服务。本文概述了由人工智能提供支持的 NWS 产品自动翻译工具的基础。 NWS 与 LILT 合作，LILT 的专利训练流程使大型语言模型 (LLM) 能够针对天气术语和消息传递调整神经机器翻译 (NMT) 工具。该系统专为天气预报办公室 (WFO) 和国家中心的可扩展性而设计，目前正在开发西班牙语、简体中文、越南语和其他广泛使用的非英语语言。该系统植根于多语言风险沟通的最佳实践，提供准确、及时且与文化相关的翻译，显着减少人工翻译时间并减轻整个 NWS 的运营工作量。为了指导这些产品的分发，GIS 地图被用来确定 NWS 不同地区的语言需求，帮助优先考虑为最需要资源的社区提供资源。我们还在整个项目设计中融入了人工智能道德实践，确保透明度、公平性和人工监督指导自动翻译的创建、评估和与公众共享的方式。这项工作最终形成了一个以实验性多语言 NWS 产品为特色的网站，包括翻译后的警报、7 天预报和教育活动，使该国离建立覆盖所有美国人的国家警报系统又近了一步。

Title: PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora

Authors: Mykolas Sveistrys, Richard Kunert
Subjects: cs.CL, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14377
Pdf URL: https://arxiv.org/pdf/2510.14377
Copy Paste: [[2510.14377]] PluriHop: Exhaustive, Recall-Sensitive QA over Distractor-Rich Corpora(https://arxiv.org/abs/2510.14377)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Recent advances in large language models (LLMs) and retrieval-augmented generation (RAG) have enabled progress on question answering (QA) when relevant evidence is in one (single-hop) or multiple (multi-hop) passages. Yet many realistic questions about recurring report data - medical records, compliance filings, maintenance logs - require aggregation across all documents, with no clear stopping point for retrieval and high sensitivity to even one missed passage. We term these pluri-hop questions and formalize them by three criteria: recall sensitivity, exhaustiveness, and exactness. To study this setting, we introduce PluriHopWIND, a diagnostic multilingual dataset of 48 pluri-hop questions built from 191 real-world wind industry reports in German and English. We show that PluriHopWIND is 8-40% more repetitive than other common datasets and thus has higher density of distractor documents, better reflecting practical challenges of recurring report corpora. We test a traditional RAG pipeline as well as graph-based and multimodal variants, and find that none of the tested approaches exceed 40% in statement-wise F1 score. Motivated by this, we propose PluriHopRAG, a RAG architecture that follows a "check all documents individually, filter cheaply" approach: it (i) decomposes queries into document-level subquestions and (ii) uses a cross-encoder filter to discard irrelevant documents before costly LLM reasoning. We find that PluriHopRAG achieves relative F1 score improvements of 18-52% depending on base LLM. Despite its modest size, PluriHopWIND exposes the limitations of current QA systems on repetitive, distractor-rich corpora. PluriHopRAG's performance highlights the value of exhaustive retrieval and early filtering as a powerful alternative to top-k methods.
摘要：当相关证据位于一个（单跳）或多个（多跳）段落中时，大型语言模型（LLM）和检索增强生成（RAG）的最新进展使得问答（QA）取得了进展。然而，许多有关重复报告数据（医疗记录、合规性归档、维护日志）的现实问题需要对所有文档进行汇总，没有明确的检索停止点，并且对哪怕是遗漏的一段内容也高度敏感。我们将这些多跳问题称为“多跳问题”，并通过三个标准将它们形式化：回忆敏感性、详尽性和准确性。为了研究这一设置，我们引入了 PluriHopWIND，这是一个包含 48 个多跳问题的诊断多语言数据集，由 191 份德语和英语的真实风电行业报告构建而成。我们表明，PluriHopWIND 的重复性比其他常见数据集高 8-40%，因此具有更高的干扰文档密度，更好地反映了重复报告语料库的实际挑战。我们测试了传统的 RAG 管道以及基于图的多模式变体，发现所有测试方法的语句 F1 分数均不超过 40%。受此启发，我们提出了 PluriHopRAG，这是一种 RAG 架构，遵循“单独检查所有文档，廉价过滤”的方法：它（i）将查询分解为文档级子问题，（ii）在昂贵的 LLM 推理之前使用交叉编码器过滤器丢弃不相关的文档。我们发现，根据基础 LLM，PluriHopRAG 的相对 F1 分数提高了 18-52%。尽管其规模不大，但 PluriHopWIND 暴露了当前 QA 系统在重复、干扰因素丰富的语料库上的局限性。 PluriHopRAG 的性能凸显了详尽检索和早期过滤作为 top-k 方法的强大替代方案的价值。

Title: Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis

Authors: Jun Li, Qun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14395
Pdf URL: https://arxiv.org/pdf/2510.14395
Copy Paste: [[2510.14395]] Suicidal Comment Tree Dataset: Enhancing Risk Assessment and Prediction Through Contextual Analysis(https://arxiv.org/abs/2510.14395)
Keywords: language model, llm
Abstract: Suicide remains a critical global public health issue. While previous studies have provided valuable insights into detecting suicidal expressions in individual social media posts, limited attention has been paid to the analysis of longitudinal, sequential comment trees for predicting a user's evolving suicidal risk. Users, however, often reveal their intentions through historical posts and interactive comments over time. This study addresses this gap by investigating how the information in comment trees affects both the discrimination and prediction of users' suicidal risk levels. We constructed a high-quality annotated dataset, sourced from Reddit, which incorporates users' posting history and comments, using a refined four-label annotation framework based on the Columbia Suicide Severity Rating Scale (C-SSRS). Statistical analysis of the dataset, along with experimental results from Large Language Models (LLMs) experiments, demonstrates that incorporating comment trees data significantly enhances the discrimination and prediction of user suicidal risk levels. This research offers a novel insight to enhancing the detection accuracy of at-risk individuals, thereby providing a valuable foundation for early suicide intervention strategies.
摘要：自杀仍然是一个重要的全球公共卫生问题。虽然之前的研究为检测个人社交媒体帖子中的自杀表达提供了宝贵的见解，但对用于预测用户不断变化的自杀风险的纵向、连续评论树的分析的关注有限。然而，随着时间的推移，用户经常通过历史帖子和互动评论来揭示他们的意图。本研究通过调查评论树中的信息如何影响用户自杀风险水平的区分和预测来解决这一差距。我们构建了一个来自 Reddit 的高质量注释数据集，其中包含用户的发帖历史和评论，使用基于哥伦比亚自杀严重程度评定量表 (C-SSRS) 的精致四标签注释框架。数据集的统计分析以及大型语言模型 (LLM) 实验的实验结果表明，合并评论树数据可以显着增强对用户自杀风险水平的辨别和预测。这项研究为提高高危个体的检测准确性提供了新颖的见解，从而为早期自杀干预策略提供了宝贵的基础。

Title: Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation

Authors: Shiyao Ding, Takayuki Ito
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14398
Pdf URL: https://arxiv.org/pdf/2510.14398
Copy Paste: [[2510.14398]] Your Next Token Prediction: A Multilingual Benchmark for Personalized Response Generation(https://arxiv.org/abs/2510.14398)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) excel at general next-token prediction but still struggle to generate responses that reflect how individuals truly communicate, such as replying to emails or social messages in their own style. However, real SNS or email histories are difficult to collect due to privacy concerns. To address this, we propose the task of "Your Next Token Prediction (YNTP)", which models a user's precise word choices through controlled human-agent conversations. We build a multilingual benchmark of 100 dialogue sessions across English, Japanese, and Chinese, where users interact for five days with psychologically grounded NPCs based on MBTI dimensions. This setup captures natural, daily-life communication patterns and enables analysis of users' internal models. We evaluate prompt-based and fine-tuning-based personalization methods, establishing the first benchmark for YNTP and a foundation for user-aligned language modeling. The dataset is available at: this https URL
摘要：大型语言模型（LLM）擅长一般的下一个标记预测，但仍然难以生成反映个人真实沟通方式的响应，例如以自己的风格回复电子邮件或社交消息。然而，由于隐私问题，真实的 SNS 或电子邮件历史记录很难收集。为了解决这个问题，我们提出了“你的下一个令牌预测（YNTP）”的任务，它通过受控的人工代理对话来模拟用户的精确单词选择。我们建立了包含英语、日语和中文的 100 个对话会话的多语言基准，其中用户与基于 MBTI 维度的心理基础 NPC 进行为期五天的互动。此设置捕获自然的日常生活通信模式，并能够分析用户的内部模型。我们评估基于提示和微调的个性化方法，为 YNTP 建立第一个基准，并为用户对齐的语言建模奠定基础。数据集位于：此 https URL

Title: MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering

Authors: Yingpeng Ning, Yuanyuan Sun, Ling Luo, Yanhua Wang, Yuchen Pan, Hongfei Lin
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14400
Pdf URL: https://arxiv.org/pdf/2510.14400
Copy Paste: [[2510.14400]] MedTrust-RAG: Evidence Verification and Trust Alignment for Biomedical Question Answering(https://arxiv.org/abs/2510.14400)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Biomedical question answering (QA) requires accurate interpretation of complex medical knowledge. Large language models (LLMs) have shown promising capabilities in this domain, with retrieval-augmented generation (RAG) systems enhancing performance by incorporating external medical literature. However, RAG-based approaches in biomedical QA suffer from hallucinations due to post-retrieval noise and insufficient verification of retrieved evidence, undermining response reliability. We propose MedTrust-Guided Iterative RAG, a framework designed to enhance factual consistency and mitigate hallucinations in medical QA. Our method introduces three key innovations. First, it enforces citation-aware reasoning by requiring all generated content to be explicitly grounded in retrieved medical documents, with structured Negative Knowledge Assertions used when evidence is insufficient. Second, it employs an iterative retrieval-verification process, where a verification agent assesses evidence adequacy and refines queries through Medical Gap Analysis until reliable information is obtained. Third, it integrates the MedTrust-Align Module (MTAM) that combines verified positive examples with hallucination-aware negative samples, leveraging Direct Preference Optimization to reinforce citation-grounded reasoning while penalizing hallucination-prone response patterns. Experiments on MedMCQA, MedQA, and MMLU-Med demonstrate that our approach consistently outperforms competitive baselines across multiple model architectures, achieving the best average accuracy with gains of 2.7% for LLaMA3.1-8B-Instruct and 2.4% for Qwen3-8B.
摘要：生物医学问答 (QA) 需要准确解释复杂的医学知识。大型语言模型 (LLM) 在该领域表现出了有前景的能力，检索增强生成 (RAG) 系统通过整合外部医学文献来提高性能。然而，生物医学 QA 中基于 RAG 的方法会因检索后噪声和检索证据验证不足而产生幻觉，从而损害了响应的可靠性。我们提出了 MedTrust 引导的迭代 RAG，这是一个旨在增强事实一致性并减轻医学 QA 中的幻觉的框架。我们的方法引入了三个关键创新。首先，它通过要求所有生成的内容明确基于检索到的医疗文档来强制执行引文感知推理，并在证据不足时使用结构化的否定知识断言。其次，它采用迭代检索验证过程，其中验证代理评估证据充分性并通过医学差距分析完善查询，直到获得可靠的信息。第三，它集成了 MedTrust-Align 模块 (MTAM)，该模块将经过验证的正面示例与幻觉感知的负面样本相结合，利用直接偏好优化来强化基于引文的推理，同时惩罚容易产生幻觉的响应模式。 MedMCQA、MedQA 和 MMLU-Med 上的实验表明，我们的方法在多个模型架构中始终优于竞争基线，实现了最佳平均准确率，LLaMA3.1-8B-Instruct 提高了 2.7%，Qwen3-8B 提高了 2.4%。

Title: Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following

Authors: Qingyu Ren, Qianyu He, Bowei Zhang, Jie Zeng, Jiaqing Liang, Yanghua Xiao, Weikang Zhou, Zeye Sun, Fei Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14420
Pdf URL: https://arxiv.org/pdf/2510.14420
Copy Paste: [[2510.14420]] Instructions are all you need: Self-supervised Reinforcement Learning for Instruction Following(https://arxiv.org/abs/2510.14420)
Keywords: language model, agent
Abstract: Language models often struggle to follow multi-constraint instructions that are crucial for real-world applications. Existing reinforcement learning (RL) approaches suffer from dependency on external supervision and sparse reward signals from multi-constraint tasks. We propose a label-free self-supervised RL framework that eliminates dependency on external supervision by deriving reward signals directly from instructions and generating pseudo-labels for reward model training. Our approach introduces constraint decomposition strategies and efficient constraint-wise binary classification to address sparse reward challenges while maintaining computational efficiency. Experiments show that our approach generalizes well, achieving strong improvements across 3 in-domain and 5 out-of-domain datasets, including challenging agentic and multi-turn instruction following. The data and code are publicly available at this https URL
摘要：语言模型通常很难遵循对现实应用程序至关重要的多约束指令。现有的强化学习（RL）方法受到对外部监督的依赖和来自多约束任务的稀疏奖励信号的困扰。我们提出了一种无标签的自监督强化学习框架，该框架通过直接从指令导出奖励信号并生成用于奖励模型训练的伪标签来消除对外部监督的依赖。我们的方法引入了约束分解策略和有效的约束二元分类，以解决稀疏奖励挑战，同时保持计算效率。实验表明，我们的方法具有良好的泛化性，在 3 个域内和 5 个域外数据集上实现了强大的改进，包括具有挑战性的代理和多轮指令跟踪。数据和代码可在此 https URL 公开获取

Title: Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents

Authors: Rui Wang, Ce Zhang, Jun-Yu Ma, Jianshu Zhang, Hongru Wang, Yi Chen, Boyang Xue, Tianqing Fang, Zhisong Zhang, Hongming Zhang, Haitao Mi, Dong Yu, Kam-Fai Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14438
Pdf URL: https://arxiv.org/pdf/2510.14438
Copy Paste: [[2510.14438]] Explore to Evolve: Scaling Evolved Aggregation Logic via Proactive Online Exploration for Deep Research Agents(https://arxiv.org/abs/2510.14438)
Keywords: gpt, agent
Abstract: Deep research web agents not only retrieve information from diverse sources such as web environments, files, and multimodal inputs, but more importantly, they need to rigorously analyze and aggregate knowledge for insightful research. However, existing open-source deep research agents predominantly focus on enhancing information-seeking capabilities of web agents to locate specific information, while overlooking the essential need for information aggregation, which would limit their ability to support in-depth research. We propose an Explore to Evolve paradigm to scalably construct verifiable training data for web agents. Begins with proactive online exploration, an agent sources grounded information by exploring the real web. Using the collected evidence, the agent then self-evolves an aggregation program by selecting, composing, and refining operations from 12 high-level logical types to synthesize a verifiable QA pair. This evolution from high-level guidance to concrete operations allowed us to scalably produce WebAggregatorQA, a dataset of 10K samples across 50K websites and 11 domains. Based on an open-source agent framework, SmolAgents, we collect supervised fine-tuning trajectories to develop a series of foundation models, WebAggregator. WebAggregator-8B matches the performance of GPT-4.1, while the 32B variant surpasses GPT-4.1 by more than 10% on GAIA-text and closely approaches Claude-3.7-sonnet. Moreover, given the limited availability of benchmarks that evaluate web agents' information aggregation abilities, we construct a human-annotated evaluation split of WebAggregatorQA as a challenging test set. On this benchmark, Claude-3.7-sonnet only achieves 28%, and GPT-4.1 scores 25.8%. Even when agents manage to retrieve all references, they still struggle on WebAggregatorQA, highlighting the need to strengthen the information aggregation capabilities of web agent foundations.
摘要：深度研究网络代理不仅从网络环境、文件和多模式输入等不同来源检索信息，更重要的是，它们需要严格分析和聚合知识以进行有洞察力的研究。然而，现有的开源深度研究代理主要侧重于增强网络代理的信息搜索能力以定位特定信息，而忽视了信息聚合的基本需求，这将限制其支持深度研究的能力。我们提出了一种“探索进化”范式，为网络代理大规模构建可验证的训练数据。从主动的在线探索开始，代理通过探索真实的网络来获取可靠的信息。然后，代理使用收集到的证据，通过从 12 种高级逻辑类型中选择、组合和细化操作来自我演化聚合程序，以合成可验证的 QA 对。这种从高级指导到具体操作的演变使我们能够可扩展地生成 WebAggregatorQA，这是一个跨 50K 网站和 11 个域的 10K 样本数据集。基于开源代理框架 SmolAgents，我们收集有监督的微调轨迹来开发一系列基础模型 WebAggregator。 WebAggregator-8B 与 GPT-4.1 的性能相匹配，而 32B 变体在 GAIA 文本上超过 GPT-4.1 10% 以上，并且非常接近 Claude-3.7-sonnet。此外，考虑到评估网络代理信息聚合能力的基准的可用性有限，我们构建了一个人工注释的 WebAggregatorQA 评估分割作为具有挑战性的测试集。在此基准测试中，Claude-3.7-sonnet 仅达到 28%，而 GPT-4.1 得分为 25.8%。即使代理设法检索所有引用，他们仍然在 WebAggregatorQA 上苦苦挣扎，这凸显了加强 Web 代理基础的信息聚合能力的必要性。

Title: Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents

Authors: Reid T. Johnson, Michelle D. Pain, Jordan D. West
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14453
Pdf URL: https://arxiv.org/pdf/2510.14453
Copy Paste: [[2510.14453]] Natural Language Tools: A Natural Language Approach to Tool Calling In Large Language Agents(https://arxiv.org/abs/2510.14453)
Keywords: language model, llm, prompt, agent
Abstract: We present Natural Language Tools (NLT), a framework that replaces programmatic JSON tool calling in large language models (LLMs) with natural language outputs. By decoupling tool selection from response generation, NLT eliminates task interference and format constraints that degrade tool call performance. When evaluated across 10 models and 6,400 trials spanning customer service and mental health domains, NLT improves tool calling accuracy by 18.4 percentage points while reducing output variance by 70%. Open-weight models see the largest gains, surpassing flagship closed-weight alternatives, with implications for model training in both reinforcement learning and supervised fine-tuning stages. These improvements persist under prompt perturbations and extend tool-calling capabilities to models lacking native support.
摘要：我们提出了自然语言工具 (NLT)，这是一个框架，它用自然语言输出取代了大型语言模型 (LLM) 中的编程 JSON 工具调用。通过将工具选择与响应生成分离，NLT 消除了会降低工具调用性能的任务干扰和格式限制。当对涵盖客户服务和心理健康领域的 10 个模型和 6,400 个试验进行评估时，NLT 将工具调用准确性提高了 18.4 个百分点，同时将输出方差降低了 70%。开放权重模型获得了最大的收益，超过了旗舰封闭权重模型，这对强化学习和监督微调阶段的模型训练都有影响。这些改进在即时扰动下持续存在，并将工具调用功能扩展到缺乏本机支持的模型。

Title: LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models

Authors: Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14466
Pdf URL: https://arxiv.org/pdf/2510.14466
Copy Paste: [[2510.14466]] LiRA: Linguistic Robust Anchoring for Cross-lingual Large Language Models(https://arxiv.org/abs/2510.14466)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) rapidly advance, performance on high-resource languages (e.g., English, Chinese) is nearing saturation, yet remains substantially lower for low-resource languages (e.g., Urdu, Thai) due to limited training data, machine-translation noise, and unstable cross-lingual alignment. We introduce LiRA (Linguistic Robust Anchoring for Large Language Models), a training framework that robustly improves cross-lingual representations under low-resource conditions while jointly strengthening retrieval and reasoning. LiRA comprises two modules: (i) Arca (Anchored Representation Composition Architecture), which anchors low-resource languages to an English semantic space via anchor-based alignment and multi-agent collaborative encoding, preserving geometric stability in a shared embedding space; and (ii) LaSR (Language-coupled Semantic Reasoner), which adds a language-aware lightweight reasoning head with consistency regularization on top of Arca's multilingual representations, unifying the training objective to enhance cross-lingual understanding, retrieval, and reasoning robustness. We further construct and release a multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Experiments across low-resource benchmarks (cross-lingual retrieval, semantic similarity, and reasoning) show consistent gains and robustness under few-shot and noise-amplified settings; ablations validate the contribution of both Arca and LaSR. Code will be released on GitHub and the dataset on Hugging Face.
摘要：随着大型语言模型 (LLM) 的迅速发展，高资源语言（例如英语、中文）的性能已接近饱和，但由于训练数据有限、机器翻译噪声和不稳定的跨语言对齐，低资源语言（例如乌尔都语、泰语）的性能仍然大幅较低。我们引入了LiRA（Linguistic Robust Anchoring for Large Language Models），这是一种训练框架，可以在资源匮乏的情况下稳健地改进跨语言表示，同时共同加强检索和推理。 LiRA 包含两个模块：（i）Arca（锚定表示合成架构），它通过基于锚的对齐和多智能体协作编码将低资源语言锚定到英语语义空间，从而在共享嵌入空间中保持几何稳定性； (ii) LaSR（语言耦合语义推理器），它在 Arca 的多语言表示之上添加了具有一致性正则化的语言感知轻量级推理头，统一了训练目标，以增强跨语言理解、检索和推理的鲁棒性。我们进一步构建并发布了涵盖五种东南亚语言和两种南亚语言的多语言产品检索数据集。跨低资源基准（跨语言检索、语义相似性和推理）的实验显示出在少样本和噪声放大设置下一致的增益和鲁棒性；消融验证了 Arca 和 LaSR 的贡献。代码将在 GitHub 上发布，数据集将在 Hugging Face 上发布。

Title: Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs

Authors: Kyubyung Chae, Gihoon Kim, Gyuseong Lee, Taesup Kim, Jaejin Lee, Heejin Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14565
Pdf URL: https://arxiv.org/pdf/2510.14565
Copy Paste: [[2510.14565]] Assessing Socio-Cultural Alignment and Technical Safety of Sovereign LLMs(https://arxiv.org/abs/2510.14565)
Keywords: llm
Abstract: Recent trends in LLMs development clearly show growing interest in the use and application of sovereign LLMs. The global debate over sovereign LLMs highlights the need for governments to develop their LLMs, tailored to their unique socio-cultural and historical contexts. However, there remains a shortage of frameworks and datasets to verify two critical questions: (1) how well these models align with users' socio-cultural backgrounds, and (2) whether they maintain safety and technical robustness without exposing users to potential harms and risks. To address this gap, we construct a new dataset and introduce an analytic framework for extracting and evaluating the socio-cultural elements of sovereign LLMs, alongside assessments of their technical robustness. Our experimental results demonstrate that while sovereign LLMs play a meaningful role in supporting low-resource languages, they do not always meet the popular claim that these models serve their target users well. We also show that pursuing this untested claim may lead to underestimating critical quality attributes such as safety. Our study suggests that advancing sovereign LLMs requires a more extensive evaluation that incorporates a broader range of well-grounded and practical criteria.
摘要：法学硕士发展的最新趋势清楚地表明人们对主权法学硕士的使用和应用越来越感兴趣。关于主权法学硕士的全球辩论凸显了各国政府需要根据其独特的社会文化和历史背景来发展法学硕士。然而，仍然缺乏框架和数据集来验证两个关键问题：（1）这些模型与用户的社会文化背景的契合程度如何，以及（2）它们是否保持安全性和技术稳健性而不会让用户面临潜在的伤害和风险。为了解决这一差距，我们构建了一个新的数据集，并引入了一个分析框架，用于提取和评估主权法学硕士的社会文化要素，同时评估其技术稳健性。我们的实验结果表明，虽然主权法学硕士在支持低资源语言方面发挥了有意义的作用，但它们并不总是符合流行的说法，即这些模型可以很好地服务于目标用户。我们还表明，追求这种未经检验的主张可能会导致低估安全等关键质量属性。我们的研究表明，推进主权法学硕士需要更广泛的评估，其中包含更广泛的有根据的和实用的标准。

Title: Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures

Authors: Shuangshuang Ying, Yunwen Li, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Xeron Du, Tianyu Zheng, Yichi Zhang, Letian Ni, Yuyang Cheng, Qiguang Chen, Jingzhe Ding, Shengda Long, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Ge Zhang, Wenhao Huang, Wanxiang Che, Chenghua Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14616
Pdf URL: https://arxiv.org/pdf/2510.14616
Copy Paste: [[2510.14616]] Beyond Correctness: Evaluating Subjective Writing Preferences Across Cultures(https://arxiv.org/abs/2510.14616)
Keywords: language model
Abstract: Current preference learning methods achieve high accuracy on standard benchmarks but exhibit significant performance degradation when objective quality signals are removed. We introduce WritingPreferenceBench, a dataset of 1,800 human-annotated preference pairs (1,200 English, 600 Chinese) across 8 creative writing genres, where responses are matched for objective correctness, factual accuracy, and length. On this benchmark, sequence-based reward models--the standard architecture for RLHF--achieve only 52.7% mean accuracy, while zero-shot language model judges perform at 53.9%. In contrast, generative reward models that produce explicit reasoning chains achieve 81.8% accuracy. We observe high within-model variance across genres: individual models range from 18.2% to 81.8% accuracy across different writing categories, with standard deviations averaging 10.1%. This variance persists regardless of model scale, with 27B parameter models showing no consistent improvement over 8B variants. Our results suggest that current RLHF methods primarily learn to detect objective errors rather than capture subjective quality preferences (e.g., creativity, stylistic flair, and emotional resonance), and that successful preference modeling may require intermediate reasoning representations rather than direct classification.
摘要：当前的偏好学习方法在标准基准上实现了高精度，但当客观质量信号被移除时，性能显着下降。我们引入了writingPreferenceBench，这是一个包含 8 个创意写作流派的 1,800 个人工注释偏好对（1,200 个英语，600 个中文）的数据集，其中的答案与客观正确性、事实准确性和长度进行匹配。在此基准测试中，基于序列的奖励模型（RLHF 的标准架构）仅达到 52.7% 的平均准确度，而零样本语言模型法官的平均准确度为 53.9%。相比之下，产生显式推理链的生成奖励模型的准确率达到 81.8%。我们观察到不同类型的模型内差异很大：不同写作类别的各个模型的准确率从 18.2% 到 81.8% 不等，标准差平均为 10.1%。无论模型规模如何，这种差异都会持续存在，27B 参数模型与 8B 变体相比并没有表现出一致的改进。我们的结果表明，当前的 RLHF 方法主要学习检测客观错误，而不是捕获主观质量偏好（例如创造力、风格天赋和情感共鸣），并且成功的偏好建模可能需要中间推理表示而不是直接分类。

Title: Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models

Authors: Kedi Chen, Zhikai Lei, Xu Guo, Xuecheng Wu, Siyuan Zeng, Jianghao Yin, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Qipeng Guo, Kai Chen, Wei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14620
Pdf URL: https://arxiv.org/pdf/2510.14620
Copy Paste: [[2510.14620]] Code-driven Number Sequence Calculation: Enhancing the inductive Reasoning Abilities of Large Language Models(https://arxiv.org/abs/2510.14620)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) make remarkable progress in reasoning tasks. Among different reasoning modes, inductive reasoning, due to its better alignment with human learning, attracts increasing interest. However, research on inductive reasoning faces certain challenges. First, existing inductive data mostly focuses on superficial regularities while lacking more complex internal patterns. Second, current works merely prompt LLMs or finetune on simple prompt-response pairs, but do not provide precise thinking processes nor implement difficulty control. Unlike previous work, we address these challenges by introducing \textit{CodeSeq}, a synthetic post-training dataset built from number sequences. We package number sequences into algorithmic problems to discover their general terms, defining a general term generation (GTG) task correspondingly. Our pipeline generates supervised finetuning data by reflecting on failed test cases and incorporating iterative corrections, thereby teaching LLMs to learn autonomous case generation and self-checking. Additionally, it leverages reinforcement learning with a novel Case-Synergy Solvability Scaling Reward based on both solvability, estimated from the problem pass rate, and the success rate of self-directed case generation, enabling models to learn more effectively from both successes and failures. Experimental results show that the models trained with \textit{CodeSeq} improve on various reasoning tasks and can preserve the models' OOD performance.
摘要：大型语言模型（LLM）在推理任务中取得了显着的进步。在不同的推理模式中，归纳推理由于更符合人类学习的特点，越来越受到人们的关注。然而，归纳推理的研究面临着一定的挑战。首先，现有的归纳数据大多侧重于表面规律，缺乏更复杂的内部模式。其次，目前的工作只是提示LLM或对简单的提示-响应对进行微调，但没有提供精确的思维过程，也没有实施难度控制。与之前的工作不同，我们通过引入 \textit{CodeSeq} 来解决这些挑战，这是一个由数字序列构建的合成后训练数据集。我们将数字序列打包成算法问题来发现它们的通用术语，相应地定义通用术语生成（GTG）任务。我们的流程通过反思失败的测试用例并结合迭代修正来生成有监督的微调数据，从而教会法学硕士学习自主案例生成和自我检查。此外，它还利用强化学习和新颖的案例协同可解决性缩放奖励，该奖励基于根据问题通过率估计的可解决性和自主案例生成的成功率，使模型能够更有效地从成功和失败中学习。实验结果表明，使用 \textit{CodeSeq} 训练的模型在各种推理任务上得到了改进，并且可以保持模型的 OOD 性能。

Title: RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF

Authors: Qing Yang, Zhenghao Liu, Junxin Wang, Yangfan Du, Pengcheng Huang, Tong Xiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14628
Pdf URL: https://arxiv.org/pdf/2510.14628
Copy Paste: [[2510.14628]] RLAIF-SPA: Optimizing LLM-based Emotional Speech Synthesis via RLAIF(https://arxiv.org/abs/2510.14628)
Keywords: language model, llm, chat
Abstract: Text-To-Speech synthesis has achieved near-human quality in neutral speech, but emotional expressiveness remains a challenge. Existing methods often rely on costly emotion annotations or optimize indirect objectives that fail to capture the emotional expressiveness and perceptual naturalness of speech, leading to generated speech that is accurate but emotionally flat. To address these challenges, we propose the RLAIF-SPA framework, incorporating a Reinforcement Learning from AI Feedback (RLAIF) mechanism to employ Automatic Speech Recognition (ASR) and Large Language Model (LLM) techniques to respectively judge semantic accuracy and prosodic-emotional label alignment as a direct reward for emotional expressiveness and intelligibility optimization. Specifically, it leverages Prosodic Label Alignment to enhance expressive quality by jointly considering semantic accuracy and prosodic-emotional alignment along four fine-grained dimensions: Structure, Emotion, Speed, and Tone. In addition, it incorporates Semantic Accuracy Feedback to ensure the generation of clear and accurate speech. Experiments on the Libri Speech dataset show that RLAIF-SPA outperforms Chat-TTS, with a 26.1% reduction in WER, a 9.1% increase in SIM-O, and over 10% improvement in human evaluation.
摘要：文本到语音合成已在中性语音中实现了接近人类的质量，但情感表达仍然是一个挑战。现有的方法通常依赖于昂贵的情感注释或优化间接目标，这些目标无法捕捉语音的情感表达和感知自然度，导致生成的语音准确但情感平淡。为了应对这些挑战，我们提出了 RLAIF-SPA 框架，结合人工智能反馈强化学习（RLAIF）机制，采用自动语音识别（ASR）和大语言模型（LLM）技术来分别判断语义准确性和韵律情感标签对齐，作为情感表达和可懂度优化的直接奖励。具体来说，它利用韵律标签对齐，通过沿着四个细粒度维度（结构、情感、速度和语气）联合考虑语义准确性和韵律情感对齐来增强表达质量。此外，它还结合了语义准确性反馈，以确保生成清晰准确的语音。在 Libri 语音数据集上的实验表明，RLAIF-SPA 的性能优于 Chat-TTS，WER 降低了 26.1%，SIM-O 提高了 9.1%，人类评估提高了 10% 以上。

Title: Intent Clustering with Shared Pseudo-Labels

Authors: I-Fan Lin, Faegheh Hasibi, Suzan Verberne
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14640
Pdf URL: https://arxiv.org/pdf/2510.14640
Copy Paste: [[2510.14640]] Intent Clustering with Shared Pseudo-Labels(https://arxiv.org/abs/2510.14640)
Keywords: llm
Abstract: In this paper, we propose an intuitive, training-free and label-free method for intent clustering that makes minimal assumptions using lightweight and open-source LLMs. Many current approaches rely on commercial LLMs, which are costly, and offer limited transparency. Additionally, their methods often explicitly depend on knowing the number of clusters in advance, which is often not the case in realistic settings. To address these challenges, instead of asking the LLM to match similar text directly, we first ask it to generate pseudo-labels for each text, and then perform multi-label classification in this pseudo-label set for each text. This approach is based on the hypothesis that texts belonging to the same cluster will share more labels, and will therefore be closer when encoded into embeddings. These pseudo-labels are more human-readable than direct similarity matches. Our evaluation on four benchmark sets shows that our approach achieves results comparable to and better than recent baselines, while remaining simple and computationally efficient. Our findings indicate that our method can be applied in low-resource scenarios and is stable across multiple models and datasets.
摘要：在本文中，我们提出了一种直观的、免训练和无标签的意图聚类方法，该方法使用轻量级开源法学硕士做出最少的假设。当前的许多方法依赖于商业法学硕士，但成本高昂且透明度有限。此外，他们的方法通常明确依赖于提前了解簇的数量，而在现实环境中通常情况并非如此。为了解决这些挑战，我们不是要求LLM直接匹配相似文本，而是首先要求它为每个文本生成伪标签，然后在这个伪标签集中为每个文本执行多标签分类。这种方法基于这样的假设：属于同一簇的文本将共享更多标签，因此在编码到嵌入中时会更接近。这些伪标签比直接相似性匹配更易于人类阅读。我们对四个基准集的评估表明，我们的方法取得了与最近的基线相当且更好的结果，同时保持简单和计算效率。我们的研究结果表明，我们的方法可以应用于资源匮乏的场景，并且在多个模型和数据集上保持稳定。

Title: An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs

Authors: Linyue Ma, Yilong Xu, Xiang Long, Zhi Zheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14660
Pdf URL: https://arxiv.org/pdf/2510.14660
Copy Paste: [[2510.14660]] An Efficient Rubric-based Generative Verifier for Search-Augmented LLMs(https://arxiv.org/abs/2510.14660)
Keywords: language model, llm
Abstract: Search augmentation empowers Large Language Models with retrieval capabilities to overcome the limitations imposed by static parameters. Recently, Reinforcement Learning leverages tailored reward signals as a viable technique to enhance LLMs performing tasks involving search. However, existing reward modeling for search-augmented LLMs faces several limitations. Rule-based rewards, such as Exact Match, are verifiable but fragile to variations in expression and cannot be applied to long-form workloads. In contrast, generative rewards improve robustness, but designing verifiable and stable rewards for long-form workloads in dynamic corpora remains challenging and also incurs high computational costs. In this paper, we propose a unified and verifiable paradigm, "nugget-as-rubric", which treats atomic information points as structured evaluation criteria for different search-augmentation workloads. Short-form tasks correspond to a single rubric, whereas long-form tasks expand to multiple rubrics aligned with the question's information needs. To support long-form settings, we design an automatic rubric construction pipeline based on query rewriting, which can automatically retrieve passages relevant to each question and extract rubrics from them, both from static corpora and from dynamic online web content. Furthermore, we introduce \textbf{Search-Gen-V}, a 4B-parameter efficient generative verifier under our proposed verifiable paradigm, which is trained via the idea of distillation and a two-stage strategy. Experimental results show that Search-Gen-V achieves strong verification accuracy across different workloads, making it a scalable, robust, and efficient verifiable reward constructor for search-augmented LLMs.
摘要：搜索增强为大型语言模型提供了检索功能，以克服静态参数带来的限制。最近，强化学习利用定制的奖励信号作为一种可行的技术来增强法学硕士执行涉及搜索的任务。然而，现有的搜索增强法学硕士奖励模型面临着一些限制。基于规则的奖励（例如“精确匹配”）是可验证的，但容易受到表达变化的影响，并且不能应用于长格式工作负载。相比之下，生成性奖励提高了鲁棒性，但为动态语料库中的长格式工作负载设计可验证且稳定的奖励仍然具有挑战性，并且还会产生高昂的计算成本。在本文中，我们提出了一种统一且可验证的范式“nugget-as-rubric”，它将原子信息点视为不同搜索增强工作负载的结构化评估标准。短格式任务对应于单个标题，而长格式任务则扩展到与问题的信息需求一致的多个标题。为了支持长格式设置，我们设计了一个基于查询重写的自动量规构建管道，它可以从静态语料库和动态在线网页内容中自动检索与每个问题相关的段落并从中提取量规。此外，我们引入了 \textbf{Search-Gen-V}，这是我们提出的可验证范式下的 4B 参数高效生成验证器，它通过蒸馏的思想和两阶段策略进行训练。实验结果表明，Search-Gen-V 在不同的工作负载下实现了强大的验证准确性，使其成为搜索增强法学硕士的可扩展、稳健且高效的可验证奖励构造器。

Title: Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms

Authors: Xingmeng Zhao, Dan Schumacher, Veronica Rammouz, Anthony Rios
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14718
Pdf URL: https://arxiv.org/pdf/2510.14718
Copy Paste: [[2510.14718]] Speculative Model Risk in Healthcare AI: Using Storytelling to Surface Unintended Harms(https://arxiv.org/abs/2510.14718)
Keywords: chat, agent
Abstract: Artificial intelligence (AI) is rapidly transforming healthcare, enabling fast development of tools like stress monitors, wellness trackers, and mental health chatbots. However, rapid and low-barrier development can introduce risks of bias, privacy violations, and unequal access, especially when systems ignore real-world contexts and diverse user needs. Many recent methods use AI to detect risks automatically, but this can reduce human engagement in understanding how harms arise and who they affect. We present a human-centered framework that generates user stories and supports multi-agent discussions to help people think creatively about potential benefits and harms before deployment. In a user study, participants who read stories recognized a broader range of harms, distributing their responses more evenly across all 13 harm types. In contrast, those who did not read stories focused primarily on privacy and well-being (58.3%). Our findings show that storytelling helped participants speculate about a broader range of harms and benefits and think more creatively about AI's impact on users.
摘要：人工智能 (AI) 正在迅速改变医疗保健，促进压力监测器、健康跟踪器和心理健康聊天机器人等工具的快速开发。然而，快速、低门槛的开发可能会带来偏见、侵犯隐私和不平等访问的风险，特别是当系统忽略现实世界背景和多样化的用户需求时。最近的许多方法都使用人工智能来自动检测风险，但这可能会减少人们对了解危害如何产生及其影响对象的参与。我们提出了一个以人为本的框架，可以生成用户故事并支持多代理讨论，以帮助人们在部署之前创造性地思考潜在的好处和危害。在一项用户研究中，阅读故事的参与者认识到了更广泛的危害，他们的反应在所有 13 种危害类型中分布得更均匀。相比之下，那些不读故事的人主要关注隐私和福祉（58.3%）。我们的研究结果表明，讲故事帮助参与者推测更广泛的危害和好处，并更有创意地思考人工智能对用户的影响。

Title: AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning

Authors: Mengzhao Jia, Zhihan Zhang, Ignacio Cases, Zheyuan Liu, Meng Jiang, Peng Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14738
Pdf URL: https://arxiv.org/pdf/2510.14738
Copy Paste: [[2510.14738]] AutoRubric-R1V: Rubric-Based Generative Rewards for Faithful Multimodal Reasoning(https://arxiv.org/abs/2510.14738)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have rapidly advanced from perception tasks to complex multi-step reasoning, yet reinforcement learning with verifiable rewards (RLVR) often leads to spurious reasoning since only the final-answer correctness is rewarded. To address this limitation, we propose AutoRubric-R1V, a framework that integrates RLVR with process-level supervision through automatically collected rubric-based generative rewards. Our key innovation lies in a scalable self-aggregation method that distills consistent reasoning checkpoints from successful trajectories, enabling problem-specific rubric construction without human annotation or stronger teacher models. By jointly leveraging rubric-based and outcome rewards, AutoRubric-R1V achieves state-of-the-art performance on six multimodal reasoning benchmarks and substantially improves reasoning faithfulness in dedicated evaluations.
摘要：多模态大语言模型（MLLM）已经从感知任务迅速发展到复杂的多步推理，但具有可验证奖励的强化学习（RLVR）通常会导致虚假推理，因为只有最终答案的正确性才会得到奖励。为了解决这个限制，我们提出了 AutoRubric-R1V，这是一个通过自动收集基于 rubric 的生成奖励将 RLVR 与流程级监督集成的框架。我们的关键创新在于可扩展的自聚合方法，该方法从成功的轨迹中提取一致的推理检查点，从而无需人工注释或更强的教师模型即可构建针对特定问题的量规。通过联合利用基于规则的奖励和结果奖励，AutoRubric-R1V 在六个多模态推理基准上实现了最先进的性能，并显着提高了专门评估中的推理忠实度。

Title: Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code

Authors: Manar Abdelatty, Maryam Nouh, Jacob K. Rosenstein, Sherief Reda
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14756
Pdf URL: https://arxiv.org/pdf/2510.14756
Copy Paste: [[2510.14756]] Pluto: A Benchmark for Evaluating Efficiency of LLM-generated Hardware Code(https://arxiv.org/abs/2510.14756)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are increasingly used to automate hardware design tasks, including the generation of Verilog code. While early benchmarks focus primarily on functional correctness, efficient hardware design demands additional optimization for synthesis metrics such as area, delay, and power. Existing benchmarks fall short in evaluating these aspects comprehensively: they often lack optimized baselines or testbenches for verification. To address these gaps, we present Pluto, a benchmark and evaluation framework designed to assess the efficiency of LLM-generated Verilog designs. Pluto presents a comprehensive evaluation set of 114 problems with self-checking testbenches and multiple Pareto-optimal reference implementations. Experimental results show that state-of-the-art LLMs can achieve high functional correctness, reaching 78.3\% at pass@1, but their synthesis efficiency still lags behind expert-crafted implementations, with area efficiency of 63.8\%, delay efficiency of 65.9\%, and power efficiency of 64.0\% at eff@1. This highlights the need for efficiency-aware evaluation frameworks such as Pluto to drive progress in hardware-focused LLM research.
摘要：大型语言模型 (LLM) 越来越多地用于自动化硬件设计任务，包括生成 Verilog 代码。虽然早期的基准测试主要关注功能正确性，但高效的硬件设计需要对面积、延迟和功耗等综合指标进行额外优化。现有的基准测试在全面评估这些方面方面存在不足：它们通常缺乏优化的基线或用于验证的测试平台。为了解决这些差距，我们推出了 Pluto，这是一个基准和评估框架，旨在评估 LLM 生成的 Verilog 设计的效率。 Pluto 提供了包含 114 个问题的综合评估集，以及自检测试平台和多个帕累托最优参考实现。实验结果表明，最先进的 LLM 可以实现很高的功能正确性，在 pass@1 时达到 78.3\%，但其综合效率仍然落后于专家设计的实现，在 eff@1 时面积效率为 63.8\%，延迟效率为 65.9\%，功率效率为 64.0\%。这凸显了对 Pluto 等具有效率意识的评估框架的需求，以推动以硬件为中心的法学硕士研究的进展。

Title: COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes

Authors: Yunwen Li, Shuangshuang Ying, Xingwei Qu, Xin Li, Sheng Jin, Minghao Liu, Zhoufutu Wen, Tianyu Zheng, Xeron Du, Qiguang Chen, Jiajun Shi, Wangchunshu Zhou, Jiazhan Feng, Wanjun Zhong, Libo Qin, Stephen Huang, Wanxiang Che, Chenghua Lin, Eli Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14763
Pdf URL: https://arxiv.org/pdf/2510.14763
Copy Paste: [[2510.14763]] COIG-Writer: A High-Quality Dataset for Chinese Creative Writing with Thought Processes(https://arxiv.org/abs/2510.14763)
Keywords: language model, prompt
Abstract: Large language models exhibit systematic deficiencies in creative writing, particularly in non-English contexts where training data is scarce and lacks process-level supervision. We present COIG-Writer, a novel Chinese creative writing dataset that captures both diverse outputs and their underlying thought processes through systematic reverse-engineering of high-quality texts. Unlike existing datasets that provide only input-output pairs, COIG-Writer comprises 1,665 meticulously curated triplets spanning 51 genres, each containing: (1) a reverse-engineered prompt, (2) detailed creative reasoning documenting decision-making processes, and (3) the final text. Through comprehensive experiments, we identify a two-component model of creative writing: narrative logic (provided by process supervision) and linguistic expression (maintained by general-purpose data). Our findings reveal three critical insights: (1) Process supervision is highly effective but requires stabilization with general data. A ratio of at least one creative sample to twelve general samples is needed to achieve optimal performance; below this threshold, the win rate progressively degrades (from 62.75% down to 35.78%)., (2) creative capabilities are culturally-bound with no cross-lingual transfer (89.26pp gap between Chinese and English performance), and (3) lexical diversity inversely correlates with creative quality (TTR paradox), suggesting high diversity signals compensatory behavior for logical deficiencies. These findings establish that creative excellence emerges from the interaction between logical scaffolding and linguistic grounding, analogous to how mathematical reasoning enhances but cannot replace linguistic competence in foundation models.
摘要：大型语言模型在创意写作方面表现出系统性缺陷，特别是在训练数据稀缺且缺乏过程级监督的非英语环境中。我们推出了 COIG-Writer，这是一个新颖的中文创意写作数据集，它通过对高质量文本进行系统的逆向工程来捕获不同的输出及其潜在的思维过程。与仅提供输入输出对的现有数据集不同，COIG-Writer 包含 1,665 个精心策划的三元组，涵盖 51 个流派，每个三元组包含：(1) 逆向工程提示，(2) 记录决策过程的详细创意推理，以及 (3) 最终文本。通过综合实验，我们确定了创意写作的两个组成部分的模型：叙事逻辑（由过程监督提供）和语言表达（由通用数据维护）。我们的研究结果揭示了三个关键见解：（1）过程监督非常有效，但需要一般数据的稳定性。需要至少 1 个创意样本与 12 个一般样本的比例才能达到最佳性能；低于此阈值，获胜率逐渐下降（从 62.75% 降至 35.78%）。（2）创意能力受文化限制，没有跨语言迁移（中文和英语表现之间有 89.26pp 差距），（3）词汇多样性与创意质量成反比（TTR 悖论），表明高度多样性标志着对逻辑缺陷的补偿行为。这些发现表明，卓越的创造力源于逻辑支架和语言基础之间的相互作用，类似于数学推理如何增强但不能取代基础模型中的语言能力。

Title: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning

Authors: Hwiyeol Jo, Joosung Lee, Jaehone Lee, Sang-Woo Lee, Joonsuk Park, Kang Min Yoo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.14773
Pdf URL: https://arxiv.org/pdf/2510.14773
Copy Paste: [[2510.14773]] Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning(https://arxiv.org/abs/2510.14773)
Keywords: language model, llm, prompt
Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
摘要：评估生成模型，例如大型语言模型 (LLM)，通常涉及问答任务，其中根据答案选择的概率选择最终答案。另一方面，对于需要推理的模型，答案提取的方法起着至关重要的作用。我们的研究表明，推理模型的性能及其最终答案分布对所采用的答案提取算法高度敏感。为了缓解这种情况，我们提出了一个基本框架：答案再生。该方法使用额外的模型推理，提供以提示“Answer：”开头的先验输入和输出。然后从重新生成的输出中选择或提取最终答案。我们表明，这种与提取规则无关的方法表现出改进的性能和增强的鲁棒性。此外，我们将该框架应用于一般数学问题和开放式问答任务。我们的分析和该框架可以为模型评估提供更可靠的结果。

Title: Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking

Authors: Ziqi Dai, Xin Zhang, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, Min Zhang
Subjects: cs.CL, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2510.14824
Pdf URL: https://arxiv.org/pdf/2510.14824
Copy Paste: [[2510.14824]] Supervised Fine-Tuning or Contrastive Learning? Towards Better Multimodal LLM Reranking(https://arxiv.org/abs/2510.14824)
Keywords: language model, llm
Abstract: In information retrieval, training reranking models mainly focuses on two types of objectives: metric learning (e.g. contrastive loss to increase the predicted scores on relevant query-document pairs) and classification (binary label prediction of relevance vs. irrelevance). For BERT-style encoders, various studies have shown that contrastive learning (CL) can be more effective than discriminative (classification) learning. However, for large language models (LLMs), classification via supervised fine-tuning (SFT), which predicts ''yes'' (resp. ''no'') token for relevant (resp. irrelevant) pairs, appears more promising as it aligns well with the generative nature of LLMs. This divergence raises a central question: which objective is intrinsically better suited to LLM-based reranking, and what mechanism underlies the difference? In this work, we conduct a comprehensive comparison and analysis between CL and SFT for reranking, taking the universal multimodal retrieval (UMR) as the experimental playground. We first decompose the objectives into two components: weight, which controls the magnitude of those updates, and direction, which guides the model updates, then present a unified framework for understanding their interactions. Through probing experiments, we find that SFT provides a substantially stronger weighting scheme than CL, whereas the preferred scoring direction shows no clear winner. Taken together, these results point to a consistent advantage of SFT over CL for LLM reranking. To further validate our findings, we conduct large-scale training with SFT and present new state-of-the-art rerankers on the MRB benchmark. We also provide ablations on SFT settings and expect our findings to benefit future research and applications in this area.
摘要：在信息检索中，训练重排序模型主要关注两类目标：度量学习（例如，对比损失以增加相关查询文档对的预测分数）和分类（相关性与不相关性的二元标签预测）。对于 BERT 风格的编码器，各种研究表明对比学习（CL）可以比判别（分类）学习更有效。然而，对于大型语言模型 (LLM)，通过监督微调 (SFT) 进行分类（预测相关（或不相关）对的“是”（或“否”）标记）似乎更有前景，因为它与 LLM 的生成性质很好地吻合。这种分歧提出了一个核心问题：哪个目标本质上更适合基于 LLM 的重新排名，以及造成差异的机制是什么？在这项工作中，我们以通用多模态检索（UMR）为实验平台，对用于重排序的 CL 和 SFT 进行了全面的比较和分析。我们首先将目标分解为两个部分：权重（控制更新的幅度）和方向（指导模型更新），然后提出一个统一的框架来理解它们的交互。通过探索实验，我们发现 SFT 提供了比 CL 更强的加权方案，而首选评分方向没有明显的赢家。总而言之，这些结果表明 SFT 在 LLM 重新排名方面始终优于 CL。为了进一步验证我们的发现，我们使用 SFT 进行了大规模训练，并在 MRB 基准上提出了新的最先进的重新排序器。我们还提供了 SFT 设置的消融，并期望我们的研究结果有利于该领域的未来研究和应用。

Title: Midtraining Bridges Pretraining and Posttraining Distributions

Authors: Emmy Liu, Graham Neubig, Chenyan Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14865
Pdf URL: https://arxiv.org/pdf/2510.14865
Copy Paste: [[2510.14865]] Midtraining Bridges Pretraining and Posttraining Distributions(https://arxiv.org/abs/2510.14865)
Keywords: language model
Abstract: Recently, many language models have been pretrained with a "midtraining" phase, in which higher quality, often instruction-formatted data, is mixed in at the end of pretraining. Despite the popularity of this practice, there is little scientific understanding of this phase of model training or why it is effective. In this work, we conduct the first systematic investigation of midtraining through controlled experiments with language models pretrained from scratch and fine-tuned on supervised finetuning datasets in different domains. We find that when compared after supervised fine-tuning, the effectiveness of midtraining is highest in the math and code domains, where midtraining can best reduce the syntactic gap between pretraining and posttraining data. In these cases, midtraining consistently outperforms continued pretraining in both in-domain validation loss as well as pretraining data forgetting after posttraining. We conduct ablations on the starting time of the midtraining phase and mixture weights of the midtraining data, using code midtraining as a case study, and find that timing has a greater impact than mixture weights, with earlier introduction of specialized data, yielding greater benefits in-domain as well as preserving general language modeling better. These findings establish midtraining as a domain adaptation technique that compared to continued pretraining yields better performance through reduced forgetting.
摘要：最近，许多语言模型都经过“训练中期”阶段进行预训练，在预训练结束时混合更高质量的数据（通常是指令格式的数据）。尽管这种做法很受欢迎，但对这一阶段的模型训练及其有效的原因却缺乏科学的理解。在这项工作中，我们通过受控实验对从头开始预训练的语言模型进行了首次系统性的中期训练研究，并在不同领域的监督微调数据集上进行了微调。我们发现，在监督微调后进行比较时，中间训练的有效性在数学和代码领域最高，其中中间训练可以最好地减少训练前和训练后数据之间的语法差距。在这些情况下，在域内验证损失以及训练后的预训练数据遗忘方面，训练中期始终优于持续预训练。我们以代码训练中期为案例研究，对训练中期阶段的开始时间和训练中期数据的混合权重进行了消融，发现时间比混合权重的影响更大，随着专业数据的早期引入，在领域内产生更大的好处，并更好地保留通用语言建模。这些发现将中期训练确立为一种领域适应技术，与持续的预训练相比，通过减少遗忘可以产生更好的性能。

Title: Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation

Authors: Xujun Peng, Anoop Kumar, Jingyu Wu, Parker Glenn, Daben Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14915
Pdf URL: https://arxiv.org/pdf/2510.14915
Copy Paste: [[2510.14915]] Harmonizing Diverse Models: A Layer-wise Merging Strategy for Consistent Generation(https://arxiv.org/abs/2510.14915)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) systems leverage Large Language Models (LLMs) to generate accurate and reliable responses that are grounded in retrieved context. However, LLMs often generate inconsistent outputs for semantically equivalent inputs, a problem compounded by the scarcity of consistency-focused training data and the limitations of current fine-tuning techniques in enhancing output consistency. We propose a new approach combining systematic synthetic data generation, triplet loss for better embeddings, and a novel layer-wise model merging approach. Using consistency-aware weights derived from intermediate layer activations, our method effectively integrates knowledge from specialized models. Experimental results how that our merged model significantly enhances output consistency, achieving a ~47.5\% improvement in response similarity over the baseline, thus offering a practical solution for increasing the reliability of an industrial RAG system.
摘要：检索增强生成 (RAG) 系统利用大型语言模型 (LLM) 生成基于检索上下文的准确可靠的响应。然而，法学硕士经常为语义等效的输入生成不一致的输出，由于缺乏以一致性为中心的训练数据以及当前微调技术在增强输出一致性方面的局限性，这一问题变得更加复杂。我们提出了一种新方法，结合了系统合成数据生成、用于更好嵌入的三元组损失以及新颖的分层模型合并方法。使用从中间层激活派生的一致性感知权重，我们的方法有效地集成了来自专业模型的知识。实验结果表明，我们的合并模型显着增强了输出一致性，响应相似性较基线提高了约 47.5%，从而为提高工业 RAG 系统的可靠性提供了实用的解决方案。

Title: Predicting Task Performance with Context-aware Scaling Laws

Authors: Kyle Montgomery, David Park, Jianhong Tu, Michael Bendersky, Beliz Gunel, Dawn Song, Chenguang Wang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14919
Pdf URL: https://arxiv.org/pdf/2510.14919
Copy Paste: [[2510.14919]] Predicting Task Performance with Context-aware Scaling Laws(https://arxiv.org/abs/2510.14919)
Keywords: language model, llm
Abstract: Scaling laws have transformed our understanding of large language models by linking upstream metrics like cross-entropy loss to design factors such as model size, training data, and compute. However, these conventional laws fail to capture downstream task performance, where context plays a critical role. In this work, we propose a straightforward, interpretable framework that jointly models downstream performance as a function of the training compute and the provided context. We empirically validate our framework by fitting it on the observed downstream performance of extended-context variants of Llama-2-7B and Llama-2-13B across 65,500 unique instances spanning three tasks: arithmetic reasoning, common sense reasoning, and machine translation. Our results demonstrate that our framework accurately models in-distribution downstream performance, generalizes across three orders of magnitude in training compute, and reliably extrapolates performance as the amount of context increases. These findings offer valuable insights into the interplay between training compute and context utilization, providing guidance for designing more efficient long-context LLMs for diverse downstream tasks. Our code is available at this https URL.
摘要：缩放定律通过将交叉熵损失等上游指标与模型大小、训练数据和计算等设计因素联系起来，改变了我们对大型语言模型的理解。然而，这些传统法则无法捕捉下游任务绩效，而上下文在下游任务绩效中发挥着关键作用。在这项工作中，我们提出了一个简单的、可解释的框架，该框架将下游性能联合建模为训练计算和提供的上下文的函数。我们通过将 Llama-2-7B 和 Llama-2-13B 的扩展上下文变体在涵盖算术推理、常识推理和机器翻译这三个任务的 65,500 个独特实例中观察到的下游性能进行拟合，以实证方式验证我们的框架。我们的结果表明，我们的框架准确地模拟了分布下游性能，在训练计算中概括了三个数量级，并随着上下文数量的增加可靠地推断性能。这些发现为训练计算和上下文利用之间的相互作用提供了宝贵的见解，为为不同的下游任务设计更高效的长上下文法学硕士提供了指导。我们的代码可以在这个 https URL 上找到。

Title: AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations

Authors: Jianfeng Zhu, Julina Maharjan, Xinyu Li, Karin G. Coifman, Ruoming Jin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.14937
Pdf URL: https://arxiv.org/pdf/2510.14937
Copy Paste: [[2510.14937]] AI-Powered Early Diagnosis of Mental Health Disorders from Real-World Clinical Conversations(https://arxiv.org/abs/2510.14937)
Keywords: gpt, llm, prompt
Abstract: Mental health disorders remain among the leading cause of disability worldwide, yet conditions such as depression, anxiety, and Post-Traumatic Stress Disorder (PTSD) are frequently underdiagnosed or misdiagnosed due to subjective assessments, limited clinical resources, and stigma and low awareness. In primary care settings, studies show that providers misidentify depression or anxiety in over 60% of cases, highlighting the urgent need for scalable, accessible, and context-aware diagnostic tools that can support early detection and intervention. In this study, we evaluate the effectiveness of machine learning models for mental health screening using a unique dataset of 553 real-world, semistructured interviews, each paried with ground-truth diagnoses for major depressive episodes (MDE), anxiety disorders, and PTSD. We benchmark multiple model classes, including zero-shot prompting with GPT-4.1 Mini and MetaLLaMA, as well as fine-tuned RoBERTa models using LowRank Adaptation (LoRA). Our models achieve over 80% accuracy across diagnostic categories, with especially strongperformance on PTSD (up to 89% accuracy and 98% recall). We also find that using shorter context, focused context segments improves recall, suggesting that focused narrative cues enhance detection sensitivity. LoRA fine-tuning proves both efficient and effective, with lower-rank configurations (e.g., rank 8 and 16) maintaining competitive performance across evaluation metrics. Our results demonstrate that LLM-based models can offer substantial improvements over traditional self-report screening tools, providing a path toward low-barrier, AI-powerd early diagnosis. This work lays the groundwork for integrating machine learning into real-world clinical workflows, particularly in low-resource or high-stigma environments where access to timely mental health care is most limited.
摘要：精神健康障碍仍然是全世界残疾的主要原因之一，但由于主观评估、有限的临床资源以及耻辱和认识不足，抑郁、焦虑和创伤后应激障碍 (PTSD) 等疾病经常被诊断不足或误诊。研究表明，在初级保健机构中，60% 以上的病例中，医疗服务提供者会错误地识别抑郁症或焦虑症，这突显出迫切需要可扩展、可访问且情境感知的诊断工具，以支持早期检测和干预。在这项研究中，我们使用包含 553 次真实世界半结构化访谈的独特数据集来评估机器学习模型用于心理健康筛查的有效性，每一次访谈都与重度抑郁发作 (MDE)、焦虑症和创伤后应激障碍 (PTSD) 的真实诊断进行比较。我们对多个模型类别进行了基准测试，包括使用 GPT-4.1 Mini 和 MetaLLaMA 进行零样本提示，以及使用 LowRank Adaptation (LoRA) 进行微调的 RoBERTa 模型。我们的模型在各个诊断类别中实现了超过 80% 的准确率，尤其是在 PTSD 方面的表现尤其出色（准确率高达 89%，召回率高达 98%）。我们还发现，使用较短的上下文、集中的上下文片段可以提高召回率，这表明集中的叙述线索可以提高检测灵敏度。 LoRA 微调被证明既高效又有效，较低等级的配置（例如等级 8 和 16）在评估指标中保持竞争性能。我们的结果表明，基于法学硕士的模型可以比传统的自我报告筛查工具提供实质性改进，为低门槛、人工智能驱动的早期诊断提供一条途径。这项工作为将机器学习整合到现实世界的临床工作流程中奠定了基础，特别是在资源匮乏或污名化的环境中，及时获得精神卫生保健的机会最为有限。

Title: LaSeR: Reinforcement Learning with Last-Token Self-Rewarding

Authors: Wenkai Yang, Weijie Liu, Ruobing Xie, Yiju Guo, Lulu Wu, Saiyong Yang, Yankai Lin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14943
Pdf URL: https://arxiv.org/pdf/2510.14943
Copy Paste: [[2510.14943]] LaSeR: Reinforcement Learning with Last-Token Self-Rewarding(https://arxiv.org/abs/2510.14943)
Keywords: language model, llm, prompt
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a core paradigm for enhancing the reasoning capabilities of Large Language Models (LLMs). To address the lack of verification signals at test time, prior studies incorporate the training of model's self-verification capability into the standard RLVR process, thereby unifying reasoning and verification capabilities within a single LLM. However, previous practice requires the LLM to sequentially generate solutions and self-verifications using two separate prompt templates, which significantly reduces efficiency. In this work, we theoretically reveal that the closed-form solution to the RL objective of self-verification can be reduced to a remarkably simple form: the true reasoning reward of a solution is equal to its last-token self-rewarding score, which is computed as the difference between the policy model's next-token log-probability assigned to any pre-specified token at the solution's last token and a pre-calculated constant, scaled by the KL coefficient. Based on this insight, we propose LaSeR (Reinforcement Learning with Last-Token Self-Rewarding), an algorithm that simply augments the original RLVR loss with a MSE loss that aligns the last-token self-rewarding scores with verifier-based reasoning rewards, jointly optimizing the reasoning and self-rewarding capabilities of LLMs. The optimized self-rewarding scores can be utilized in both training and testing to enhance model performance. Notably, our algorithm derives these scores from the predicted next-token probability distribution of the last token immediately after generation, incurring only the minimal extra cost of one additional token inference. Experiments show that our method not only improves the model's reasoning performance but also equips it with remarkable self-rewarding capability, thereby boosting its inference-time scaling performance.
摘要：带可验证奖励的强化学习（RLVR）最近已成为增强大型语言模型（LLM）推理能力的核心范例。为了解决测试时缺乏验证信号的问题，先前的研究将模型自验证能力的训练纳入标准RLVR流程中，从而将推理和验证能力统一在单个LLM内。然而，以往的做法需要LLM使用两个独立的提示模板依次生成解决方案并进行自我验证，这大大降低了效率。在这项工作中，我们从理论上揭示了自我验证的 RL 目标的封闭形式解决方案可以简化为非常简单的形式：解决方案的真实推理奖励等于其最后一个令牌自我奖励分数，该分数计算为分配给解决方案最后一个令牌处的任何预先指定令牌的策略模型的下一个令牌对数概率与通过 KL 系数缩放的预先计算的常数之间的差值。基于这一见解，我们提出了 LaSeR（带有最后令牌自我奖励的强化学习），这是一种简单地用 MSE 损失来增强原始 RLVR 损失的算法，该 MSE 损失将最后令牌自我奖励分数与基于验证者的推理奖励相结合，共同优化 LLM 的推理和自我奖励能力。优化后的自我奖励分数可用于训练和测试，以提高模型性能。值得注意的是，我们的算法从生成后立即预测的最后一个令牌的下一个令牌概率分布中得出这些分数，仅产生一个额外令牌推理的最小额外成本。实验表明，我们的方法不仅提高了模型的推理性能，而且使其具有显着的自我奖励能力，从而提高了其推理时间扩展性能。

Title: MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics

Authors: Yuxing Lu, Xukai Zhao, J. Ben Tamo, Micky C. Nnamdi, Rui Peng, Shuang Zeng, Xingyu Hu, Jinzhuo Wang, May D. Wang
Subjects: cs.CL, cs.AI, cs.CE
Abstract URL: https://arxiv.org/abs/2510.14944
Pdf URL: https://arxiv.org/pdf/2510.14944
Copy Paste: [[2510.14944]] MetaBench: A Multi-task Benchmark for Assessing LLMs in Metabolomics(https://arxiv.org/abs/2510.14944)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities on general text; however, their proficiency in specialized scientific domains that require deep, interconnected knowledge remains largely uncharacterized. Metabolomics presents unique challenges with its complex biochemical pathways, heterogeneous identifier systems, and fragmented databases. To systematically evaluate LLM capabilities in this domain, we introduce MetaBench, the first benchmark for metabolomics assessment. Curated from authoritative public resources, MetaBench evaluates five capabilities essential for metabolomics research: knowledge, understanding, grounding, reasoning, and research. Our evaluation of 25 open- and closed-source LLMs reveals distinct performance patterns across metabolomics tasks: while models perform well on text generation tasks, cross-database identifier grounding remains challenging even with retrieval augmentation. Model performance also decreases on long-tail metabolites with sparse annotations. With MetaBench, we provide essential infrastructure for developing and evaluating metabolomics AI systems, enabling systematic progress toward reliable computational tools for metabolomics research.
摘要：大型语言模型（LLM）在一般文本上表现出了卓越的能力；然而，他们对需要深入、相互关联的知识的专业科学领域的熟练程度在很大程度上仍然没有得到体现。代谢组学因其复杂的生化途径、异构标识符系统和分散的数据库而面临独特的挑战。为了系统地评估该领域的 LLM 能力，我们引入了 MetaBench，这是代谢组学评估的第一个基准。 MetaBench 根据权威的公共资源进行筛选，评估代谢组学研究必不可少的五种能力：知识、理解、基础、推理和研究。我们对 25 个开源和闭源法学硕士的评估揭示了代谢组学任务中不同的性能模式：虽然模型在文本生成任务上表现良好，但即使在检索增强的情况下，跨数据库标识符基础仍然具有挑战性。对于具有稀疏注释的长尾代谢物，模型性能也会下降。借助 MetaBench，我们为开发和评估代谢组学 AI 系统提供了必要的基础设施，从而使代谢组学研究的可靠计算工具取得了系统性进展。

Title: DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation

Authors: Yu Zhou, Sohyun An, Haikang Deng, Da Yin, Clark Peng, Cho-Jui Hsieh, Kai-Wei Chang, Nanyun Peng
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14949
Pdf URL: https://arxiv.org/pdf/2510.14949
Copy Paste: [[2510.14949]] DialectGen: Benchmarking and Improving Dialect Robustness in Multimodal Generation(https://arxiv.org/abs/2510.14949)
Keywords: prompt
Abstract: Contact languages like English exhibit rich regional variations in the form of dialects, which are often used by dialect speakers interacting with generative models. However, can multimodal generative models effectively produce content given dialectal textual input? In this work, we study this question by constructing a new large-scale benchmark spanning six common English dialects. We work with dialect speakers to collect and verify over 4200 unique prompts and evaluate on 17 image and video generative models. Our automatic and human evaluation results show that current state-of-the-art multimodal generative models exhibit 32.26% to 48.17% performance degradation when a single dialect word is used in the prompt. Common mitigation methods such as fine-tuning and prompt rewriting can only improve dialect performance by small margins (< 7%), while potentially incurring significant performance degradation in Standard American English (SAE). To this end, we design a general encoder-based mitigation strategy for multimodal generative models. Our method teaches the model to recognize new dialect features while preserving SAE performance. Experiments on models such as Stable Diffusion 1.5 show that our method is able to simultaneously raise performance on five dialects to be on par with SAE (+34.4%), while incurring near zero cost to SAE performance.
摘要：像英语这样的接触语言以方言的形式表现出丰富的地区差异，方言使用者经常使用方言与生成模型进行交互。然而，多模态生成模型能否在给定方言文本输入的情况下有效地生成内容？在这项工作中，我们通过构建一个涵盖六种常见英语方言的新的大规模基准来研究这个问题。我们与方言使用者合作，收集和验证超过 4200 个独特的提示，并对 17 个图像和视频生成模型进行评估。我们的自动和人工评估结果表明，当提示中使用单个方言单词时，当前最先进的多模态生成模型表现出 32.26% 至 48.17% 的性能下降。微调和提示重写等常见缓解方法只能小幅提高方言性能 (< 7%)，而可能会导致标准美式英语 (SAE) 的性能显着下降。为此，我们为多模态生成模型设计了一种基于编码器的通用缓解策略。我们的方法教会模型识别新的方言特征，同时保持 SAE 性能。对稳定扩散 1.5 等模型的实验表明，我们的方法能够同时将五种方言的性能提高到与 SAE 相当（+34.4%），同时 SAE 性能的成本几乎为零。

Title: Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

Authors: Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14967
Pdf URL: https://arxiv.org/pdf/2510.14967
Copy Paste: [[2510.14967]] Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents(https://arxiv.org/abs/2510.14967)
Keywords: language model, llm, agent
Abstract: Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided at the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate two critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals, and (ii) lack of fine-grained credit assignment, where dependencies between turns are obscured, especially in long-horizon tasks. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward trajectories. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved sample efficiency.
摘要：基于大型语言模型 (LLM) 的智能体越来越多地接受强化学习 (RL) 训练，以增强其通过工具使用与外部环境交互的能力，特别是在需要多轮推理和知识获取的基于搜索的环境中。然而，现有的方法通常依赖于基于结果的奖励，这些奖励仅在最终答案时提供。这种奖励稀疏性在多回合设置中变得尤其成问题，其中长轨迹加剧了两个关键问题：（i）优势崩溃，其中所有推出都获得相同的奖励并且不提供有用的学习信号，以及（ii）缺乏细粒度的信用分配，其中回合之间的依赖性被掩盖，特别是在长期任务中。在本文中，我们提出了基于信息增益的策略优化（IGPO），这是一种简单而有效的 RL 框架，可为多轮智能体训练提供密集且内在的监督。 IGPO 将每个交互回合建模为获取有关基本事实信息的增量过程，并将回合级别奖励定义为策略产生正确答案的概率的边际增长。与先前依赖于外部奖励模型或昂贵的蒙特卡洛估计的过程级奖励方法不同，IGPO 直接从模型自身的信念更新中获得内在奖励。这些内在的回合级奖励与结果级监督相结合，形成密集的奖励轨迹。对域内和域外基准的大量实验表明，IGPO 在多轮场景中始终优于强大的基准，实现了更高的准确性并提高了样本效率。

Title: LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

Authors: Yiming Wang, Da Yin, Yuedong Cui, Ruichen Zheng, Zhiqian Li, Zongyu Lin, Di Wu, Xueqing Wu, Chenchen Ye, Yu Zhou, Kai-Wei Chang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14969
Pdf URL: https://arxiv.org/pdf/2510.14969
Copy Paste: [[2510.14969]] LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training(https://arxiv.org/abs/2510.14969)
Keywords: llm, agent
Abstract: Digital agents require diverse, large-scale UI trajectories to generalize across real-world tasks, yet collecting such data is prohibitively expensive in both human annotation, infra and engineering perspectives. To this end, we introduce $\textbf{UI-Simulator}$, a scalable paradigm that generates structured UI states and transitions to synthesize training trajectories at scale. Our paradigm integrates a digital world simulator for diverse UI states, a guided rollout process for coherent exploration, and a trajectory wrapper that produces high-quality and diverse trajectories for agent training. We further propose $\textbf{UI-Simulator-Grow}$, a targeted scaling strategy that enables more rapid and data-efficient scaling by prioritizing high-impact tasks and synthesizes informative trajectory variants. Experiments on WebArena and AndroidWorld show that UI-Simulator rivals or surpasses open-source agents trained on real UIs with significantly better robustness, despite using weaker teacher models. Moreover, UI-Simulator-Grow matches the performance of Llama-3-70B-Instruct using only Llama-3-8B-Instruct as the base model, highlighting the potential of targeted synthesis scaling paradigm to continuously and efficiently enhance the digital agents.
摘要：数字代理需要多样化、大规模的 UI 轨迹来概括现实世界的任务，但从人工注释、基础设施和工程角度来看，收集此类数据的成本都极其昂贵。为此，我们引入了 $\textbf{UI-Simulator}$，这是一种可扩展的范例，可以生成结构化的 UI 状态和转换以大规模地合成训练轨迹。我们的范例集成了用于不同 UI 状态的数字世界模拟器、用于连贯探索的引导推出过程以及为代理训练生成高质量和多样化轨迹的轨迹包装器。我们进一步提出了 $\textbf{UI-Simulator-Grow}$，这是一种有针对性的扩展策略，通过优先考虑高影响力的任务并合成信息丰富的轨迹变体，实现更快速和数据高效的扩展。 WebArena 和 AndroidWorld 上的实验表明，尽管使用较弱的教师模型，UI-Simulator 仍可以与在真实 UI 上训练的开源代理相媲美或超越，并且鲁棒性显着提高。此外，UI-Simulator-Grow 仅使用 Llama-3-8B-Instruct 作为基本模型，与 Llama-3-70B-Instruct 的性能相匹配，凸显了目标合成扩展范式持续有效增强数字代理的潜力。

Title: TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Authors: Yinxi Li, Yuntian Deng, Pengyu Nie
Subjects: cs.CL, cs.AI, cs.LG, cs.PL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.14972
Pdf URL: https://arxiv.org/pdf/2510.14972
Copy Paste: [[2510.14972]] TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar(https://arxiv.org/abs/2510.14972)
Keywords: language model, llm
Abstract: Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depending on superficial factors such as whitespace or identifier naming. To measure the impact of this misalignment, we introduce TokDrift, a framework that applies semantic-preserving rewrite rules to create code variants differing only in tokenization. Across nine code LLMs, including large ones with over 30B parameters, even minor formatting changes can cause substantial shifts in model behavior. Layer-wise analysis shows that the issue originates in early embeddings, where subword segmentation fails to capture grammar token boundaries. Our findings identify misaligned tokenization as a hidden obstacle to reliable code understanding and generation, highlighting the need for grammar-aware tokenization for future code LLMs.
摘要：代码的大型语言模型 (LLM) 依赖于子字分词器，例如字节对编码 (BPE)，它是从混合自然语言文本和编程语言代码中学习的，但由统计数据而不是语法驱动。因此，语义上相同的代码片段可以根据空格或标识符命名等表面因素进行不同的标记。为了衡量这种不一致的影响，我们引入了 TokDrift，这是一个应用语义保留重写规则来创建仅在标记化方面不同的代码变体的框架。在九个代码 LLM 中，包括具有超过 30B 参数的大型 LLM，即使是微小的格式更改也可能会导致模型行为发生重大变化。逐层分析表明，问题源于早期嵌入，其中子词分割无法捕获语法标记边界。我们的研究结果表明，未对齐的标记化是可靠的代码理解和生成的隐藏障碍，强调了未来代码法学硕士对语法感知标记化的需求。

Title: Attention Is All You Need for KV Cache in Diffusion LLMs

Authors: Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.14973
Pdf URL: https://arxiv.org/pdf/2510.14973
Copy Paste: [[2510.14973]] Attention Is All You Need for KV Cache in Diffusion LLMs(https://arxiv.org/abs/2510.14973)
Keywords: language model, llm
Abstract: This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant ${\bf MASK}$ tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose ${\bf Elastic-Cache}$, a training-free, architecture-agnostic strategy that jointly decides ${when}$ to refresh (via an attention-aware drift test on the most-attended token) and ${where}$ to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: $8.7\times$ on GSM8K (256 tokens), $45.1\times$ on longer sequences, and $4.8\times$ on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ($6.8\times$ on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
摘要：这项工作研究如何自适应地重新计算扩散大语言模型 (DLM) 的键值 (KV) 缓存，以最大限度地提高预测精度，同时最大限度地减少解码延迟。现有方法的解码器在每个去噪步骤和层中重新计算所有标记的 QKV，尽管 KV 状态在大多数步骤中变化很小，尤其是在浅层中，从而导致大量冗余。我们做出三个观察：（1）遥远的 ${\bf MASK}$ 令牌主要充当长度偏差，并且可以在活动预测窗口之外按块缓存； (2) KV 动态随着深度的增加而增加，这表明从更深的层开始选择性刷新就足够了； (3) 最常关注的令牌表现出最小的 KV 漂移，为其他令牌的缓存更改提供保守的下限。在此基础上，我们提出了 ${\bf Elastic-Cache}$，一种免训练、与架构无关的策略，共同决定 ${when}$ 刷新（通过对最常访问的令牌进行注意感知漂移测试）和 ${where}$ 刷新（通过深度感知调度，从所选层开始重新计算，同时重用浅层缓存和窗口外掩码缓存）。与固定周期方案不同，Elastic-Cache 为扩散 LLM 执行自适应、层感知的缓存更新，减少冗余计算并加速解码，而生成质量的损失可以忽略不计。在 LLaDA-Instruct、LLaDA-1.5 和 LLaDA-V 上跨数学推理和代码生成任务进行的实验证明了一致的加速：GSM8K（256 个令牌）上为 8.7\times$，较长序列为 $45.1\times$，HumanEval 上为 $4.8\times$，同时始终保持比基线更高的准确度。与现有基于置信度的方法相比，我们的方法实现了显着更高的吞吐量（GSM8K 上为 6.8 美元/倍），同时保持了生成质量，从而实现了扩散 LLM 的实际部署。