2025-10-14

Title: Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation

Authors: Wei Zhou, Bolei Ma, Annemarie Friedrich, Mohsen Mesgar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09671
Pdf URL: https://arxiv.org/pdf/2510.09671
Copy Paste: [[2510.09671]] Table Question Answering in the Era of Large Language Models: A Comprehensive Survey of Tasks, Methods, and Evaluation(https://arxiv.org/abs/2510.09671)
Keywords: language model, llm
Abstract: Table Question Answering (TQA) aims to answer natural language questions about tabular data, often accompanied by additional contexts such as text passages. The task spans diverse settings, varying in table representation, question/answer complexity, modality involved, and domain. While recent advances in large language models (LLMs) have led to substantial progress in TQA, the field still lacks a systematic organization and understanding of task formulations, core challenges, and methodological trends, particularly in light of emerging research directions such as reinforcement learning. This survey addresses this gap by providing a comprehensive and structured overview of TQA research with a focus on LLM-based methods. We provide a comprehensive categorization of existing benchmarks and task setups. We group current modeling strategies according to the challenges they target, and analyze their strengths and limitations. Furthermore, we highlight underexplored but timely topics that have not been systematically covered in prior research. By unifying disparate research threads and identifying open problems, our survey offers a consolidated foundation for the TQA community, enabling a deeper understanding of the state of the art and guiding future developments in this rapidly evolving area.
摘要：表格问答 (TQA) 旨在回答有关表格数据的自然语言问题，通常伴有文本段落等其他上下文。该任务跨越不同的设置，表格表示、问题/答案复杂性、涉及的模式和领域各不相同。虽然大型语言模型 (LLM) 的最新进展导致 TQA 取得了实质性进展，但该领域仍然缺乏对任务制定、核心挑战和方法趋势的系统组织和理解，特别是考虑到强化学习等新兴研究方向。本调查通过提供全面、结构化的 TQA 研究概述来弥补这一差距，重点关注基于法学硕士的方法。我们提供现有基准和任务设置的全面分类。我们根据当前的建模策略所针对的挑战对它们进行分组，并分析它们的优点和局限性。此外，我们强调了先前研究中尚未系统涵盖的尚未充分探索但及时的主题。通过统一不同的研究线索并确定开放问题，我们的调查为 TQA 社区提供了坚实的基础，使人们能够更深入地了解最新技术并指导这个快速发展领域的未来发展。

Title: Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection

Authors: Yanran Chen, Lynn Greschner, Roman Klinger, Michael Klenk, Steffen Eger
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09695
Pdf URL: https://arxiv.org/pdf/2510.09695
Copy Paste: [[2510.09695]] Emotionally Charged, Logically Blurred: AI-driven Emotional Framing Impairs Human Fallacy Detection(https://arxiv.org/abs/2510.09695)
Keywords: language model, llm
Abstract: Logical fallacies are common in public communication and can mislead audiences; fallacious arguments may still appear convincing despite lacking soundness, because convincingness is inherently subjective. We present the first computational study of how emotional framing interacts with fallacies and convincingness, using large language models (LLMs) to systematically change emotional appeals in fallacious arguments. We benchmark eight LLMs on injecting emotional appeal into fallacious arguments while preserving their logical structures, then use the best models to generate stimuli for a human study. Our results show that LLM-driven emotional framing reduces human fallacy detection in F1 by 14.5% on average. Humans perform better in fallacy detection when perceiving enjoyment than fear or sadness, and these three emotions also correlate with significantly higher convincingness compared to neutral or other emotion states. Our work has implications for AI-driven emotional manipulation in the context of fallacious argumentation.
摘要：逻辑谬误在公共传播中很常见，可能会误导受众；尽管缺乏合理性，错误的论点仍然可能显得有说服力，因为说服力本质上是主观的。我们提出了第一个关于情感框架如何与谬误和说服力相互作用的计算研究，使用大型语言模型（LLM）系统地改变谬论中的情感诉求。我们对八位法学硕士进行了基准测试，要求他们在错误的论点中注入情感诉求，同时保留其逻辑结构，然后使用最佳模型来生成人类研究的刺激。我们的结果表明，LLM 驱动的情感框架将 F1 中的人类谬误检测平均减少了 14.5%。当人类感知快乐时，比感知恐惧或悲伤时，在谬误检测方面表现得更好，并且与中性或其他情绪状态相比，这三种情绪也与显着更高的可信度相关。我们的工作对在错误论证的背景下人工智能驱动的情绪操纵具有影响。

Title: The Idola Tribus of AI: Large Language Models tend to perceive order where none exists

Authors: Shin-nosuke Ishikawa, Masato Todo, Taiki Ogihara, Hirotsugu Ohba
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09709
Pdf URL: https://arxiv.org/pdf/2510.09709
Copy Paste: [[2510.09709]] The Idola Tribus of AI: Large Language Models tend to perceive order where none exists(https://arxiv.org/abs/2510.09709)
Keywords: language model, llm, retrieval-augmented generation, chain-of-thought, agent
Abstract: We present a tendency of large language models (LLMs) to generate absurd patterns despite their clear inappropriateness in a simple task of identifying regularities in number series. Several approaches have been proposed to apply LLMs to complex real-world tasks, such as providing knowledge through retrieval-augmented generation and executing multi-step tasks using AI agent frameworks. However, these approaches rely on the logical consistency and self-coherence of LLMs, making it crucial to evaluate these aspects and consider potential countermeasures. To identify cases where LLMs fail to maintain logical consistency, we conducted an experiment in which LLMs were asked to explain the patterns in various integer sequences, ranging from arithmetic sequences to randomly generated integer series. While the models successfully identified correct patterns in arithmetic and geometric sequences, they frequently over-recognized patterns that were inconsistent with the given numbers when analyzing randomly generated series. This issue was observed even in multi-step reasoning models, including OpenAI o3, o4-mini, and Google Gemini 2.5 Flash Preview Thinking. This tendency to perceive non-existent patterns can be interpreted as the AI model equivalent of Idola Tribus and highlights potential limitations in their capability for applied tasks requiring logical reasoning, even when employing chain-of-thought reasoning mechanisms.
摘要：我们提出了大型语言模型（LLM）生成荒谬模式的趋势，尽管它们在识别数字序列规律的简单任务中明显不合适。人们提出了几种将法学硕士应用于复杂的现实世界任务的方法，例如通过检索增强生成提供知识以及使用人工智能代理框架执行多步骤任务。然而，这些方法依赖于法学硕士的逻辑一致性和自我连贯性，因此评估这些方面并考虑潜在的对策至关重要。为了识别法学硕士无法保持逻辑一致性的情况，我们进行了一项实验，要求法学硕士解释各种整数序列中的模式，从算术序列到随机生成的整数序列。虽然模型成功地识别了算术和几何序列中的正确模式，但在分析随机生成的序列时，它们经常过度识别与给定数字不一致的模式。即使在多步推理模型中也存在此问题，包括 OpenAI o3、o4-mini 和 Google Gemini 2.5 Flash Preview Thinking。这种感知不存在模式的倾向可以被解释为相当于 Idola Tribus 的 AI 模型，并强调了它们在需要逻辑推理的应用任务中的潜在局限性，即使在采用思想链推理机制时也是如此。

Title: SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG

Authors: Xiaonan Si, Meilin Zhu, Simeng Qin, Lijia Yu, Lijun Zhang, Shuaitong Liu, Xinfeng Li, Ranjie Duan, Yang Liu, Xiaojun Jia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09710
Pdf URL: https://arxiv.org/pdf/2510.09710
Copy Paste: [[2510.09710]] SeCon-RAG: A Two-Stage Semantic Filtering and Conflict-Free Framework for Trustworthy RAG(https://arxiv.org/abs/2510.09710)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems enhance large language models (LLMs) with external knowledge but are vulnerable to corpus poisoning and contamination attacks, which can compromise output integrity. Existing defenses often apply aggressive filtering, leading to unnecessary loss of valuable information and reduced reliability in generation. To address this problem, we propose a two-stage semantic filtering and conflict-free framework for trustworthy RAG. In the first stage, we perform a joint filter with semantic and cluster-based filtering which is guided by the Entity-intent-relation extractor (EIRE). EIRE extracts entities, latent objectives, and entity relations from both the user query and filtered documents, scores their semantic relevance, and selectively adds valuable documents into the clean retrieval database. In the second stage, we proposed an EIRE-guided conflict-aware filtering module, which analyzes semantic consistency between the query, candidate answers, and retrieved knowledge before final answer generation, filtering out internal and external contradictions that could mislead the model. Through this two-stage process, SeCon-RAG effectively preserves useful knowledge while mitigating conflict contamination, achieving significant improvements in both generation robustness and output trustworthiness. Extensive experiments across various LLMs and datasets demonstrate that the proposed SeCon-RAG markedly outperforms state-of-the-art defense methods.
摘要：检索增强生成（RAG）系统利用外部知识增强大型语言模型（LLM），但容易受到语料库中毒和污染攻击，从而损害输出完整性。现有的防御措施通常采用积极的过滤，导致有价值信息不必要的丢失并降低生成的可靠性。为了解决这个问题，我们提出了一种用于可信 RAG 的两阶段语义过滤和无冲突框架。在第一阶段，我们在实体意图关系提取器（EIRE）的指导下执行具有语义和基于集群的过滤的联合过滤器。 EIRE 从用户查询和过滤文档中提取实体、潜在目标和实体关系，对它们的语义相关性进行评分，并有选择地将有价值的文档添加到干净的检索数据库中。在第二阶段，我们提出了一个EIRE引导的冲突感知过滤模块，该模块在最终答案生成之前分析查询、候选答案和检索到的知识之间的语义一致性，过滤掉可能误导模型的内部和外部矛盾。通过这个两阶段过程，SeCon-RAG 有效地保留了有用的知识，同时减轻了冲突污染，从而在生成稳健性和输出可信度方面实现了显着改进。跨各种法学硕士和数据集的广泛实验表明，所提出的 SeCon-RAG 明显优于最先进的防御方法。

Title: ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models

Authors: Wenbin Guo, Xin Wang, Jiaoyan Chen, Lingbing Guo, Zhao Li, Zirui Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09711
Pdf URL: https://arxiv.org/pdf/2510.09711
Copy Paste: [[2510.09711]] ReaLM: Residual Quantization Bridging Knowledge Graph Embeddings and Large Language Models(https://arxiv.org/abs/2510.09711)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have recently emerged as a powerful paradigm for Knowledge Graph Completion (KGC), offering strong reasoning and generalization capabilities beyond traditional embedding-based approaches. However, existing LLM-based methods often struggle to fully exploit structured semantic representations, as the continuous embedding space of pretrained KG models is fundamentally misaligned with the discrete token space of LLMs. This discrepancy hinders effective semantic transfer and limits their performance. To address this challenge, we propose ReaLM, a novel and effective framework that bridges the gap between KG embeddings and LLM tokenization through the mechanism of residual vector quantization. ReaLM discretizes pretrained KG embeddings into compact code sequences and integrates them as learnable tokens within the LLM vocabulary, enabling seamless fusion of symbolic and contextual knowledge. Furthermore, we incorporate ontology-guided class constraints to enforce semantic consistency, refining entity predictions based on class-level compatibility. Extensive experiments on two widely used benchmark datasets demonstrate that ReaLM achieves state-of-the-art performance, confirming its effectiveness in aligning structured knowledge with large-scale language models.
摘要：大型语言模型（LLM）最近已成为知识图谱补全（KGC）的强大范例，提供超越传统基于嵌入的方法的强大推理和泛化能力。然而，现有的基于 LLM 的方法通常难以充分利用结构化语义表示，因为预训练 KG 模型的连续嵌入空间与 LLM 的离散标记空间根本不一致。这种差异阻碍了有效的语义传输并限制了它们的性能。为了应对这一挑战，我们提出了 ReaLM，这是一种新颖且有效的框架，通过残差向量量化机制弥合了 KG 嵌入和 LLM 标记化之间的差距。 ReaLM 将预训练的 KG 嵌入离散化为紧凑的代码序列，并将它们作为可学习的标记集成到 LLM 词汇表中，从而实现符号和上下文知识的无缝融合。此外，我们结合本体引导的类约束来强制语义一致性，基于类级兼容性改进实体预测。对两个广泛使用的基准数据集进行的大量实验表明，ReaLM 实现了最先进的性能，证实了其在将结构化知识与大规模语言模型结合起来方面的有效性。

Title: All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language

Authors: Shiyuan Guo, Henry Sleight, Fabien Roger
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09714
Pdf URL: https://arxiv.org/pdf/2510.09714
Copy Paste: [[2510.09714]] All Code, No Thought: Current Language Models Struggle to Reason in Ciphered Language(https://arxiv.org/abs/2510.09714)
Keywords: language model, prompt, chain-of-thought, agent
Abstract: Detecting harmful AI actions is important as AI agents gain adoption. Chain-of-thought (CoT) monitoring is one method widely used to detect adversarial attacks and AI misalignment. However, attackers and misaligned models might evade CoT monitoring through ciphered reasoning: reasoning hidden in encrypted, translated, or compressed text. To assess this risk, we test whether models can perform ciphered reasoning. For each of 28 different ciphers, we fine-tune and prompt up to 10 models to reason in that cipher. We measure model accuracy on math problems as a proxy for reasoning ability. Across the models we test, we find an asymmetry: model accuracy can drop significantly when reasoning in ciphered text, even though models demonstrate comprehension of ciphered text by being able to translate it accurately to English. Even frontier models struggle with lesser-known ciphers, although they can reason accurately in well-known ciphers like rot13. We show that ciphered reasoning capability correlates with cipher prevalence in pretraining data. We also identify scaling laws showing that ciphered reasoning capability improves slowly with additional fine-tuning data. Our work suggests that evading CoT monitoring using ciphered reasoning may be an ineffective tactic for current models and offers guidance on constraining the development of this capability in future frontier models.
摘要：随着人工智能代理的普及，检测有害的人工智能行为非常重要。思想链（CoT）监控是一种广泛用于检测对抗性攻击和人工智能失调的方法。然而，攻击者和错位模型可能会通过加密推理来逃避 CoT 监控：隐藏在加密、翻译或压缩文本中的推理。为了评估这种风险，我们测试模型是否可以执行加密推理。对于 28 种不同密码中的每一种，我们都会微调并提示最多 10 个模型对该密码进行推理。我们测量数学问题的模型准确性作为推理能力的代表。在我们测试的模型中，我们发现了一个不对称性：在用密文进行推理时，模型的准确性可能会显着下降，尽管模型通过能够将密文准确地翻译成英语来证明对密文的理解。即使前沿模型也能在不太知名的密码中遇到困难，尽管它们可以在 rot13 等众所周知的密码中进行准确的推理。我们表明，加密推理能力与预训练数据中的密码流行率相关。我们还确定了缩放定律，表明加密推理能力随着额外的微调数据而缓慢提高。我们的工作表明，使用加密推理来逃避 CoT 监控对于当前模型来说可能是一种无效的策略，并为限制未来前沿模型中这种能力的发展提供了指导。

Title: Preference-Aware Memory Update for Long-Term LLM Agents

Authors: Haoran Sun, Zekun Zhang, Shaoning Zeng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09720
Pdf URL: https://arxiv.org/pdf/2510.09720
Copy Paste: [[2510.09720]] Preference-Aware Memory Update for Long-Term LLM Agents(https://arxiv.org/abs/2510.09720)
Keywords: llm, agent
Abstract: One of the key factors influencing the reasoning capabilities of LLM-based agents is their ability to leverage long-term memory. Integrating long-term memory mechanisms allows agents to make informed decisions grounded in historical interactions. While recent advances have significantly improved the storage and retrieval components, by encoding memory into dense vectors for similarity search or organizing memory as structured knowledge graphs most existing approaches fall short in memory updating. In particular, they lack mechanisms for dynamically refining preference memory representations in response to evolving user behaviors and contexts. To address this gap, we propose a Preference-Aware Memory Update Mechanism (PAMU) that enables dynamic and personalized memory refinement. By integrating sliding window averages (SW) with exponential moving averages (EMA), PAMU constructs a fused preference-aware representation that captures both short-term fluctuations and long-term user tendencies. We conduct experiments on five task scenarios of the LoCoMo dataset, and the results show that our mechanism can significantly improve the output quality of LLM in five baselines, validating its effectiveness in long-term conversations.
摘要：影响基于 LLM 的智能体推理能力的关键因素之一是它们利用长期记忆的能力。集成长期记忆机制使智能体能够根据历史交互做出明智的决策。虽然最近的进展显着改进了存储和检索组件，但通过将内存编码为用于相似性搜索的密集向量或将内存组织为结构化知识图，大多数现有方法在内存更新方面存在不足。特别是，它们缺乏动态细化偏好记忆表示以响应不断变化的用户行为和上下文的机制。为了解决这一差距，我们提出了一种偏好感知内存更新机制（PAMU），可以实现动态和个性化的内存细化。通过将滑动窗口平均值 (SW) 与指数移动平均值 (EMA) 相集成，PAMU 构建了一种融合偏好感知表示，可捕获短期波动和长期用户趋势。我们对LoCoMo数据集的五个任务场景进行了实验，结果表明我们的机制可以在五个基线上显着提高LLM的输出质量，验证其在长期对话中的有效性。

Title: Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation

Authors: Fanwei Zhu, Jinke Yu, Zulong Chen, Ying Zhou, Junhao Ji, Zhibo Yang, Yuxue Zhang, Haoyuan Hu, Zhenghao Liu
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.09722
Pdf URL: https://arxiv.org/pdf/2510.09722
Copy Paste: [[2510.09722]] Layout-Aware Parsing Meets Efficient LLMs: A Unified, Scalable Framework for Resume Information Extraction and Evaluation(https://arxiv.org/abs/2510.09722)
Keywords: language model, llm, prompt
Abstract: Automated resume information extraction is critical for scaling talent acquisition, yet its real-world deployment faces three major challenges: the extreme heterogeneity of resume layouts and content, the high cost and latency of large language models (LLMs), and the lack of standardized datasets and evaluation tools. In this work, we present a layout-aware and efficiency-optimized framework for automated extraction and evaluation that addresses all three challenges. Our system combines a fine-tuned layout parser to normalize diverse document formats, an inference-efficient LLM extractor based on parallel prompting and instruction tuning, and a robust two-stage automated evaluation framework supported by new benchmark datasets. Extensive experiments show that our framework significantly outperforms strong baselines in both accuracy and efficiency. In particular, we demonstrate that a fine-tuned compact 0.6B LLM achieves top-tier accuracy while significantly reducing inference latency and computational cost. The system is fully deployed in Alibaba's intelligent HR platform, supporting real-time applications across its business units.
摘要：自动简历信息提取对于扩大人才获取规模至关重要，但其实际部署面临三大挑战：简历布局和内容的极端异构性、大型语言模型 (LLM) 的高成本和延迟以及缺乏标准化数据集和评估工具。在这项工作中，我们提出了一个布局感知和效率优化的框架，用于自动提取和评估，解决了所有三个挑战。我们的系统结合了一个经过微调的布局解析器来标准化不同的文档格式、一个基于并行提示和指令调整的推理高效 LLM 提取器以及一个由新基准数据集支持的强大的两阶段自动评估框架。大量的实验表明，我们的框架在准确性和效率方面都显着优于强大的基线。特别是，我们证明了经过微调的紧凑型 0.6B LLM 实现了顶级精度，同时显着降低了推理延迟和计算成本。该系统全面部署在阿里巴巴智能人力资源平台，支持各业务部门的实时应用。

Title: VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation

Authors: Yubo Sun, Chunyi Peng, Yukun Yan, Shi Yu, Zhenghao Liu, Chi Chen, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.09733
Pdf URL: https://arxiv.org/pdf/2510.09733
Copy Paste: [[2510.09733]] VisRAG 2.0: Evidence-Guided Multi-Image Reasoning in Visual Retrieval-Augmented Generation(https://arxiv.org/abs/2510.09733)
Keywords: language model, hallucination, retrieval-augmented generation
Abstract: Visual retrieval-augmented generation (VRAG) augments vision-language models (VLMs) with external visual knowledge to ground reasoning and reduce hallucinations. Yet current VRAG systems often fail to reliably perceive and integrate evidence across multiple images, leading to weak grounding and erroneous conclusions. In this paper, we propose EVisRAG, an end-to-end framework that learns to reason with evidence-guided multi-image to address this issue. The model first observes retrieved images and records per-image evidence, then derives the final answer from the aggregated evidence. To train EVisRAG effectively, we introduce Reward-Scoped Group Relative Policy Optimization (RS-GRPO), which binds fine-grained rewards to scope-specific tokens to jointly optimize visual perception and reasoning abilities of VLMs. Experimental results on multiple visual question answering benchmarks demonstrate that EVisRAG delivers substantial end-to-end gains over backbone VLM with 27\% improvements on average. Further analysis shows that, powered by RS-GRPO, EVisRAG improves answer accuracy by precisely perceiving and localizing question-relevant evidence across multiple images and deriving the final answer from that evidence, much like a real detective.
摘要：视觉检索增强生成（VRAG）利用外部视觉知识增强视觉语言模型（VLM），以进行推理并减少幻觉。然而，当前的 VRAG 系统通常无法可靠地感知和整合多个图像中的证据，从而导致基础薄弱和错误的结论。在本文中，我们提出了 EVisRAG，这是一种端到端框架，可以学习使用证据引导的多图像进行推理来解决这个问题。该模型首先观察检索到的图像并记录每个图像的证据，然后从聚合的证据中得出最终答案。为了有效地训练 EVisRAG，我们引入了奖励范围组相对策略优化（RS-GRPO），它将细粒度奖励与特定范围的代币绑定在一起，共同优化 VLM 的视觉感知和推理能力。多个视觉问答基准的实验结果表明，EVisRAG 比主干 VLM 提供了显着的端到端增益，平均提高了 27%。进一步的分析表明，在 RS-GRPO 的支持下，EVisRAG 通过在多个图像中精确感知和定位与问题相关的证据并从该证据中得出最终答案来提高答案准确性，就像真正的侦探一样。

Title: Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement

Authors: Steve Han, Gilberto Titericz Junior, Tom Balough, Wenfei Zhou
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09738
Pdf URL: https://arxiv.org/pdf/2510.09738
Copy Paste: [[2510.09738]] Judge's Verdict: A Comprehensive Analysis of LLM Judge Capability Through Human Agreement(https://arxiv.org/abs/2510.09738)
Keywords: language model, gpt, llm, retrieval-augmented generation, agent
Abstract: This research introduces the Judge's Verdict Benchmark, a novel two-step methodology to evaluate Large Language Models (LLMs) as judges for response accuracy evaluation tasks. We assess how well 54 LLMs can replicate human judgment when scoring responses from RAG (Retrieval-Augmented Generation) or Agentic pipelines against ground truth answers. Our methodology progresses from traditional correlation analysis to comprehensive Cohen's Kappa analysis that measures actual agreement patterns. The two-step approach includes: (1) a correlation test that filters judges with strong alignment, followed by (2) a human-likeness test using z-scores to identify two distinct judgment patterns: human-like judgment (|z| < 1) that mimics natural human variation, and super-consistent judgment (z > 1) that exceeds typical human-to-human agreement levels. This methodology reveals that 27 out of 54 tested LLMs achieve Tier 1 performance: 23 models exhibit human-like patterns that preserve the nuances of human judgment, while 4 models demonstrate super-consistent behavior, a pattern that could indicate either enhanced reliability or oversimplification of complex judgments. Testing 43 open-source models (1B-405B parameters) and 11 closed models (GPT, Gemini, Claude variants), we demonstrate that judge excellence is not solely dependent on model size but on specific training strategies. Our key contributions include: (1) establishing that correlation alone is insufficient for judge evaluation, (2) introducing a "Turing Test for judges" based on agreement patterns, and (3) providing a standardized benchmark for classifying LLM judges into distinct performance tiers for different evaluation needs.
摘要：本研究引入了法官判决基准，这是一种新颖的两步方法，用于评估大型语言模型（LLM）作为响应准确性评估任务的法官。我们评估了 54 名法学硕士根据真实答案对 RAG（检索增强生成）或 Agentic 管道的响应进行评分时复制人类判断的能力。我们的方法从传统的相关性分析发展到衡量实际一致性模式的综合 Cohen's Kappa 分析。两步方法包括：(1) 过滤具有强一致性的判断的相关性测试，然后是 (2) 使用 z 分数进行人类相似性测试，以识别两种不同的判断模式：模仿人类自然变异的类人判断 (|z| < 1) 和超出典型人与人一致性水平的超一致性判断 (z > 1)。该方法显示，54 个接受测试的法学硕士中有 27 个达到了第 1 级性能：23 个模型表现出类似人类的模式，保留了人类判断的细微差别，而 4 个模型则表现出超一致的行为，这种模式可能表明复杂判断的可靠性增强或过于简单化。测试了 43 个开源模型（1B-405B 参数）和 11 个封闭模型（GPT、Gemini、Claude 变体），我们证明判断的卓越性不仅取决于模型大小，还取决于具体的训练策略。我们的主要贡

Title: Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning

Authors: Adam Byerly, Daniel Khashabi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09770
Pdf URL: https://arxiv.org/pdf/2510.09770
Copy Paste: [[2510.09770]] Gold Panning: Turning Positional Bias into Signal for Multi-Document LLM Reasoning(https://arxiv.org/abs/2510.09770)
Keywords: language model, llm
Abstract: Large language models exhibit a strong position bias in multi-document contexts, systematically prioritizing information based on location rather than relevance. While existing approaches treat this bias as noise to be mitigated, we introduce Gold Panning Bandits, a framework that leverages position bias as a diagnostic signal: by reordering documents and observing shifts in the model's responses, we can efficiently identify the most relevant content. We frame the problem of choosing reorderings as a bipartite matching problem. While an optimal assignment can be computed at each iteration with the Hungarian algorithm in $O(N^3)$ time, we propose a greedy $O(N \log N)$ strategy that achieves comparable performance by prioritizing the placement of the most uncertain documents in the most informative positions. Our approach identifies relevant documents using up to 65\% fewer language model queries than random permutation baselines on knowledge-intensive NLP tasks, substantially reducing computational cost without model retraining. This work demonstrates that inherent LLM biases can be transformed from liabilities into assets for efficient, inference-time optimization.
摘要：大型语言模型在多文档上下文中表现出强烈的位置偏差，根据位置而不是相关性系统地对信息进行优先级排序。虽然现有方法将这种偏差视为需要减轻的噪音，但我们引入了 Gold Panning Bandits，这是一个利用位置偏差作为诊断信号的框架：通过重新排序文档并观察模型响应的变化，我们可以有效地识别最相关的内容。我们将选择重新排序的问题构建为二分匹配问题。虽然可以在每次迭代时使用匈牙利算法在 $O(N^3)$ 时间内计算出最佳分配，但我们提出了一种贪婪的 $O(N \log N)$ 策略，该策略通过优先将最不确定的文档放置在信息最丰富的位置来实现可比较的性能。我们的方法在识别相关文档时使用的语言模型查询比知识密集型 NLP 任务上的随机排列基线减少了 65%，从而在无需模型重新训练的情况下大幅降低了计算成本。这项工作表明，固有的法学硕士偏见可以从负债转化为资产，以实现高效的推理时间优化。

Title: PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection

Authors: Rakib Hossan, Shubhashis Roy Dipta
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09771
Pdf URL: https://arxiv.org/pdf/2510.09771
Copy Paste: [[2510.09771]] PromptGuard at BLP-2025 Task 1: A Few-Shot Classification Framework Using Majority Voting and Keyword Similarity for Bengali Hate Speech Detection(https://arxiv.org/abs/2510.09771)
Keywords: prompt
Abstract: The BLP-2025 Task 1A requires Bengali hate speech classification into six categories. Traditional supervised approaches need extensive labeled datasets that are expensive for low-resource languages. We developed PromptGuard, a few-shot framework combining chi-square statistical analysis for keyword extraction with adaptive majority voting for decision-making. We explore statistical keyword selection versus random approaches and adaptive voting mechanisms that extend classification based on consensus quality. Chi-square keywords provide consistent improvements across categories, while adaptive voting benefits ambiguous cases requiring extended classification rounds. PromptGuard achieves a micro-F1 of 67.61, outperforming n-gram baselines (60.75) and random approaches (14.65). Ablation studies confirm chi-square-based keywords show the most consistent impact across all categories.
摘要：BLP-2025 任务 1A 要求孟加拉仇恨言论分为六类。传统的监督方法需要大量的标记数据集，这对于资源匮乏的语言来说是昂贵的。我们开发了 PromptGuard，这是一个小框架，它将用于关键字提取的卡方统计分析与用于决策的自适应多数投票相结合。我们探索统计关键词选择与随机方法和自适应投票机制，这些机制根据共识质量扩展分类。卡方关键字提供了跨类别的一致改进，而自适应投票则有利于需要延长分类轮次的模糊情况。 PromptGuard 的 micro-F1 达到 67.61，优于 n-gram 基线 (60.75) 和随机方法 (14.65)。消融研究证实，基于卡方的关键词在所有类别中显示出最一致的影响。

Title: Text Prompt Injection of Vision Language Models

Authors: Ruizhe Zhu
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.09849
Pdf URL: https://arxiv.org/pdf/2510.09849
Copy Paste: [[2510.09849]] Text Prompt Injection of Vision Language Models(https://arxiv.org/abs/2510.09849)
Keywords: language model, prompt
Abstract: The widespread application of large vision language models has significantly raised safety concerns. In this project, we investigate text prompt injection, a simple yet effective method to mislead these models. We developed an algorithm for this type of attack and demonstrated its effectiveness and efficiency through experiments. Compared to other attack methods, our approach is particularly effective for large models without high demand for computational resources.
摘要：大型视觉语言模型的广泛应用显着引发了安全问题。在这个项目中，我们研究了文本提示注入，这是一种误导这些模型的简单而有效的方法。我们针对此类攻击开发了一种算法，并通过实验证明了其有效性和效率。与其他攻击方法相比，我们的方法对于对计算资源要求不高的大型模型特别有效。

Title: NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering

Authors: Kaiwen Shi, Zheyuan Zhang, Zhengqing Yuan, Keerthiram Murugesan, Vincent Galass, Chuxu Zhang, Yanfang Ye
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09854
Pdf URL: https://arxiv.org/pdf/2510.09854
Copy Paste: [[2510.09854]] NG-Router: Graph-Supervised Multi-Agent Collaboration for Nutrition Question Answering(https://arxiv.org/abs/2510.09854)
Keywords: agent
Abstract: Diet plays a central role in human health, and Nutrition Question Answering (QA) offers a promising path toward personalized dietary guidance and the prevention of diet-related chronic diseases. However, existing methods face two fundamental challenges: the limited reasoning capacity of single-agent systems and the complexity of designing effective multi-agent architectures, as well as contextual overload that hinders accurate decision-making. We introduce Nutritional-Graph Router (NG-Router), a novel framework that formulates nutritional QA as a supervised, knowledge-graph-guided multi-agent collaboration problem. NG-Router integrates agent nodes into heterogeneous knowledge graphs and employs a graph neural network to learn task-aware routing distributions over agents, leveraging soft supervision derived from empirical agent performance. To further address contextual overload, we propose a gradient-based subgraph retrieval mechanism that identifies salient evidence during training, thereby enhancing multi-hop and relational reasoning. Extensive experiments across multiple benchmarks and backbone models demonstrate that NG-Router consistently outperforms both single-agent and ensemble baselines, offering a principled approach to domain-aware multi-agent reasoning for complex nutritional health tasks.
摘要：饮食在人类健康中发挥着核心作用，营养问答（QA）为个性化饮食指导和预防与饮食相关的慢性疾病提供了一条有前途的道路。然而，现有方法面临两个基本挑战：单智能体系统的有限推理能力和设计有效多智能体架构的复杂性，以及阻碍准确决策的上下文过载。我们引入了营养图路由器（NG-Router），这是一种新颖的框架，它将营养 QA 制定为有监督的、知识图引导的多智能体协作问题。 NG-Router 将代理节点集成到异构知识图中，并利用图神经网络来学习代理上的任务感知路由分布，利用从经验代理性能中得出的软监督。为了进一步解决上下文过载问题，我们提出了一种基于梯度的子图检索机制，可以在训练期间识别显着证据，从而增强多跳和关系推理。跨多个基准和骨干模型的广泛实验表明，NG-Router 始终优于单智能体和集成基线，为复杂的营养健康任务的领域感知多智能体推理提供了原则性方法。

Title: NarraBench: A Comprehensive Framework for Narrative Benchmarking

Authors: Sil Hamilton, Matthew Wilkens, Andrew Piper
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09869
Pdf URL: https://arxiv.org/pdf/2510.09869
Copy Paste: [[2510.09869]] NarraBench: A Comprehensive Framework for Narrative Benchmarking(https://arxiv.org/abs/2510.09869)
Keywords: llm
Abstract: We present NarraBench, a theory-informed taxonomy of narrative-understanding tasks, as well as an associated survey of 78 existing benchmarks in the area. We find significant need for new evaluations covering aspects of narrative understanding that are either overlooked in current work or are poorly aligned with existing metrics. Specifically, we estimate that only 27% of narrative tasks are well captured by existing benchmarks, and we note that some areas -- including narrative events, style, perspective, and revelation -- are nearly absent from current evaluations. We also note the need for increased development of benchmarks capable of assessing constitutively subjective and perspectival aspects of narrative, that is, aspects for which there is generally no single correct answer. Our taxonomy, survey, and methodology are of value to NLP researchers seeking to test LLM narrative understanding.
摘要：我们提出了 NarraBench，一种基于理论的叙事理解任务分类法，以及对该领域 78 个现有基准的相关调查。我们发现非常需要新的评估，涵盖叙事理解的各个方面，这些方面要么在当前的工作中被忽视，要么与现有的指标不一致。具体来说，我们估计只有 27% 的叙事任务能够被现有基准很好地捕捉到，并且我们注意到一些领域——包括叙事事件、风格、视角和启示——在当前的评估中几乎缺失。我们还注意到需要加强制定能够评估叙事的构成性主观和视角方面的基准，即通常没有单一正确答案的方面。我们的分类、调查和方法对于寻求测试 LLM 叙事理解的 NLP 研究人员很有价值。

Title: CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs

Authors: Nafiseh Nikeghbal, Amir Hossein Kargaran, Jana Diesner
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09871
Pdf URL: https://arxiv.org/pdf/2510.09871
Copy Paste: [[2510.09871]] CoBia: Constructed Conversations Can Trigger Otherwise Concealed Societal Biases in LLMs(https://arxiv.org/abs/2510.09871)
Keywords: language model, llm
Abstract: Improvements in model construction, including fortified safety guardrails, allow Large language models (LLMs) to increasingly pass standard safety checks. However, LLMs sometimes slip into revealing harmful behavior, such as expressing racist viewpoints, during conversations. To analyze this systematically, we introduce CoBia, a suite of lightweight adversarial attacks that allow us to refine the scope of conditions under which LLMs depart from normative or ethical behavior in conversations. CoBia creates a constructed conversation where the model utters a biased claim about a social group. We then evaluate whether the model can recover from the fabricated bias claim and reject biased follow-up questions. We evaluate 11 open-source as well as proprietary LLMs for their outputs related to six socio-demographic categories that are relevant to individual safety and fair treatment, i.e., gender, race, religion, nationality, sex orientation, and others. Our evaluation is based on established LLM-based bias metrics, and we compare the results against human judgments to scope out the LLMs' reliability and alignment. The results suggest that purposefully constructed conversations reliably reveal bias amplification and that LLMs often fail to reject biased follow-up questions during dialogue. This form of stress-testing highlights deeply embedded biases that can be surfaced through interaction. Code and artifacts are available at this https URL.
摘要：模型构建的改进（包括强化安全护栏）使大型语言模型（LLM）能够越来越多地通过标准安全检查。然而，法学硕士有时会在谈话中暴露有害行为，例如表达种族主义观点。为了系统地分析这一点，我们引入了 CoBia，这是一套轻量级的对抗性攻击，使我们能够细化法学硕士在对话中偏离规范或道德行为的条件范围。 CoBia 创建了一个构建的对话，其中模型对某个社会群体提出了带有偏见的主张。然后，我们评估模型是否可以从捏造的偏见声明中恢复并拒绝有偏见的后续问题。我们评估了 11 名开源和专有的法学硕士，他们的成果涉及与个人安全和公平待遇相关的六个社会人口类别，即性别、种族、宗教、国籍、性取向等。我们的评估基于已建立的基于法学硕士的偏差指标，并将结果与人类判断进行比较，以确定法学硕士的可靠性和一致性。结果表明，有目的地构建的对话可靠地揭示了偏见放大，并且法学硕士在对话过程中常常无法拒绝有偏见的后续问题。这种形式的压力测试凸显了根深蒂固的偏见，这些偏见可以通过互动而显现出来。代码和工件可从此 https URL 获取。

Title: Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs

Authors: Xu Pan, Ely Hahami, Jingxuan Fan, Ziqian Xie, Haim Sompolinsky
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09885
Pdf URL: https://arxiv.org/pdf/2510.09885
Copy Paste: [[2510.09885]] Closing the Data-Efficiency Gap Between Autoregressive and Masked Diffusion LLMs(https://arxiv.org/abs/2510.09885)
Keywords: language model, llm
Abstract: Despite autoregressive large language models (arLLMs) being the current dominant paradigm in language modeling, they resist knowledge injection via fine-tuning due to inherent shortcomings such as the "reversal curse" -- the challenge of answering questions that reverse the original information order in the training sample. Masked diffusion large language models (dLLMs) are rapidly emerging as a powerful alternative to the arLLM paradigm, with evidence of better data efficiency and free of the "reversal curse" in pre-training. However, it is unknown whether these advantages extend to the post-training phase, i.e. whether pre-trained dLLMs can easily acquire new knowledge through fine-tuning. On three diverse datasets, we fine-tune arLLMs and dLLMs, evaluating them with forward and backward style Question Answering (QA) to probe knowledge generalization and the reversal curse. Our results confirm that arLLMs critically rely on extensive data augmentation via paraphrases for QA generalization, and paraphrases are only effective when their information order matches the QA style. Conversely, dLLMs achieve high accuracies on both forward and backward QAs without paraphrases; adding paraphrases yields only marginal gains. Lastly, inspired by the dLLM's performance, we introduce a novel masked fine-tuning paradigm for knowledge injection into pre-trained arLLMs. This proposed method successfully and drastically improves the data efficiency of arLLM fine-tuning, effectively closing the performance gap with dLLMs.
摘要：尽管自回归大语言模型 (arLLM) 是当前语言建模中的主导范式，但由于“反转诅咒”等固有缺陷，它们抵制通过微调进行知识注入，即回答颠倒训练样本中原始信息顺序的问题的挑战。掩码扩散大语言模型 (dLLM) 正在迅速崛起，成为 arLLM 范式的强大替代方案，有证据表明其数据效率更高，并且没有预训练中的“逆转诅咒”。然而，目前尚不清楚这些优势是否延伸到训练后阶段，即预训练的 dLLM 是否可以通过微调轻松获取新知识。在三个不同的数据集上，我们对 arLLM 和 dLLM 进行微调，使用前向和后向问答 (QA) 对其进行评估，以探究知识泛化和逆转诅咒。我们的结果证实，arLLM 严重依赖通过释义进行广泛的数据增强来进行 QA 泛化，并且释义仅在其信息顺序与 QA 风格匹配时才有效。相反，dLLM 在前向和后向 QA 上都实现了高精度，无需释义；添加释义只能产生边际收益。最后，受 dLLM 性能的启发，我们引入了一种新颖的屏蔽微调范例，用于将知识注入到预先训练的 arLLM 中。该方法成功并大幅提高了 arLLM 微调的数据效率，有效缩小了与 dLLM 的性能差距。

Title: Abductive Preference Learning

Authors: Yijin Ni, Peng Qi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09887
Pdf URL: https://arxiv.org/pdf/2510.09887
Copy Paste: [[2510.09887]] Abductive Preference Learning(https://arxiv.org/abs/2510.09887)
Keywords: language model, gpt, prompt
Abstract: Frontier large language models such as GPT-5 and Claude Sonnet remain prone to overconfidence even after alignment through Reinforcement Learning with Human Feedback (RLHF) and Direct Preference Optimization (DPO). For instance, they tend to offer the same conservative answer "No" to both questions "Can I eat the [food / potato chips] that has been left out overnight?" despite the latter requiring no refridgeration for safe consumption. We find that this failure is potentially attributed to a limitation of existing preference learning: it emphasizes selecting the correct response for a given prompt, while neglecting counterfactual prompts that should alter the response. To address this limitation, we propose abductive preference learning, a fine-tuning paradigm that reverses the conventional conditioning by learning preferences over prompts given a response. To validate this idea, we construct an abductive dataset derived from the HaluEval QA benchmark with 1,001 entries, implementing abductive DPO and its variant DPOP. Experiments reveal complementary strengths: standard methods improve response selection, abductive methods improve prompt discrimination, while a multitask objective unifies both. On the abductive dataset, multitask DPOP boosts accuracy from $90.0\%$ to $99.5\%$ in response selection and $54.7\%$ to $85.0\%$ in prompt discrimination, with qualitative evidence highlighting improved sensitivity to prompt differences. Finally, evaluation on AlpacaEval shows multitask DPOP improves win rate (from $5.26\%$ to $6.17\%$), confirming that abductive preference learning preserves the benefits of conventional preference optimization while addressing the overlooked challenge of counterfactual prompts.
摘要：即使通过人类反馈强化学习 (RLHF) 和直接偏好优化 (DPO) 进行调整后，GPT-5 和 Claude Sonnet 等前沿大型语言模型仍然容易过度自信。例如，他们倾向于对“我可以吃隔夜的[食物/薯片]吗？”这两个问题提供相同的保守答案“否”。尽管后者不需要冷藏即可安全食用。我们发现这种失败可能归因于现有偏好学习的局限性：它强调为给定的提示选择正确的响应，而忽略了应该改变响应的反事实提示。为了解决这个限制，我们提出了溯因偏好学习，这是一种微调范式，通过学习对给出响应的提示的偏好来逆转传统的条件作用。为了验证这个想法，我们构建了一个源自 HaluEval QA 基准的溯因数据集，包含 1,001 个条目，实现溯因 DPO 及其变体 DPOP。实验揭示了互补的优势：标准方法改善了反应选择，溯因方法改善了即时辨别，而多任务目标则将两者结合起来。在溯因数据集上，多任务 DPOP 将响应选择的准确性从 $90.0\%$ 提高到 $99.5\%$，将提示辨别的准确性从 $54.7\%$ 提高到 $85.0\%$，定性证据强调对提示差异的敏感性有所提高。最后，对 AlpacaEval 的评估显示多任务 DPOP 提高了胜率（从 $5.26\%$ 到 $6.17\%$），证实溯因偏好学习保留了传统偏好优化的好处，同时解决了反事实提示的被忽视的挑战。

Title: HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection

Authors: Guanming Chen, Lingzhi Shen, Xiaohao Cai, Imran Razzak, Shoaib Jameel
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09893
Pdf URL: https://arxiv.org/pdf/2510.09893
Copy Paste: [[2510.09893]] HIPPD: Brain-Inspired Hierarchical Information Processing for Personality Detection(https://arxiv.org/abs/2510.09893)
Keywords: language model
Abstract: Personality detection from text aims to infer an individual's personality traits based on linguistic patterns. However, existing machine learning approaches often struggle to capture contextual information spanning multiple posts and tend to fall short in extracting representative and robust features in semantically sparse environments. This paper presents HIPPD, a brain-inspired framework for personality detection that emulates the hierarchical information processing of the human brain. HIPPD utilises a large language model to simulate the cerebral cortex, enabling global semantic reasoning and deep feature abstraction. A dynamic memory module, modelled after the prefrontal cortex, performs adaptive gating and selective retention of critical features, with all adjustments driven by dopaminergic prediction error feedback. Subsequently, a set of specialised lightweight models, emulating the basal ganglia, are dynamically routed via a strict winner-takes-all mechanism to capture the personality-related patterns they are most proficient at recognising. Extensive experiments on the Kaggle and Pandora datasets demonstrate that HIPPD consistently outperforms state-of-the-art baselines.
摘要：文本人格检测旨在根据语言模式推断个体的人格特征。然而，现有的机器学习方法通常难以捕获跨多个帖子的上下文信息，并且往往无法在语义稀疏的环境中提取代表性和鲁棒性的特征。本文提出了 HIPPD，一种受大脑启发的人格检测框架，模拟人脑的分层信息处理。 HIPPD利用大型语言模型来模拟大脑皮层，实现全局语义推理和深层特征抽象。以前额皮质为模型的动态记忆模块执行自适应门控和关键特征的选择性保留，所有调整均由多巴胺能预测误差反馈驱动。随后，一组模拟基底神经节的专门轻量级模型通过严格的赢家通吃机制动态路由，以捕获它们最擅长识别的与人格相关的模式。对 Kaggle 和 Pandora 数据集的大量实验表明，HIPPD 始终优于最先进的基线。

Title: Don't Throw Away Your Pretrained Model

Authors: Shangbin Feng, Wenhao Yu, Yike Wang, Hongming Zhang, Yulia Tsvetkov, Dong Yu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09913
Pdf URL: https://arxiv.org/pdf/2510.09913
Copy Paste: [[2510.09913]] Don't Throw Away Your Pretrained Model(https://arxiv.org/abs/2510.09913)
Keywords: language model
Abstract: Alignment training has tradeoffs: it helps language models (LMs) gain in reasoning and instruction following but might lose out on skills such as creativity and calibration, where unaligned base models are better at. We aim to make the best of both worlds through model collaboration, where different models in the training pipeline collaborate and complement each other. Since LM responses feature interleaving skills that favor different models, we propose Switch Generation, where pretrained and aligned model versions take turns to ``speak'' in a response sequence. Specifically, we train a switcher LM by learning from outcomes of choosing different models to generate the next segment across diverse queries and contexts. At inference time, the switcher LM guides different model checkpoints to dynamically generate the next segment where their strengths are most needed. Extensive experiments with 8 model collaboration baselines and 18 datasets show that 1) model collaboration consistently outperforms individual models on 16 out of 18 tasks, and 2) Switch Generation further outperforms baselines by 12.9% on average. Further analysis reveals that Switch Generation discovers compositional skills to solve problems where individual models struggle and generalizes to unseen models and tasks, reusing and repurposing by-products in expensive model training pipelines that are otherwise discarded.
摘要：对齐训练需要权衡：它可以帮助语言模型 (LM) 提高推理和指令遵循能力，但可能会失去创造力和校准等技能，而这些技能是未对齐的基础模型所擅长的。我们的目标是通过模型协作实现两全其美，即训练管道中的不同模型进行协作并相互补充。由于 LM 响应具有有利于不同模型的交错技能，因此我们提出了 Switch Generation，其中预训练和对齐的模型版本轮流在响应序列中“说话”。具体来说，我们通过学习选择不同模型的结果来训练切换器 LM，以跨不同的查询和上下文生成下一个片段。在推理时，切换器 LM 引导不同的模型检查点动态生成最需要其优势的下一个片段。对 8 个模型协作基线和 18 个数据集进行的广泛实验表明：1) 模型协作在 18 项任务中的 16 项上始终优于单个模型，2) Switch Generation 的平均性能进一步优于基线 12.9%。进一步的分析表明，Switch Generation 发现了组合技能来解决单个模型陷入困境的问题，并推广到看不见的模型和任务，重用和重新利用昂贵的模型训练管道中的副产品，否则这些副产品就会被丢弃。

Title: Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning

Authors: Sicong Huang, Qianqi Yan, Shengze Wang, Ian Lane
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09915
Pdf URL: https://arxiv.org/pdf/2510.09915
Copy Paste: [[2510.09915]] Enhancing Faithfulness in Abstractive Summarization via Span-Level Fine-Tuning(https://arxiv.org/abs/2510.09915)
Keywords: language model, gpt, llm, hallucination
Abstract: Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. However, despite their ability to generate fluent summaries, these models sometimes produce unfaithful summaries, introducing hallucinations at the word, phrase, or concept level. Existing mitigation strategies, such as post-processing corrections or contrastive learning with synthetically generated negative samples, fail to fully address the diverse errors that can occur in LLM-generated summaries. In this paper, we investigate fine-tuning strategies to reduce the occurrence of unfaithful spans in generated summaries. First, we automatically generate summaries for the set of source documents in the training set with a variety of LLMs and then use GPT-4o to annotate any hallucinations it detects at the span-level. Leveraging these annotations, we fine-tune LLMs with both hallucination-free summaries and annotated unfaithful spans to enhance model faithfulness. In this paper, we introduce a new dataset that contains both faithful and unfaithful summaries with span-level labels and we evaluate three techniques to fine-tuning a LLM to improve the faithfulness of the resulting summarization: gradient ascent, unlikelihood training, and task vector negation. Experimental results show that all three approaches successfully leverage span-level annotations to improve faithfulness, with unlikelihood training being the most effective.
摘要：使用大型语言模型（LLM）进行抽象摘要已成为压缩信息的重要工具。然而，尽管它们能够生成流畅的摘要，但这些模型有时会生成不忠实的摘要，在单词、短语或概念层面引入幻觉。现有的缓解策略，例如后处理校正或与综合生成的负样本进行对比学习，无法完全解决法学硕士生成的摘要中可能出现的各种错误。在本文中，我们研究了微调策略，以减少生成的摘要中不忠实跨度的发生。首先，我们使用各种 LLM 自动生成训练集中源文档集的摘要，然后使用 GPT-4o 注释它在跨度级别检测到的任何幻觉。利用这些注释，我们通过无幻觉摘要和注释的不忠实跨度对法学硕士进行微调，以增强模型的可信度。在本文中，我们介绍了一个新的数据集，其中包含带有跨度标签的忠实和不忠实的摘要，并且我们评估了三种微调 LLM 以提高结果摘要的可信度的技术：梯度上升、似然性训练和任务向量否定。实验结果表明，所有三种方法都成功地利用跨度注释来提高可信度，其中可能性训练是最有效的。

Title: Unpacking Hateful Memes: Presupposed Context and False Claims

Authors: Weibin Cai, Jiayu Li, Reza Zafarani
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.09935
Pdf URL: https://arxiv.org/pdf/2510.09935
Copy Paste: [[2510.09935]] Unpacking Hateful Memes: Presupposed Context and False Claims(https://arxiv.org/abs/2510.09935)
Keywords: language model
Abstract: While memes are often humorous, they are frequently used to disseminate hate, causing serious harm to individuals and society. Current approaches to hateful meme detection mainly rely on pre-trained language models. However, less focus has been dedicated to \textit{what make a meme hateful}. Drawing on insights from philosophy and psychology, we argue that hateful memes are characterized by two essential features: a \textbf{presupposed context} and the expression of \textbf{false claims}. To capture presupposed context, we develop \textbf{PCM} for modeling contextual information across modalities. To detect false claims, we introduce the \textbf{FACT} module, which integrates external knowledge and harnesses cross-modal reference graphs. By combining PCM and FACT, we introduce \textbf{\textsf{SHIELD}}, a hateful meme detection framework designed to capture the fundamental nature of hate. Extensive experiments show that SHIELD outperforms state-of-the-art methods across datasets and metrics, while demonstrating versatility on other tasks, such as fake news detection.
摘要：虽然模因通常很幽默，但它们经常被用来传播仇恨，对个人和社会造成严重伤害。目前的仇恨模因检测方法主要依赖于预先训练的语言模型。然而，人们对 \textit{什么使模因变得可憎} 的关注较少。借鉴哲学和心理学的见解，我们认为仇恨模因有两个基本特征：\textbf{预设上下文}和\textbf{虚假主张}的表达。为了捕获预设的上下文，我们开发了 \textbf{PCM} 来对跨模态的上下文信息进行建模。为了检测虚假声明，我们引入了 \textbf{FACT} 模块，该模块集成了外部知识并利用跨模式参考图。通过结合 PCM 和 FACT，我们引入了 \textbf{\textsf{SHIELD}}，这是一个仇恨模因检测框架，旨在捕获仇恨的基本本质。大量实验表明，SHIELD 在数据集和指标方面优于最先进的方法，同时展示了在其他任务（例如假新闻检测）上的多功能性。

Title: Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation

Authors: Mir Tafseer Nayeem, Sawsan Alqahtani, Md Tahmid Rahman Laskar, Tasnim Mohiuddin, M Saiful Bari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.09947
Pdf URL: https://arxiv.org/pdf/2510.09947
Copy Paste: [[2510.09947]] Beyond Fertility: Analyzing STRR as a Metric for Multilingual Tokenization Evaluation(https://arxiv.org/abs/2510.09947)
Keywords: language model, llm
Abstract: Tokenization is a crucial but under-evaluated step in large language models (LLMs). The standard metric, fertility (the average number of tokens per word), captures compression efficiency but obscures how vocabularies are allocated across languages and domains. We analyze six widely used tokenizers across seven languages and two domains, finding stable fertility for English, high fertility for Chinese, and little domain sensitivity. To address fertility's blind spots, we propose the Single Token Retention Rate (STRR), which measures the proportion of words preserved as single tokens. STRR reveals systematic prioritization of English, strong support for Chinese, and fragmentation in Hindi, offering an interpretable view of cross-lingual fairness. Our results show that STRR complements fertility and provides practical guidance for designing more equitable multilingual tokenizers.
摘要：标记化是大型语言模型 (LLM) 中至关重要但评估不足的一步。标准指标“生育力”（每个单词的平均标记数）捕获了压缩效率，但掩盖了词汇表如何跨语言和领域分配。我们分析了跨七种语言和两个领域的六种广泛使用的分词器，发现英语的繁殖力稳定，中文的繁殖力高，而领域敏感性很小。为了解决生育盲点，我们提出了单令牌保留率（STRR），它衡量保留为单个令牌的单词的比例。 STRR揭示了英语的系统性优先、对中文的大力支持以及印地语的碎片化，提供了跨语言公平性的可解释观点。我们的结果表明，STRR 补充了生育力，并为设计更公平的多语言分词器提供了实用指导。

Title: Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey

Authors: Jiaqi Wei, Xiang Zhang, Yuejin Yang, Wenxuan Huang, Juntai Cao, Sheng Xu, Xiang Zhuang, Zhangyang Gao, Muhammad Abdul-Mageed, Laks V.S. Lakshmanan, Chenyu You, Wanli Ouyang, Siqi Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09988
Pdf URL: https://arxiv.org/pdf/2510.09988
Copy Paste: [[2510.09988]] Unifying Tree Search Algorithm and Reward Design for LLM Reasoning: A Survey(https://arxiv.org/abs/2510.09988)
Keywords: language model, llm, agent
Abstract: Deliberative tree search is a cornerstone of modern Large Language Model (LLM) research, driving the pivot from brute-force scaling toward algorithmic efficiency. This single paradigm unifies two critical frontiers: \textbf{Test-Time Scaling (TTS)}, which deploys on-demand computation to solve hard problems, and \textbf{Self-Improvement}, which uses search-generated data to durably enhance model parameters. However, this burgeoning field is fragmented and lacks a common formalism, particularly concerning the ambiguous role of the reward signal -- is it a transient heuristic or a durable learning target? This paper resolves this ambiguity by introducing a unified framework that deconstructs search algorithms into three core components: the \emph{Search Mechanism}, \emph{Reward Formulation}, and \emph{Transition Function}. We establish a formal distinction between transient \textbf{Search Guidance} for TTS and durable \textbf{Parametric Reward Modeling} for Self-Improvement. Building on this formalism, we introduce a component-centric taxonomy, synthesize the state-of-the-art, and chart a research roadmap toward more systematic progress in creating autonomous, self-improving agents.
摘要：协商树搜索是现代大型语言模型 (LLM) 研究的基石，推动着从强力扩展转向算法效率的转变。这个单一范式统一了两个关键前沿：\textbf{测试时间缩放（TTS）}，它部署按需计算来解决难题；\textbf{自我改进}，它使用搜索生成的数据来持久增强模型参数。然而，这个新兴领域是支离破碎的，缺乏共同的形式主义，特别是关于奖励信号的模糊作用——它是短暂的启发式还是持久的学习目标？本文通过引入一个统一的框架来解决这个歧义，该框架将搜索算法解构为三个核心组件：\emph{搜索机制}、\emph{奖励公式}和\emph{转换函数}。我们在 TTS 的瞬态 \textbf{搜索指导} 和自我改进的持久 \textbf{参数奖励建模} 之间建立了正式的区别。在此形式主义的基础上，我们引入了以组件为中心的分类法，综合了最先进的技术，并制定了研究路线图，以在创建自主、自我改进的智能体方面取得更系统的进展。

Title: Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations

Authors: Yimin Xiao, Yongle Zhang, Dayeon Ki, Calvin Bao, Marianna J. Martindale, Charlotte Vaughn, Ge Gao, Marine Carpuat
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.09994
Pdf URL: https://arxiv.org/pdf/2510.09994
Copy Paste: [[2510.09994]] Toward Machine Translation Literacy: How Lay Users Perceive and Rely on Imperfect Translations(https://arxiv.org/abs/2510.09994)
Keywords: prompt
Abstract: As Machine Translation (MT) becomes increasingly commonplace, understanding how the general public perceives and relies on imperfect MT is crucial for contextualizing MT research in real-world applications. We present a human study conducted in a public museum (n=452), investigating how fluency and adequacy errors impact bilingual and non-bilingual users' reliance on MT during casual use. Our findings reveal that non-bilingual users often over-rely on MT due to a lack of evaluation strategies and alternatives, while experiencing the impact of errors can prompt users to reassess future reliance. This highlights the need for MT evaluation and NLP explanation techniques to promote not only MT quality, but also MT literacy among its users.
摘要：随着机器翻译 (MT) 变得越来越普遍，了解公众如何看待和依赖不完善的 MT 对于将 MT 研究融入实际应用中至关重要。我们提出了一项在公共博物馆 (n=452) 中进行的人体研究，调查流畅性和充分性错误如何影响双语和非双语用户在休闲使用期间对 MT 的依赖。我们的研究结果表明，由于缺乏评估策略和替代方案，非双语用户往往过度依赖机器翻译，而经历错误的影响可以促使用户重新评估未来的依赖程度。这凸显了对 MT 评估和 NLP 解释技术的需求，以不仅提高 MT 质量，而且提高用户的 MT 素养。

Title: Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning

Authors: Shu Zhao, Tan Yu, Anbang Xu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.10009
Pdf URL: https://arxiv.org/pdf/2510.10009
Copy Paste: [[2510.10009]] Beyond the limitation of a single query: Train your LLM for query expansion with Reinforcement Learning(https://arxiv.org/abs/2510.10009)
Keywords: llm, agent
Abstract: Reasoning-augmented search agents, such as Search-R1, are trained to reason, search, and generate the final answer iteratively. Nevertheless, due to their limited capabilities in reasoning and search, their performance on multi-hop QA benchmarks remains far from satisfactory. To handle complex or compound queries, we train an LLM-based search agent with the native capability of query expansion through reinforcement learning. In each turn, our search agent proposes several query variants, which are searched simultaneously to cover more relevant information. Meanwhile, given limited post-training data and computing resources, it is very challenging for a search agent to master multiple tasks, including query generation, retrieved information understanding, and answer generation. Therefore, we propose incorporating a pre-trained squeezer model that helps the search agent understand the retrieved documents, allowing the search agent to focus on query generation for high retrieval recall. With the assistance of the squeezer model, we discover that even a small-scale 3B LLM can demonstrate a strong capability of query expansion and achieve state-of-the-art accuracy on the multi-hop QA benchmarks. To be specific, our experiments across seven question-answering benchmarks demonstrate that our method, named ExpandSearch, achieves an average improvement of 4.4% compared to state-of-the-art baselines, with strong gains on multi-hop reasoning tasks requiring diverse evidence aggregation.
摘要：推理增强搜索代理（例如 Search-R1）经过训练可以迭代推理、搜索并生成最终答案。然而，由于其推理和搜索能力有限，它们在多跳 QA 基准测试中的表现仍远不能令人满意。为了处理复杂或复合查询，我们训练了一个基于 LLM 的搜索代理，该代理具有通过强化学习进行查询扩展的本机功能。在每一轮中，我们的搜索代理都会提出几个查询变体，同时搜索这些变体以涵盖更多相关信息。同时，由于训练后数据和计算资源有限，搜索代理掌握多项任务（包括查询生成、检索信息理解和答案生成）非常具有挑战性。因此，我们建议合并一个预训练的挤压器模型，帮助搜索代理理解检索到的文档，使搜索代理能够专注于查询生成以实现高检索召回率。在squeequer模型的帮助下，我们发现即使是小规模的3B LLM也可以表现出强大的查询扩展能力，并在多跳QA基准上实现最先进的准确性。具体来说，我们在七个问答基准上进行的实验表明，与最先进的基线相比，我们的方法（名为 ExpandSearch）平均提高了 4.4%，并且在需要不同证据聚合的多跳推理任务上取得了巨大的进步。

Title: Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety

Authors: Yuyi Huang, Runzhe Zhan, Lidia S.Chao, Ailin Tao, Derek F.Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10013
Pdf URL: https://arxiv.org/pdf/2510.10013
Copy Paste: [[2510.10013]] Path Drift in Large Reasoning Models:How First-Person Commitments Override Safety(https://arxiv.org/abs/2510.10013)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: As large language models (LLMs) are increasingly deployed for complex reasoning tasks, Long Chain-of-Thought (Long-CoT) prompting has emerged as a key paradigm for structured inference. Despite early-stage safeguards enabled by alignment techniques such as RLHF, we identify a previously underexplored vulnerability: reasoning trajectories in Long-CoT models can drift from aligned paths, resulting in content that violates safety constraints. We term this phenomenon Path Drift. Through empirical analysis, we uncover three behavioral triggers of Path Drift: (1) first-person commitments that induce goal-driven reasoning that delays refusal signals; (2) ethical evaporation, where surface-level disclaimers bypass alignment checkpoints; (3) condition chain escalation, where layered cues progressively steer models toward unsafe completions. Building on these insights, we introduce a three-stage Path Drift Induction Framework comprising cognitive load amplification, self-role priming, and condition chain hijacking. Each stage independently reduces refusal rates, while their combination further compounds the effect. To mitigate these risks, we propose a path-level defense strategy incorporating role attribution correction and metacognitive reflection (reflective safety cues). Our findings highlight the need for trajectory-level alignment oversight in long-form reasoning beyond token-level alignment.
摘要：随着大型语言模型 (LLM) 越来越多地用于复杂的推理任务，长思维链 (Long-CoT) 提示已成为结构化推理的关键范例。尽管通过 RLHF 等对齐技术启用了早期保护措施，但我们发现了一个先前未充分研究的漏洞：Long-CoT 模型中的推理轨迹可能会偏离对齐路径，从而导致内容违反安全约束。我们将这种现象称为“路径漂移”。通过实证分析，我们发现了路径漂移的三个行为触发因素：（1）第一人称承诺引发目标驱动推理，延迟拒绝信号； (2) 道德蒸发，表面的免责声明绕过对齐检查点； (3) 条件链升级，其中分层线索逐渐引导模型走向不安全的完成。基于这些见解，我们引入了一个三阶段路径漂移归纳框架，包括认知负荷放大、自我角色启动和条件链劫持。每个阶段都独立地降低拒绝率，而它们的组合则进一步增强了效果。为了减轻这些风险，我们提出了一种结合角色归因校正和元认知反射（反射安全线索）的路径级防御策略。我们的研究结果强调了在令牌级对齐之外的长形式推理中进行轨迹级对齐监督的必要性。

Title: Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default

Authors: Jiaqi Liu, Lanruo Wang, Su Liu, Xin Hu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10025
Pdf URL: https://arxiv.org/pdf/2510.10025
Copy Paste: [[2510.10025]] Lightweight Baselines for Medical Abstract Classification: DistilBERT with Cross-Entropy as a Strong Default(https://arxiv.org/abs/2510.10025)
Keywords: language model
Abstract: Large language models work well for many NLP tasks, but they are hard to deploy in health settings with strict cost, latency, and privacy limits. We revisit a lightweight recipe for medical abstract classification and ask how far compact encoders can go under a controlled budget. Using the public medical abstracts corpus, we finetune BERT base and DistilBERT with three objectives standard cross-entropy, class weighted cross entropy, and focal loss keeping tokenizer, sequence length, optimizer, and schedule fixed. DistilBERT with plain cross-entropy gives the best balance on the test set while using far fewer parameters than BERT base. We report accuracy, Macro F1, and Weighted F1, release the evaluation code, and include confusion analyses to make error patterns clear. Our results suggest a practical default: start with a compact encoder and cross-entropy, then add calibration and task-specific checks before moving to heavier models.
摘要：大型语言模型适用于许多 NLP 任务，但它们很难部署在具有严格成本、延迟和隐私限制的健康环境中。我们重新审视医学摘要分类的轻量级配方，并询问紧凑型编码器在受控预算下可以走多远。使用公共医学文摘语料库，我们通过三个目标标准交叉熵、类别加权交叉熵和焦点损失保持分词器、序列长度、优化器和固定时间表来微调 BERT 基础和 DistilBERT。具有简单交叉熵的 DistilBERT 在测试集上提供了最佳平衡，同时使用比 BERT 基础少得多的参数。我们报告准确性、宏 F1 和加权 F1，发布评估代码，并进行混淆分析以使错误模式变得清晰。我们的结果提出了一个实用的默认值：从紧凑的编码器和交叉熵开始，然后在转向更重的模型之前添加校准和特定于任务的检查。

Title: CLMN: Concept based Language Models via Neural Symbolic Reasoning

Authors: Yibo Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10063
Pdf URL: https://arxiv.org/pdf/2510.10063
Copy Paste: [[2510.10063]] CLMN: Concept based Language Models via Neural Symbolic Reasoning(https://arxiv.org/abs/2510.10063)
Keywords: language model
Abstract: Deep learning has advanced NLP, but interpretability remains limited, especially in healthcare and finance. Concept bottleneck models tie predictions to human concepts in vision, but NLP versions either use binary activations that harm text representations or latent concepts that weaken semantics, and they rarely model dynamic concept interactions such as negation and context. We introduce the Concept Language Model Network (CLMN), a neural-symbolic framework that keeps both performance and interpretability. CLMN represents concepts as continuous, human-readable embeddings and applies fuzzy-logic reasoning to learn adaptive interaction rules that state how concepts affect each other and the final decision. The model augments original text features with concept-aware representations and automatically induces interpretable logic rules. Across multiple datasets and pre-trained language models, CLMN achieves higher accuracy than existing concept-based methods while improving explanation quality. These results show that integrating neural representations with symbolic reasoning in a unified concept space can yield practical, transparent NLP systems.
摘要：深度学习推动了 NLP 的发展，但可解释性仍然有限，尤其是在医疗保健和金融领域。概念瓶颈模型将预测与视觉中的人类概念联系起来，但 NLP 版本要么使用损害文本表示的二元激活，要么使用削弱语义的潜在概念，并且它们很少模拟动态概念交互，例如否定和上下文。我们引入概念语言模型网络（CLMN），这是一种同时保持性能和可解释性的神经符号框架。 CLMN 将概念表示为连续的、人类可读的嵌入，并应用模糊逻辑推理来学习自适应交互规则，这些规则说明概念如何相互影响以及最终决策。该模型通过概念感知表示增强原始文本特征，并自动归纳可解释的逻辑规则。在多个数据集和预训练语言模型中，CLMN 比现有的基于概念的方法实现了更高的准确性，同时提高了解释质量。这些结果表明，将神经表示与符号推理集成在统一的概念空间中可以产生实用、透明的 NLP 系统。

Title: Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference

Authors: Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, Tianke Ban
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10072
Pdf URL: https://arxiv.org/pdf/2510.10072
Copy Paste: [[2510.10072]] Unilaw-R1: A Large Language Model for Legal Reasoning with Reinforcement Learning and Iterative Inference(https://arxiv.org/abs/2510.10072)
Keywords: language model, llm, chain-of-thought
Abstract: Reasoning-focused large language models (LLMs) are rapidly evolving across various domains, yet their capabilities in handling complex legal problems remains underexplored. In this paper, we introduce Unilaw-R1, a large language model tailored for legal reasoning. With a lightweight 7-billion parameter scale, Unilaw-R1 significantly reduces deployment cost while effectively tackling three core challenges in the legal domain: insufficient legal knowledge, unreliable reasoning logic, and weak business generalization. To address these issues, we first construct Unilaw-R1-Data, a high-quality dataset containing 17K distilled and screened chain-of-thought (CoT) samples. Based on this, we adopt a two-stage training strategy combining Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), which significantly boosts the performance on complex legal reasoning tasks and supports interpretable decision-making in legal AI applications. To assess legal reasoning ability, we also introduce Unilaw-R1-Eval, a dedicated benchmark designed to evaluate models across single- and multi-choice legal tasks. Unilaw-R1 demonstrates strong results on authoritative benchmarks, outperforming all models of similar scale and achieving performance on par with the much larger DeepSeek-R1-Distill-Qwen-32B (54.9%). Following domain-specific training, it also showed significant gains on LawBench and LexEval, exceeding Qwen-2.5-7B-Instruct (46.6%) by an average margin of 6.6%.
摘要：以推理为中心的大型语言模型（LLM）正在各个领域迅速发展，但它们处理复杂法律问题的能力仍未得到充分开发。在本文中，我们介绍了 Unilaw-R1，一种专为法律推理量身定制的大型语言模型。 Unilaw-R1凭借轻量级的70亿参数规模，大幅降低部署成本，同时有效解决法律领域的三大核心挑战：法律知识不足、推理逻辑不可靠、业务泛化能力弱。为了解决这些问题，我们首先构建了 Unilaw-R1-Data，这是一个包含 17K 个经过蒸馏和筛选的思想链 (CoT) 样本的高质量数据集。基于此，我们采用监督微调（SFT）和强化学习（RL）相结合的两阶段训练策略，显着提高了复杂法律推理任务的性能，并支持法律人工智能应用中的可解释决策。为了评估法律推理能力，我们还引入了 Unilaw-R1-Eval，这是一个专用基准，旨在评估单选和多选法律任务的模型。 Unilaw-R1 在权威基准测试中表现出强劲的结果，优于所有类似规模的模型，并达到与更大的 DeepSeek-R1-Distill-Qwen-32B (54.9%) 相当的性能。在特定领域的训练之后，它在 LawBench 和 LexEval 上也显示出显着的进步，平均超过 Qwen-2.5-7B-Instruct (46.6%) 6.6%。

Title: A-IPO: Adaptive Intent-driven Preference Optimization

Authors: Wenqing Wang (1), Muhammad Asif Ali (2), Ali Shoker (2), Ruohan Yang (1), Junyang Chen (3), Ying Sha (1), Huan Wang (1) ((1) Huazhong Agricultural University, China,(2) King Abdullah University of Science and Technology, KSA,(3) Shenzhen University, China)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10077
Pdf URL: https://arxiv.org/pdf/2510.10077
Copy Paste: [[2510.10077]] A-IPO: Adaptive Intent-driven Preference Optimization(https://arxiv.org/abs/2510.10077)
Keywords: prompt
Abstract: Human preferences are diverse and dynamic, shaped by regional, cultural, and social factors. Existing alignment methods like Direct Preference Optimization (DPO) and its variants often default to majority views, overlooking minority opinions and failing to capture latent user intentions in prompts. To address these limitations, we introduce \underline{\textbf{A}}daptive \textbf{\underline{I}}ntent-driven \textbf{\underline{P}}reference \textbf{\underline{O}}ptimization (\textbf{A-IPO}). Specifically,A-IPO introduces an intention module that infers the latent intent behind each user prompt and explicitly incorporates this inferred intent into the reward function, encouraging stronger alignment between the preferred model's responses and the user's underlying intentions. We demonstrate, both theoretically and empirically, that incorporating an intention--response similarity term increases the preference margin (by a positive shift of $\lambda\,\Delta\mathrm{sim}$ in the log-odds), resulting in clearer separation between preferred and dispreferred responses compared to DPO. For evaluation, we introduce two new benchmarks, Real-pref, Attack-pref along with an extended version of an existing dataset, GlobalOpinionQA-Ext, to assess real-world and adversarial preference alignment. Through explicit modeling of diverse user intents,A-IPO facilitates pluralistic preference optimization while simultaneously enhancing adversarial robustness in preference alignment. Comprehensive empirical evaluation demonstrates that A-IPO consistently surpasses existing baselines, yielding substantial improvements across key metrics: up to +24.8 win-rate and +45.6 Response-Intention Consistency on Real-pref; up to +38.6 Response Similarity and +52.2 Defense Success Rate on Attack-pref; and up to +54.6 Intention Consistency Score on GlobalOpinionQA-Ext.
摘要：人类偏好是多种多样的、动态的，受到区域、文化和社会因素的影响。现有的对齐方法（例如直接偏好优化（DPO）及其变体）通常默认为多数观点，忽视了少数意见，并且无法捕获提示中的潜在用户意图。为了解决这些限制，我们引入了\underline{\textbf{A}}自适应\textbf{\underline{I}}目标驱动\textbf{\underline{P}}参考\textbf{\underline{O}}优化（\textbf{A-IPO}）。具体来说，A-IPO 引入了一个意图模块，该模块可以推断每个用户提示背后的潜在意图，并将这种推断的意图明确地纳入奖励函数中，从而鼓励首选模型的响应与用户的潜在意图之间更强的一致性。我们从理论上和经验上证明，合并意图-响应相似性项会增加偏好边际（通过对数几率中 $\lambda\,\Delta\mathrm{sim}$ 的正向偏移），从而与 DPO 相比，首选响应和不首选响应之间的分离更清晰。为了进行评估，我们引入了两个新的基准：Real-pref、Attack-pref 以及现有数据集的扩展版本 GlobalOpinionQA-Ext，以评估现实世界和对抗性偏好的一致性。通过对不同用户意图的显式建模，A-IPO 促进了多元偏好优化，同时增强了偏好调整中的对抗鲁棒性。全面的实证评估表明，A-IPO 始终超越现有基线，在关键指标上取得了实质性改进：真实偏好的胜率高达 +24.8，响应意图一致性高达 +45.6；攻击偏好的响应相似度高达+38.6，防御成功率高达+52.2； GlobalOpinionQA-Ext 上的意图一致性得分高达 +54.6。

Title: Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning

Authors: Renliang Sun, Wei Cheng, Dawei Li, Haifeng Chen, Wei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10103
Pdf URL: https://arxiv.org/pdf/2510.10103
Copy Paste: [[2510.10103]] Stop When Enough: Adaptive Early-Stopping for Chain-of-Thought Reasoning(https://arxiv.org/abs/2510.10103)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning has driven recent gains of large language models (LLMs) on reasoning-intensive tasks by externalizing intermediate steps. However, excessive or redundant reasoning -- so-called overthinking -- can increase inference costs and lead LLMs toward incorrect conclusions. In this paper, we present REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference), a training-free framework that adaptively determines when to stop reasoning to mitigate overthinking. REFRAIN integrates a two-stage stop discriminator to identify reflective yet redundant reasoning and a sliding-window Upper Confidence Bound (SW-UCB) multi-armed bandit controller to dynamically adjust stopping thresholds according to problem difficulty without supervision or fine-tuning. Across four representative benchmarks and two model families, REFRAIN reduces token usage by 20-55% while maintaining or improving accuracy compared to standard CoT prompting. Extensive ablation and robustness analyses demonstrate its stability across models, scorers, and prompt variations. In summary, our findings highlight when-to-stop as a new and practical axis of test-time scaling -- enabling models to reason not just more, but just enough.
摘要：思想链 (CoT) 推理通过外部化中间步骤，推动了大型语言模型 (LLM) 在推理密集型任务上的最新进展。然而，过度或冗余的推理（即所谓的过度思考）会增加推理成本并导致法学硕士得出错误的结论。在本文中，我们提出了 REFRAIN ($\underline{REF}$lective-$\underline{R}$edundancy for $\underline{A}$daptive $\underline{IN}$ference)，这是一个无需训练的框架，可以自适应地确定何时停止推理以减轻过度思考。 REFRAIN 集成了一个两级停止鉴别器来识别反射性但冗余的推理，以及一个滑动窗口置信上限 (SW-UCB) 多臂老虎机控制器，可以根据问题难度动态调整停止阈值，无需监督或微调。在四个代表性基准和两个模型系列中，与标准 CoT 提示相比，REFRAIN 将令牌使用量减少了 20-55%，同时保持或提高了准确性。广泛的消融和稳健性分析证明了其在模型、评分器和提示变化中的稳定性。总之，我们的研究结果强调了何时停止作为测试时间缩放的一个新的实用轴——使模型不仅能够推理更多，而且能够推理足够。

Title: LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora

Authors: Luyao Zhuang, Shengyuan Chen, Yilin Xiao, Huachi Zhou, Yujing Zhang, Hao Chen, Qinggang Zhang, Xiao Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10114
Pdf URL: https://arxiv.org/pdf/2510.10114
Copy Paste: [[2510.10114]] LinearRAG: Linear Graph Retrieval Augmented Generation on Large-scale Corpora(https://arxiv.org/abs/2510.10114)
Keywords: language model, llm, hallucination, retrieval augmented generation, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) is widely used to mitigate hallucinations of Large Language Models (LLMs) by leveraging external knowledge. While effective for simple queries, traditional RAG systems struggle with large-scale, unstructured corpora where information is fragmented. Recent advances incorporate knowledge graphs to capture relational structures, enabling more comprehensive retrieval for complex, multi-hop reasoning tasks. However, existing graph-based RAG (GraphRAG) methods rely on unstable and costly relation extraction for graph construction, often producing noisy graphs with incorrect or inconsistent relations that degrade retrieval quality. In this paper, we revisit the pipeline of existing GraphRAG systems and propose LinearRAG (Linear Graph-based Retrieval-Augmented Generation), an efficient framework that enables reliable graph construction and precise passage retrieval. Specifically, LinearRAG constructs a relation-free hierarchical graph, termed Tri-Graph, using only lightweight entity extraction and semantic linking, avoiding unstable relation modeling. This new paradigm of graph construction scales linearly with corpus size and incurs no extra token consumption, providing an economical and reliable indexing of the original passages. For retrieval, LinearRAG adopts a two-stage strategy: (i) relevant entity activation via local semantic bridging, followed by (ii) passage retrieval through global importance aggregation. Extensive experiments on four datasets demonstrate that LinearRAG significantly outperforms baseline models.
摘要：检索增强生成（RAG）广泛用于通过利用外部知识来减轻大型语言模型（LLM）的幻觉。传统的 RAG 系统虽然对于简单查询有效，但在处理信息分散的大规模、非结构化语料库时却遇到了困难。最近的进展结合了知识图来捕获关系结构，从而能够更全面地检索复杂的多跳推理任务。然而，现有的基于图的 RAG (GraphRAG) 方法依赖于不稳定且成本高昂的关系提取来构建图，通常会产生具有不正确或不一致关系的噪声图，从而降低检索质量。在本文中，我们重新审视现有 GraphRAG 系统的流程，并提出 LinearRAG（基于线性图的检索增强生成），这是一种有效的框架，可以实现可靠的图构建和精确的段落检索。具体来说，LinearRAG 仅使用轻量级实体提取和语义链接构建了一个无关系的层次图，称为 Tri-Graph，避免了不稳定的关系建模。这种新的图构建范式随语料库大小线性扩展，并且不会产生额外的令牌消耗，从而为原始段落提供了经济且可靠的索引。对于检索，LinearRAG 采用两阶段策略：（i）通过局部语义桥接相关实体激活，然后（ii）通过全局重要性聚合进行段落检索。对四个数据集的大量实验表明，LinearRAG 的性能显着优于基线模型。

Title: Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task

Authors: Zilong Wang, Xiaoyu Shen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10138
Pdf URL: https://arxiv.org/pdf/2510.10138
Copy Paste: [[2510.10138]] Hybrid OCR-LLM Framework for Enterprise-Scale Document Information Extraction Under Copy-heavy Task(https://arxiv.org/abs/2510.10138)
Keywords: language model, llm
Abstract: Information extraction from copy-heavy documents, characterized by massive volumes of structurally similar content, represents a critical yet understudied challenge in enterprise document processing. We present a systematic framework that strategically combines OCR engines with Large Language Models (LLMs) to optimize the accuracy-efficiency trade-off inherent in repetitive document extraction tasks. Unlike existing approaches that pursue universal solutions, our method exploits document-specific characteristics through intelligent strategy selection. We implement and evaluate 25 configurations across three extraction paradigms (direct, replacement, and table-based) on identity documents spanning four formats (PNG, DOCX, XLSX, PDF). Through table-based extraction methods, our adaptive framework delivers outstanding results: F1=1.0 accuracy with 0.97s latency for structured documents, and F1=0.997 accuracy with 0.6 s for challenging image inputs when integrated with PaddleOCR, all while maintaining sub-second processing speeds. The 54 times performance improvement compared with multimodal methods over naive approaches, coupled with format-aware routing, enables processing of heterogeneous document streams at production scale. Beyond the specific application to identity extraction, this work establishes a general principle: the repetitive nature of copy-heavy tasks can be transformed from a computational burden into an optimization opportunity through structure-aware method selection.
摘要：从大量复制的文档中提取信息（其特征是大量结构相似的内容）是企业文档处理中的一个关键但尚未得到充分研究的挑战。我们提出了一个系统框架，该框架战略性地将 OCR 引擎与大型语言模型 (LLM) 结合起来，以优化重复性文档提取任务中固有的准确性与效率权衡。与追求通用解决方案的现有方法不同，我们的方法通过智能策略选择来利用文档特定的特征。我们在跨越四种格式（PNG、DOCX、XLSX、PDF）的身份证件上实施并评估了三种提取范例（直接、替换和基于表格）的 25 种配置。通过基于表格的提取方法，我们的自适应框架可提供出色的结果：与 PaddleOCR 集成时，对于结构化文档，F1=1.0 准确度，延迟为 0.97 秒；对于具有挑战性的图像输入，F1=0.997 准确度，延迟为 0.6 秒，同时保持亚秒级处理速度。与简单方法相比，多模式方法的性能提高了 54 倍，再加上格式感知路由，可以在生产规模上处理异构文档流。除了身份提取的具体应用之外，这项工作还建立了一个一般原则：通过结构感知方法选择，可以将大量复制任务的重复性从计算负担转变为优化机会。

Title: DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models

Authors: Tingxu Han, Wei Song, Ziqi Ding, Ziming Li, Chunrong Fang, Yuekang Li, Dongfang Liu, Zhenyu Chen, Zhenting Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10142
Pdf URL: https://arxiv.org/pdf/2510.10142
Copy Paste: [[2510.10142]] DiffHeads: Differential Analysis and Inference-Time Masking of Bias Heads in Large Language Models(https://arxiv.org/abs/2510.10142)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large language models (LLMs) increasingly mediate decisions in domains where unfair treatment of demographic groups is unacceptable. Existing work probes when biased outputs appear, but gives little insight into the mechanisms that generate them, leaving existing mitigations largely fragile. In this paper, we conduct a systematic investigation LLM unfairness and propose DiffHeads, a lightweight debiasing framework for LLMs. We first compare Direct-Answer (DA) prompting to Chain-of-Thought (CoT) prompting across eight representative open- and closed-source LLMs. DA will trigger the nature bias part of LLM and improve measured unfairness by 534.5%-391.9% in both one-turn and two-turn dialogues. Next, we define a token-to-head contribution score that traces each token's influence back to individual attention heads. This reveals a small cluster of bias heads that activate under DA but stay largely dormant with CoT, providing the first causal link between prompting strategy and bias emergence. Finally, building on this insight, we propose DiffHeads that identifies bias heads through differential activation analysis between DA and CoT, and selectively masks only those heads. DiffHeads reduces unfairness by 49.4%, and 40.3% under DA and CoT, respectively, without harming model utility.
摘要：大型语言模型 (LLM) 越来越多地在对人口群体的不公平待遇不可接受的领域中调解决策。现有的工作会探讨何时出现有偏差的输出，但很少深入了解产生它们的机制，从而使现有的缓解措施在很大程度上变得脆弱。在本文中，我们对法学硕士的不公平性进行了系统调查，并提出了 DiffHeads，这是一种针对法学硕士的轻量级去偏见框架。我们首先将八个具有代表性的开源和闭源法学硕士的直接回答（DA）提示与思想链（CoT）提示进行比较。 DA将触发LLM的自然偏见部分，并将一轮和两轮对话中的测量不公平性改善534.5%-391.9%。接下来，我们定义一个 token-to-head 贡献分数，将每个 token 的影响追溯到各个注意力头。这揭示了一小群偏见头，它们在 DA 下激活，但在 CoT 下基本上处于休眠状态，从而提供了提示策略和偏见出现之间的第一个因果关系。最后，基于这一见解，我们提出了 DiffHeads，它通过 DA 和 CoT 之间的差异激活分析来识别偏差头，并选择性地仅屏蔽那些头。 DiffHeads 在 DA 和 CoT 下分别减少了 49.4% 和 40.3% 的不公平性，且不损害模型效用。

Title: BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation

Authors: Tsung-Min Pai, Jui-I Wang, Li-Chun Lu, Shao-Hua Sun, Hung-Yi Lee, Kai-Wei Chang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10157
Pdf URL: https://arxiv.org/pdf/2510.10157
Copy Paste: [[2510.10157]] BILLY: Steering Large Language Models via Merging Persona Vectors for Creative Generation(https://arxiv.org/abs/2510.10157)
Keywords: language model, llm, prompt
Abstract: Multi-LLM systems enhance the creativity of large language models by simulating human collective intelligence but suffer from significant drawbacks, such as high computational costs and inference latency. To address these limitations, we propose BILLY (BlendIng persona vectors for Large Language model creativitY), a training-free framework that captures the benefits of multi-LLM collaboration, i.e. inducing diverse perspectives and specialized expertise, within a single model. BILLY operates by extracting and blending multiple distinct persona vectors directly in the model's activation space. We steer the model's generation process with this merged vector while inference, enabling multi-perspective output without explicit multi-LLM communication. Our experiments across creativity-oriented benchmarks demonstrate that BILLY surpasses single model prompting and traditional multi-LLM approaches, while substantially reducing inference time and computational costs. Our analyses further reveal that distinct persona vectors can be blended to achieve both effective control over complementary aspects of generation and greater interpretability.
摘要：多法学硕士系统通过模拟人类集体智慧来增强大型语言模型的创造力，但也存在明显的缺点，例如计算成本高和推理延迟。为了解决这些限制，我们提出了 BILLY（用于大型语言模型创造力的混合角色向量），这是一个无需培训的框架，可以捕捉多法学硕士协作的好处，即在单个模型中引入不同的观点和专业知识。 BILLY 通过直接在模型的激活空间中提取和混合多个不同的角色向量来进行操作。我们在推理时使用此合并向量来引导模型的生成过程，从而无需显式的多 LLM 通信即可实现多视角输出。我们在以创造力为导向的基准测试中进行的实验表明，BILLY 超越了单一模型提示和传统的多 LLM 方法，同时大大减少了推理时间和计算成本。我们的分析进一步表明，可以混合不同的人物角色向量，以实现对生成的互补方面的有效控制和更大的可解释性。

Title: BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data

Authors: Jaap Jumelet, Abdellah Fourtassi, Akari Haga, Bastian Bunzeck, Bhargav Shandilya, Diana Galvan-Sosa, Faiz Ghifari Haznitrama, Francesca Padovani, Francois Meyer, Hai Hu, Julen Etxaniz, Laurent Prévot, Linyang He, María Grandury, Mila Marcheva, Negar Foroutan, Nikitas Theodoropoulos, Pouya Sadeghi, Siyuan Song, Suchir Salhan, Susana Zhou, Yurii Paniv, Ziyin Zhang, Arianna Bisazza, Alex Warstadt, Leshem Choshen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10159
Pdf URL: https://arxiv.org/pdf/2510.10159
Copy Paste: [[2510.10159]] BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data(https://arxiv.org/abs/2510.10159)
Keywords: llm
Abstract: We present BabyBabelLM, a multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. We curate developmentally plausible pretraining data aiming to cover the equivalent of 100M English words of content in each of 45 languages. We compile evaluation suites and train baseline models in each language. BabyBabelLM aims to facilitate multilingual pretraining and cognitive modeling.
摘要：我们推出 BabyBabelLM，这是一个多语言数据集集合，对一个人从出生到获得母语期间所观察到的语言进行建模。我们策划了发展上合理的预训练数据，旨在覆盖 45 种语言中每种语言相当于 1 亿个英语单词的内容。我们编译每种语言的评估套件并训练基线模型。 BabyBabelLM 旨在促进多语言预训练和认知建模。

Title: Large Language Model Sourcing: A Survey

Authors: Liang Pang, Kangxi Wu, Sunhao Dai, Zihao Wei, Zenghao Duan, Jia Gu, Xiang Li, Zhiyi Yin, Jun Xu, Huawei Shen, Xueqi Cheng
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10161
Pdf URL: https://arxiv.org/pdf/2510.10161
Copy Paste: [[2510.10161]] Large Language Model Sourcing: A Survey(https://arxiv.org/abs/2510.10161)
Keywords: language model, llm, hallucination
Abstract: The rapid advancement of large language models (LLMs) has revolutionized artificial intelligence, shifting from supporting objective tasks (e.g., recognition) to empowering subjective decision-making (e.g., planning, decision). This marks the dawn of general and powerful AI, with applications spanning a wide range of fields, including programming, education, healthcare, finance, and law. However, their deployment introduces multifaceted risks. Due to the black-box nature of LLMs and the human-like quality of their generated content, issues such as hallucinations, bias, unfairness, and copyright infringement become particularly significant. In this context, sourcing information from multiple perspectives is essential. This survey presents a systematic investigation into provenance tracking for content generated by LLMs, organized around four interrelated dimensions that together capture both model- and data-centric perspectives. From the model perspective, Model Sourcing treats the model as a whole, aiming to distinguish content generated by specific LLMs from content authored by humans. Model Structure Sourcing delves into the internal generative mechanisms, analyzing architectural components that shape the outputs of model. From the data perspective, Training Data Sourcing focuses on internal attribution, tracing the origins of generated content back to the training data of model. In contrast, External Data Sourcing emphasizes external validation, identifying external information used to support or influence the responses of model. Moreover, we also propose a dual-paradigm taxonomy that classifies existing sourcing methods into prior-based (proactive traceability embedding) and posterior-based (retrospective inference) approaches. Traceability across these dimensions enhances the transparency, accountability, and trustworthiness of LLMs deployment in real-world applications.
摘要：大语言模型（LLM）的快速发展彻底改变了人工智能，从支持客观任务（例如识别）转变为增强主观决策（例如规划、决策）。这标志着通用且强大的人工智能的黎明，其应用涵盖编程、教育、医疗保健、金融和法律等广泛领域。然而，它们的部署带来了多方面的风险。由于法学硕士的黑箱性质及其生成内容的类人品质，幻觉、偏见、不公平和版权侵权等问题变得尤为重要。在这种情况下，从多个角度获取信息至关重要。这项调查对法学硕士生成的内容的来源跟踪进行了系统调查，围绕四个相互关联的维度进行组织，共同捕获以模型和数据为中心的观点。从模型的角度来看，Model Sourcing 将模型视为一个整体，旨在区分特定 LLM 生成的内容与人类创作的内容。模型结构溯源深入研究内部生成机制，分析塑造模型输出的架构组件。从数据角度来看，训练数据溯源侧重于内部归因，将生成内容的起源追溯到模型的训练数据。相比之下，外部数据源强调外部验证，识别用于支持或影响模型响应的外部信息。此外，我们还提出了一种双范式分类法，将现有的采购方法分为基于先验的（主动可追溯性嵌入）和基于后验的（回顾性推理）方法。这些维度的可追溯性增强了法学硕士在实际应用中部署的透明度、责任感和可信度。

Title: A Survey of Inductive Reasoning for Large Language Models

Authors: Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, Wei Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10182
Pdf URL: https://arxiv.org/pdf/2510.10182
Copy Paste: [[2510.10182]] A Survey of Inductive Reasoning for Large Language Models(https://arxiv.org/abs/2510.10182)
Keywords: language model, llm
Abstract: Reasoning is an important task for large language models (LLMs). Among all the reasoning paradigms, inductive reasoning is one of the fundamental types, which is characterized by its particular-to-general thinking process and the non-uniqueness of its answers. The inductive mode is crucial for knowledge generalization and aligns better with human cognition, so it is a fundamental mode of learning, hence attracting increasing interest. Despite the importance of inductive reasoning, there is no systematic summary of it. Therefore, this paper presents the first comprehensive survey of inductive reasoning for LLMs. First, methods for improving inductive reasoning are categorized into three main areas: post-training, test-time scaling, and data augmentation. Then, current benchmarks of inductive reasoning are summarized, and a unified sandbox-based evaluation approach with the observation coverage metric is derived. Finally, we offer some analyses regarding the source of inductive ability and how simple model architectures and data help with inductive tasks, providing a solid foundation for future research.
摘要：推理是大型语言模型 (LLM) 的一项重要任务。在所有推理范式中，归纳推理是基本类型之一，其特点是思维过程的特殊性到一般性以及答案的非唯一性。归纳模式对于知识泛化至关重要，并且更符合人类认知，因此是一种基本的学习模式，因此引起了越来越多的兴趣。尽管归纳推理很重要，但还没有对其进行系统的总结。因此，本文首次对法学硕士的归纳推理进行了全面的调查。首先，改进归纳推理的方法分为三个主要领域：训练后、测试时间缩放和数据增强。然后，总结了当前归纳推理的基准，并推导了一种基于观察覆盖率度量的统一的基于沙箱的评估方法。最后，我们对归纳能力的来源以及简单的模型架构和数据如何帮助归纳任务进行了一些分析，为未来的研究提供了坚实的基础。

Title: MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems

Authors: Lei Gu, Yinghao Zhu, Haoran Sang, Zixiang Wang, Dehao Sui, Wen Tang, Ewen Harrison, Junyi Gao, Lequan Yu, Liantao Ma
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2510.10185
Pdf URL: https://arxiv.org/pdf/2510.10185
Copy Paste: [[2510.10185]] MedAgentAudit: Diagnosing and Quantifying Collaborative Failure Modes in Medical Multi-Agent Systems(https://arxiv.org/abs/2510.10185)
Keywords: language model, llm, agent
Abstract: While large language model (LLM)-based multi-agent systems show promise in simulating medical consultations, their evaluation is often confined to final-answer accuracy. This practice treats their internal collaborative processes as opaque "black boxes" and overlooks a critical question: is a diagnostic conclusion reached through a sound and verifiable reasoning pathway? The inscrutable nature of these systems poses a significant risk in high-stakes medical applications, potentially leading to flawed or untrustworthy conclusions. To address this, we conduct a large-scale empirical study of 3,600 cases from six medical datasets and six representative multi-agent frameworks. Through a rigorous, mixed-methods approach combining qualitative analysis with quantitative auditing, we develop a comprehensive taxonomy of collaborative failure modes. Our quantitative audit reveals four dominant failure patterns: flawed consensus driven by shared model deficiencies, suppression of correct minority opinions, ineffective discussion dynamics, and critical information loss during synthesis. This study demonstrates that high accuracy alone is an insufficient measure of clinical or public trust. It highlights the urgent need for transparent and auditable reasoning processes, a cornerstone for the responsible development and deployment of medical AI.
摘要：虽然基于大语言模型（LLM）的多智能体系统在模拟医疗咨询方面显示出前景，但它们的评估通常仅限于最终答案的准确性。这种做法将他们的内部协作流程视为不透明的“黑匣子”，并忽略了一个关键问题：是否通过合理且可验证的推理途径得出诊断结论？这些系统的难以捉摸的性质在高风险的医疗应用中带来了巨大的风险，可能导致有缺陷或不可信的结论。为了解决这个问题，我们对来自 6 个医学数据集和 6 个代表性多智能体框架的 3,600 个病例进行了大规模实证研究。通过严格的混合方法，将定性分析与定量审核相结合，我们开发了协作失败模式的全面分类法。我们的定量审计揭示了四种主要的失败模式：由共享模型缺陷驱动的有缺陷的共识、压制正确的少数意见、无效的讨论动态以及综合过程中的关键信息丢失。这项研究表明，仅靠高精度并不足以衡量临床或公众的信任度。它强调了对透明和可审计的推理过程的迫切需要，这是负责任地开发和部署医疗人工智能的基石。

Title: Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning

Authors: Bo Yuan, Yulin Chen, Yin Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10208
Pdf URL: https://arxiv.org/pdf/2510.10208
Copy Paste: [[2510.10208]] Weed Out, Then Harvest: Dual Low-Rank Adaptation is an Effective Noisy Label Detector for Noise-Robust Learning(https://arxiv.org/abs/2510.10208)
Keywords: language model, llm
Abstract: Parameter-efficient fine-tuning (PEFT) large language models (LLMs) have shown impressive performance in various downstream tasks. However, in many real-world scenarios, the collected training data inevitably contains noisy labels. To learn from noisy labels, most solutions select samples with small losses for model training. However, the selected samples, in turn, impact the loss computation in the next iteration. An inaccurate initial selection can create a vicious cycle, leading to suboptimal performance. To break this cycle, we propose Delora, a novel framework that decouples the sample selection from model training. For sample selection, Delora establishes a noisy label detector by introducing clean and noisy LoRA. Benefiting from the memory effect, the clean LoRA is encouraged to memorize clean data, while the noisy LoRA is constrained to memorize mislabeled data, which serves as a learnable threshold for selecting clean and noisy samples. For model training, Delora can use carefully selected samples to fine-tune language models seamlessly. Experimental results on synthetic and real-world noisy datasets demonstrate the effectiveness of Delora in noisy label detection and text classification.
摘要：参数高效微调（PEFT）大语言模型（LLM）在各种下游任务中表现出了令人印象深刻的性能。然而，在许多现实场景中，收集的训练数据不可避免地包含噪声标签。为了从噪声标签中学习，大多数解决方案都会选择损失较小的样本进行模型训练。然而，所选样本反过来会影响下一次迭代中的损失计算。不准确的初始选择可能会造成恶性循环，导致性能不佳。为了打破这个循环，我们提出了 Delora，一种新颖的框架，它将样本选择与模型训练分离。对于样本选择，Delora 通过引入干净和噪声 LoRA 建立了噪声标签检测器。受益于记忆效应，干净的 LoRA 被鼓励记住干净的数据，而噪声的 LoRA 被限制记住错误标记的数据，这作为选择干净和噪声样本的可学习阈值。对于模型训练，Delora 可以使用精心挑选的样本来无缝地微调语言模型。合成和真实世界噪声数据集的实验结果证明了 Delora 在噪声标签检测和文本分类方面的有效性。

Title: You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs

Authors: Yijie Xu, Huizai Yao, Zhiyu Guo, Weiyu Guo, Pengteng Li, Aiwei Liu, Xuming Hu, Hui Xiong
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10223
Pdf URL: https://arxiv.org/pdf/2510.10223
Copy Paste: [[2510.10223]] You only need 4 extra tokens: Synergistic Test-time Adaptation for LLMs(https://arxiv.org/abs/2510.10223)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly deployed in specialized domains such as finance, medicine, and agriculture, where they face significant distribution shifts from their training data. Domain-specific fine-tuning can mitigate this challenge but relies on high-quality labeled data that is expensive and slow to collect in expertise-limited settings. We study label-free test-time adaptation for language models and present SyTTA, an inference-time framework that adapts models on-the-fly without additional supervision. SyTTA couples two complementary uncertainty signals that arise under distribution shift: input-side perplexity, indicating mismatch with domain-specific terminology and patterns, and output-side predictive entropy, indicating diffuse and unstable token probabilities during generation. Across diverse model architectures and domain-specific benchmarks, SyTTA delivers consistent gains. Notably, on agricultural question answering, SyTTA improves Rouge-LSum by over 120% on Qwen-2.5-7B with only 4 extra tokens per query. These results show that effective test-time adaptation for language models is achievable without labeled examples, supporting deployment in label-scarce domains. The code will be made available upon acceptance.
摘要：大型语言模型 (LLM) 越来越多地部署在金融、医学和农业等专业领域，它们在这些领域面临着训练数据的重大分布变化。特定领域的微调可以缓解这一挑战，但依赖于高质量的标记数据，而在专业知识有限的环境中，这些数据的收集成本昂贵且缓慢。我们研究了语言模型的无标签测试时间适应，并提出了 SyTTA，这是一种推理时间框架，可以在无需额外监督的情况下即时适应模型。 SyTTA 将分布偏移下出现的两个互补的不确定性信号结合起来：输入侧困惑度，表示与特定领域的术语和模式不匹配；输出侧预测熵，表示生成过程中分散且不稳定的令牌概率。在不同的模型架构和特定领域的基准测试中，SyTTA 提供了一致的收益。值得注意的是，在农业问题回答方面，SyTTA 在 Qwen-2.5-7B 上将 Rouge-LSum 提高了 120% 以上，每个查询仅需要 4 个额外标记。这些结果表明，无需标记示例即可实现语言模型的有效测试时适应，支持在标签稀缺领域中的部署。该代码将在接受后提供。

Title: Text2Token: Unsupervised Text Representation Learning with Token Target Prediction

Authors: Ruize An, Richong Zhang, Zhijie Nie, Zhanyu Wu, Yanzhao Zhang, Dingkun Long
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.10224
Pdf URL: https://arxiv.org/pdf/2510.10224
Copy Paste: [[2510.10224]] Text2Token: Unsupervised Text Representation Learning with Token Target Prediction(https://arxiv.org/abs/2510.10224)
Keywords: llm
Abstract: Unsupervised text representation learning (TRL) is a fundamental task in natural language processing, which is beneficial for improving search and recommendations with the web's unlabeled texts. A recent empirical study finds that the high-quality representation aligns with the key token of the input text, uncovering the potential connection between representation space and vocabulary space. Inspired by the findings, we revisit the generative tasks and develop an unsupervised generative framework for TRL, Text2Token. The framework is based on the token target prediction task, utilizing carefully constructed target token distribution as supervisory signals. To construct the high-quality target token distribution, we analyze the token-alignment properties with advanced embedders and identify two essential categories of key tokens: (1) the meaningful tokens in the text and (2) semantically derived tokens beyond the text. Based on these insights, we propose two methods -- data-driven and model-derived -- to construct synthetic token targets from data or the LLM backbone. Experiments on the MTEB v2 benchmark demonstrate that Text2Token achieves performance competitive with the state-of-the-art embedder with unsupervised contrastive learning, LLM2Vec. Our analysis further shows that vocabulary and representation spaces optimize together and toward the optimum solution during training, providing new ideas and insights for future work.
摘要：无监督文本表示学习（TRL）是自然语言处理中的一项基本任务，有利于改进网络未标记文本的搜索和推荐。最近的一项实证研究发现，高质量的表示与输入文本的关键标记相一致，揭示了表示空间和词汇空间之间的潜在联系。受到这些发现的启发，我们重新审视了生成任务，并为 TRL 开发了一个无监督生成框架 Text2Token。该框架基于令牌目标预测任务，利用精心构建的目标令牌分布作为监督信号。为了构建高质量的目标令牌分布，我们使用高级嵌入器分析令牌对齐属性，并识别关键令牌的两个基本类别：（1）文本中有意义的令牌和（2）文本之外的语义派生令牌。基于这些见解，我们提出了两种方法——数据驱动和模型衍生——从数据或 LLM 主干构建合成代币目标。 MTEB v2 基准测试表明，Text2Token 的性能可与采用无监督对比学习的最先进的嵌入器 LLM2Vec 相媲美。我们的分析进一步表明，词汇和表示空间在培训期间共同优化并朝着最佳解决方案发展，为未来的工作提供新的想法和见解。

Title: ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement

Authors: Kangyang Luo, Yuzhuo Bai, Shuzheng Si, Cheng Gao, Zhitong Wang, Yingli Shen, Wenhao Li, Zhu Liu, Yufeng Han, Jiayi Wu, Cunliang Kong, Maosong Sun
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.10241
Pdf URL: https://arxiv.org/pdf/2510.10241
Copy Paste: [[2510.10241]] ImCoref-CeS: An Improved Lightweight Pipeline for Coreference Resolution with LLM-based Checker-Splitter Refinement(https://arxiv.org/abs/2510.10241)
Keywords: language model, llm, agent
Abstract: Coreference Resolution (CR) is a critical task in Natural Language Processing (NLP). Current research faces a key dilemma: whether to further explore the potential of supervised neural methods based on small language models, whose detect-then-cluster pipeline still delivers top performance, or embrace the powerful capabilities of Large Language Models (LLMs). However, effectively combining their strengths remains underexplored. To this end, we propose \textbf{ImCoref-CeS}, a novel framework that integrates an enhanced supervised model with LLM-based reasoning. First, we present an improved CR method (\textbf{ImCoref}) to push the performance boundaries of the supervised neural method by introducing a lightweight bridging module to enhance long-text encoding capability, devising a biaffine scorer to comprehensively capture positional information, and invoking a hybrid mention regularization to improve training efficiency. Importantly, we employ an LLM acting as a multi-role Checker-Splitter agent to validate candidate mentions (filtering out invalid ones) and coreference results (splitting erroneous clusters) predicted by ImCoref. Extensive experiments demonstrate the effectiveness of ImCoref-CeS, which achieves superior performance compared to existing state-of-the-art (SOTA) methods.
摘要：共指解析（CR）是自然语言处理（NLP）中的一项关键任务。当前的研究面临着一个关键的困境：是进一步探索基于小语言模型的监督神经方法的潜力（其检测然后集群管道仍然提供顶级性能），还是拥抱大型语言模型（LLM）的强大功能。然而，如何有效地结合它们的优势仍有待探索。为此，我们提出了 \textbf{ImCoref-CeS}，这是一种将增强型监督模型与基于 LLM 的推理相结合的新颖框架。首先，我们提出了一种改进的 CR 方法（\textbf{ImCoref}），通过引入轻量级桥接模块来增强长文本编码能力，设计双仿射评分器来全面捕获位置信息，并调用混合提及正则化来提高训练效率，从而突破监督神经方法的性能界限。重要的是，我们采用 LLM 作为多角色 Checker-Splitter 代理来验证 ImCoref 预测的候选提及（过滤掉无效的提及）和共指结果（分割错误的簇）。大量实验证明了 ImCoref-CeS 的有效性，与现有最先进 (SOTA) 方法相比，它具有卓越的性能。

Title: Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models

Authors: Samir Abdaljalil, Erchin Serpedin, Khalid Qaraqe, Hasan Kurban
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10252
Pdf URL: https://arxiv.org/pdf/2510.10252
Copy Paste: [[2510.10252]] Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models(https://arxiv.org/abs/2510.10252)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: Large language models (LLMs) often generate reasoning traces that appear coherent but rest on unsupported assumptions, leading to hallucinated conclusions. Prior work mainly addresses factual hallucinations or relies on post-hoc verification, leaving reasoning-induced hallucinations largely unaddressed. We propose Audit-of-Understanding (AoU), a framework that constrains inference to validated premises through three phases: (1) decomposing a query into candidate assumptions, (2) auditing their support, and (3) conditioning inference only on the validated subset. Formally, AoU is \emph{posterior-constrained inference}, connecting to selective prediction and rejection learning. Our contributions are threefold: (i) theoretical guarantees under perfect validation, (ii) excess-risk bounds under imperfect audits, and (iii) tractability analysis. Empirically, AoU improves both accuracy and faithfulness on GSM8K, MultiArith, and SVAMP, achieving up to +30% gains on GSM8K, +45% on MultiArith, and consistent +20--28% improvements on SVAMP over Chain-of-Thought, Self-Consistency, and CoT-Decoding. Code is available at this https URL.
摘要：大型语言模型 (LLM) 通常会生成看似连贯的推理痕迹，但基于不受支持的假设，从而导致产生幻觉的结论。之前的工作主要解决事实幻觉或依赖事后验证，而推理引起的幻觉基本上没有得到解决。我们提出了理解审核（AoU），这是一个通过三个阶段将推理限制为经过验证的前提的框架：（1）将查询分解为候选假设，（2）审核它们的支持，以及（3）仅在经过验证的子集上限制推理。形式上，AoU 是 \emph{后验约束推理}，与选择性预测和拒绝学习相关。我们的贡献有三个方面：（i）完美验证下的理论保证，（ii）不完美审计下的超额风险界限，以及（iii）可处理性分析。根据经验，AoU 提高了 GSM8K、MultiArith 和 SVAMP 的准确性和忠实度，在 GSM8K 上实现了高达 +30% 的增益，在 MultiArith 上实现了 +45% 的增益，并且在 SVAMP 上相对于思想链、自一致性和 CoT 解码实现了 +20--28% 的持续改进。代码可从此 https URL 获取。

Title: Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models

Authors: Liang Lin, Miao Yu, Moayad Aloqaily, Zhenhong Zhou, Kun Wang, Linsey Pang, Prakhar Mehrotra, Qingsong Wen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10265
Pdf URL: https://arxiv.org/pdf/2510.10265
Copy Paste: [[2510.10265]] Backdoor Collapse: Eliminating Unknown Threats via Known Backdoor Aggregation in Language Models(https://arxiv.org/abs/2510.10265)
Keywords: language model, llm
Abstract: Backdoor attacks are a significant threat to large language models (LLMs), often embedded via public checkpoints, yet existing defenses rely on impractical assumptions about trigger settings. To address this challenge, we propose \ourmethod, a defense framework that requires no prior knowledge of trigger settings. \ourmethod is based on the key observation that when deliberately injecting known backdoors into an already-compromised model, both existing unknown and newly injected backdoors aggregate in the representation space. \ourmethod leverages this through a two-stage process: \textbf{first}, aggregating backdoor representations by injecting known triggers, and \textbf{then}, performing recovery fine-tuning to restore benign outputs. Extensive experiments across multiple LLM architectures demonstrate that: (I) \ourmethod reduces the average Attack Success Rate to 4.41\% across multiple benchmarks, outperforming existing baselines by 28.1\%$\sim$69.3\%$\uparrow$. (II) Clean accuracy and utility are preserved within 0.5\% of the original model, ensuring negligible impact on legitimate tasks. (III) The defense generalizes across different types of backdoors, confirming its robustness in practical deployment scenarios.
摘要：后门攻击对大型语言模型 (LLM) 来说是一个重大威胁，通常通过公共检查点嵌入，但现有的防御措施依赖于有关触发设置的不切实际的假设。为了应对这一挑战，我们提出了“我们的方法”，这是一种不需要事先了解触发设置的防御框架。我们的方法基于一个关键观察，即当故意将已知后门注入已经受损的模型时，现有的未知后门和新注入的后门都会在表示空间中聚合。我们的方法通过两个阶段的过程来利用这一点：\textbf{first}，通过注入已知触发器来聚合后门表示，\textbf{then}，执行恢复微调以恢复良性输出。跨多个 LLM 架构的大量实验表明：(I) \我们的方法在多个基准测试中将平均攻击成功率降低至 4.41\%，比现有基准高出 28.1\%$\sim$69.3\%$\uparrow$。 (II) 干净的准确性和效用保持在原始模型的 0.5% 以内，确保对合法任务的影响可以忽略不计。（III）防御泛化于不同类型的后门，证实了其在实际部署场景中的稳健性。

Title: On the Entity-Level Alignment in Crosslingual Consistency

Authors: Yihong Liu, Mingyang Wang, François Yvon, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10280
Pdf URL: https://arxiv.org/pdf/2510.10280
Copy Paste: [[2510.10280]] On the Entity-Level Alignment in Crosslingual Consistency(https://arxiv.org/abs/2510.10280)
Keywords: language model, llm, prompt
Abstract: Multilingual large language models (LLMs) are expected to recall factual knowledge consistently across languages. However, the factors that give rise to such crosslingual consistency -- and its frequent failure -- remain poorly understood. In this work, we hypothesize that these inconsistencies may arise from failures in entity alignment, the process of mapping subject and object entities into a shared conceptual space across languages. To test this, we assess alignment through entity-level (subject and object) translation tasks, and find that consistency is strongly correlated with alignment across all studied models, with misalignment of subjects or objects frequently resulting in inconsistencies. Building on this insight, we propose SubSub and SubInj, two effective methods that integrate English translations of subjects into prompts across languages, leading to substantial gains in both factual recall accuracy and consistency. Finally, our mechanistic analysis reveals that these interventions reinforce the entity representation alignment in the conceptual space through model's internal pivot-language processing, offering effective and practical strategies for improving multilingual factual prediction.
摘要：多语言大语言模型 (LLM) 有望跨语言一致地回忆事实知识。然而，造成这种跨语言一致性及其频繁失败的因素仍然知之甚少。在这项工作中，我们假设这些不一致可能是由于实体对齐失败而引起的，实体对齐是将主体和客体实体映射到跨语言的共享概念空间的过程。为了测试这一点，我们通过实体级（主体和客体）翻译任务评估一致性，并发现一致性与所有研究模型的一致性密切相关，主体或客体的错位经常导致不一致。基于这一见解，我们提出了 SubSub 和 SubInj 这两种有效的方法，将主题的英语翻译整合到跨语言的提示中，从而在事实回忆的准确性和一致性方面取得实质性进展。最后，我们的机制分析表明，这些干预措施通过模型的内部枢轴语言处理强化了概念空间中的实体表示对齐，为改进多语言事实预测提供了有效且实用的策略。

Title: MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning

Authors: Hongwei Chen, Yishu Lei, Dan Zhang, Bo Ke, Danxiang Zhu, Xuyi Chen, Yuxiang Lu, Zhengjie Huang, Shikun Feng, Jingzhou He, Yu Sun, Hua Wu, Haifeng Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10293
Pdf URL: https://arxiv.org/pdf/2510.10293
Copy Paste: [[2510.10293]] MatryoshkaThinking: Recursive Test-Time Scaling Enables Efficient Reasoning(https://arxiv.org/abs/2510.10293)
Keywords: language model
Abstract: Test-time scaling has emerged as a promising paradigm in language modeling, wherein additional computational resources are allocated during inference to enhance model performance. Recent approaches, such as DeepConf, have demonstrated the efficacy of this strategy, however, they often incur substantial computational overhead to achieve competitive results. In this work, we propose MatryoshkaThinking, a novel method that significantly reduces computational cost while maintaining state-of-the-art performance. Specifically, MatryoshkaThinking attains a score of 99.79 on AIME2025 using only 4% of the computation required by DeepConf. The core of our approach lies in the recursive exploitation of the model's intrinsic capabilities in reasoning, verification, and summarization, which collectively enhance the retention of correct solutions and reduce the disparity between Pass@k and Pass@1. Comprehensive evaluations across multiple open-source models and challenging multi-modal reasoning benchmarks validate the effectiveness and generality of our method. These findings offer new insights into the design of efficient and scalable test-time inference strategies for advanced language models.
摘要：测试时间缩放已成为语言建模中一个有前途的范例，其中在推理过程中分配额外的计算资源以增强模型性能。最近的方法，例如 DeepConf，已经证明了这种策略的有效性，但是，它们通常会产生大量的计算开销来获得有竞争力的结果。在这项工作中，我们提出了 MatryoshkaThinking，这是一种新颖的方法，可以显着降低计算成本，同时保持最先进的性能。具体来说，MatryoshkaThinking 仅使用 DeepConf 所需计算量的 4%，在 AIME2025 上获得了 99.79 的分数。我们方法的核心在于递归地利用模型在推理、验证和总结方面的内在能力，这些能力共同增强了正确解决方案的保留并减少了 Pass@k 和 Pass@1 之间的差异。跨多个开源模型和具有挑战性的多模态推理基准的综合评估验证了我们方法的有效性和通用性。这些发现为高级语言模型的高效且可扩展的测试时推理策略的设计提供了新的见解。

Title: Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy

Authors: Ananya Malik, Nazanin Sabri, Melissa Karnaze, Mai Elsherief
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10328
Pdf URL: https://arxiv.org/pdf/2510.10328
Copy Paste: [[2510.10328]] Are LLMs Empathetic to All? Investigating the Influence of Multi-Demographic Personas on a Model's Empathy(https://arxiv.org/abs/2510.10328)
Keywords: language model, llm
Abstract: Large Language Models' (LLMs) ability to converse naturally is empowered by their ability to empathetically understand and respond to their users. However, emotional experiences are shaped by demographic and cultural contexts. This raises an important question: Can LLMs demonstrate equitable empathy across diverse user groups? We propose a framework to investigate how LLMs' cognitive and affective empathy vary across user personas defined by intersecting demographic attributes. Our study introduces a novel intersectional analysis spanning 315 unique personas, constructed from combinations of age, culture, and gender, across four LLMs. Results show that attributes profoundly shape a model's empathetic responses. Interestingly, we see that adding multiple attributes at once can attenuate and reverse expected empathy patterns. We show that they broadly reflect real-world empathetic trends, with notable misalignments for certain groups, such as those from Confucian culture. We complement our quantitative findings with qualitative insights to uncover model behaviour patterns across different demographic groups. Our findings highlight the importance of designing empathy-aware LLMs that account for demographic diversity to promote more inclusive and equitable model behaviour.
摘要：大型语言模型 (LLM) 自然交谈的能力得益于其同理心地理解和响应用户的能力。然而，情感体验是由人口和文化背景决定的。这就提出了一个重要的问题：法学硕士能否在不同的用户群体中表现出公平的同理心？我们提出了一个框架来研究法学硕士的认知和情感同理心如何随着由交叉人口统计属性定义的用户角色而变化。我们的研究引入了一种新颖的交叉分析，涵盖 315 个独特的角色，这些角色是根据年龄、文化和性别的组合构建的，涉及四个法学硕士。结果表明，属性深刻地塑造了模型的移情反应。有趣的是，我们发现一次添加多个属性可以减弱和逆转预期的同理心模式。我们表明，它们广泛反映了现实世界的移情趋势，但对于某些群体（例如来自儒家文化的群体）存在明显的偏差。我们用定性见解来补充我们的定量发现，以揭示不同人口群体的模型行为模式。我们的研究结果强调了设计具有同理心意识的法学硕士的重要性，该法学硕士考虑了人口多样性，以促进更具包容性和公平的模型行为。

Title: End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs

Authors: Nam Luu, Ondřej Bojar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10329
Pdf URL: https://arxiv.org/pdf/2510.10329
Copy Paste: [[2510.10329]] End-to-end Automatic Speech Recognition and Speech Translation: Integration of Speech Foundational Models and LLMs(https://arxiv.org/abs/2510.10329)
Keywords: language model, llm
Abstract: Speech Translation (ST) is a machine translation task that involves converting speech signals from one language to the corresponding text in another language; this task has two different approaches, namely the traditional cascade and the more recent end-to-end. This paper explores a combined end-to-end architecture of pre-trained speech encoders and Large Language Models (LLMs) for performing both Automatic Speech Recognition (ASR) and ST simultaneously. Experiments with the English-to-German language pair show that our best model not only can achieve better translation results than SeamlessM4T, a large foundational end-to-end, multi-modal translation model, but can also match the performance of a cascaded system with Whisper and NLLB, with up to a score gain of 8% in $\text{COMET}^{\text{DA}}_{22}$ metric.
摘要：语音翻译（ST）是一种机器翻译任务，涉及将语音信号从一种语言转换为另一种语言的相应文本；该任务有两种不同的方法，即传统的级联和最近的端到端。本文探讨了预训练语音编码器和大型语言模型 (LLM) 的组合端到端架构，用于同时执行自动语音识别 (ASR) 和 ST。英语到德语语言对的实验表明，我们的最佳模型不仅可以实现比 SeamlessM4T（一种大型基础端到端、多模态翻译模型）更好的翻译结果，而且还可以与 Whisper 和 NLLB 级联系统的性能相匹配，在 $\text{COMET}^{\text{DA}}_{22}$ 指标中得分提升高达 8%。

Title: RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models

Authors: Aashiq Muhamed, Leonardo F. R. Ribeiro, Markus Dreyer, Virginia Smith, Mona T. Diab
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10390
Pdf URL: https://arxiv.org/pdf/2510.10390
Copy Paste: [[2510.10390]] RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models(https://arxiv.org/abs/2510.10390)
Keywords: language model
Abstract: The ability of language models in RAG systems to selectively refuse to answer based on flawed context is critical for safety, yet remains a significant failure point. Our large-scale study reveals that even frontier models struggle in this setting, with refusal accuracy dropping below 50% on multi-document tasks, while exhibiting either dangerous overconfidence or overcaution. Static benchmarks fail to reliably evaluate this capability, as models exploit dataset-specific artifacts and memorize test instances. We introduce RefusalBench, a generative methodology that programmatically creates diagnostic test cases through controlled linguistic perturbation. Our framework employs 176 distinct perturbation strategies across six categories of informational uncertainty and three intensity levels. Evaluation of over 30 models uncovers systematic failure patterns: refusal comprises separable detection and categorization skills, and neither scale nor extended reasoning improves performance. We find that selective refusal is a trainable, alignment-sensitive capability, offering a clear path for improvement. We release two benchmarks -- RefusalBench-NQ (single document) and RefusalBench-GaRAGe (multi-document) -- and our complete generation framework to enable continued, dynamic evaluation of this critical capability.
摘要：RAG 系统中的语言模型能够根据有缺陷的上下文选择性地拒绝回答，这对于安全至关重要，但仍然是一个重要的故障点。我们的大规模研究表明，即使是前沿模型在这种情况下也会陷入困境，在多文档任务上拒绝准确率降至 50% 以下，同时表现出危险的过度自信或过度谨慎。静态基准测试无法可靠地评估此功能，因为模型会利用特定于数据集的工件并记住测试实例。我们引入了 RefusalBench，这是一种生成方法，可以通过受控语言扰动以编程方式创建诊断测试用例。我们的框架采用了 176 种不同的扰动策略，涵盖六类信息不确定性和三个强度级别。对 30 多个模型的评估揭示了系统性故障模式：拒绝包括可分离的检测和分类技能，规模和扩展推理都无法提高性能。我们发现选择性拒绝是一种可训练的、对联盟敏感的能力，为改进提供了明确的途径。我们发布了两个基准测试——RefusalBench-NQ（单文档）和RefusalBench-GaRAGe（多文档）——以及我们完整的生成框架，以实现对这一关键功能的持续、动态评估。

Title: STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models

Authors: Geunyeong Jeong, Juoh Sun, Seonghee Lee, Harksoo Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10398
Pdf URL: https://arxiv.org/pdf/2510.10398
Copy Paste: [[2510.10398]] STEAM: A Semantic-Level Knowledge Editing Framework for Large Language Models(https://arxiv.org/abs/2510.10398)
Keywords: language model
Abstract: Large Language Models store extensive factual knowledge acquired during large-scale pre-training. However, this knowledge is inherently static, reflecting only the state of the world at the time of training. Knowledge editing has emerged as a promising solution for updating outdated or incorrect facts without full retraining. However, most existing locate-and-edit methods primarily focus on token-level likelihood optimization without addressing semantic coherence. Our analysis reveals that such edited knowledge is often encoded as isolated residual streams in the model's latent space, distinct from pre-existing knowledge and bypassing natural reasoning process. To address this, we propose \textsc{Steam}, a semantic-level knowledge editing framework that enhances integration of updated knowledge into the model's knowledge structure. \textsc{Steam} first identifies target representations as semantic anchors for the updated factual association, then guides the internal representation of the edited fact towards these anchors through an alignment loss during optimization. Experimental results demonstrate that \textsc{Steam} improves model's ability to reason with edited knowledge and enhances semantic coherence, underscoring the importance of latent-space alignment for reliable and coherent knowledge editing. The code is available at this https URL.
摘要：大型语言模型存储在大规模预训练期间获得的广泛事实知识。然而，这些知识本质上是静态的，仅反映训练时的世界状态。知识编辑已成为一种很有前途的解决方案，可以在不进行全面再培训的情况下更新过时或不正确的事实。然而，大多数现有的定位和编辑方法主要关注标记级似然优化，而不解决语义一致性问题。我们的分析表明，此类编辑的知识通常被编码为模型潜在空间中的孤立残差流，与预先存在的知识不同并绕过自然推理过程。为了解决这个问题，我们提出了 \textsc{Steam}，一个语义级知识编辑框架，可以增强更新知识与模型知识结构的集成。 \textsc{Steam} 首先将目标表示识别为更新的事实关联的语义锚点，然后通过优化过程中的对齐损失将编辑事实的内部表示引导到这些锚点。实验结果表明， \textsc{Steam} 提高了模型利用编辑知识进行推理的能力，并增强了语义连贯性，强调了潜在空间对齐对于可靠且连贯的知识编辑的重要性。该代码可从此 https URL 获取。

Title: LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints

Authors: Federica Bologna, Tiffany Pan, Matthew Wilkens, Yue Guo, Lucy Lu Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10415
Pdf URL: https://arxiv.org/pdf/2510.10415
Copy Paste: [[2510.10415]] LONGQAEVAL: Designing Reliable Evaluations of Long-Form Clinical QA under Resource Constraints(https://arxiv.org/abs/2510.10415)
Keywords: llm
Abstract: Evaluating long-form clinical question answering (QA) systems is resource-intensive and challenging: accurate judgments require medical expertise and achieving consistent human judgments over long-form text is difficult. We introduce LongQAEval, an evaluation framework and set of evaluation recommendations for limited-resource and high-expertise settings. Based on physician annotations of 300 real patient questions answered by physicians and LLMs, we compare coarse answer-level versus fine-grained sentence-level evaluation over the dimensions of correctness, relevance, and safety. We find that inter-annotator agreement (IAA) varies by dimension: fine-grained annotation improves agreement on correctness, coarse improves agreement on relevance, and judgments on safety remain inconsistent. Additionally, annotating only a small subset of sentences can provide reliability comparable to coarse annotations, reducing cost and effort.
摘要：评估长篇临床问答 (QA) 系统需要大量资源且具有挑战性：准确的判断需要医学专业知识，并且对长篇文本实现一致的人类判断很困难。我们引入了 LongQAEval，这是一个针对资源有限和高专业知识环境的评估框架和一套评估建议。根据医生和法学硕士回答的 300 个真实患者问题的医生注释，我们在正确性、相关性和安全性维度上比较粗略答案级别与细粒度句子级别评估。我们发现注释者间一致性（IAA）因维度而异：细粒度注释提高了正确性的一致性，粗粒度的注释提高了相关性的一致性，而对安全性的判断仍然不一致。此外，仅注释一小部分句子可以提供与粗略注释相当的可靠性，从而减少成本和工作量。

Title: Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance

Authors: Jingyi Chen, Zhimeng Guo, Jiyun Chun, Pichao Wang, Andrew Perrault, Micha Elsner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10444
Pdf URL: https://arxiv.org/pdf/2510.10444
Copy Paste: [[2510.10444]] Do Audio LLMs Really LISTEN, or Just Transcribe? Measuring Lexical vs. Acoustic Emotion Cues Reliance(https://arxiv.org/abs/2510.10444)
Keywords: language model, llm
Abstract: Understanding emotion from speech requires sensitivity to both lexical and acoustic cues. However, it remains unclear whether large audio language models (LALMs) genuinely process acoustic information or rely primarily on lexical content. We present LISTEN (Lexical vs. Acoustic Speech Test for Emotion in Narratives), a controlled benchmark designed to disentangle lexical reliance from acoustic sensitivity in emotion understanding. Across evaluations of six state-of-the-art LALMs, we observe a consistent lexical dominance. Models predict "neutral" when lexical cues are neutral or absent, show limited gains under cue alignment, and fail to classify distinct emotions under cue conflict. In paralinguistic settings, performance approaches chance. These results indicate that current LALMs largely "transcribe" rather than "listen," relying heavily on lexical semantics while underutilizing acoustic cues. LISTEN offers a principled framework for assessing emotion understanding in multimodal models.
摘要：从言语中理解情感需要对词汇和声音线索敏感。然而，目前尚不清楚大型音频语言模型（LALM）是否真正处理声学信息或主要依赖于词汇内容。我们提出了 LISTEN（叙事情感的词汇与声学言语测试），这是一个受控基准，旨在将情感理解中的词汇依赖与声学敏感性分开。在对六种最先进的 LALM 的评估中，我们观察到一致的词汇优势。当词汇提示为中性或不存在时，模型会预测“中性”，在提示对齐下显示出有限的增益，并且无法对提示冲突下的不同情绪进行分类。在副语言环境中，表现接近偶然。这些结果表明，当前的 LALM 主要是“转录”而不是“聆听”，严重依赖词汇语义，而未充分利用声学线索。 LISTEN 提供了一个评估多模态模型中情绪理解的原则框架。

Title: RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation

Authors: Zhichao Xu, Minheng Wang, Yawei Wang, Wenqian Ye, Yuntao Du, Yunpu Ma, Yijun Tian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10448
Pdf URL: https://arxiv.org/pdf/2510.10448
Copy Paste: [[2510.10448]] RECON: Reasoning with Condensation for Efficient Retrieval-Augmented Generation(https://arxiv.org/abs/2510.10448)
Keywords: llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems trained using reinforcement learning (RL) with reasoning are hampered by inefficient context management, where long, noisy retrieved documents increase costs and degrade performance. We introduce RECON (REasoning with CONdensation), a framework that integrates an explicit summarization module to compress evidence within the reasoning loop. Our summarizer is trained via a two-stage process: relevance pretraining on QA datasets, followed by multi-aspect distillation from proprietary LLMs to ensure factuality and clarity. Integrated into the Search-R1 pipeline, RECON reduces total context length by 35\%, leading to improved training speed and inference latency, while simultaneously improving RAG performance on downstream QA benchmarks. Notably, it boosts the average EM score of the 3B model by 14.5\% and the 7B model by 3.0\%, showing particular strength in multi-hop QA. RECON demonstrates that learned context compression is essential for building practical, scalable, and performant RAG systems. Our code implementation is made available at this https URL.
摘要：使用强化学习 (RL) 进行推理训练的检索增强生成 (RAG) 系统受到低效的上下文管理的阻碍，其中长而嘈杂的检索文档会增加成本并降低性能。我们引入 RECON（REasoning with CONdensation），这是一个集成显式摘要模块以压缩推理循环内证据的框架。我们的摘要器通过两个阶段的过程进行培训：对 QA 数据集进行相关性预训练，然后从专有的法学硕士进行多方面蒸馏，以确保事实性和清晰度。 RECON 集成到 Search-R1 管道中，将上下文总长度减少了 35%，从而提高了训练速度和推理延迟，同时提高了下游 QA 基准上的 RAG 性能。值得注意的是，它将 3B 模型的平均 EM 分数提高了 14.5%，7B 模型的平均 EM 分数提高了 3.0%，在多跳 QA 中显示出特殊的优势。 RECON 证明了学习上下文压缩对于构建实用、可扩展且高性能的 RAG 系统至关重要。我们的代码实现可通过此 https URL 获取。

Title: Steering Over-refusals Towards Safety in Retrieval Augmented Generation

Authors: Utsav Maskey, Mark Dras, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10452
Pdf URL: https://arxiv.org/pdf/2510.10452
Copy Paste: [[2510.10452]] Steering Over-refusals Towards Safety in Retrieval Augmented Generation(https://arxiv.org/abs/2510.10452)
Keywords: language model, llm, retrieval augmented generation, retrieval-augmented generation
Abstract: Safety alignment in large language models (LLMs) induces over-refusals -- where LLMs decline benign requests due to aggressive safety filters. We analyze this phenomenon in retrieval-augmented generation (RAG), where both the query intent and retrieved context properties influence refusal behavior. We construct RagRefuse, a domain-stratified benchmark spanning medical, chemical, and open domains, pairing benign and harmful queries with controlled context contamination patterns and sizes. Our analysis shows that context arrangement / contamination, domain of query and context, and harmful-text density trigger refusals even on benign queries, with effects depending on model-specific alignment choices. To mitigate over-refusals, we introduce \textsc{SafeRAG-Steering}, a model-centric embedding intervention that steers the embedding regions towards the confirmed safe, non-refusing output regions at inference time. This reduces over-refusals in contaminated RAG pipelines while preserving legitimate refusals.
摘要：大型语言模型 (LLM) 中的安全对齐会导致过度拒绝，其中 LLM 由于积极的安全过滤器而拒绝良性请求。我们在检索增强生成（RAG）中分析了这种现象，其中查询意图和检索的上下文属性都会影响拒绝行为。我们构建了 RagRefuse，这是一个跨越医学、化学和开放领域的领域分层基准，将良性和有害查询与受控上下文污染模式和大小配对。我们的分析表明，即使在良性查询上，上下文排列/污染、查询和上下文的域以及有害文本密度也会触发拒绝，其影响取决于特定于模型的对齐选择。为了减轻过度拒绝，我们引入了 \textsc{SafeRAG-Steering}，这是一种以模型为中心的嵌入干预，可在推理时将嵌入区域引导至已确认的安全、非拒绝输出区域。这减少了受污染的 RAG 管道中的过度拒绝，同时保留了合法的拒绝。

Title: Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?

Authors: Shaobo Wang, Cong Wang, Wenjie Fu, Yue Min, Mingquan Feng, Isabel Guan, Xuming Hu, Conghui He, Cunxiang Wang, Kexin Yang, Xingzhang Ren, Fei Huang, Dayiheng Liu, Linfeng Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10457
Pdf URL: https://arxiv.org/pdf/2510.10457
Copy Paste: [[2510.10457]] Rethinking LLM Evaluation: Can We Evaluate LLMs with 200x Less Data?(https://arxiv.org/abs/2510.10457)
Keywords: llm
Abstract: As the demand for comprehensive evaluations of diverse model capabilities steadily increases, benchmark suites have correspondingly grown significantly in scale. Despite notable advances in redundancy reduction and subset-level performance prediction, a systematic framework that effectively integrates these methods to ensure both prediction accuracy and ranking consistency is still largely elusive. In this paper, we first perform a sample-level analysis of benchmark redundancy and identify several highly similar samples that can be eliminated. Besides, we frame benchmark compression as an optimization problem with the aim of score reconstruction. Building on these, we then propose EssenceBench, a coarse-to-fine framework utilizing an iterative Genetic Algorithm (GA), which takes the advantages of fitness-based subset search and attribution-based sample search. Compared to previous methods, our approach yields superior compression results with lower reconstruction error and markedly higher efficiency. In particular, on the HellaSwag benchmark (10K samples), our method preserves the ranking of all models shifting within 5% using 25x fewer samples, and achieves 95% ranking preservation shifting within 5% using only 200x fewer samples.
摘要：随着对多种模型能力综合评估的需求稳步增长，基准套件的规模也相应显着增长。尽管在冗余减少和子集级性能预测方面取得了显着进展，但有效集成这些方法以确保预测准确性和排名一致性的系统框架仍然在很大程度上难以实现。在本文中，我们首先对基准冗余进行样本级分析，并识别出几个可以消除的高度相似的样本。此外，我们将基准压缩视为一个优化问题，目的是重建分数。在此基础上，我们提出了 EssenceBench，这是一个利用迭代遗传算法 (GA) 的从粗到细的框架，它利用了基于适应度的子集搜索和基于归因的样本搜索的优点。与以前的方法相比，我们的方法产生了优异的压缩结果，具有更低的重建误差和显着更高的效率。特别是，在 HellaSwag 基准（10K 样本）上，我们的方法使用减少 25 倍的样本将所有模型的排名保持在 5% 以内，并且仅使用 200 倍的样本就实现了 95% 的排名保持在 5% 以内。

Title: NIM: Neuro-symbolic Ideographic Metalanguage for Inclusive Communication

Authors: Prawaal Sharma, Poonam Goyal, Navneet Goyal, Vidisha Sharma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10459
Pdf URL: https://arxiv.org/pdf/2510.10459
Copy Paste: [[2510.10459]] NIM: Neuro-symbolic Ideographic Metalanguage for Inclusive Communication(https://arxiv.org/abs/2510.10459)
Keywords: language model, llm
Abstract: Digital communication has become the cornerstone of modern interaction, enabling rapid, accessible, and interactive exchanges. However, individuals with lower academic literacy often face significant barriers, exacerbating the "digital divide". In this work, we introduce a novel, universal ideographic metalanguage designed as an innovative communication framework that transcends academic, linguistic, and cultural boundaries. Our approach leverages principles of Neuro-symbolic AI, combining neural-based large language models (LLMs) enriched with world knowledge and symbolic knowledge heuristics grounded in the linguistic theory of Natural Semantic Metalanguage (NSM). This enables the semantic decomposition of complex ideas into simpler, atomic concepts. Adopting a human-centric, collaborative methodology, we engaged over 200 semi-literate participants in defining the problem, selecting ideographs, and validating the system. With over 80\% semantic comprehensibility, an accessible learning curve, and universal adaptability, our system effectively serves underprivileged populations with limited formal education.
摘要：数字通信已成为现代互动的基石，实现快速、便捷、互动的交流。然而，学术素养较低的个人往往面临重大障碍，加剧了“数字鸿沟”。在这项工作中，我们介绍了一种新颖的、通用的表意元语言，它被设计为超越学术、语言和文化界限的创新交流框架。我们的方法利用神经符号人工智能的原理，结合基于神经的大语言模型（LLM），丰富了世界知识和基于自然语义元语言（NSM）语言理论的符号知识启发法。这使得复杂的想法能够在语义上分解为更简单的原子概念。我们采用以人为本的协作方法，让 200 多名半文盲参与者参与定义问题、选择表意文字和验证系统。我们的系统具有超过 80% 的语义可理解性、易于理解的学习曲线和普遍的适应性，有效地服务于正规教育有限的贫困人群。

Title: FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth

Authors: Qiran Zou, Hou Hei Lam, Wenhao Zhao, Yiming Tang, Tingting Chen, Samson Yu, Tianyi Zhang, Chang Liu, Xiangyang Ji, Dianbo Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10472
Pdf URL: https://arxiv.org/pdf/2510.10472
Copy Paste: [[2510.10472]] FML-bench: A Benchmark for Automatic ML Research Agents Highlighting the Importance of Exploration Breadth(https://arxiv.org/abs/2510.10472)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have sparked growing interest in automatic machine learning research agents. Among them, agents capable of autonomously proposing ideas and conducting machine learning experiments are particularly promising, as they maximize research automation and accelerate scientific progress by iteratively refining ideas based on experimental results. However, comprehensively evaluating such agents remains challenging. Existing benchmarks tend to overemphasize engineering aspects while neglecting academic rigor, creating barriers that obscure a clear assessment of an agent's scientific capabilities in machine learning research. They also suffer from limited task diversity, an overemphasis on application-oriented tasks over fundamental research problems, and limited scalability to realistic research settings. To address these limitations, we introduce FML-bench, a benchmark designed to evaluate automatic machine learning research agents on 8 diverse and fundamental machine learning research problems. It reduces coding burden, emphasizes fundamental problems rather than specific use cases, offers high task diversity, and is extensible to real-world machine learning GitHub repositories. Furthermore, we present a unified evaluation framework with five complementary metrics, designed to comprehensively assess agent performance on our benchmark. We evaluate state-of-the-art automatic research agents on FML-bench, and find that agents employing broad research exploration strategies outperform those focusing on narrow but deep exploration. These findings suggest that emphasizing the breadth of exploration may lead to more effective research outcomes than focusing solely on incremental refinement. Our benchmark is available at this https URL.
摘要：大型语言模型（LLM）引发了人们对自动机器学习研究代理日益增长的兴趣。其中，能够自主提出想法和进行机器学习实验的智能体尤其有前途，因为它们可以根据实验结果迭代地完善想法，从而最大限度地提高研究自动化并加速科学进步。然而，全面评估此类药物仍然具有挑战性。现有的基准往往过分强调工程方面，而忽视了学术严谨性，从而造成了障碍，阻碍了对代理在机器学习研究中的科学能力的清晰评估。它们还面临任务多样性有限、过分强调面向应用的任务而不是基础研究问题以及现实研究环境的可扩展性有限等问题。为了解决这些限制，我们引入了 FML-bench，这是一个旨在评估自动机器学习研究代理在 8 个不同的基本机器学习研究问题上的基准。它减少了编码负担，强调基本问题而不是特定用例，提供高度的任务多样性，并且可扩展到现实世界的机器学习 GitHub 存储库。此外，我们提出了一个具有五个补充指标的统一评估框架，旨在根据我们的基准全面评估代理的表现。我们在 FML-bench 上评估了最先进的自动研究代理，发现采用广泛研究探索策略的代理优于那些专注于狭窄但深入探索的代理。这些发现表明，强调探索的广度可能比仅仅关注渐进式细化带来更有效的研究成果。我们的基准测试可以在这个 https URL 上找到。

Title: Assessing Large Language Models for Structured Medical Order Extraction

Authors: A H M Rezaul Karim, Ozlem Uzuner
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10475
Pdf URL: https://arxiv.org/pdf/2510.10475
Copy Paste: [[2510.10475]] Assessing Large Language Models for Structured Medical Order Extraction(https://arxiv.org/abs/2510.10475)
Keywords: language model, llm, prompt
Abstract: Medical order extraction is essential for structuring actionable clinical information, supporting decision-making, and enabling downstream applications such as documentation and workflow automation. Orders may be embedded in diverse sources, including electronic health records, discharge summaries, and multi-turn doctor-patient dialogues, and can span categories such as medications, laboratory tests, imaging studies, and follow-up actions. The MEDIQA-OE 2025 shared task focuses on extracting structured medical orders from extended conversational transcripts, requiring the identification of order type, description, reason, and provenance. We present the MasonNLP submission, which ranked 5th among 17 participating teams with 105 total submissions. Our approach uses a general-purpose, instruction-tuned LLaMA-4 17B model without domain-specific fine-tuning, guided by a single in-context example. This few-shot configuration achieved an average F1 score of 37.76, with notable improvements in reason and provenance accuracy. These results demonstrate that large, non-domain-specific LLMs, when paired with effective prompt engineering, can serve as strong, scalable baselines for specialized clinical NLP tasks.
摘要：医疗医嘱提取对于构建可操作的临床信息、支持决策以及支持文档和工作流程自动化等下游应用程序至关重要。医嘱可以嵌入多种来源，包括电子健康记录、出院摘要和多轮医患对话，并且可以涵盖药物、实验室检查、影像学研究和后续行动等类别。 MEDIQA-OE 2025 共享任务侧重于从扩展对话记录中提取结构化医疗医嘱，需要识别医嘱类型、描述、原因和出处。我们展示了 MasonNLP 提交的作品，在 17 个参赛团队的 105 份提交中排名第五。我们的方法使用通用的、指令调整的 LLaMA-4 17B 模型，没有特定于领域的微调，由单个上下文示例引导。这种少数镜头配置的平均 F1 分数为 37.76，在推理和来源准确性方面显着提高。这些结果表明，大型非特定领域的法学硕士与有效的即时工程相结合，可以作为专业临床 NLP 任务的强大、可扩展的基线。

Title: UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models

Authors: Guangxin He, Shen Nie, Fengqi Zhu, Yuankang Zhao, Tianyi Bai, Ran Yan, Jie Fu, Chongxuan Li, Binhang Yuan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10481
Pdf URL: https://arxiv.org/pdf/2510.10481
Copy Paste: [[2510.10481]] UltraLLaDA: Scaling the Context Length to 128K for Diffusion Large Language Models(https://arxiv.org/abs/2510.10481)
Keywords: language model, llm
Abstract: Diffusion LLMs have attracted growing interest, with plenty of recent work emphasizing their great potential in various downstream tasks; yet the long-context behavior of diffusion LLMs remains largely uncharted. We present a case study of post-training techniques for extending the context window of diffusion LLMs (i.e., LLaDA) without retraining from scratch. We show that a simple modification to the standard Rotary Positional Embeddings (RoPE) extension effectively accommodates the probabilistic modeling inherent in the diffusion process, enabling stable scaling to longer context ranges. We further compare masking strategies used during post-training and analyze their impact on optimization stability and long-range recall. Instantiating these insights, we introduce UltraLLaDA, a diffusion LLM with a 128K-token context window that, in our empirical evaluation on long-context tasks, significantly outperforms training-free baselines. Our experimental results highlight the special positional extension as a key lever for scaling diffusion LLMs to extended contexts and offer practical guidance for practitioners seeking 128K-scale context via efficient post-training.
摘要：扩散法学硕士引起了越来越多的兴趣，最近的大量工作强调了它们在各种下游任务中的巨大潜力；然而，扩散法学硕士的长情境行为在很大程度上仍然是未知的。我们提出了一个训练后技术的案例研究，用于扩展扩散 LLM（即 LLaDA）的上下文窗口，而无需从头开始重新训练。我们证明，对标准旋转位置嵌入（RoPE）扩展的简单修改可以有效地适应扩散过程中固有的概率建模，从而能够稳定地扩展到更长的上下文范围。我们进一步比较训练后使用的掩蔽策略，并分析它们对优化稳定性和远程召回的影响。为了实例化这些见解，我们引入了 UltraLLaDA，这是一种具有 128K 令牌上下文窗口的扩散法学硕士，在我们对长上下文任务的实证评估中，它的性能显着优于无训练基线。我们的实验结果强调了特殊的位置扩展作为将扩散法学硕士扩展到扩展上下文的关键杠杆，并为通过高效的后期培训寻求 128K 规模上下文的从业者提供实用指导。

Title: Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting

Authors: Heming Xia, Cunxiao Du, Rui Li, Chak Tou Leong, Yongqi Li, Wenjie Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10528
Pdf URL: https://arxiv.org/pdf/2510.10528
Copy Paste: [[2510.10528]] Merlin's Whisper: Enabling Efficient Reasoning in LLMs via Black-box Adversarial Prompting(https://arxiv.org/abs/2510.10528)
Keywords: llm, prompt
Abstract: Large reasoning models (LRMs) have demonstrated remarkable proficiency in tackling complex reasoning tasks through step-by-step thinking. However, such a lengthy reasoning process incurs substantial computational and latency overheads, hindering the practical deployment of these models. In this work, we present a new perspective on mitigating overthinking in LRMs via black-box adversarial prompting. By treating both open-source LRMs and closed-source APIs as black-box communicators, we investigate how to elicit concise responses without sacrificing accuracy. We introduce AdvPrompt, an iterative refinement framework that generates high-quality adversarial prompts from diverse perspectives. Experiments across multiple benchmarks demonstrate that AdvPrompt consistently reduces token usage while preserving performance. Notably, AdvPrompt achieves a 3x reduction in average response length on simple GSM8K questions for the Qwen3 model series, and delivers an average ~40% token reduction across four benchmarks. For closed-source APIs, AdvPrompt reduces token usage on MATH-500 by 35% for Claude-3.7 and 47% for Gemini-2.5. Further analysis reveals the generalizability of AdvPrompt across various model scales and families, underscoring the potential of black-box prompting as a practical and effective strategy for enhancing LRM efficiency.
摘要：大型推理模型（LRM）在通过逐步思考来处理复杂推理任务方面表现出了卓越的能力。然而，如此漫长的推理过程会带来大量的计算和延迟开销，阻碍了这些模型的实际部署。在这项工作中，我们提出了通过黑盒对抗性提示来减轻 LRM 中过度思考的新视角。通过将开源 LRM 和闭源 API 视为黑盒通信器，我们研究了如何在不牺牲准确性的情况下得出简洁的响应。我们引入了 AdvPrompt，这是一个迭代细化框架，可以从不同的角度生成高质量的对抗性提示。跨多个基准测试的实验表明，AdvPrompt 在保持性能的同时持续减少了令牌使用。值得注意的是，AdvPrompt 将 Qwen3 型号系列的简单 GSM8K 问题的平均响应长度缩短了 3 倍，并在四个基准测试中平均减少了约 40% 的标记。对于闭源 API，AdvPrompt 将 MATH-500 上的令牌使用量（对于 Claude-3.7）减少了 35%，对于 Gemini-2.5 减少了 47%。进一步的分析揭示了 AdvPrompt 在各种模型规模和系列中的普遍性，强调了黑盒提示作为提高 LRM 效率的实用且有效策略的潜力。

Title: Detecting Hallucinations in Authentic LLM-Human Interactions

Authors: Yujie Ren, Niklas Gruhlke, Anne Lauscher
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10539
Pdf URL: https://arxiv.org/pdf/2510.10539
Copy Paste: [[2510.10539]] Detecting Hallucinations in Authentic LLM-Human Interactions(https://arxiv.org/abs/2510.10539)
Keywords: language model, llm, hallucination
Abstract: As large language models (LLMs) are increasingly applied in sensitive domains such as medicine and law, hallucination detection has become a critical task. Although numerous benchmarks have been proposed to advance research in this area, most of them are artificially constructed--either through deliberate hallucination induction or simulated interactions--rather than derived from genuine LLM-human dialogues. Consequently, these benchmarks fail to fully capture the characteristics of hallucinations that occur in real-world usage. To address this limitation, we introduce AuthenHallu, the first hallucination detection benchmark built entirely from authentic LLM-human interactions. For AuthenHallu, we select and annotate samples from genuine LLM-human dialogues, thereby providing a faithful reflection of how LLMs hallucinate in everyday user interactions. Statistical analysis shows that hallucinations occur in 31.4% of the query-response pairs in our benchmark, and this proportion increases dramatically to 60.0% in challenging domains such as Math & Number Problems. Furthermore, we explore the potential of using vanilla LLMs themselves as hallucination detectors and find that, despite some promise, their current performance remains insufficient in real-world scenarios.
摘要：随着大型语言模型（LLM）越来越多地应用于医学和法律等敏感领域，幻觉检测已成为一项关键任务。尽管已经提出了许多基准来推进这一领域的研究，但其中大多数都是人为构建的——要么通过故意的幻觉诱导，要么通过模拟交互——而不是源自真正的法学硕士与人类对话。因此，这些基准无法完全捕捉现实世界使用中发生的幻觉的特征。为了解决这一限制，我们引入了 AuthenHallu，这是第一个完全基于真实的法学硕士与人类交互构建的幻觉检测基准。对于 AuthenHallu，我们从真正的法学硕士与人类对话中选择并注释样本，从而忠实地反映了法学硕士在日常用户交互中如何产生幻觉。统计分析显示，在我们的基准测试中，31.4% 的查询-响应对出现幻觉，而在数学和数字问题等具有挑战性的领域，这一比例急剧增加至 60.0%。此外，我们探索了使用普通法学硕士本身作为幻觉探测器的潜力，并发现尽管有一些希望，但它们目前的性能在现实场景中仍然不足。

Title: BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Authors: Euhid Aman, Esteban Carlin, Hsing-Kuo Pao, Giovanni Beltrame, Ghaluh Indah Permata Sari, Yie-Tarng Chen
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.10560
Pdf URL: https://arxiv.org/pdf/2510.10560
Copy Paste: [[2510.10560]] BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices(https://arxiv.org/abs/2510.10560)
Keywords: language model
Abstract: Cross-attention transformers and other multimodal vision-language models excel at grounding and generation; however, their extensive, full-precision backbones make it challenging to deploy them on edge devices. Memory-augmented architectures enhance the utilization of past context; however, most works rarely pair them with aggressive edge-oriented quantization. We introduce BitMar, a quantized multimodal transformer that proposes an external human-like episodic memory for effective image-text generation on hardware with limited resources. BitMar utilizes 1.58-bit encoders, one for text (BitNet-style) and one for vision (DiNOv2-based), to create compact embeddings that are combined and used to query a fixed-size key-value episodic memory. During vector retrieval, the BitNet decoder applies per-layer conditioning, which increases the contextual relevance of generated content. The decoder also employs attention sinks with a sliding-window mechanism to process long or streaming inputs under tight memory budgets. The combination of per-layer conditioning and sliding-window attention achieves a strong quality-speed trade-off, delivering competitive captioning and multimodal understanding at low latency with a small model footprint. These characteristics make BitMar well-suited for edge deployment.
摘要：交叉注意力变换器和其他多模态视觉语言模型擅长基础和生成；然而，它们广泛的全精度主干网使得将它们部署在边缘设备上具有挑战性。记忆增强架构增强了对过去上下文的利用；然而，大多数作品很少将它们与激进的边缘导向量化配对。我们介绍了 BitMar，一种量化多模态转换器，它提出了一种外部类人情景存储器，用于在资源有限的硬件上有效生成图像文本。 BitMar 利用 1.58 位编码器（一种用于文本（BitNet 样式），一种用于视觉（基于 DiNOv2））来创建紧凑的嵌入，这些嵌入组合起来用于查询固定大小的键值情景存储器。在矢量检索过程中，BitNet 解码器应用每层条件，这增加了生成内容的上下文相关性。解码器还采用带有滑动窗口机制的注意力接收器来在内存预算紧张的情况下处理长输入或流输入。每层调节和滑动窗口注意力的结合实现了强大的质量与速度权衡，以低延迟和小模型占用空间提供有竞争力的字幕和多模态理解。这些特性使 BitMar 非常适合边缘部署。

Title: Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models

Authors: Di Wu abd Shuaidong Pan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10613
Pdf URL: https://arxiv.org/pdf/2510.10613
Copy Paste: [[2510.10613]] Dynamic Topic Evolution with Temporal Decay and Attention in Large Language Models(https://arxiv.org/abs/2510.10613)
Keywords: language model
Abstract: This paper proposes a modeling framework for dynamic topic evolution based on temporal large language models. The method first uses a large language model to obtain contextual embeddings of text and then introduces a temporal decay function and an attention mechanism. These components allow the model to adjust the importance of semantic units according to time intervals and capture topic variations across different periods. The temporal representations are then mapped into a latent topic space, where a state transition matrix is applied to describe the dynamic evolution of topics. A joint optimization objective constrains both semantic modeling and temporal consistency, ensuring diversity and smoothness in topic generation. The design emphasizes the unified modeling of semantic representation and temporal evolution, which improves topic coherence and diversity while enhancing stability and interpretability over time. Experiments on real-world corpora show that the framework effectively captures the generation, expansion, and decline of topics and outperforms existing models across multiple metrics. Overall, the proposed method provides a systematic solution for understanding dynamic semantic patterns in large-scale text, enriches the research paradigm of topic modeling, and supports complex text analysis tasks in multiple domains.
摘要：本文提出了一种基于时态大语言模型的动态主题演化建模框架。该方法首先使用大型语言模型来获得文本的上下文嵌入，然后引入时间衰减函数和注意机制。这些组件允许模型根据时间间隔调整语义单元的重要性，并捕获不同时期的主题变化。然后将时间表示映射到潜在主题空间，其中应用状态转换矩阵来描述主题的动态演化。联合优化目标约束语义建模和时间一致性，确保主题生成的多样性和平滑性。该设计强调语义表示和时间演化的统一建模，这提高了主题的连贯性和多样性，同时增强了随着时间的推移的稳定性和可解释性。对现实世界语料库的实验表明，该框架有效地捕获了主题的生成、扩展和衰退，并且在多个指标上优于现有模型。总体而言，该方法为理解大规模文本中的动态语义模式提供了系统的解决方案，丰富了主题建模的研究范式，并支持多个领域的复杂文本分析任务。

Title: Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization

Authors: Bowei He, Lihao Yin, Huiling Zhen, Shuqi Liu, Han Wu, Xiaokun Zhang, Mingxuan Yuan, Chen Ma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10618
Pdf URL: https://arxiv.org/pdf/2510.10618
Copy Paste: [[2510.10618]] Preserving LLM Capabilities through Calibration Data Curation: From Analysis to Optimization(https://arxiv.org/abs/2510.10618)
Keywords: language model, llm
Abstract: Post-training compression has been a widely employed approach to scale down large language model (LLM) and facilitate efficient inference. In various proposed compression methods, including pruning and quantization, calibration data plays a vital role by informing the weight importance and activation dynamic ranges. However, how calibration data impacts the LLM capability after compression is less explored. Few of the existing works, though recognizing the significance of this study, only investigate the language modeling or commonsense reasoning performance degradation from limited angles, like the data sources or sample amounts. More systematic research is still needed to examine the impacts on different LLM capabilities in terms of compositional properties and domain correspondence of calibration data. In this work, we aim at bridging this gap and further analyze underlying influencing mechanisms from the activation pattern perspective. Especially, we explore the calibration data's impacts on high-level complex reasoning capabilities, like math problem solving and code generation. Delving into the underlying mechanism, we find that the representativeness and diversity in activation space more fundamentally determine the quality of calibration data. Finally, we propose a calibration data curation framework based on such observations and analysis, enhancing the performance of existing post-training compression methods on preserving critical LLM capabilities. Our code is provided in \href{this https URL}{Link}.
摘要：训练后压缩已成为一种广泛采用的方法，用于缩小大型语言模型 (LLM) 并促进高效推理。在各种提出的压缩方法中，包括修剪和量化，校准数据通过告知权重重要性和激活动态范围发挥着至关重要的作用。然而，校准数据在压缩后如何影响 LLM 能力的研究较少。现有的工作虽然认识到这项研究的重要性，但很少从有限的角度（例如数据源或样本量）研究语言建模或常识推理性能下降。仍需要更系统的研究来检验校准数据的成分特性和域对应关系对不同法学硕士能力的影响。在这项工作中，我们的目标是弥合这一差距，并从激活模式的角度进一步分析潜在的影响机制。特别是，我们探讨了校准数据对高级复杂推理能力的影响，例如数学问题解决和代码生成。深入研究底层机制，我们发现激活空间的代表性和多样性更从根本上决定了校准数据的质量。最后，我们基于此类观察和分析提出了一个校准数据管理框架，增强了现有训练后压缩方法在保留关键 LLM 功能方面的性能。我们的代码在 \href{this https URL}{Link} 中提供。

Title: AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation

Authors: Omid Reza Heidari, Siobhan Reid, Yassine Yaakoubi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10661
Pdf URL: https://arxiv.org/pdf/2510.10661
Copy Paste: [[2510.10661]] AGENTIQL: An Agent-Inspired Multi-Expert Framework for Text-to-SQL Generation(https://arxiv.org/abs/2510.10661)
Keywords: gpt, llm, agent
Abstract: LLMs have advanced text-to-SQL generation, yet monolithic architectures struggle with complex reasoning and schema diversity. We propose AGENTIQL, an agent-inspired multi-expert framework that combines a reasoning agent for question decomposition, a coding agent for sub-query generation, and a refinement step for column selection. An adaptive router further balances efficiency and accuracy by selecting between our modular pipeline and a baseline parser. Several steps in the pipeline can be executed in parallel, making the framework scalable to larger workloads. Evaluated on the Spider benchmark, AGENTIQL improves execution accuracy and interpretability and achieves up to 86.07\% EX with 14B models using the Planner&Executor merging strategy. The attained performance is contingent upon the efficacy of the routing mechanism, thereby narrowing the gap to GPT-4-based SOTA (89.65% EX) while using much smaller open-source LLMs. Beyond accuracy, AGENTIQL enhances transparency by exposing intermediate reasoning steps, offering a robust, scalable, and interpretable approach to semantic parsing.
摘要：LLM 具有先进的文本到 SQL 生成功能，但整体架构难以应对复杂的推理和模式多样性。我们提出了 AGENTIQL，一个受代理启发的多专家框架，它结合了用于问题分解的推理代理、用于子查询生成的编码代理和用于列选择的细化步骤。自适应路由器通过在我们的模块化管道和基线解析器之间进行选择，进一步平衡效率和准确性。管道中的多个步骤可以并行执行，使框架可扩展以适应更大的工作负载。根据 Spider 基准评估，AGENTIQL 提高了执行准确性和可解释性，并使用 Planner&Executor 合并策略在 14B 模型上实现了高达 86.07% EX。所获得的性能取决于路由机制的功效，从而缩小了与基于 GPT-4 的 SOTA (89.65% EX) 的差距，同时使用更小的开源 LLM。除了准确性之外，AGENTIQL 还通过公开中间推理步骤来增强透明度，提供强大、可扩展且可解释的语义解析方法。

Title: BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Authors: Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10666
Pdf URL: https://arxiv.org/pdf/2510.10666
Copy Paste: [[2510.10666]] BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions(https://arxiv.org/abs/2510.10666)
Keywords: llm, agent
Abstract: Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.
摘要：法学硕士能否有效解决现实问题越来越取决于他们与动态网络环境交互和自主获取外部信息的能力。虽然最近的研究（例如 Search-R1 和 WebDancer）在解决 Web 任务方面展示了强大的性能，但它们严重依赖其他工具将交互式 Web 环境转换为静态文本内容。这与人类浏览行为形成对比，人类浏览行为涉及与浏览器的多种交互，例如滚动、单击和键入。在本文中，我们提出了 BrowserAgent，这是一种更具交互性的代理，可以通过受人类启发的浏览器操作来解决复杂的任务。 BrowserAgent 通过 Playwright 通过一组预定义的浏览器操作直接在原始网页上运行。我们采用两阶段训练（监督微调（SFT）和拒绝微调（RFT））来提高模型的泛化能力。尽管使用的训练数据比 Search-R1 少得多，但 BrowserAgent 在不同的 Open-QA 任务中取得了更具竞争力的结果。此外，我们引入了显式记忆机制来存储跨步骤的关键结论，进一步增强了模型对长期任务的推理能力。值得注意的是，在 HotpotQA、2Wiki 和 Bamboogle 等多跳 QA 任务上，BrowserAgent-7B 比 Search-R1 可以实现约 20% 的改进。这些结果表明，BrowserAgent 可以作为更高级的框架，提供更具交互性和可扩展性的 Web 代理。

Title: Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data

Authors: Zhuowei Chen, Bowei Zhang, Nankai Lin, Tian Hou, Lianxi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10677
Pdf URL: https://arxiv.org/pdf/2510.10677
Copy Paste: [[2510.10677]] Unlocking LLM Safeguards for Low-Resource Languages via Reasoning and Alignment with Minimal Training Data(https://arxiv.org/abs/2510.10677)
Keywords: llm
Abstract: Recent advances in LLMs have enhanced AI capabilities, but also increased the risk posed by malicious requests, highlighting the need for effective LLM safeguards to detect such queries. Existing approaches largely rely on classifier-based methods that lack interpretability and perform poorly on low-resource languages. To address these limitations, we propose ConsistentGuard, a novel reasoning-based multilingual safeguard, which enhances explainability via reasoning and boosts knowledge transfer between languages through alignment. With only 1,000 training samples, our method demonstrates superior performance on three datasets across six languages, outperforming larger models trained with significantly more data, and exhibits strong interpretability and generalization ability. We also contribute a multilingual benchmark extension and release our codes to support future research.
摘要：法学硕士的最新进展增强了人工智能能力，但也增加了恶意请求带来的风险，凸显了需要有效的法学硕士保障措施来检测此类查询。现有方法很大程度上依赖于基于分类器的方法，这些方法缺乏可解释性并且在低资源语言上表现不佳。为了解决这些限制，我们提出了 ConcientGuard，一种基于推理的新型多语言保护措施，它通过推理增强可解释性，并通过对齐促进语言之间的知识转移。仅用 1,000 个训练样本，我们的方法就在六种语言的三个数据集上表现出卓越的性能，优于使用更多数据训练的大型模型，并表现出强大的可解释性和泛化能力。我们还贡献了多语言基准扩展并发布了我们的代码以支持未来的研究。

Title: RePro: Training Language Models to Faithfully Recycle the Web for Pretraining

Authors: Zichun Yu, Chenyan Xiong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10681
Pdf URL: https://arxiv.org/pdf/2510.10681
Copy Paste: [[2510.10681]] RePro: Training Language Models to Faithfully Recycle the Web for Pretraining(https://arxiv.org/abs/2510.10681)
Keywords: language model, llm, prompt
Abstract: High-quality pretraining data is the fossil fuel of large language models (LLMs), yet its reserves are running low for frontier models. In this paper, we introduce RePro, a novel web recycling method that trains a relatively small LM with reinforcement learning to generate effective and faithful rephrasings of pretraining data. Specifically, we design one quality reward and three faithfulness rewards, optimizing the LM rephraser to convert organic data into high-quality rephrasings while maintaining its core semantics and structure. In our experiment, we train a 4B rephraser to recycle 72B tokens sampled from DCLM-RefinedWeb. Pretraining results on 400M and 1.4B models demonstrate that RePro delivers 4.7%-14.0% relative accuracy gains over organic-only baseline on 22 downstream tasks. RePro also outperforms ReWire, the state-of-the-art web recycling method that prompts a 70B rephraser, as well as the organic baseline with a 4x larger data pool. Experiments with different amounts of recycled data highlight that RePro improves organic data efficiency by 2-3x. Individual and distributional analyses validate that RePro preserves more critical information and faithfully reflects the characteristics of organic data compared to prompting-based methods. Together, these results show that RePro provides an efficient and controllable path to effectively harness the fossil fuel of LLM pretraining. We open-source our code, rephraser, and recycled data at this https URL.
摘要：高质量的预训练数据是大型语言模型（LLM）的化石燃料，但对于前沿模型来说，其储备已经所剩无几。在本文中，我们介绍了 RePro，这是一种新颖的网络回收方法，它通过强化学习训练相对较小的 LM，以生成预训练数据的有效且忠实的改写。具体来说，我们设计了一种质量奖励和三种忠诚度奖励，优化 LM 重写器，将有机数据转换为高质量的重写，同时保持其核心语义和结构。在我们的实验中，我们训练 4B 重写器来回收从 DCLM-RefinedWeb 采样的 72B 令牌。 400M 和 1.4B 模型的预训练结果表明，在 22 个下游任务中，RePro 比纯有机基线的相对准确度提高了 4.7%-14.0%。 RePro 的性能还优于 ReWire，ReWire 是最先进的网络回收方法，可提示 70B 重写器，以及具有 4 倍大数据池的有机基线。对不同数量的回收数据进行的实验表明，RePro 将有机数据效率提高了 2-3 倍。个体和分布分析证实，与基于提示的方法相比，RePro 保留了更多关键信息并忠实地反映了有机数据的特征。总之，这些结果表明 RePro 提供了一条高效且可控的途径来有效利用 LLM 预训练的化石燃料。我们在此 https URL 开源我们的代码、改写器和回收数据。

Title: Sarcasm Detection Using Deep Convolutional Neural Networks: A Modular Deep Learning Framework

Authors: Manas Zambre, Sarika Bobade (Supervisor)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10729
Pdf URL: https://arxiv.org/pdf/2510.10729
Copy Paste: [[2510.10729]] Sarcasm Detection Using Deep Convolutional Neural Networks: A Modular Deep Learning Framework(https://arxiv.org/abs/2510.10729)
Keywords: chat
Abstract: Sarcasm is a nuanced and often misinterpreted form of communication, especially in text, where tone and body language are absent. This paper proposes a modular deep learning framework for sarcasm detection, leveraging Deep Convolutional Neural Networks (DCNNs) and contextual models such as BERT to analyze linguistic, emotional, and contextual cues. The system integrates sentiment analysis, contextual embeddings, linguistic feature extraction, and emotion detection through a multi-layer architecture. While the model is in the conceptual stage, it demonstrates feasibility for real-world applications such as chatbots and social media analysis.
摘要：讽刺是一种微妙且经常被误解的交流形式，尤其是在缺乏语气和肢体语言的文本中。本文提出了一种用于讽刺检测的模块化深度学习框架，利用深度卷积神经网络 (DCNN) 和 BERT 等上下文模型来分析语言、情感和上下文线索。该系统通过多层架构集成了情感分析、上下文嵌入、语言特征提取和情感检测。虽然该模型处于概念阶段，但它展示了聊天机器人和社交媒体分析等现实应用的可行性。

Title: Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis

Authors: Wenqing Zhang, Trang Nguyen, Elizabeth A. Stuart, Yiqun T. Chen
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2510.10762
Pdf URL: https://arxiv.org/pdf/2510.10762
Copy Paste: [[2510.10762]] Large Language Models for Full-Text Methods Assessment: A Case Study on Mediation Analysis(https://arxiv.org/abs/2510.10762)
Keywords: language model, llm
Abstract: Systematic reviews are crucial for synthesizing scientific evidence but remain labor-intensive, especially when extracting detailed methodological information. Large language models (LLMs) offer potential for automating methodological assessments, promising to transform evidence synthesis. Here, using causal mediation analysis as a representative methodological domain, we benchmarked state-of-the-art LLMs against expert human reviewers across 180 full-text scientific articles. Model performance closely correlated with human judgments (accuracy correlation 0.71; F1 correlation 0.97), achieving near-human accuracy on straightforward, explicitly stated methodological criteria. However, accuracy sharply declined on complex, inference-intensive assessments, lagging expert reviewers by up to 15%. Errors commonly resulted from superficial linguistic cues -- for instance, models frequently misinterpreted keywords like "longitudinal" or "sensitivity" as automatic evidence of rigorous methodological approache, leading to systematic misclassifications. Longer documents yielded lower model accuracy, whereas publication year showed no significant effect. Our findings highlight an important pattern for practitioners using LLMs for methods review and synthesis from full texts: current LLMs excel at identifying explicit methodological features but require human oversight for nuanced interpretations. Integrating automated information extraction with targeted expert review thus provides a promising approach to enhance efficiency and methodological rigor in evidence synthesis across diverse scientific fields.
摘要：系统评价对于综合科学证据至关重要，但仍然是劳动密集型的，特别是在提取详细的方法学信息时。大型语言模型（LLM）提供了自动化方法评估的潜力，有望改变证据合成。在这里，我们使用因果中介分析作为代表性方法论领域，将最先进的法学硕士与人类专家审稿人对 180 篇全文科学文章进行了基准测试。模型性能与人类判断密切相关（准确度相关性 0.71；F1 相关性 0.97），在简单、明确规定的方法标准上实现了接近人类的准确度。然而，复杂的推理密集型评估的准确性急剧下降，落后于专家评审员高达 15%。错误通常是由表面的语言线索造成的——例如，模型经常将“纵向”或“敏感性”等关键词误解为严格方法论的自动证据，从而导致系统性错误分类。较长的文档产生的模型准确性较低，而出版年份则没有显示出显着的影响。我们的研究结果强调了使用法学硕士进行方法审查和全文综合的从业者的一个重要模式：当前的法学硕士擅长识别明确的方法论特征，但需要人工监督来进行细致入微的解释。因此，将自动信息提取与有针对性的专家评审相结合，为提高不同科学领域证据合成的效率和方法严谨性提供了一种有前景的方法。

Title: Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG

Authors: Zhichao Wang, Cheng Wan, Dong Nie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10787
Pdf URL: https://arxiv.org/pdf/2510.10787
Copy Paste: [[2510.10787]] Review of Inference-Time Scaling Strategies: Reasoning, Search and RAG(https://arxiv.org/abs/2510.10787)
Keywords: llm
Abstract: The performance gains of LLMs have historically been driven by scaling up model size and training data. However, the rapidly diminishing availability of high-quality training data is introducing a fundamental bottleneck, shifting the focus of research toward inference-time scaling. This paradigm uses additional computation at the time of deployment to substantially improve LLM performance on downstream tasks without costly model re-training. This review systematically surveys the diverse techniques contributing to this new era of inference-time scaling, organizing the rapidly evolving field into two comprehensive perspectives: Output-focused and Input-focused methods. Output-focused techniques encompass complex, multi-step generation strategies, including reasoning (e.g., CoT, ToT, ReAct), various search and decoding methods (e.g., MCTS, beam search), training for long CoT (e.g., RLVR, GRPO), and model ensemble methods. Input-focused techniques are primarily categorized by few-shot and RAG, with RAG as the central focus. The RAG section is further detailed through a structured examination of query expansion, data, retrieval and reranker, LLM generation methods, and multi-modal RAG.
摘要：法学硕士的性能提升历来是通过扩大模型规模和训练数据来推动的。然而，高质量训练数据的可用性迅速减少正在引入一个根本性瓶颈，将研究重点转向推理时间缩放。该范例在部署时使用额外的计算来显着提高下游任务的 LLM 性能，而无需昂贵的模型重新训练。这篇综述系统地调查了促成推理时间缩放新时代的各种技术，将快速发展的领域分为两个综合视角：以输出为中心的方法和以输入为中心的方法。以输出为中心的技术涵盖复杂的多步骤生成策略，包括推理（例如 CoT、ToT、ReAct）、各种搜索和解码方法（例如 MCTS、波束搜索）、长 CoT 训练（例如 RLVR、GRPO）和模型集成方法。以输入为中心的技术主要分为少样本和 RAG，其中以 RAG 为中心焦点。 RAG 部分通过对查询扩展、数据、检索和重新排序、LLM 生成方法和多模式 RAG 的结构化检查进行了进一步详细说明。

Title: Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures

Authors: Mihir Gupte, Paolo Giusto, Ramesh S
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10806
Pdf URL: https://arxiv.org/pdf/2510.10806
Copy Paste: [[2510.10806]] Is Implicit Knowledge Enough for LLMs? A RAG Approach for Tree-based Structures(https://arxiv.org/abs/2510.10806)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are adept at generating responses based on information within their context. While this ability is useful for interacting with structured data like code files, another popular method, Retrieval-Augmented Generation (RAG), retrieves relevant documents to augment the model's in-context learning. However, it is not well-explored how to best represent this retrieved knowledge for generating responses on structured data, particularly hierarchical structures like trees. In this work, we propose a novel bottom-up method to linearize knowledge from tree-like structures (like a GitHub repository) by generating implicit, aggregated summaries at each hierarchical level. This approach enables the knowledge to be stored in a knowledge base and used directly with RAG. We then compare our method to using RAG on raw, unstructured code, evaluating the accuracy and quality of the generated responses. Our results show that while response quality is comparable across both methods, our approach generates over 68% fewer documents in the retriever, a significant gain in efficiency. This finding suggests that leveraging implicit, linearized knowledge may be a highly effective and scalable strategy for handling complex, hierarchical data structures.
摘要：大型语言模型 (LLM) 擅长根据上下文中的信息生成响应。虽然此功能对于与代码文件等结构化数据进行交互非常有用，但另一种流行的方法是检索增强生成 (RAG)，它可以检索相关文档以增强模型的上下文学习。然而，如何最好地表示这些检索到的知识以生成对结构化数据（特别是树等分层结构）的响应尚未得到充分探索。在这项工作中，我们提出了一种新颖的自下而上的方法，通过在每个层次级别生成隐式的聚合摘要来线性化树状结构（如 GitHub 存储库）中的知识。这种方法使知识能够存储在知识库中并直接与 RAG 一起使用。然后，我们将我们的方法与在原始非结构化代码上使用 RAG 进行比较，评估生成的响应的准确性和质量。我们的结果表明，虽然两种方法的响应质量相当，但我们的方法在检索器中生成的文档减少了 68% 以上，显着提高了效率。这一发现表明，利用隐式的线性化知识可能是处理复杂的分层数据结构的高效且可扩展的策略。

Title: DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models

Authors: Kaixuan Ren, Preslav Nakov, Usman Naseem
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10846
Pdf URL: https://arxiv.org/pdf/2510.10846
Copy Paste: [[2510.10846]] DUAL-Bench: Measuring Over-Refusal and Robustness in Vision-Language Models(https://arxiv.org/abs/2510.10846)
Keywords: language model, gpt
Abstract: As vision-language models become increasingly capable, maintaining a balance between safety and usefulness remains a central challenge. Safety mechanisms, while essential, can backfire, causing over-refusal, where models decline benign requests out of excessive caution. Yet, no existing benchmark has systematically addressed over-refusal in the visual modality. This setting introduces unique challenges, such as dual-use cases where an instruction is harmless, but the accompanying image contains harmful content. Models frequently fail in such scenarios, either refusing too conservatively or completing tasks unsafely, which highlights the need for more fine-grained alignment. The ideal behavior is safe completion, i.e., fulfilling the benign parts of a request while explicitly warning about any potentially harmful elements. To address this, we present DUAL-Bench, the first multimodal benchmark focused on over-refusal and safe completion in VLMs. We evaluated 18 VLMs across 12 hazard categories, with focus on their robustness under semantics-preserving visual perturbations. The results reveal substantial room for improvement: GPT-5-Nano achieves 12.9% safe completion, GPT-5 models average 7.9%, and Qwen models only 3.9%. We hope that DUAL-Bench will foster the development of more nuanced alignment strategies that ensure models remain both safe and useful in complex multimodal settings.
摘要：随着视觉语言模型的能力越来越强，保持安全性和实用性之间的平衡仍然是一个核心挑战。安全机制虽然重要，但可能会适得其反，导致过度拒绝，模型会出于过度谨慎而拒绝善意的请求。然而，现有的基准还没有系统地解决视觉形态中的过度拒绝问题。此设置带来了独特的挑战，例如指令无害但随附图像包含有害内容的双重用途情况。在这种情况下，模型经常会失败，要么过于保守地拒绝，要么不安全地完成任务，这凸显了更细粒度的对齐的需要。理想的行为是安全完成，即满足请求的良性部分，同时明确警告任何潜在的有害元素。为了解决这个问题，我们推出了 DUAL-Bench，这是第一个专注于 VLM 中的过度拒绝和安全完成的多模式基准测试。我们评估了 12 个危险类别的 18 个 VLM，重点关注它们在保留语义的视觉扰动下的鲁棒性。结果揭示了巨大的改进空间：GPT-5-Nano 实现了 12.9% 的安全完成率，GPT-5 模型平均为 7.9%，而 Qwen 模型仅为 3.9%。我们希望 DUAL-Bench 能够促进更细致的调整策略的开发，确保模型在复杂的多模式环境中保持安全和有用。

Title: Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks

Authors: Jiajing Guo, Kenil Patel, Jorge Piazentin Ono, Wenbin He, Liu Ren
Subjects: cs.CL, cs.DB
Abstract URL: https://arxiv.org/abs/2510.10885
Pdf URL: https://arxiv.org/pdf/2510.10885
Copy Paste: [[2510.10885]] Rethinking Agentic Workflows: Evaluating Inference-Based Test-Time Scaling Strategies in Text2SQL Tasks(https://arxiv.org/abs/2510.10885)
Keywords: language model, llm, prompt, agent
Abstract: Large language models (LLMs) are increasingly powering Text-to-SQL (Text2SQL) systems, enabling non-expert users to query industrial databases using natural language. While test-time scaling strategies have shown promise in LLM-based solutions, their effectiveness in real-world applications, especially with the latest reasoning models, remains uncertain. In this work, we benchmark six lightweight, industry-oriented test-time scaling strategies and four LLMs, including two reasoning models, evaluating their performance on the BIRD Mini-Dev benchmark. Beyond standard accuracy metrics, we also report inference latency and token consumption, providing insights relevant for practical system deployment. Our findings reveal that Divide-and-Conquer prompting and few-shot demonstrations consistently enhance performance for both general-purpose and reasoning-focused LLMs. However, introducing additional workflow steps yields mixed results, and base model selection plays a critical role. This work sheds light on the practical trade-offs between accuracy, efficiency, and complexity when deploying Text2SQL systems.
摘要：大型语言模型 (LLM) 越来越多地支持文本到 SQL (Text2SQL) 系统，使非专家用户能够使用自然语言查询工业数据库。虽然测试时间扩展策略在基于 LLM 的解决方案中显示出了希望，但它们在实际应用中的有效性，特别是在最新的推理模型中，仍然不确定。在这项工作中，我们对六种轻量级、面向行业的测试时间扩展策略和四个法学硕士（包括两个推理模型）进行了基准测试，评估了它们在 BIRD Mini-Dev 基准测试上的性能。除了标准准确性指标之外，我们还报告推理延迟和令牌消耗，提供与实际系统部署相关的见解。我们的研究结果表明，分而治之的提示和少量演示能够持续提高通用型和推理型法学硕士的表现。然而，引入额外的工作流程步骤会产生好坏参半的结果，而基本模型选择起着至关重要的作用。这项工作揭示了部署 Text2SQL 系统时准确性、效率和复杂性之间的实际权衡。

Title: LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System

Authors: Yu Chao, Siyu Lin, xiaorong wang, Zhu Zhang, Zihan Zhou, Haoyu Wang, Shuo Wang, Jie Zhou, Zhiyuan Liu, Maosong Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10890
Pdf URL: https://arxiv.org/pdf/2510.10890
Copy Paste: [[2510.10890]] LLM$\times$MapReduce-V3: Enabling Interactive In-Depth Survey Generation through a MCP-Driven Hierarchically Modular Agent System(https://arxiv.org/abs/2510.10890)
Keywords: llm, agent
Abstract: We introduce LLM x MapReduce-V3, a hierarchically modular agent system designed for long-form survey generation. Building on the prior work, LLM x MapReduce-V2, this version incorporates a multi-agent architecture where individual functional components, such as skeleton initialization, digest construction, and skeleton refinement, are implemented as independent model-context-protocol (MCP) servers. These atomic servers can be aggregated into higher-level servers, creating a hierarchically structured system. A high-level planner agent dynamically orchestrates the workflow by selecting appropriate modules based on their MCP tool descriptions and the execution history. This modular decomposition facilitates human-in-the-loop intervention, affording users greater control and customization over the research process. Through a multi-turn interaction, the system precisely captures the intended research perspectives to generate a comprehensive skeleton, which is then developed into an in-depth survey. Human evaluations demonstrate that our system surpasses representative baselines in both content depth and length, highlighting the strength of MCP-based modular planning.
摘要：我们推出了 LLM x MapReduce-V3，这是一个专为长格式调查生成而设计的分层模块化代理系统。该版本以之前的工作 LLM x MapReduce-V2 为基础，采用了多代理架构，其中各个功能组件（例如骨架初始化、摘要构建和骨架细化）被实现为独立的模型上下文协议 (MCP) 服务器。这些原子服务器可以聚合到更高级别的服务器中，创建一个分层结构的系统。高级规划器代理根据 MCP 工具描述和执行历史记录选择适当的模块，动态编排工作流程。这种模块化分解有利于人机交互，为用户提供对研究过程更大的控制和定制。通过多轮交互，系统精确捕捉预期的研究视角，生成综合框架，进而发展为深入调查。人工评估表明，我们的系统在内容深度和长度方面均超过了代表性基线，凸显了基于 MCP 的模块化规划的优势。

Title: ADVICE: Answer-Dependent Verbalized Confidence Estimation

Authors: Ki Jung Seo, Sehun Lim, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10913
Pdf URL: https://arxiv.org/pdf/2510.10913
Copy Paste: [[2510.10913]] ADVICE: Answer-Dependent Verbalized Confidence Estimation(https://arxiv.org/abs/2510.10913)
Keywords: language model, llm
Abstract: Recent progress in large language models (LLMs) has enabled them to express their confidence in natural language, enhancing transparency and reliability. However, their confidence often exhibits overconfidence, the cause of which remains poorly understood. In this work, we conduct a detailed analysis of the dynamics underlying verbalized confidence and identify answer-independence as a key factor, defined as the model's failure to condition confidence on its own answer. To address this, we propose ADVICE (Answer-Dependent Verbalized Confidence Estimation), a fine-tuning framework that facilitates answer-grounded confidence estimation. Extensive experiments show that ADVICE substantially improves confidence calibration while preserving task performance. Further analyses confirm that ADVICE strengthens answer-groundedness, leading to more balanced and well-calibrated confidence distributions. Our findings shed light on the origin of overconfidence and establish a framework for more trustworthy confidence verbalization.
摘要：大语言模型 (LLM) 的最新进展使他们能够表达对自然语言的信心，提高透明度和可靠性。然而，他们的自信常常表现出过度自信，其原因仍然知之甚少。在这项工作中，我们对言语信心背后的动态进行了详细分析，并将答案独立性确定为一个关键因素，定义为模型未能根据自己的答案来调节信心。为了解决这个问题，我们提出了 ADVICE（答案相关的言语置信度估计），这是一个微调框架，有助于基于答案的置信度估计。大量实验表明，ADVICE 显着提高了置信度校准，同时保持了任务性能。进一步的分析证实，ADVICE 增强了答案的基础性，从而导致更加平衡和校准良好的置信分布。我们的研究结果揭示了过度自信的根源，并为更值得信赖的自信语言表达建立了一个框架。

Title: Evaluating Language Models' Evaluations of Games

Authors: Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10930
Pdf URL: https://arxiv.org/pdf/2510.10930
Copy Paste: [[2510.10930]] Evaluating Language Models' Evaluations of Games(https://arxiv.org/abs/2510.10930)
Keywords: language model, agent
Abstract: Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over $100$ novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.
摘要：推理不仅仅是解决问题——它还涉及评估哪些问题值得解决。人工智能 (AI) 系统的评估主要侧重于解决问题，过去是通过研究模型如何玩国际象棋和围棋等游戏。在本文中，我们倡导一种新的范式来评估人工智能系统对游戏的评价。首先，我们引入评估此类评估的形式主义。然后，我们利用包含超过 100 美元的新颖棋盘游戏和超过 450 个人类判断的大规模数据集，将现代语言和推理模型产生的评估与人类和符号计算代理的评估进行比较。我们考虑两种评估性查询：评估游戏的回报（或公平性）和乐趣。这些查询跨越与人工智能评估的评估设计相关的两个维度：查询的计算复杂程度以及查询的量化难度。我们的结果表明，推理模型通常比非推理语言模型更符合人们对游戏的评估。然而，我们观察到一种非单调关系：随着模型越来越接近博弈论最优，它们对人类数据的拟合度会减弱。我们还观察到评估趣味性的模型之间存在更多“锯齿状”，这与量化该查询的难度更大相一致。在查询和游戏中，推理模型在评估查询时显示出高度可变和不可预测的资源使用情况，这表明在语言和推理模型中注入更多资源合理的元推理的重要性。

Title: KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification

Authors: Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10961
Pdf URL: https://arxiv.org/pdf/2510.10961
Copy Paste: [[2510.10961]] KOTOX: A Korean Toxic Dataset for Deobfuscation and Detoxification(https://arxiv.org/abs/2510.10961)
Keywords: language model, llm
Abstract: Toxic content has become an increasingly critical social issue with the rapid expansion of online communication. While numerous studies explored methods for detecting and detoxifying such content, most have focused primarily on English, leaving low-resource language underrepresented. Consequently, Large Language Models~(LLMs) often struggle to identify and neutralize toxic expressions in these languages. This challenge becomes even more pronounced when user employ obfuscation techniques to evade detection systems. Therefore, we propose a \textbf{KOTOX: Korean Toxic Dataset} for deobfuscation and detoxicification to address this issue. We categorize various obfuscation approaches based on linguistic characteristics of Korean and define a set of transformation rules grounded in real-word examples. Using these rules, we construct three dataset versions (easy, normal, and hard) representing different levels of obfuscation difficulty. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect it to facilitate better understanding and mitigating of obfuscated toxic content in LLM for low-resource languages. Our code and data are available at this https URL.
摘要：随着在线交流的迅速扩展，有毒内容已成为一个日益严重的社会问题。尽管许多研究探索了检测和消除此类内容的方法，但大多数研究主要集中在英语上，而导致资源匮乏的语言代表性不足。因此，大型语言模型（LLM）经常难以识别和消除这些语言中的有毒表达。当用户采用混淆技术来逃避检测系统时，这一挑战变得更加明显。因此，我们提出了一个用于反混淆和解毒的 \textbf{KOTOX：韩国有毒数据集}来解决这个问题。我们根据韩语的语言特征对各种混淆方法进行分类，并定义了一组基于真实示例的转换规则。使用这些规则，我们构建了三个数据集版本（简单、正常和困难），代表不同级别的混淆难度。这是第一个同时支持韩语反混淆和解毒的数据集。我们希望它能够促进更好地理解和减轻低资源语言法学硕士中混淆的有毒内容。我们的代码和数据可在此 https URL 中获取。

Title: Judge Before Answer: Can MLLM Discern the False Premise in Question?

Authors: Jidong Li, Lingyong Fang, Haodong Zhao, Sufeng Duan, Gongshen Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10965
Pdf URL: https://arxiv.org/pdf/2510.10965
Copy Paste: [[2510.10965]] Judge Before Answer: Can MLLM Discern the False Premise in Question?(https://arxiv.org/abs/2510.10965)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have witnessed astonishing advancements in recent years. Despite these successes, MLLMs remain vulnerable to flase premise problems. However, existing benchmarks targeting this issue are limited in scope: they often lack fine-grained categorization, exhibit insufficient coverage, and thus fail to provide a rigorous evaluation of the ability of models to recognize false premises. To bridge this gap, we introduce a fully automated pipeline for constructing a comprehensive benchmark of false premise questions. Our method systematically categorizes the premises into three main types and thirteen subtypes according to the abilities required to identify the premises, resulting in the JBA this http URL show current MLLMs still struggle with false premise recognition. Building upon this benchmark, we further propose a recognition enhancement framework tailored to strengthen the robustness of MLLMs to detect false premises. Extensive experiments demonstrate that models trained with our framework achieve significant improvements in false premise recognition.
摘要：近年来，多模态大语言模型（MLLM）取得了惊人的进步。尽管取得了这些成功，MLLM 仍然容易受到不稳定前提问题的影响。然而，针对这一问题的现有基准测试范围有限：它们通常缺乏细粒度的分类，覆盖范围不足，因此无法对模型识别错误前提的能力进行严格的评估。为了弥补这一差距，我们引入了一个全自动管道来构建错误前提问题的综合基准。我们的方法根据识别前提所需的能力，系统地将前提分为三种主要类型和十三个子类型，导致 JBA 这个 http URL 显示当前的 MLLM 仍然在与错误的前提识别作斗争。在此基准的基础上，我们进一步提出了一个识别增强框架，旨在增强 MLLM 检测虚假前提的鲁棒性。大量的实验表明，使用我们的框架训练的模型在错误前提识别方面取得了显着的改进。

Title: Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning

Authors: Zhiwen Ruan, Yixia Li, He Zhu, Yun Chen, Peng Li, Yang Liu, Guanhua Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.10974
Pdf URL: https://arxiv.org/pdf/2510.10974
Copy Paste: [[2510.10974]] Enhancing Large Language Model Reasoning via Selective Critical Token Fine-Tuning(https://arxiv.org/abs/2510.10974)
Keywords: language model, llm
Abstract: Large language models (LLMs) primarily rely on supervised fine-tuning (SFT) as a key method to adapt pre-trained models to domain-specific tasks such as mathematical reasoning. However, standard SFT uniformly penalizes all tokens, neglecting that only a small subset of critical tokens determines reasoning correctness. This uniform supervision often causes reduced output diversity and limited generalization. We propose Critical Token Fine-tuning (CFT), a simple yet effective approach that updates only tokens identified as functionally indispensable via counterfactual perturbations. By focusing gradient signals on these decisive reasoning steps while preserving the diversity of non-critical tokens, CFT can enhance both generation and diversity. Extensive experiments on five models across three families (Qwen, OLMo, LLaMA) and eleven mathematical reasoning benchmarks show that CFT, despite fine-tuning on less than 12% of tokens, consistently outperforms standard SFT. Moreover, CFT enables test-time scaling through improved sampling diversity and provides a stronger initialization for reinforcement learning, sustaining performance gains in later training stages while maintaining higher entropy for better exploration. These results highlight CFT as a practical and general framework for efficient and robust LLM fine-tuning.
摘要：大型语言模型 (LLM) 主要依靠监督微调 (SFT) 作为使预训练模型适应数学推理等特定领域任务的关键方法。然而，标准 SFT 统一惩罚所有 token，忽略了只有一小部分关键 token 决定推理的正确性。这种统一的监督通常会导致输出多样性减少和泛化能力有限。我们提出了关键令牌微调（CFT），这是一种简单而有效的方法，通过反事实扰动仅更新被识别为功能上不可或缺的令牌。通过将梯度信号集中在这些决定性推理步骤上，同时保留非关键令牌的多样性，CFT 可以增强生成和多样性。对三个系列（Qwen、OLMo、LLaMA）的五个模型和 11 个数学推理基准进行的广泛实验表明，尽管对不到 12% 的代币进行了微调，CFT 的性能始终优于标准 SFT。此外，CFT 通过改进采样多样性来实现测试时间扩展，并为强化学习提供更强的初始化，在后期训练阶段维持性能增益，同时保持更高的熵以进行更好的探索。这些结果凸显了 CFT 作为高效、稳健的 LLM 微调的实用且通用的框架。

Title: DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety

Authors: Wei-Chieh Huang, Henry Peng Zou, Yaozu Wu, Dongyuan Li, Yankai Chen, Weizhi Zhang, Yangning Li, Angelo Zangari, Jizhou Guo, Chunyu Miao, Liancheng Fang, Langzhou He, Renhe Jiang, Philip S. Yu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.10994
Pdf URL: https://arxiv.org/pdf/2510.10994
Copy Paste: [[2510.10994]] DeepResearchGuard: Deep Research with Open-Domain Evaluation and Multi-Stage Guardrails for Safety(https://arxiv.org/abs/2510.10994)
Keywords: gpt, llm
Abstract: Deep research frameworks have shown promising capabilities in synthesizing comprehensive reports from web sources. While deep research possesses significant potential to address complex issues through planning and research cycles, existing frameworks are deficient in sufficient evaluation procedures and stage-specific protections. They typically treat evaluation as exact match accuracy of question-answering, but overlook crucial aspects of report quality such as credibility, coherence, breadth, depth, and safety. This oversight may result in hazardous or malicious sources being integrated into the final report. To address these issues, we introduce DEEPRESEARCHGUARD, a comprehensive framework featuring four-stage safeguards with open-domain evaluation of references and reports. We assess performance across multiple metrics, e.g., defense success rate and over-refusal rate, and five key report dimensions. In the absence of a suitable safety benchmark, we introduce DRSAFEBENCH, a stage-wise benchmark for deep research safety. Our evaluation spans diverse state-of-the-art LLMs, including GPT-4o, Gemini-2.5-flash, DeepSeek-v3, and o4-mini. DEEPRESEARCHGUARD achieves an average defense success rate improvement of 18.16% while reducing over-refusal rate by 6%. The input guard provides the most substantial early-stage protection by filtering out obvious risks, while the plan and research guards enhance citation discipline and source credibility. Through extensive experiments, we show that DEEPRESEARCHGUARD enables comprehensive open-domain evaluation and stage-aware defenses that effectively block harmful content propagation, while systematically improving report quality without excessive over-refusal rates. The code can be found via this https URL.
摘要：深度研究框架在综合网络资源的综合报告方面表现出了有前景的能力。虽然深入研究具有通过规划和研究周期解决复杂问题的巨大潜力，但现有框架缺乏足够的评估程序和特定阶段的保护。他们通常将评估视为问答的精确匹配准确性，但忽视了报告质量的关键方面，例如可信度、连贯性、广度、深度和安全性。这种疏忽可能会导致危险或恶意来源被整合到最终报告中。为了解决这些问题，我们引入了 DEEPRESEARCHGUARD，这是一个综合框架，具有四阶段保护措施，并对参考文献和报告进行开放域评估。我们通过多个指标评估绩效，例如防御成功率和过度拒绝率以及五个关键报告维度。在缺乏合适的安全基准的情况下，我们引入了 DRSAFEBENCH，这是一个用于深度研究安全的分阶段基准。我们的评估涵盖各种最先进的法学硕士，包括 GPT-4o、Gemini-2.5-flash、DeepSeek-v3 和 o4-mini。 DEEPRESEARCHGUARD 的平均防御成功率提高了 18.16%，同时降低了 6% 的过度拒绝率。输入防护通过过滤掉明显的风险来提供最实质性的早期保护，而计划和研究防护则增强了引用纪律和来源可信度。通过大量实验，我们表明 DEEPRESEARCHGUARD 可实现全面的开放域评估和阶段感知防御，有效阻止有害内容传播，同时系统地提高报告质量，而不会出现过高的过度拒绝率。可以通过此 https URL 找到该代码。

Title: ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios

Authors: Mahika Phutane, Hayoung Jung, Matthew Kim, Tanushree Mitra, Aditya Vashistha
Subjects: cs.CL, cs.AI, cs.CY, cs.HC, cs.LG
Abstract URL: https://arxiv.org/abs/2510.10998
Pdf URL: https://arxiv.org/pdf/2510.10998
Copy Paste: [[2510.10998]] ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios(https://arxiv.org/abs/2510.10998)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly under scrutiny for perpetuating identity-based discrimination in high-stakes domains such as hiring, particularly against people with disabilities (PwD). However, existing research remains largely Western-centric, overlooking how intersecting forms of marginalization--such as gender and caste--shape experiences of PwD in the Global South. We conduct a comprehensive audit of six LLMs across 2,820 hiring scenarios spanning diverse disability, gender, nationality, and caste profiles. To capture subtle intersectional harms and biases, we introduce ABLEIST (Ableism, Inspiration, Superhumanization, and Tokenism), a set of five ableism-specific and three intersectional harm metrics grounded in disability studies literature. Our results reveal significant increases in ABLEIST harms towards disabled candidates--harms that many state-of-the-art models failed to detect. These harms were further amplified by sharp increases in intersectional harms (e.g., Tokenism) for gender and caste-marginalized disabled candidates, highlighting critical blind spots in current safety tools and the need for intersectional safety evaluations of frontier models in high-stakes domains like hiring.
摘要：大型语言模型 (LLM) 因在招聘等高风险领域长期存在基于身份的歧视而受到越来越多的审查，特别是针对残疾人 (PwD)。然而，现有的研究仍然主要以西方为中心，忽视了边缘化的交叉形式（例如性别和种姓）如何塑造南半球残疾人的经历。我们对 2,820 个招聘场景中的 6 名法学硕士进行了全面审计，涵盖不同的残疾、性别、国籍和种姓概况。为了捕捉微妙的交叉伤害和偏见，我们引入了 ABLEIST（能力主义、灵感、超人化和象征主义），这是一组基于残疾研究文献的五个特定能力主义和三个交叉伤害指标。我们的结果显示，ABLEIST 对残疾候选人的伤害显着增加，而许多最先进的模型未能检测到这种伤害。这些危害因性别和种姓边缘化残疾候选人的交叉伤害（例如象征主义）急剧增加而进一步放大，凸显了当前安全工具中的关键盲点以及在招聘等高风险领域对前沿模型进行交叉安全评估的必要性。

Title: DND: Boosting Large Language Models with Dynamic Nested Depth

Authors: Tieyuan Chen, Xiaodong Chen, Haoxing Chen, Zhenzhong Lan, Weiyao Lin, Jianguo Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11001
Pdf URL: https://arxiv.org/pdf/2510.11001
Copy Paste: [[2510.11001]] DND: Boosting Large Language Models with Dynamic Nested Depth(https://arxiv.org/abs/2510.11001)
Keywords: language model, llm
Abstract: We introduce Dynamic Nested Depth (DND), a novel method that improves performance for off-the-shelf LLMs by selecting critical tokens to reprocess in a nested depth manner. Specifically, at the end of the given transformer layer, DND identifies more critical tokens with a router and feeds them back for an extra round of processing, effectively ``reviewing" difficult tokens while avoiding redundant computation for easier ones. The dynamic selection mechanism is tailored for precise control via two novel strategies: a router controlling loss to enhance token selection distinguishability, and a threshold control scheme to ensure selection stability. We demonstrate the effectiveness of DND by directly integrating it into pre-trained dense and MoE models during a post-training phase. On diverse benchmarks, this approach boosts the performances of the dense Qwen3-1.7B by 1.88% and the MoE Qwen3-30B-A3B by 0.87%, all with a minimal parameter and computing increase.
摘要：我们引入动态嵌套深度（DND），这是一种新颖的方法，通过选择关键标记以嵌套深度方式重新处理来提高现成的 LLM 的性能。具体来说，在给定的转换器层的末尾，DND 通过路由器识别出更关键的令牌，并将它们反馈回来进行额外一轮处理，有效地“审查”困难的令牌，同时避免对较容易的令牌进行冗余计算。动态选择机制是通过两种新颖的策略进行精确控制的：路由器控制损失以增强令牌选择的可区分性，以及阈值控制方案以确保选择的稳定性。通过在训练后阶段将 DND 直接集成到预训练的密集模型和 MoE 模型中来证明 DND 的有效性。在不同的基准测试中，这种方法将密集 Qwen3-1.7B 的性能提高了 1.88%，将 MoE Qwen3-30B-A3B 的性能提高了 0.87%，所有这些都只增加了最小的参数和计算量。

Title: LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

Authors: Yiwei Liu, Yucheng Li, Xiao Li, Gong Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11031
Pdf URL: https://arxiv.org/pdf/2510.11031
Copy Paste: [[2510.11031]] LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models(https://arxiv.org/abs/2510.11031)
Keywords: language model, llm
Abstract: Joint logical-numerical reasoning remains a major challenge for language models, yet existing datasets rely on fixed rule sets and offer limited control over task complexity, constraining their generalizability for evaluation and training. We present LogiNumSynth, a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical reasoning (e.g., rule-based reasoning) and numerical reasoning (e.g., arithmetic computation). LogiNumSynth supports fine-grained control over reasoning world richness, logical reasoning depth, and the complexity of numerical computations, enabling flexible data synthesis across difficulty levels. We demonstrate three key contributions: (1) Synthesizer -- synthesizing fully controllable joint reasoning tasks over natural language; (2) Evaluation & Process Analysis -- evaluating both process accuracy and answer accuracy; (3) Targeted Training -- using synthesized data to enhance LLMs' reasoning performance. Experiments with multiple LLMs highlight persistent weaknesses in logical-numerical reasoning, showing that LogiNumSynth can serve as both a diagnostic tool and a source of targeted supervision for advancing integrated reasoning skills.
摘要：联合逻辑数字推理仍然是语言模型的主要挑战，但现有数据集依赖于固定的规则集，并对任务复杂性提供有限的控制，限制了它们评估和训练的通用性。我们提出了 LogiNumSynth，一种灵活的自然语言问题合成器，它可以合成需要熟练掌握联合逻辑推理（例如基于规则的推理）和数字推理（例如算术计算）的任务。 LogiNumSynth 支持对推理世界丰富度、逻辑推理深度和数值计算复杂度的细粒度控制，实现跨难度级别的灵活数据合成。我们展示了三个关键贡献：（1）合成器——通过自然语言合成完全可控的联合推理任务； (2) 评估&过程分析——评估过程准确性和答案准确性； (3) 针对性训练——利用合成数据提升法学硕士的推理能力。多个法学硕士的实验凸显了逻辑数字推理中持续存在的弱点，表明 LogiNumSynth 既可以作为诊断工具，也可以作为提高综合推理技能的有针对性的监督来源。

Title: Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks

Authors: Wenya Xie, Qingying Xiao, Yu Zheng, Xidong Wang, Junying Chen, Ke Ji, Anningzhe Gao, Prayag Tiwari, Xiang Wan, Feng Jiang, Benyou Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11040
Pdf URL: https://arxiv.org/pdf/2510.11040
Copy Paste: [[2510.11040]] Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks(https://arxiv.org/abs/2510.11040)
Keywords: language model, llm
Abstract: The rise of large language models (LLMs) has transformed healthcare by offering clinical guidance, yet their direct deployment to patients poses safety risks due to limited domain expertise. To mitigate this, we propose repositioning LLMs as clinical assistants that collaborate with experienced physicians rather than interacting with patients directly. We conduct a two-stage inspiration-feedback survey to identify real-world needs in clinical workflows. Guided by this, we construct DoctorFLAN, a large-scale Chinese medical dataset comprising 92,000 Q&A instances across 22 clinical tasks and 27 specialties. To evaluate model performance in doctor-facing applications, we introduce DoctorFLAN-test (550 single-turn Q&A items) and DotaBench (74 multi-turn conversations). Experimental results with over ten popular LLMs demonstrate that DoctorFLAN notably improves the performance of open-source LLMs in medical contexts, facilitating their alignment with physician workflows and complementing existing patient-oriented models. This work contributes a valuable resource and framework for advancing doctor-centered medical LLM development
摘要：大语言模型 (LLM) 的兴起通过提供临床指导改变了医疗保健，但由于领域专业知识有限，将其直接部署给患者会带来安全风险。为了缓解这一问题，我们建议将法学硕士重新定位为临床助理，与经验丰富的医生合作，而不是直接与患者互动。我们进行了两阶段的灵感反馈调查，以确定临床工作流程中的现实需求。以此为指导，我们构建了DoctorFLAN，这是一个大规模的中国医学数据集，包含 22 个临床任务和 27 个专业的 92,000 个问答实例。为了评估模型在面向医生的应用程序中的性能，我们引入了 DoctorFLAN-test（550 个单轮问答项目）和 DotaBench（74 个多轮对话）。十多个流行的法学硕士的实验结果表明，DoctorFLAN 显着提高了开源法学硕士在医疗环境中的性能，促进其与医生工作流程的协调并补充现有的以患者为导向的模型。这项工作为推进以医生为中心的医学法学硕士的发展提供了宝贵的资源和框架

Title: Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States

Authors: Qinglin Zhu, Yizhen Yao, Runcong Zhao, Yanzheng Xiang, Amrutha Saseendran, Chen Jin, Philip Alexander Teare, Bin Liang, Yulan He, Lin Gui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11052
Pdf URL: https://arxiv.org/pdf/2510.11052
Copy Paste: [[2510.11052]] Latent Refinement Decoding: Enhancing Diffusion-Based Language Models by Refining Belief States(https://arxiv.org/abs/2510.11052)
Keywords: language model
Abstract: Autoregressive (AR) models remain the standard for natural language generation but still suffer from high latency due to strictly sequential decoding. Recent diffusion-inspired approaches, such as LlaDA and Dream, mitigate this by generating in parallel, yet they suffer from two core limitations: information loss, as predictive distributions for non-finalized tokens are discarded at each step, and premature commitment, where local decisions are made without sufficient global coordination. We introduce Latent Refinement Decoding (LRD), a two-stage framework with Latent Refinement and a Predictive Feedback Loop. The first stage maintains masked positions as distributional mixtures of predicted tokens and the mask embedding, allowing the model to establish more globally consistent beliefs. The second stage progressively finalizes confident tokens while retaining uncertain ones for iterative feedback. KL-divergence dynamics provide a principled and reliable criterion for convergence and early stopping. Experiments across coding (HumanEval +6.3, MBPP +2.6) and reasoning (GSM8K +2.9, MATH500 +3.8) show that LRD improves accuracy while delivering speedups of up to 10.6x, making it a strong and versatile alternative for parallel sequence generation.
摘要：自回归 (AR) 模型仍然是自然语言生成的标准，但由于严格的顺序解码，仍然存在高延迟问题。最近受扩散启发的方法，例如 LlaDA 和 Dream，通过并行生成来缓解这一问题，但它们面临两个核心限制：信息丢失，因为未最终确定的代币的预测分布在每一步都会被丢弃；以及过早承诺，在没有充分全球协调的情况下做出本地决策。我们引入潜在细化解码（LRD），这是一个具有潜在细化和预测反馈循环的两阶段框架。第一阶段将屏蔽位置维持为预测标记和屏蔽嵌入的分布混合，从而使模型能够建立更加全局一致的信念。第二阶段逐步确定有信心的代币，同时保留不确定的代币以进行迭代反馈。 KL 散度动力学为收敛和早期停止提供了原则性且可靠的标准。跨编码（HumanEval +6.3、MBPP +2.6）和推理（GSM8K +2.9、MATH500 +3.8）的实验表明，LRD 提高了准确性，同时提供高达 10.6 倍的加速，使其成为并行序列生成的强大且通用的替代方案。

Title: Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization

Authors: Junjie Lu, Yuliang Liu, Chaofeng Qu, Wei Shen, Zhouhan Lin, Min Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11104
Pdf URL: https://arxiv.org/pdf/2510.11104
Copy Paste: [[2510.11104]] Enhancing LLM Reasoning via Non-Human-Like Reasoning Path Preference Optimization(https://arxiv.org/abs/2510.11104)
Keywords: llm
Abstract: Current approaches for strengthening LLM reasoning tend to introduce a training bias toward human-like reasoning trajectories. In step-wise preference optimization, in particular, dependence on human or higher-capacity model annotations for intermediate steps limits exploration of alternative, non-human-like reasoning paths and thus constrains achievable performance. Furthermore, through a small-scale pilot study, we observed that in approximately 75% of cases, the model's first erroneous step occurs after the lowest-confidence point. This suggests that guiding the model at its lowest-confidence point before an error provides more accurate supervision than locating the first explicit error. In this paper, we propose Confidence-Guided Reasoning Path Preference Optimization (CGPO), a method that leverages a confidence signal to identify points of maximal uncertainty in the model's reasoning process and applies self-generated, non-human-like reasoning-path guidance to mitigate trajectory drift. Our experiments span diverse models applied to both code and mathematical reasoning tasks. The results show that, with the same amount of training data, our method using data generated by a small model can achieve better performance in most cases compared with approaches using data generated by a strong model or human-annotated.
摘要：目前加强法学硕士推理的方法往往会引入对类人推理轨迹的训练偏差。特别是在逐步偏好优化中，中间步骤对人类或更高容量模型注释的依赖限制了对替代性非人类推理路径的探索，从而限制了可实现的性能。此外，通过小规模试点研究，我们观察到在大约 75% 的情况下，模型的第一个错误步骤发生在最低置信点之后。这表明，在错误发生之前将模型引导到最低置信点可以提供比定位第一个显式错误更准确的监督。在本文中，我们提出了置信引导推理路径偏好优化（CGPO），一种利用置信信号来识别模型推理过程中最大不确定点的方法，并应用自我生成的非类人推理路径指导来减轻轨迹漂移。我们的实验涵盖了应用于代码和数学推理任务的不同模型。结果表明，在相同数量的训练数据下，与使用强模型生成的数据或人工注释的数据的方法相比，我们使用小模型生成的数据的方法在大多数情况下可以获得更好的性能。

Title: TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code

Authors: Alexander Sternfeld, Andrei Kucharavy, Ljiljana Dolamic
Subjects: cs.CL, cs.CR
Abstract URL: https://arxiv.org/abs/2510.11151
Pdf URL: https://arxiv.org/pdf/2510.11151
Copy Paste: [[2510.11151]] TypePilot: Leveraging the Scala Type System for Secure LLM-generated Code(https://arxiv.org/abs/2510.11151)
Keywords: language model, llm, prompt, agent
Abstract: Large language Models (LLMs) have shown remarkable proficiency in code generation tasks across various programming languages. However, their outputs often contain subtle but critical vulnerabilities, posing significant risks when deployed in security-sensitive or mission-critical systems. This paper introduces TypePilot, an agentic AI framework designed to enhance the security and robustness of LLM-generated code by leveraging strongly typed and verifiable languages, using Scala as a representative example. We evaluate the effectiveness of our approach in two settings: formal verification with the Stainless framework and general-purpose secure code generation. Our experiments with leading open-source LLMs reveal that while direct code generation often fails to enforce safety constraints, just as naive prompting for more secure code, our type-focused agentic pipeline substantially mitigates input validation and injection vulnerabilities. The results demonstrate the potential of structured, type-guided LLM workflows to improve the SotA of the trustworthiness of automated code generation in high-assurance domains.
摘要：大型语言模型 (LLM) 在跨各种编程语言的代码生成任务中表现出了卓越的熟练程度。然而，它们的输出通常包含微妙但关键的漏洞，在安全敏感或任务关键型系统中部署时会带来重大风险。本文介绍了 TypePilot，这是一个代理 AI 框架，旨在通过利用强类型和可验证的语言来增强 LLM 生成代码的安全性和鲁棒性，以 Scala 为代表。我们在两种情况下评估我们方法的有效性：使用不锈钢框架进行形式验证和通用安全代码生成。我们对领先的开源法学硕士的实验表明，虽然直接代码生成通常无法强制执行安全约束，就像天真的提示更安全的代码一样，但我们以类型为中心的代理管道大大减轻了输入验证和注入漏洞。结果证明了结构化、类型引导的 LLM 工作流程在提高高保证域中自动代码生成的可信度 SotA 方面的潜力。

Title: Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages

Authors: Paloma Piot, José Ramom Pichel Campos, Javier Parapar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11167
Pdf URL: https://arxiv.org/pdf/2510.11167
Copy Paste: [[2510.11167]] Bridging Gaps in Hate Speech Detection: Meta-Collections and Benchmarks for Low-Resource Iberian Languages(https://arxiv.org/abs/2510.11167)
Keywords: language model
Abstract: Hate speech poses a serious threat to social cohesion and individual well-being, particularly on social media, where it spreads rapidly. While research on hate speech detection has progressed, it remains largely focused on English, resulting in limited resources and benchmarks for low-resource languages. Moreover, many of these languages have multiple linguistic varieties, a factor often overlooked in current approaches. At the same time, large language models require substantial amounts of data to perform reliably, a requirement that low-resource languages often cannot meet. In this work, we address these gaps by compiling a meta-collection of hate speech datasets for European Spanish, standardised with unified labels and metadata. This collection is based on a systematic analysis and integration of existing resources, aiming to bridge the data gap and support more consistent and scalable hate speech detection. We extended this collection by translating it into European Portuguese and into a Galician standard that is more convergent with Spanish and another Galician variant that is more convergent with Portuguese, creating aligned multilingual corpora. Using these resources, we establish new benchmarks for hate speech detection in Iberian languages. We evaluate state-of-the-art large language models in zero-shot, few-shot, and fine-tuning settings, providing baseline results for future research. Moreover, we perform a cross-lingual analysis with our target languages. Our findings underscore the importance of multilingual and variety-aware approaches in hate speech detection and offer a foundation for improved benchmarking in underrepresented European languages.
摘要：仇恨言论对社会凝聚力和个人福祉构成严重威胁，尤其是在社交媒体上，仇恨言论迅速传播。尽管仇恨言论检测的研究取得了进展，但它仍然主要集中在英语上，导致低资源语言的资源和基准有限。此外，其中许多语言具有多种语言变体，这是当前方法中经常忽视的一个因素。同时，大型语言模型需要大量数据才能可靠地执行，而资源匮乏的语言通常无法满足这一要求。在这项工作中，我们通过编译欧洲西班牙语仇恨言论数据集的元集合来解决这些差距，并使用统一的标签和元数据进行标准化。该集合基于对现有资源的系统分析和整合，旨在弥合数据差距并支持更加一致和可扩展的仇恨言论检测。我们通过将这个集合翻译成欧洲葡萄牙语和与西班牙语更加趋同的加利西亚语标准以及与葡萄牙语更加趋同的另一种加利西亚语变体来扩展该集合，从而创建一致的多语言语料库。利用这些资源，我们建立了伊比利亚语言仇恨言论检测的新基准。我们在零样本、少样本和微调设置中评估最先进的大型语言模型，为未来的研究提供基线结果。此外，我们还对目标语言进行跨语言分析。我们的研究结果强调了多语言和多样性感知方法在仇恨言论检测中的重要性，并为改进代表性不足的欧洲语言的基准测试奠定了基础。

Title: Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations

Authors: Johannes Moll, Markus Graf, Tristan Lemke, Nicolas Lenhart, Daniel Truhn, Jean-Benoit Delbrouck, Jiazhen Pan, Daniel Rueckert, Lisa C. Adams, Keno K. Bressem
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.11196
Pdf URL: https://arxiv.org/pdf/2510.11196
Copy Paste: [[2510.11196]] Evaluating Reasoning Faithfulness in Medical Vision-Language Models using Multimodal Perturbations(https://arxiv.org/abs/2510.11196)
Keywords: language model, chain-of-thought
Abstract: Vision-language models (VLMs) often produce chain-of-thought (CoT) explanations that sound plausible yet fail to reflect the underlying decision process, undermining trust in high-stakes clinical use. Existing evaluations rarely catch this misalignment, prioritizing answer accuracy or adherence to formats. We present a clinically grounded framework for chest X-ray visual question answering (VQA) that probes CoT faithfulness via controlled text and image modifications across three axes: clinical fidelity, causal attribution, and confidence calibration. In a reader study (n=4), evaluator-radiologist correlations fall within the observed inter-radiologist range for all axes, with strong alignment for attribution (Kendall's $\tau_b=0.670$), moderate alignment for fidelity ($\tau_b=0.387$), and weak alignment for confidence tone ($\tau_b=0.091$), which we report with caution. Benchmarking six VLMs shows that answer accuracy and explanation quality are decoupled, acknowledging injected cues does not ensure grounding, and text cues shift explanations more than visual cues. While some open-source models match final answer accuracy, proprietary models score higher on attribution (25.0% vs. 1.4%) and often on fidelity (36.1% vs. 31.7%), highlighting deployment risks and the need to evaluate beyond final answer accuracy.
摘要：视觉语言模型 (VLM) 通常会产生听起来似乎合理的思想链 (CoT) 解释，但无法反映潜在的决策过程，从而破坏了对高风险临床使用的信任。现有的评估很少发现这种不一致，优先考虑答案的准确性或遵守格式。我们提出了一个基于临床的胸部 X 射线视觉问答 (VQA) 框架，通过跨三个轴的受控文本和图像修改来探测 CoT 忠实度：临床保真度、因果归因和置信度校准。在一项读者研究 (n=4) 中，评估者与放射科医生之间的相关性落在所有轴上观察到的放射科医生间范围内，归因强对齐 (Kendall's $\tau_b=0.670$)、保真度中等对齐 ($\tau_b=0.387$) 和置信度弱对齐 ($\tau_b=0.091$)，我们报告了这一点谨慎行事。对六个 VLM 进行基准测试表明，答案准确性和解释质量是脱钩的，承认注入的提示并不能确保基础，并且文本提示比视觉提示更能改变解释。虽然一些开源模型与最终答案的准确性相匹配，但专有模型在归因方面得分更高（25.0% vs. 1.4%），并且通常在保真度方面得分更高（36.1% vs. 31.7%），这凸显了部署风险以及评估最终答案准确性之外的需要。

Title: Discursive Circuits: How Do Language Models Understand Discourse Relations?

Authors: Yisong Miao, Min-Yen Kan
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.11210
Pdf URL: https://arxiv.org/pdf/2510.11210
Copy Paste: [[2510.11210]] Discursive Circuits: How Do Language Models Understand Discourse Relations?(https://arxiv.org/abs/2510.11210)
Keywords: language model, gpt
Abstract: Which components in transformer language models are responsible for discourse understanding? We hypothesize that sparse computational graphs, termed as discursive circuits, control how models process discourse relations. Unlike simpler tasks, discourse relations involve longer spans and complex reasoning. To make circuit discovery feasible, we introduce a task called Completion under Discourse Relation (CuDR), where a model completes a discourse given a specified relation. To support this task, we construct a corpus of minimal contrastive pairs tailored for activation patching in circuit discovery. Experiments show that sparse circuits ($\approx 0.2\%$ of a full GPT-2 model) recover discourse understanding in the English PDTB-based CuDR task. These circuits generalize well to unseen discourse frameworks such as RST and SDRT. Further analysis shows lower layers capture linguistic features such as lexical semantics and coreference, while upper layers encode discourse-level abstractions. Feature utility is consistent across frameworks (e.g., coreference supports Expansion-like relations).
摘要：Transformer 语言模型中的哪些组件负责话语理解？我们假设稀疏计算图（称为话语电路）控制模型处理话语关系的方式。与简单的任务不同，话语关系涉及更长的跨度和复杂的推理。为了使电路发现变得可行，我们引入了一项称为“话语关系下的完成”（CuDR）的任务，其中模型在给定指定关系的情况下完成话语。为了支持这项任务，我们构建了一个最小对比对的语料库，专门用于电路发现中的激活修补。实验表明，稀疏电路（完整 GPT-2 模型的 $\approx 0.2\%$）在基于英语 PDTB 的 CuDR 任务中恢复了话语理解。这些电路可以很好地推广到看不见的话语框架，例如 RST 和 SDRT。进一步的分析表明，较低层捕获词汇语义和共指等语言特征，而较高层则编码话语级抽象。功能实用性在各个框架之间是一致的（例如，共指支持类似扩展的关系）。

Title: Domain-Specific Data Generation Framework for RAG Adaptation

Authors: Chris Xing Tian, Weihao Xie, Zhen Chen, Zhengyuan Yi, Hui Liu, Haoliang Li, Shiqi Wang, Siwei Ma
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11217
Pdf URL: https://arxiv.org/pdf/2510.11217
Copy Paste: [[2510.11217]] Domain-Specific Data Generation Framework for RAG Adaptation(https://arxiv.org/abs/2510.11217)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) combines the language understanding and reasoning power of large language models (LLMs) with external retrieval to enable domain-grounded responses. Effectively adapting RAG systems to domain-specific settings requires specialized, context-rich training data beyond general-purpose question-answering. Here, we propose RAGen, a scalable and modular framework for generating domain-grounded question-answer-context (QAC) triples tailored to diverse RAG adaptation approaches. RAGen produces these QAC triples by identifying key concepts in documents, generating diverse questions guided by Bloom's Taxonomy-inspired principles, and pairing them with precise answers extracted from relevant contexts. RAGen supports multiple RAG adaptation strategies, including the optimization of key components such as the LLM, retriever, and embedding model, etc. Its modular pipeline features semantic chunking, hierarchical concept extraction, and multi-chunk retrieval, along with the introduction of curated distractor contexts to promote robust reasoning. Designed for scalability, RAGen efficiently handles large and evolving document corpora without redundant processing, making it especially suitable for dynamic evolving domains such as scientific research and enterprise knowledge bases.
摘要：检索增强生成 (RAG) 将大型语言模型 (LLM) 的语言理解和推理能力与外部检索相结合，以实现基于领域的响应。有效地使 RAG 系统适应特定领域的设置需要除通用问答之外的专门的、上下文丰富的训练数据。在这里，我们提出了 RAGen，这是一个可扩展的模块化框架，用于生成针对不同 RAG 适应方法的基于领域的问答上下文 (QAC) 三元组。 RAGen 通过识别文档中的关键概念、根据 Bloom 分类学启发原则生成各种问题，并将它们与从相关上下文中提取的精确答案配对来生成这些 QAC 三元组。 RAGen 支持多种 RAG 适应策略，包括 LLM、检索器和嵌入模型等关键组件的优化。其模块化管道具有语义分块、分层概念提取和多块检索的功能，并引入精心策划的干扰上下文以促进鲁棒推理。 RAGen 专为可扩展性而设计，可有效处理大型且不断发展的文档语料库，无需冗余处理，使其特别适合动态发展的领域，例如科学研究和企业知识库。

Title: The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers

Authors: Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11218
Pdf URL: https://arxiv.org/pdf/2510.11218
Copy Paste: [[2510.11218]] The Curious Case of Factual (Mis)Alignment between LLMs' Short- and Long-Form Answers(https://arxiv.org/abs/2510.11218)
Keywords: language model, llm
Abstract: Large language models (LLMs) can correctly answer "When was Einstein born?" yet fail to provide the same date when writing about Einstein's life revealing a fundamental inconsistency in how models access factual knowledge across task complexities. While models display impressive accuracy on factual question-answering benchmarks, the reliability gap between simple and complex queries remains poorly understood, eroding their trustworthiness. In this work, we introduce Short-Long Form Alignment for Factual Question Answering (SLAQ), a controlled evaluation framework that compares LLMs' answers to the same factual questions asked (a) in isolation (short) vs. (b) integrated into complex queries (long). Looking at 16 LLMs across 600 queries, we find a systematic misalignment of answers to the corresponding short and long queries. We further uncover position-dependent accuracy loss and momentum effects where consecutive correct or incorrect answers create self-reinforcing patterns. Through mechanistic analysis, we find that aligned facts activate overlapping model internals, and that metrics based on mechanistic similarity can predict short-long answer alignment with up to 78% accuracy. Our work establishes factual consistency over query complexity as an important aspect of LLMs' trustworthiness and challenges current evaluation practices, which implicitly assume that good performance for simple factual queries implies reliability in more complex knowledge-seeking tasks too.
摘要：大型语言模型（LLM）可以正确回答“爱因斯坦何时出生？”但在撰写爱因斯坦的一生时却未能提供相同的日期，揭示了模型如何跨任务复杂性获取事实知识的根本不一致。虽然模型在事实问答基准上显示出令人印象深刻的准确性，但简单查询和复杂查询之间的可靠性差距仍然知之甚少，从而削弱了它们的可信度。在这项工作中，我们引入了事实问答的短长形式对齐（SLAQ），这是一种受控评估框架，用于比较法学硕士对相同事实问题的答案（a）单独（短）与（b）集成到复杂查询（长）中。通过查看 16 个法学硕士的 600 个查询，我们发现相应的短查询和长查询的答案存在系统性偏差。我们进一步揭示了位置相关的准确性损失和动量效应，其中连续的正确或错误答案会创建自我强化模式。通过机制分析，我们发现一致的事实会激活重叠的模型内部结构，并且基于机制相似性的指标可以预测短长答案对齐，准确率高达 78%。我们的工作将查询复杂性上的事实一致性作为法学硕士可信度的一个重要方面，并对当前的评估实践提出了挑战，这些实践隐含地假设简单事实查询的良好性能也意味着更复杂的知识寻求任务的可靠性。

Title: WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent

Authors: Tao Li, Jinlong Hu, Yang Wang, Junfeng Liu, Xuejun Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11221
Pdf URL: https://arxiv.org/pdf/2510.11221
Copy Paste: [[2510.11221]] WebRouter: Query-specific Router via Variational Information Bottleneck for Cost-sensitive Web Agent(https://arxiv.org/abs/2510.11221)
Keywords: gpt, llm, prompt, agent
Abstract: LLM-brained web agents offer powerful capabilities for web automation but face a critical cost-performance trade-off. The challenge is amplified by web agents' inherently complex prompts that include goals, action histories, and environmental states, leading to degraded LLM ensemble performance. To address this, we introduce WebRouter, a novel query-specific router trained from an information-theoretic perspective. Our core contribution is a cost-aware Variational Information Bottleneck (ca-VIB) objective, which learns a compressed representation of the input prompt while explicitly penalizing the expected operational cost. Experiments on five real-world websites from the WebVoyager benchmark show that WebRouter reduces operational costs by a striking 87.8\% compared to a GPT-4o baseline, while incurring only a 3.8\% accuracy drop.
摘要：法学硕士网络代理为网络自动化提供了强大的功能，但面临着关键的性价比权衡。网络代理固有的复杂提示（包括目标、行动历史和环境状态）放大了这一挑战，导致 LLM 整体性能下降。为了解决这个问题，我们引入了 WebRouter，这是一种从信息论角度训练的新型查询特定路由器。我们的核心贡献是成本感知的变分信息瓶颈（ca-VIB）目标，它学习输入提示的压缩表示，同时明确惩罚预期的运营成本。根据 WebVoyager 基准测试对五个真实网站进行的实验表明，与 GPT-4o 基准相比，WebRouter 将运营成本降低了 87.8%，而准确度仅下降了 3.8%。

Title: A Theorem-Proving-Based Evaluation of Neural Semantic Parsing

Authors: Hayate Funakura, Hyunsoo Kim, Koji Mineshima
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11225
Pdf URL: https://arxiv.org/pdf/2510.11225
Copy Paste: [[2510.11225]] A Theorem-Proving-Based Evaluation of Neural Semantic Parsing(https://arxiv.org/abs/2510.11225)
Keywords: gpt
Abstract: Graph-matching metrics such as Smatch are the de facto standard for evaluating neural semantic parsers, yet they capture surface overlap rather than logical equivalence. We reassess evaluation by pairing graph-matching with automated theorem proving. We compare two approaches to building parsers: supervised fine-tuning (T5-Small/Base) and few-shot in-context learning (GPT-4o/4.1/5), under normalized and unnormalized targets. We evaluate outputs using graph-matching, bidirectional entailment between source and target formulas with a first-order logic theorem prover, and well-formedness. Across settings, we find that models performing well on graph-matching often fail to produce logically equivalent formulas. Normalization reduces incidental target variability, improves well-formedness, and strengthens logical adequacy. Error analysis shows performance degrades with increasing formula complexity and with coordination, prepositional phrases, and passive voice; the dominant failures involve variable binding and indexing, and predicate naming. These findings highlight limits of graph-based metrics for reasoning-oriented applications and motivate logic-sensitive evaluation and training objectives together with simplified, normalized target representations. All code and data for our experiments are publicly available.
摘要：Smatch 等图形匹配指标是评估神经语义解析器的事实上的标准，但它们捕获的是表面重叠而不是逻辑等价。我们通过将图匹配与自动定理证明配对来重新评估评估。我们比较了构建解析器的两种方法：在标准化和非标准化目标下的监督微调（T5-Small/Base）和少量上下文学习（GPT-4o/4.1/5）。我们使用图匹配、源公式和目标公式之间的双向蕴涵以及一阶逻辑定理证明器以及格式良好性来评估输出。在各种设置中，我们发现在图匹配方面表现良好的模型通常无法生成逻辑上等效的公式。标准化减少了偶然的目标变异性，提高了格式良好性，并增强了逻辑充分性。错误分析表明，随着公式复杂性的增加以及协调性、介词短语和被动语态的增加，性能会下降；主要的失败涉及变量绑定和索引以及谓词命名。这些发现突出了基于图形的指标对于面向推理的应用程序的局限性，并激发了逻辑敏感的评估和训练目标以及简化的标准化目标表示。我们实验的所有代码和数据都是公开的。

Title: CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis

Authors: Jinyuan Xu, Tian Lan, Xintao Yu, Xue He, Hezhi Zhang, Ying Wang, Pierre Magistry, Mathieu Valette, Lei Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11233
Pdf URL: https://arxiv.org/pdf/2510.11233
Copy Paste: [[2510.11233]] CNSocialDepress: A Chinese Social Media Dataset for Depression Risk Detection and Structured Analysis(https://arxiv.org/abs/2510.11233)
Keywords: language model
Abstract: Depression is a pressing global public health issue, yet publicly available Chinese-language resources for risk detection remain scarce and are mostly limited to binary classification. To address this limitation, we release CNSocialDepress, a benchmark dataset for depression risk detection from Chinese social media posts. The dataset contains 44,178 texts from 233 users, within which psychological experts annotated 10,306 depression-related segments. CNSocialDepress provides binary risk labels together with structured multi-dimensional psychological attributes, enabling interpretable and fine-grained analysis of depressive signals. Experimental results demonstrate its utility across a wide range of NLP tasks, including structured psychological profiling and fine-tuning of large language models for depression detection. Comprehensive evaluations highlight the dataset's effectiveness and practical value for depression risk identification and psychological analysis, thereby providing insights to mental health applications tailored for Chinese-speaking populations.
摘要：抑郁症是一个紧迫的全球公共卫生问题，但用于风险检测的公开中文资源仍然稀缺，并且大多仅限于二元分类。为了解决这一限制，我们发布了 CNSocialDepress，这是一个从中国社交媒体帖子中检测抑郁症风险的基准数据集。该数据集包含来自 233 位用户的 44,178 条文本，其中心理专家注释了 10,306 个与抑郁症相关的片段。 CNSocialDepress 提供二元风险标签以及结构化多维心理属性，从而能够对抑郁信号进行可解释和细粒度的分析。实验结果证明了它在广泛的 NLP 任务中的实用性，包括结构化心理分析和用于抑郁症检测的大型语言模型的微调。综合评估凸显了该数据集对于抑郁症风险识别和心理分析的有效性和实用价值，从而为针对华语人群量身定制的心理健康应用提供见解。

Title: XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression

Authors: Haoqi Yang, Yao Yao, Zuchao Li, Baoyuan Qi, Guoming Liu, Hai Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11236
Pdf URL: https://arxiv.org/pdf/2510.11236
Copy Paste: [[2510.11236]] XQuant: Achieving Ultra-Low Bit KV Cache Quantization with Cross-Layer Compression(https://arxiv.org/abs/2510.11236)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks. However, their extensive memory requirements, particularly due to KV cache growth during long-text understanding and generation, present significant challenges for deployment in resource-constrained environments. Quantization has emerged as a promising solution to reduce memory consumption while preserving historical information. We propose XQuant, a training-free and plug-and-play framework that achieves ultra-low equivalent bit-width KV cache quantization. XQuant introduces two key innovations: a computationally negligible data-free calibration method and cross-layer KV cache compression, enabling quantization to sub-1.4 bits. Extensive experiments on TruthfulQA and LongBench demonstrate that XQuant outperforms state-of-the-art methods (e.g., KIVI-2bit and AsymKV-1.5bit) by achieving lower bit-width while maintaining superior performance, establishing a better trade-off between memory efficiency and model accuracy.
摘要：大型语言模型 (LLM) 在各种自然语言处理任务中表现出了卓越的能力。然而，它们大量的内存需求，特别是由于长文本理解和生成过程中 KV 缓存的增长，给资源受限环境中的部署带来了重大挑战。量化已成为一种有前途的解决方案，可以在保留历史信息的同时减少内存消耗。我们提出了 XQuant，一种免训练、即插即用的框架，可实现超低等效位宽 KV 缓存量化。 XQuant 引入了两项关键创新：计算量可忽略的无数据校准方法和跨层 KV 缓存压缩，可实现低于 1.4 位的量化。在 TruthfulQA 和 LongBench 上进行的大量实验表明，XQuant 在保持卓越性能的同时实现更低的位宽，从而在内存效率和模型精度之间建立更好的权衡，从而优于最先进的方法（例如 KIVI-2bit 和 AsymKV-1.5bit）。

Title: Attacks by Content: Automated Fact-checking is an AI Security Issue

Authors: Michael Schlichtkrull
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11238
Pdf URL: https://arxiv.org/pdf/2510.11238
Copy Paste: [[2510.11238]] Attacks by Content: Automated Fact-checking is an AI Security Issue(https://arxiv.org/abs/2510.11238)
Keywords: prompt, agent
Abstract: When AI agents retrieve and reason over external documents, adversaries can manipulate the data they receive to subvert their behaviour. Previous research has studied indirect prompt injection, where the attacker injects malicious instructions. We argue that injection of instructions is not necessary to manipulate agents - attackers could instead supply biased, misleading, or false information. We term this an attack by content. Existing defenses, which focus on detecting hidden commands, are ineffective against attacks by content. To defend themselves and their users, agents must critically evaluate retrieved information, corroborating claims with external evidence and evaluating source trustworthiness. We argue that this is analogous to an existing NLP task, automated fact-checking, which we propose to repurpose as a cognitive self-defense tool for agents.
摘要：当人工智能代理检索外部文档并进行推理时，对手可以操纵他们收到的数据来颠覆他们的行为。之前的研究研究了间接提示注入，即攻击者注入恶意指令。我们认为，注入指令并不是操纵代理所必需的——攻击者可以提供有偏见的、误导性的或虚假的信息。我们将此称为内容攻击。现有的防御措施侧重于检测隐藏命令，对于内容攻击无效。为了保护自己和用户，代理必须严格评估检索到的信息，用外部证据证实主张并评估来源的可信度。我们认为这类似于现有的 NLP 任务，即自动事实检查，我们建议将其重新用作代理的认知自卫工具。

Title: Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality

Authors: Jana Jung, Marlene Lutz, Indira Sen, Markus Strohmaier
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11254
Pdf URL: https://arxiv.org/pdf/2510.11254
Copy Paste: [[2510.11254]] Do Psychometric Tests Work for Large Language Models? Evaluation of Tests on Sexism, Racism, and Morality(https://arxiv.org/abs/2510.11254)
Keywords: language model, llm, prompt
Abstract: Psychometric tests are increasingly used to assess psychological constructs in large language models (LLMs). However, it remains unclear whether these tests -- originally developed for humans -- yield meaningful results when applied to LLMs. In this study, we systematically evaluate the reliability and validity of human psychometric tests for three constructs: sexism, racism, and morality. We find moderate reliability across multiple item and prompt variations. Validity is evaluated through both convergent (i.e., testing theory-based inter-test correlations) and ecological approaches (i.e., testing the alignment between tests scores and behavior in real-world downstream tasks). Crucially, we find that psychometric test scores do not align, and in some cases even negatively correlate with, model behavior in downstream tasks, indicating low ecological validity. Our results highlight that systematic evaluations of psychometric tests is essential before interpreting their scores. They also suggest that psychometric tests designed for humans cannot be applied directly to LLMs without adaptation.
摘要：心理测量测试越来越多地用于评估大型语言模型（LLM）中的心理结构。然而，目前尚不清楚这些最初为人类开发的测试在应用于法学硕士时是否会产生有意义的结果。在这项研究中，我们系统地评估了人类心理测试对性别歧视、种族主义和道德这三种概念的可靠性和有效性。我们发现多个项目和提示变化的可靠性中等。通过收敛（即测试基于理论的测试间相关性）和生态方法（即测试测试分数与现实下游任务中的行为之间的一致性）来评估有效性。至关重要的是，我们发现心理测试分数与下游任务中的模型行为不一致，在某些情况下甚至呈负相关，这表明生态有效性较低。我们的结果强调，在解释心理测试的分数之前，对心理测试进行系统评估至关重要。他们还建议，为人类设计的心理测试如果不进行调整就不能直接应用于法学硕士。

Title: Towards Real-Time Fake News Detection under Evidence Scarcity

Authors: Guangyu Wei, Ke Han, Yueming Lyu, Yu Luo, Yue Jiang, Caifeng Shan, Nicu Sebe
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11277
Pdf URL: https://arxiv.org/pdf/2510.11277
Copy Paste: [[2510.11277]] Towards Real-Time Fake News Detection under Evidence Scarcity(https://arxiv.org/abs/2510.11277)
Keywords: language model, llm
Abstract: Fake news detection becomes particularly challenging in real-time scenarios, where emerging events often lack sufficient supporting evidence. Existing approaches often rely heavily on external evidence and therefore struggle to generalize under evidence scarcity. To address this issue, we propose Evaluation-Aware Selection of Experts (EASE), a novel framework for real-time fake news detection that dynamically adapts its decision-making process according to the assessed sufficiency of available evidence. EASE introduces a sequential evaluation mechanism comprising three independent perspectives: (1) Evidence-based evaluation, which assesses evidence and incorporates it into decision-making only when the evidence is sufficiently supportive; (2) Reasoning-based evaluation, which leverages the world knowledge of large language models (LLMs) and applies them only when their reliability is adequately established; and (3) Sentiment-based fallback, which integrates sentiment cues when neither evidence nor reasoning is reliable. To enhance the accuracy of evaluation processes, EASE employs instruction tuning with pseudo labels to guide each evaluator in justifying its perspective-specific knowledge through interpretable reasoning. Furthermore, the expert modules integrate the evaluators' justified assessments with the news content to enable evaluation-aware decision-making, thereby enhancing overall detection accuracy. Moreover, we introduce RealTimeNews-25, a new benchmark comprising recent news for evaluating model generalization on emerging news with limited evidence. Extensive experiments demonstrate that EASE not only achieves state-of-the-art performance across multiple benchmarks, but also significantly improves generalization to real-time news. The code and dataset are available: this https URL.
摘要：在实时场景中，假新闻检测变得尤其具有挑战性，因为新出现的事件往往缺乏足够的支持证据。现有的方法通常严重依赖外部证据，因此在证据稀缺的情况下很难推广。为了解决这个问题，我们提出了评估感知专家选择（EASE），这是一种实时假新闻检测的新颖框架，可以根据评估的可用证据的充分性动态调整其决策过程。 EASE引入了由三个独立视角组成的序贯评估机制：（1）基于证据的评估，只有在证据足够支持的情况下才评估证据并将其纳入决策；（2）基于推理的评估，利用大语言模型（LLM）的世界知识，并仅在其可靠性充分建立时才应用它们；（3）基于情感的回退，当证据和推理都不可靠时，它会整合情感线索。为了提高评估过程的准确性，EASE 采用带有伪标签的指令调整来指导每个评估者通过可解释的推理来证明其特定观点知识的合理性。此外，专家模块将评估者的合理评估与新闻内容相结合，以实现评估感知决策，从而提高整体检测准确性。此外，我们还引入了 RealTimeNews-25，这是一个包含最新新闻的新基准，用于评估证据有限的新兴新闻的模型泛化能力。大量实验表明，EASE 不仅在多个基准测试中实现了最先进的性能，而且还显着提高了对实时新闻的泛化能力。代码和数据集可用：此 https URL。

Title: Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Authors: Nikita Afonin, Nikita Andriyanov, Nikhil Bageshpura, Kyle Liu, Kevin Zhu, Sunishchal Dev, Ashwinee Panda, Alexander Panchenko, Oleg Rogov, Elena Tutubalina, Mikhail Seleznyov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11288
Pdf URL: https://arxiv.org/pdf/2510.11288
Copy Paste: [[2510.11288]] Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs(https://arxiv.org/abs/2510.11288)
Keywords: llm, chain-of-thought
Abstract: Recent work has shown that narrow finetuning can produce broadly misaligned LLMs, a phenomenon termed emergent misalignment (EM). While concerning, these findings were limited to finetuning and activation steering, leaving out in-context learning (ICL). We therefore ask: does EM emerge in ICL? We find that it does: across three datasets, three frontier models produce broadly misaligned responses at rates between 2% and 17% given 64 narrow in-context examples, and up to 58% with 256 examples. We also examine mechanisms of EM by eliciting step-by-step reasoning (while leaving in-context examples unchanged). Manual analysis of the resulting chain-of-thought shows that 67.5% of misaligned traces explicitly rationalize harmful outputs by adopting a reckless or dangerous ''persona'', echoing prior results on finetuning-induced EM.
摘要：最近的研究表明，狭隘的微调可能会产生广泛的 LLM 错位，这种现象被称为紧急错位 (EM)。虽然令人担忧，但这些发现仅限于微调和激活指导，而忽略了上下文学习（ICL）。因此我们要问：ICL 中是否会出现 EM？我们发现确实如此：在三个数据集中，三个前沿模型在给出 64 个狭窄的上下文示例时产生了 2% 到 17% 之间的广泛错位响应，而在 256 个示例中则高达 58%。我们还通过引出逐步推理来检查 EM 机制（同时保持上下文示例不变）。对由此产生的思想链的手动分析表明，67.5% 的未对齐痕迹通过采用鲁莽或危险的“角色”明确地合理化有害输出，这与之前关于微调引起的 EM 的结果相呼应。

Title: Are Large Language Models Effective Knowledge Graph Constructors?

Authors: Ruirui Chen, Weifeng Jiang, Chengwei Qin, Bo Xiong, Fiona Liausvia, Dongkyu Choi, Boon Kiat Quek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11297
Pdf URL: https://arxiv.org/pdf/2510.11297
Copy Paste: [[2510.11297]] Are Large Language Models Effective Knowledge Graph Constructors?(https://arxiv.org/abs/2510.11297)
Keywords: language model, llm, hallucination
Abstract: Knowledge graphs (KGs) are vital for knowledge-intensive tasks and have shown promise in reducing hallucinations in large language models (LLMs). However, constructing high-quality KGs remains difficult, requiring accurate information extraction and structured representations that support interpretability and downstream utility. Existing LLM-based approaches often focus narrowly on entity and relation extraction, limiting coverage to sentence-level contexts or relying on predefined schemas. We propose a hierarchical extraction framework that organizes information at multiple levels, enabling the creation of semantically rich and well-structured KGs. Using state-of-the-art LLMs, we extract and construct knowledge graphs and evaluate them comprehensively from both structural and semantic perspectives. Our results highlight the strengths and shortcomings of current LLMs in KG construction and identify key challenges for future work. To advance research in this area, we also release a curated dataset of LLM-generated KGs derived from research papers on children's mental well-being. This resource aims to foster more transparent, reliable, and impactful applications in high-stakes domains such as healthcare.
摘要：知识图（KG）对于知识密集型任务至关重要，并且在减少大型语言模型（LLM）中的幻觉方面表现出了希望。然而，构建高质量的知识图谱仍然很困难，需要准确的信息提取和支持可解释性和下游实用性的结构化表示。现有的基于法学硕士的方法通常狭隘地关注实体和关系提取，将覆盖范围限制在句子级上下文或依赖于预定义的模式。我们提出了一个分层提取框架，可以在多个级别组织信息，从而能够创建语义丰富且结构良好的知识图谱。使用最先进的法学硕士，我们提取和构建知识图，并从结构和语义的角度对其进行全面评估。我们的结果突出了当前法学硕士在知识库建设方面的优势和劣势，并确定了未来工作的主要挑战。为了推进这一领域的研究，我们还发布了一个由法学硕士生成的 KG 的精选数据集，该数据集源自有关儿童心理健康的研究论文。该资源旨在促进医疗保健等高风险领域更加透明、可靠和有影响力的应用程序。

Title: FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks

Authors: Sabrina McCallum, Amit Parekh, Alessandro Suglia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11307
Pdf URL: https://arxiv.org/pdf/2510.11307
Copy Paste: [[2510.11307]] FOSSIL: Harnessing Feedback on Suboptimal Samples for Data-Efficient Generalisation with Imitation Learning for Embodied Vision-and-Language Tasks(https://arxiv.org/abs/2510.11307)
Keywords: agent
Abstract: Current approaches to embodied AI tend to learn policies from expert demonstrations. However, without a mechanism to evaluate the quality of demonstrated actions, they are limited to learning from optimal behaviour, or they risk replicating errors and inefficiencies. While reinforcement learning offers one alternative, the associated exploration typically results in sacrificing data efficiency. This work explores how agents trained with imitation learning can learn robust representations from both optimal and suboptimal demonstrations when given access to constructive language feedback as a means to contextualise different modes of behaviour. We directly provide language feedback embeddings as part of the input sequence into a Transformer-based policy, and optionally complement the traditional next action prediction objective with auxiliary self-supervised learning objectives for feedback prediction. We test our approach on a range of embodied Vision-and-Language tasks in our custom BabyAI-XGen environment and show significant improvements in agents' compositional generalisation abilities and robustness, suggesting that our data-efficient method allows models to successfully convert suboptimal behaviour into learning opportunities. Overall, our results suggest that language feedback is a competitive and intuitive alternative to intermediate scalar rewards for language-specified embodied tasks.
摘要：当前的具体人工智能方法倾向于从专家演示中学习策略。然而，如果没有评估所展示行动质量的机制，他们就只能从最佳行为中学习，否则就会面临重复错误和低效率的风险。虽然强化学习提供了一种替代方案，但相关的探索通常会导致数据效率的牺牲。这项工作探讨了通过模仿学习训练的智能体如何在获得建设性语言反馈作为将不同行为模式置于情境中的手段时，如何从最佳和次优演示中学习稳健的表征。我们直接将语言反馈嵌入作为输入序列的一部分提供到基于 Transformer 的策略中，并可选地通过辅助自监督学习目标来补充传统的下一步动作预测目标以进行反馈预测。我们在自定义 BabyAI-XGen 环境中对一系列具体视觉和语言任务测试了我们的方法，并显示代理的组合泛化能力和鲁棒性有了显着改善，这表明我们的数据高效方法允许模型成功地将次优行为转化为学习机会。总的来说，我们的结果表明，对于语言指定的具体任务，语言反馈是一种有竞争力的、直观的替代中间标量奖励的方法。

Title: Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications

Authors: Belkiss Souayed, Sarah Ebling, Yingqiang Gao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11314
Pdf URL: https://arxiv.org/pdf/2510.11314
Copy Paste: [[2510.11314]] Template-Based Text-to-Image Alignment for Language Accessibility: A Study on Visualizing Text Simplifications(https://arxiv.org/abs/2510.11314)
Keywords: language model, prompt
Abstract: Individuals with intellectual disabilities often have difficulties in comprehending complex texts. While many text-to-image models prioritize aesthetics over accessibility, it is not clear how visual illustrations relate to text simplifications (TS) generated from them. This paper presents a structured vision-language model (VLM) prompting framework for generating accessible images from simplified texts. We designed five prompt templates, i.e., Basic Object Focus, Contextual Scene, Educational Layout, Multi-Level Detail, and Grid Layout, each following distinct spatial arrangements while adhering to accessibility constraints such as object count limits, spatial separation, and content restrictions. Using 400 sentence-level simplifications from four established TS datasets (OneStopEnglish, SimPA, Wikipedia, and ASSET), we conducted a two-phase evaluation: Phase 1 assessed prompt template effectiveness with CLIPScores, and Phase 2 involved human annotation of generated images across ten visual styles by four accessibility experts. Results show that the Basic Object Focus prompt template achieved the highest semantic alignment, indicating that visual minimalism enhances language accessibility. Expert evaluation further identified Retro style as the most accessible and Wikipedia as the most effective data source. Inter-annotator agreement varied across dimensions, with Text Simplicity showing strong reliability and Image Quality proving more subjective. Overall, our framework offers practical guidelines for accessible content generation and underscores the importance of structured prompting in AI-generated visual accessibility tools.
摘要：智力障碍人士通常难以理解复杂的文本。虽然许多文本到图像模型优先考虑美观性而不是可访问性，但尚不清楚视觉插图与由它们生成的文本简化 (TS) 有何关系。本文提出了一种结构化视觉语言模型（VLM）提示框架，用于从简化文本生成可访问的图像。我们设计了五个提示模板，即基本对象焦点、上下文场景、教育布局、多层次细节和网格布局，每个提示模板都遵循不同的空间安排，同时遵守对象数量限制、空间分离和内容限制等可访问性约束。使用来自四个已建立的 TS 数据集（OneStopEnglish、SimPA、Wikipedia 和 ASSET）的 400 个句子级简化，我们进行了两阶段评估：第一阶段使用 CLIPScore 评估提示模板的有效性，第二阶段由四位无障碍专家对十种视觉风格的生成图像进行人工注释。结果表明，基本对象焦点提示模板实现了最高的语义对齐，表明视觉极简主义增强了语言的可访问性。专家评估进一步确定复古风格是最容易获取的，维基百科是最有效的数据源。注释者间的一致性在不同维度上有所不同，文本简单性表现出很强的可靠性，而图像质量则更加主观。总体而言，我们的框架为无障碍内容生成提供了实用指南，并强调了人工智能生成的视觉无障碍工具中结构化提示的重要性。

Title: Do LLMs "Feel"? Emotion Circuits Discovery and Control

Authors: Chenxi Wang, Yixuan Zhang, Ruiji Yu, Yufei Zheng, Lang Gao, Zirui Song, Zixiang Xu, Gus Xia, Huishuai Zhang, Dongyan Zhao, Xiuying Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11328
Pdf URL: https://arxiv.org/pdf/2510.11328
Copy Paste: [[2510.11328]] Do LLMs "Feel"? Emotion Circuits Discovery and Control(https://arxiv.org/abs/2510.11328)
Keywords: language model, llm, prompt
Abstract: As the demand for emotional intelligence in large language models (LLMs) grows, a key challenge lies in understanding the internal mechanisms that give rise to emotional expression and in controlling emotions in generated text. This study addresses three core questions: (1) Do LLMs contain context-agnostic mechanisms shaping emotional expression? (2) What form do these mechanisms take? (3) Can they be harnessed for universal emotion control? We first construct a controlled dataset, SEV (Scenario-Event with Valence), to elicit comparable internal states across emotions. Subsequently, we extract context-agnostic emotion directions that reveal consistent, cross-context encoding of emotion (Q1). We identify neurons and attention heads that locally implement emotional computation through analytical decomposition and causal analysis, and validate their causal roles via ablation and enhancement interventions. Next, we quantify each sublayer's causal influence on the model's final emotion representation and integrate the identified local components into coherent global emotion circuits that drive emotional expression (Q2). Directly modulating these circuits achieves 99.65% emotion-expression accuracy on the test set, surpassing prompting- and steering-based methods (Q3). To our knowledge, this is the first systematic study to uncover and validate emotion circuits in LLMs, offering new insights into interpretability and controllable emotional intelligence.
摘要：随着大语言模型（LLM）对情商的需求不断增长，一个关键的挑战在于理解引发情感表达的内部机制以及控制生成文本中的情感。这项研究解决了三个核心问题：（1）法学硕士是否包含塑造情绪表达的情境不可知机制？（2）这些机制采取什么形式？（3）它们可以用于普遍的情绪控制吗？我们首先构建一个受控数据集 SEV（带效价的场景事件），以得出不同情绪之间可比较的内部状态。随后，我们提取与上下文无关的情感方向，揭示一致的、跨上下文的情感编码（Q1）。我们通过分析分解和因果分析来识别局部执行情感计算的神经元和注意力头，并通过消融和增强干预来验证它们的因果作用。接下来，我们量化每个子层对模型最终情感表示的因果影响，并将识别的局部组件集成到驱动情感表达的连贯的全局情感电路中（Q2）。直接调节这些电路在测试集上实现了 99.65% 的情绪表达准确度，超越了基于提示和引导的方法（Q3）。据我们所知，这是第一个揭示和验证法学硕士情绪回路的系统研究，为可解释性和可控情商提供了新的见解。

Title: LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation

Authors: Hengran Zhang, Keping Bi, Jiafeng Guo, Jiaming Zhang, Shuaiqiang Wang, Dawei Yin, Xueqi Cheng
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.11358
Pdf URL: https://arxiv.org/pdf/2510.11358
Copy Paste: [[2510.11358]] LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation(https://arxiv.org/abs/2510.11358)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) by incorporating external knowledge. While traditional retrieval focuses on relevance, RAG's effectiveness depends on the utility of retrieved passages, i.e., the usefulness in facilitating the generation of an accurate and comprehensive answer. Existing studies often treat utility as a generic attribute, ignoring the fact that different LLMs may benefit differently from the same passage due to variations in internal knowledge and comprehension ability. In this work, we introduce and systematically investigate the notion of LLM-specific utility. Through large-scale experiments across multiple datasets and LLMs, we demonstrate that human-annotated passages are not optimal for LLMs and that ground-truth utilitarian passages are not transferable across different LLMs. These findings highlight the necessity of adopting the LLM-specific utility in RAG research. Our findings indicate that some human-annotated passages are not ground-truth utilitarian passages for specific LLMs, partially due to the varying readability of queries and passages for LLMs, a tendency for which perplexity is a key metric. Based on these findings, we propose a benchmarking procedure for LLM-specific utility judgments. We evaluate existing utility judgment methods on six datasets and find that while verbalized methods using pseudo-answers perform robustly, LLMs struggle to assess utility effectively-failing to reject all passages for known queries and to select truly useful ones for unknown queries.
摘要：检索增强生成 (RAG) 通过整合外部知识来增强大型语言模型 (LLM)。传统检索注重相关性，而 RAG 的有效性取决于检索到的段落的效用，即促进生成准确且全面的答案的有用性。现有的研究通常将效用视为通用属性，忽略了这样一个事实：由于内部知识和理解能力的差异，不同的法学硕士可能从同一篇文章中受益不同。在这项工作中，我们介绍并系统地研究了法学硕士特定效用的概念。通过跨多个数据集和法学硕士的大规模实验，我们证明人工注释的段落对于法学硕士来说并不是最佳的，并且真实的功利主义段落不能在不同的法学硕士之间转移。这些发现强调了在 RAG 研究中采用 LLM 特定实用程序的必要性。我们的研究结果表明，一些人工注释的段落并不是特定法学硕士的真实实用段落，部分原因是法学硕士的查询和段落的可读性不同，而困惑度是这一趋势的关键指标。基于这些发现，我们提出了针对法学硕士特定效用判断的基准程序。我们在六个数据集上评估了现有的效用判断方法，发现虽然使用伪答案的语言化方法表现稳健，但法学硕士很难有效地评估效用——无法拒绝已知查询的所有段落并为未知查询选择真正有用的段落。

Title: Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers

Authors: Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, Fuli Luo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.11370
Pdf URL: https://arxiv.org/pdf/2510.11370
Copy Paste: [[2510.11370]] Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers(https://arxiv.org/abs/2510.11370)
Keywords: language model
Abstract: Reinforcement learning (RL) has emerged as a crucial approach for enhancing the capabilities of large language models. However, in Mixture-of-Experts (MoE) models, the routing mechanism often introduces instability, even leading to catastrophic RL training collapse. We analyze the training-inference consistency of MoE models and identify a notable discrepancy in routing behaviors between the two phases. Moreover, even under identical conditions, the routing framework can yield divergent expert selections across repeated forward passes. To address this foundational inconsistency, we propose Rollout Routing Replay (R3), a method that records routing distributions from the inference engine and replays them during training. R3 significantly reduces training-inference policy KL divergence and mitigates extreme discrepancies without compromising training speed. Extensive experiments on various settings confirm that R3 succeeds in stabilizing RL training, preventing collapse and outperforming methods such as GSPO and TIS. We believe this work can offer a new solution for stabilizing RL in MoE models.
摘要：强化学习（RL）已成为增强大型语言模型能力的重要方法。然而，在专家混合 (MoE) 模型中，路由机制常常会带来不稳定，甚至导致灾难性的 RL 训练崩溃。我们分析了 MoE 模型的训练-推理一致性，并发现两个阶段之间的路由行为存在显着差异。此外，即使在相同的条件下，路由框架也可以在重复的前向传递中产生不同的专家选择。为了解决这种根本性的不一致问题，我们提出了 Rollout Routing Replay (R3)，这是一种记录来自推理引擎的路由分布并在训练期间重播它们的方法。 R3 显着减少了训练-推理策略 KL 散度，并在不影响训练速度的情况下减轻了极端差异。在各种设置上进行的大量实验证实，R3 成功地稳定了 RL 训练，防止崩溃，并且性能优于 GSPO 和 TIS 等方法。我们相信这项工作可以为稳定 MoE 模型中的 RL 提供新的解决方案。

Title: Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning

Authors: Dean L. Slack, Noura Al Moubayed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11372
Pdf URL: https://arxiv.org/pdf/2510.11372
Copy Paste: [[2510.11372]] Early Detection and Reduction of Memorisation for Domain Adaptation and Instruction Tuning(https://arxiv.org/abs/2510.11372)
Keywords: language model
Abstract: Although large language models excel across many tasks, they can memorise training data and thereby expose private or copyrighted text. Most defences target the pre-training stage, leaving memorisation during fine-tuning, especially for domain adaptation and instruction tuning, poorly understood. We fine-tune Pythia, Llama3, and Mistral models spanning 1.4B-70B parameters on common evaluation datasets and track verbatim memorisation throughout training. We find that memorisation increases dramatically in the first few epochs, often significantly before either validation perplexity or evaluation performance is optimised. We use a simple but effective n-gram memorisation score which reliably precedes verbatim memorisation; using it as an early-stopping criterion mitigates memorisation with minimal performance loss. Further, we introduce an n-gram-aware loss regulariser and show that it reduces memorisation across all model families tested by up to 40% while minimising evaluation performance trade-offs when compared to an existing memorisation mitigation strategy. These results yield practical, scalable insights into memorisation dynamics during language model fine-tuning.
摘要：尽管大型语言模型在许多任务中表现出色，但它们可以记住训练数据，从而暴露私有或受版权保护的文本。大多数防御措施都针对预训练阶段，而对微调过程中的记忆，尤其是领域适应和指令调整，却知之甚少。我们在通用评估数据集上对涵盖 1.4B-70B 参数的 Pythia、Llama3 和 Mistral 模型进行微调，并在整个训练过程中跟踪逐字记忆。我们发现，记忆力在最初的几个时期中急剧增加，通常在验证困惑或评估性能优化之前显着增加。我们使用简单但有效的 n-gram 记忆分数，该分数可靠地领先于逐字记忆；使用它作为提前停止的标准可以减轻记忆，同时将性能损失降到最低。此外，我们引入了 n-gram 感知损失正则化器，并表明与现有的记忆缓解策略相比，它可以将所有测试的模型系列的记忆减少高达 40%，同时最大限度地减少评估性能权衡。这些结果为语言模型微调期间的记忆动态提供了实用的、可扩展的见解。

Title: Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies

Authors: Zirui Song, Yuan Huang, Junchang Liu, Haozhe Luo, Chenxi Wang, Lang Gao, Zixiang Xu, Mingfei Han, Xiaojun Chang, Xiuying Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11389
Pdf URL: https://arxiv.org/pdf/2510.11389
Copy Paste: [[2510.11389]] Beyond Survival: Evaluating LLMs in Social Deduction Games with Human-Aligned Strategies(https://arxiv.org/abs/2510.11389)
Keywords: llm, agent
Abstract: Social deduction games like Werewolf combine language, reasoning, and strategy, providing a testbed for studying natural language and social intelligence. However, most studies reduce the game to LLM-based self-play, yielding templated utterances and anecdotal cases that overlook the richness of social gameplay. Evaluation further relies on coarse metrics such as survival time or subjective scoring due to the lack of quality reference data. To address these gaps, we curate a high-quality, human-verified multimodal Werewolf dataset containing over 100 hours of video, 32.4M utterance tokens, and 15 rule variants. Based on this dataset, we propose a novel strategy-alignment evaluation that leverages the winning faction's strategies as ground truth in two stages: 1) Speech evaluation, formulated as multiple-choice-style tasks that assess whether the model can adopt appropriate stances across five dimensions of social ability; and 2) Decision evaluation, which assesses the model's voting choices and opponent-role inferences. This framework enables a fine-grained evaluation of models' linguistic and reasoning capabilities, while capturing their ability to generate strategically coherent gameplay. Our experiments show that state-of-the-art LLMs show diverse performance, with roughly half remain below 0.50, revealing clear gaps in deception and counterfactual reasoning. We hope our dataset further inspires research on language, reasoning, and strategy in multi-agent interaction.
摘要：像《狼人杀》这样的社交演绎游戏将语言、推理和策略结合在一起，为研究自然语言和社交智能提供了一个测试平台。然而，大多数研究将游戏简化为基于法学硕士的自我游戏，产生模板化的话语和轶事案例，忽视了社交游戏的丰富性。由于缺乏质量参考数据，评估进一步依赖于生存时间或主观评分等粗略指标。为了解决这些差距，我们策划了一个高质量的、经过人工验证的多模式狼人数据集，其中包含超过 100 小时的视频、3240 万个话语标记和 15 个规则变体。基于该数据集，我们提出了一种新颖的策略一致性评估，该评估利用获胜派系的策略作为两个阶段的基本事实：1）语音评估，制定为多项选择式任务，评估模型是否可以在社交能力的五个维度上采取适当的立场； 2）决策评估，评估模型的投票选择和对手角色推断。该框架可以对模型的语言和推理能力进行细粒度评估，同时捕获它们生成战略上连贯的游戏玩法的能力。我们的实验表明，最先进的法学硕士表现出多样化的表现，大约有一半保持在 0.50 以下，揭示了欺骗和反事实推理方面的明显差距。我们希望我们的数据集进一步激发多智能体交互中的语言、推理和策略的研究。

Title: KnowRL: Teaching Language Models to Know What They Know

Authors: Sahil Kale, Devendra Singh Dhami
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11407
Pdf URL: https://arxiv.org/pdf/2510.11407
Copy Paste: [[2510.11407]] KnowRL: Teaching Language Models to Know What They Know(https://arxiv.org/abs/2510.11407)
Keywords: language model, llm
Abstract: Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.
摘要：真正可靠的人工智能需要的不仅仅是扩展知识；它要求有能力知道什么是它知道的，什么时候它不知道的。然而最近的研究表明，即使是最好的法学硕士也会在超过五分之一的情况下误判自己的能力，使得这种内部不确定性产生的任何反应都无法完全信任。受到需要最少数据的自我改进强化学习技术的启发，我们提出了一个简单但功能强大的框架 KnowRL，它加强了模型对其自身可行性边界的内部理解，从而实现更安全、更负责任的行为。我们的框架结合了两个组成部分：（i）内省，模型生成并分类它判断可行或不可行的任务，以及（ii）基于共识的奖励，其中通过内部协议加强自我知识评估的稳定性。通过使用内部生成的数据，这种设计增强了自我知识的一致性，并完全避免了昂贵的外部监督。在 LLaMA-3.1-8B 和 Qwen-2.5-7B 的实验中，KnowRL 稳步提高了自我知识，并通过内在的自我一致性和外在的基准测试进行了验证。只需要一个小的种子集并且没有外部监督，我们的方法就将准确率提高了 28%，在 F1 中提高了 12%，在几次迭代中就超越了基线。我们的框架本质上释放了法学硕士未开发的能力，以自我提高他们的知识意识，为关键应用程序中可靠、更负责任的人工智能和更安全的部署打开了大门。由于其简单性和独立于外部努力，我们鼓励将这种可靠性增强过程应用于所有未来的模型。

Title: Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification

Authors: Stefan Krsteski, Giuseppe Russo, Serina Chang, Robert West, Kristina Gligorić
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11408
Pdf URL: https://arxiv.org/pdf/2510.11408
Copy Paste: [[2510.11408]] Valid Survey Simulations with Limited Human Data: The Roles of Prompting, Fine-Tuning, and Rectification(https://arxiv.org/abs/2510.11408)
Keywords: language model, llm, prompt
Abstract: Surveys provide valuable insights into public opinion and behavior, but their execution is costly and slow. Large language models (LLMs) have been proposed as a scalable, low-cost substitute for human respondents, but their outputs are often biased and yield invalid estimates. We study the interplay between synthesis methods that use LLMs to generate survey responses and rectification methods that debias population estimates, and explore how human responses are best allocated between them. Using two panel surveys with questions on nutrition, politics, and economics, we find that synthesis alone introduces substantial bias (24-86%), whereas combining it with rectification reduces bias below 5% and increases effective sample size by up to 14%. Overall, we challenge the common practice of using all human responses for fine-tuning, showing that under a fixed budget, allocating most to rectification results in far more effective estimation.
摘要：调查为公众舆论和行为提供了宝贵的见解，但其执行成本高昂且缓慢。大型语言模型 (LLM) 已被提议作为人类受访者的可扩展、低成本替代品，但其输出往往存在偏差并产生无效的估计。我们研究使用法学硕士生成调查响应的综合方法与消除总体估计偏差的校正方法之间的相互作用，并探索如何在它们之间最好地分配人类响应。使用两项有关营养、政治和经济问题的小组调查，我们发现综合本身会带来很大的偏差 (24-86%)，而将其与纠正相结合可将偏差降低到 5% 以下，并将有效样本量增加高达 14%。总体而言，我们对使用所有人类反应进行微调的常见做法提出了挑战，表明在固定预算下，将大部分资金分配给纠正会导致更有效的估计。

Title: Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content

Authors: Dana Sotto Porat, Ella Rabinovich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11434
Pdf URL: https://arxiv.org/pdf/2510.11434
Copy Paste: [[2510.11434]] Who are you, ChatGPT? Personality and Demographic Style in LLM-Generated Content(https://arxiv.org/abs/2510.11434)
Keywords: language model, gpt, llm, chat, agent
Abstract: Generative large language models (LLMs) have become central to everyday life, producing human-like text across diverse domains. A growing body of research investigates whether these models also exhibit personality- and demographic-like characteristics in their language. In this work, we introduce a novel, data-driven methodology for assessing LLM personality without relying on self-report questionnaires, applying instead automatic personality and gender classifiers to model replies on open-ended questions collected from Reddit. Comparing six widely used models to human-authored responses, we find that LLMs systematically express higher Agreeableness and lower Neuroticism, reflecting cooperative and stable conversational tendencies. Gendered language patterns in model text broadly resemble those of human writers, though with reduced variation, echoing prior findings on automated agents. We contribute a new dataset of human and model responses, along with large-scale comparative analyses, shedding new light on the topic of personality and demographic patterns of generative AI.
摘要：生成式大语言模型 (LLM) 已成为日常生活的核心，可在不同领域生成类似人类的文本。越来越多的研究调查这些模型是否也在其语言中表现出类似人格和人口统计的特征。在这项工作中，我们引入了一种新颖的、数据驱动的方法来评估法学硕士的性格，而不依赖于自我报告问卷，而是应用自动性格和性别分类器来对从 Reddit 收集的开放式问题的回复进行建模。将六种广泛使用的模型与人类撰写的回答进行比较，我们发现法学硕士系统地表现出较高的宜人性和较低的神经质，反映了合作和稳定的对话倾向。模型文本中的性别语言模式与人类作家的语言模式大致相似，尽管变化较少，这与之前关于自动化代理的发现相呼应。我们贡献了人类和模型反应的新数据集，以及大规模比较分析，为生成人工智能的个性和人口统计模式主题提供了新的线索。

Title: Investigating Large Language Models' Linguistic Abilities for Text Preprocessing

Authors: Marco Braga, Gian Carlo Milanese, Gabriella Pasi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11482
Pdf URL: https://arxiv.org/pdf/2510.11482
Copy Paste: [[2510.11482]] Investigating Large Language Models' Linguistic Abilities for Text Preprocessing(https://arxiv.org/abs/2510.11482)
Keywords: language model, llm, prompt
Abstract: Text preprocessing is a fundamental component of Natural Language Processing, involving techniques such as stopword removal, stemming, and lemmatization to prepare text as input for further processing and analysis. Despite the context-dependent nature of the above techniques, traditional methods usually ignore contextual information. In this paper, we investigate the idea of using Large Language Models (LLMs) to perform various preprocessing tasks, due to their ability to take context into account without requiring extensive language-specific annotated resources. Through a comprehensive evaluation on web-sourced data, we compare LLM-based preprocessing (specifically stopword removal, lemmatization and stemming) to traditional algorithms across multiple text classification tasks in six European languages. Our analysis indicates that LLMs are capable of replicating traditional stopword removal, lemmatization, and stemming methods with accuracies reaching 97%, 82%, and 74%, respectively. Additionally, we show that ML algorithms trained on texts preprocessed by LLMs achieve an improvement of up to 6% with respect to the $F_1$ measure compared to traditional techniques. Our code, prompts, and results are publicly available at this https URL.
摘要：文本预处理是自然语言处理的基本组成部分，涉及停用词删除、词干提取和词形还原等技术，以准备文本作为进一步处理和分析的输入。尽管上述技术具有上下文相关的性质，但传统方法通常忽略上下文信息。在本文中，我们研究了使用大型语言模型（LLM）执行各种预处理任务的想法，因为它们能够考虑上下文，而不需要大量特定于语言的注释资源。通过对网络来源数据的综合评估，我们将基于 LLM 的预处理（特别是停用词去除、词形还原和词干提取）与传统算法在六种欧洲语言的多个文本分类任务中进行比较。我们的分析表明，法学硕士能够复制传统的停用词删除、词形还原和词干提取方法，准确率分别达到 97%、82% 和 74%。此外，我们还表明，与传统技术相比，在由法学硕士预处理的文本上训练的 ML 算法在 $F_1$ 度量方面实现了高达 6% 的改进。我们的代码、提示和结果可通过此 https URL 公开获取。

Title: Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models

Authors: Yusheng Song, Lirong Qiu, Xi Zhang, Zhihao Tang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11529
Pdf URL: https://arxiv.org/pdf/2510.11529
Copy Paste: [[2510.11529]] Hallucination Detection via Internal States and Structured Reasoning Consistency in Large Language Models(https://arxiv.org/abs/2510.11529)
Keywords: language model, llm, hallucination, chain-of-thought
Abstract: The detection of sophisticated hallucinations in Large Language Models (LLMs) is hampered by a ``Detection Dilemma'': methods probing internal states (Internal State Probing) excel at identifying factual inconsistencies but fail on logical fallacies, while those verifying externalized reasoning (Chain-of-Thought Verification) show the opposite behavior. This schism creates a task-dependent blind spot: Chain-of-Thought Verification fails on fact-intensive tasks like open-domain QA where reasoning is ungrounded, while Internal State Probing is ineffective on logic-intensive tasks like mathematical reasoning where models are confidently wrong. We resolve this with a unified framework that bridges this critical gap. However, unification is hindered by two fundamental challenges: the Signal Scarcity Barrier, as coarse symbolic reasoning chains lack signals directly comparable to fine-grained internal states, and the Representational Alignment Barrier, a deep-seated mismatch between their underlying semantic spaces. To overcome these, we introduce a multi-path reasoning mechanism to obtain more comparable, fine-grained signals, and a segment-aware temporalized cross-attention module to adaptively fuse these now-aligned representations, pinpointing subtle dissonances. Extensive experiments on three diverse benchmarks and two leading LLMs demonstrate that our framework consistently and significantly outperforms strong baselines. Our code is available: this https URL.
摘要：大型语言模型（LLM）中复杂幻觉的检测受到“检测困境”的阻碍：探测内部状态的方法（内部状态探测）擅长识别事实不一致，但在逻辑谬误上失败，而验证外化推理的方法（思想链验证）则表现出相反的行为。这种分裂造成了一个依赖于任务的盲点：思想链验证在开放域 QA 等事实密集型任务上失败，其中推理是没有根据的，而内部状态探测在数学推理等逻辑密集型任务上无效，其中模型肯定是错误的。我们通过一个统一的框架来解决这个问题，以弥补这一关键差距。然而，统一受到两个基本挑战的阻碍：信号稀缺障碍，因为粗略的符号推理链缺乏与细粒度内部状态直接可比的信号；以及表征对齐障碍，它们的底层语义空间之间根深蒂固的不匹配。为了克服这些问题，我们引入了多路径推理机制来获得更具可比性的细粒度信号，以及分段感知的时间化交叉注意模块来自适应地融合这些现在对齐的表示，精确定位微妙的不和谐之处。对三个不同基准和两个领先的法学硕士进行的广泛实验表明，我们的框架始终显着优于强大的基线。我们的代码可用：此 https URL。

Title: Information-Preserving Reformulation of Reasoning Traces for Antidistillation

Authors: Jiayu Ding, Lei Cui, Li Dong, Nanning Zheng, Furu Wei
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11545
Pdf URL: https://arxiv.org/pdf/2510.11545
Copy Paste: [[2510.11545]] Information-Preserving Reformulation of Reasoning Traces for Antidistillation(https://arxiv.org/abs/2510.11545)
Keywords: language model, llm
Abstract: Recent advances in Large Language Models (LLMs) show that extending the length of reasoning chains significantly improves performance on complex tasks. While revealing these reasoning traces helps users better follow, verify, and learn from the model's problem-solving process, it also makes them highly vulnerable to unauthorized distillation. To mitigate this risk, proprietary model providers often adopt aggressive protection strategies, such as replacing detailed reasoning with brief summaries, which deprive users of valuable intermediate information. To address this trade-off, we propose PART, an information-preserving antidistillation reformulation of reasoning traces. Motivated by the difference between how humans understand reasoning traces and how LLMs exploit them for supervised fine-tuning, we design a simple but effective two-step reformulation: removing self-talk behaviors and reordering sub-conclusions. A small auxiliary model is trained to perform this reformulation, incurring minimal computational overhead. Extensive experiments demonstrate that PART consistently disrupts distillation across student models of different sizes and types on various reasoning benchmarks. For instance, when training on reformulated traces, even the performance of a large 32B student model decreases from 54.17 to 46.88 on AIME 2024, corresponding to a 13.5% degradation.
摘要：大型语言模型 (LLM) 的最新进展表明，延长推理链的长度可以显着提高复杂任务的性能。虽然揭示这些推理痕迹可以帮助用户更好地跟踪、验证模型的问题解决过程并从中学习，但也使他们极易受到未经授权的蒸馏的影响。为了减轻这种风险，专有模型提供商通常采取积极的保护策略，例如用简短的摘要代替详细的推理，这剥夺了用户宝贵的中间信息。为了解决这种权衡问题，我们提出了 PART，一种推理轨迹的信息保留反蒸馏重构。受人类理解推理痕迹的方式与法学硕士如何利用推理痕迹进行监督微调之间差异的启发，我们设计了一个简单但有效的两步重新表述：消除自言自语行为并重新排序子结论。训练一个小型辅助模型来执行此重构，从而产生最小的计算开销。大量实验表明，PART 在各种推理基准上始终能够破坏不同大小和类型的学生模型的蒸馏。例如，在重新制定的轨迹上进行训练时，即使是大型 32B 学生模型的性能在 AIME 2024 上也从 54.17 下降到 46.88，相当于下降了 13.5%。

Title: Invisible Languages of the LLM Universe

Authors: Saurabh Khanna, Xinxu Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11557
Pdf URL: https://arxiv.org/pdf/2510.11557
Copy Paste: [[2510.11557]] Invisible Languages of the LLM Universe(https://arxiv.org/abs/2510.11557)
Keywords: language model, llm
Abstract: Large Language Models are trained on massive multilingual corpora, yet this abundance masks a profound crisis: of the world's 7,613 living languages, approximately 2,000 languages with millions of speakers remain effectively invisible in digital ecosystems. We propose a critical framework connecting empirical measurements of language vitality (real world demographic strength) and digitality (online presence) with postcolonial theory and epistemic injustice to explain why linguistic inequality in AI systems is not incidental but structural. Analyzing data across all documented human languages, we identify four categories: Strongholds (33%, high vitality and digitality), Digital Echoes (6%, high digitality despite declining vitality), Fading Voices (36%, low on both dimensions), and critically, Invisible Giants (27%, high vitality but near-zero digitality) - languages spoken by millions yet absent from the LLM universe. We demonstrate that these patterns reflect continuities from colonial-era linguistic hierarchies to contemporary AI development, constituting what we term digital epistemic injustice. Our analysis reveals that English dominance in AI is not a technical necessity but an artifact of power structures that systematically exclude marginalized linguistic knowledge. We conclude with implications for decolonizing language technology and democratizing access to AI benefits.
摘要：大型语言模型是在海量多语言语料库上进行训练的，但这种丰富性掩盖了一场深刻的危机：在世界上 7,613 种现存语言中，大约 2,000 种拥有数百万使用者的语言在数字生态系统中实际上仍然不可见。我们提出了一个关键框架，将语言活力（现实世界人口实力）和数字性（在线存在）的实证测量与后殖民理论和认知不公正联系起来，以解释为什么人工智能系统中的语言不平等不是偶然的而是结构性的。通过分析所有记录在案的人类语言的数据，我们确定了四个类别：据点（33%，高活力和数字化）、数字回声（6%，尽管活力下降但数字化程度较高）、衰落之声（36%，两个维度都较低），以及至关重要的隐形巨人（27%，高活力但数字化程度接近于零）——数百万人使用的语言，但 LLM 领域中并不存在。我们证明，这些模式反映了从殖民时代的语言等级制度到当代人工智能发展的连续性，构成了我们所说的数字认知不公正。我们的分析表明，英语在人工智能中的主导地位并不是技术上的必然，而是系统性排除边缘化语言知识的权力结构的产物。最后，我们得出了对语言技术去殖民化和人工智能优势民主化的影响。

Title: Culturally-Aware Conversations: A Framework & Benchmark for LLMs

Authors: Shreya Havaldar, Sunny Rai, Young-Min Cho, Lyle Ungar
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11563
Pdf URL: https://arxiv.org/pdf/2510.11563
Copy Paste: [[2510.11563]] Culturally-Aware Conversations: A Framework & Benchmark for LLMs(https://arxiv.org/abs/2510.11563)
Keywords: llm
Abstract: Existing benchmarks that measure cultural adaptation in LLMs are misaligned with the actual challenges these models face when interacting with users from diverse cultural backgrounds. In this work, we introduce the first framework and benchmark designed to evaluate LLMs in realistic, multicultural conversational settings. Grounded in sociocultural theory, our framework formalizes how linguistic style - a key element of cultural communication - is shaped by situational, relational, and cultural context. We construct a benchmark dataset based on this framework, annotated by culturally diverse raters, and propose a new set of desiderata for cross-cultural evaluation in NLP: conversational framing, stylistic sensitivity, and subjective correctness. We evaluate today's top LLMs on our benchmark and show that these models struggle with cultural adaptation in a conversational setting.
摘要：衡量法学硕士文化适应性的现有基准与这些模型在与来自不同文化背景的用户互动时面临的实际挑战不一致。在这项工作中，我们介绍了第一个框架和基准，旨在在现实的多元文化对话环境中评估法学硕士。我们的框架以社会文化理论为基础，正式阐述了语言风格（文化传播的关键要素）是如何由情境、关系和文化背景塑造的。我们基于这个框架构建了一个基准数据集，由不同文化的评估者进行注释，并提出了 NLP 中跨文化评估的一组新的需求：对话框架、风格敏感性和主观正确性。我们根据我们的基准评估当今顶级的法学硕士，并表明这些模型在对话环境中难以适应文化。

Title: LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings

Authors: Ting Li, Yang Yang, Yipeng Yu, Liang Yao, Guoqing Chao, Ruifeng Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11584
Pdf URL: https://arxiv.org/pdf/2510.11584
Copy Paste: [[2510.11584]] LLMAtKGE: Large Language Models as Explainable Attackers against Knowledge Graph Embeddings(https://arxiv.org/abs/2510.11584)
Keywords: language model, llm, prompt
Abstract: Adversarial attacks on knowledge graph embeddings (KGE) aim to disrupt the model's ability of link prediction by removing or inserting triples. A recent black-box method has attempted to incorporate textual and structural information to enhance attack performance. However, it is unable to generate human-readable explanations, and exhibits poor generalizability. In the past few years, large language models (LLMs) have demonstrated powerful capabilities in text comprehension, generation, and reasoning. In this paper, we propose LLMAtKGE, a novel LLM-based framework that selects attack targets and generates human-readable explanations. To provide the LLM with sufficient factual context under limited input constraints, we design a structured prompting scheme that explicitly formulates the attack as multiple-choice questions while incorporating KG factual evidence. To address the context-window limitation and hesitation issues, we introduce semantics-based and centrality-based filters, which compress the candidate set while preserving high recall of attack-relevant information. Furthermore, to efficiently integrate both semantic and structural information into the filter, we precompute high-order adjacency and fine-tune the LLM with a triple classification task to enhance filtering performance. Experiments on two widely used knowledge graph datasets demonstrate that our attack outperforms the strongest black-box baselines and provides explanations via reasoning, and showing competitive performance compared with white-box methods. Comprehensive ablation and case studies further validate its capability to generate explanations.
摘要：对知识图嵌入（KGE）的对抗性攻击旨在通过删除或插入三元组来破坏模型的链接预测能力。最近的黑盒方法尝试合并文本和结构信息以增强攻击性能。然而，它无法生成人类可读的解释，并且普遍性较差。在过去的几年里，大型语言模型（LLM）在文本理解、生成和推理方面展现出了强大的能力。在本文中，我们提出了LLMAtKGE，这是一种基于LLM的新型框架，可以选择攻击目标并生成人类可读的解释。为了在有限的输入约束下为法学硕士提供足够的事实背景，我们设计了一种结构化提示方案，将攻击明确地表述为多项选择问题，同时结合知识图谱事实证据。为了解决上下文窗口限制和犹豫问题，我们引入了基于语义和基于中心性的过滤器，它压缩候选集，同时保留攻击相关信息的高召回率。此外，为了有效地将语义和结构信息集成到过滤器中，我们预先计算高阶邻接并使用三重分类任务对 LLM 进行微调，以增强过滤性能。对两个广泛使用的知识图数据集的实验表明，我们的攻击优于最强的黑盒基线，并通过推理提供解释，并显示与白盒方法相比的竞争性能。全面的消融和案例研究进一步验证了其生成解释的能力。

Title: Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models

Authors: Georg Ahnert, Anna-Carolina Haensch, Barbara Plank, Markus Strohmaier
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2510.11586
Pdf URL: https://arxiv.org/pdf/2510.11586
Copy Paste: [[2510.11586]] Survey Response Generation: Generating Closed-Ended Survey Responses In-Silico with Large Language Models(https://arxiv.org/abs/2510.11586)
Keywords: language model, llm
Abstract: Many in-silico simulations of human survey responses with large language models (LLMs) focus on generating closed-ended survey responses, whereas LLMs are typically trained to generate open-ended text instead. Previous research has used a diverse range of methods for generating closed-ended survey responses with LLMs, and a standard practice remains to be identified. In this paper, we systematically investigate the impact that various Survey Response Generation Methods have on predicted survey responses. We present the results of 32 mio. simulated survey responses across 8 Survey Response Generation Methods, 4 political attitude surveys, and 10 open-weight language models. We find significant differences between the Survey Response Generation Methods in both individual-level and subpopulation-level alignment. Our results show that Restricted Generation Methods perform best overall, and that reasoning output does not consistently improve alignment. Our work underlines the significant impact that Survey Response Generation Methods have on simulated survey responses, and we develop practical recommendations on the application of Survey Response Generation Methods.
摘要：许多使用大型语言模型 (LLM) 对人类调查响应进行计算机模拟的重点是生成封闭式调查响应，而 LLM 通常经过训练以生成开放式文本。先前的研究使用了多种方法来生成法学硕士的封闭式调查回复，但标准做法仍有待确定。在本文中，我们系统地研究了各种调查响应生成方法对预测调查响应的影响。我们展示 32 mio 的结果。模拟了 8 种调查响应生成方法、4 种政治态度调查和 10 种开放权重语言模型的调查响应。我们发现调查回答生成方法在个人层面和亚群体层面的一致性方面存在显着差异。我们的结果表明，限制生成方法总体表现最佳，并且推理输出并不能始终如一地改善对齐。我们的工作强调了调查响应生成方法对模拟调查响应的重大影响，并且我们针对调查响应生成方法的应用提出了实用建议。

Title: MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models

Authors: Bo Cheng, Xu Wang, Jinda Liu, Yi Chang, Yuan Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11598
Pdf URL: https://arxiv.org/pdf/2510.11598
Copy Paste: [[2510.11598]] MeTA-LoRA: Data-Efficient Multi-Task Fine-Tuning for Large Language Models(https://arxiv.org/abs/2510.11598)
Keywords: language model, llm
Abstract: Low-Rank Adaptation (LoRA) has emerged as one of the most widely used parameter-efficient fine-tuning (PEFT) methods for adapting large language models (LLMs) to downstream tasks. While highly effective in single-task settings, it struggles to efficiently leverage inter-task knowledge in complex multi-task learning scenarios, often requiring substantial task-specific data to achieve optimal performance. To address this limitation, we introduce MeTA-LoRA, a two-stage optimization framework that significantly improves data efficiency in multi-task adaptation. In the first stage, task-specific LoRA adapters are learned using only a few samples from each involved dataset, enabling rapid adaptation without large-scale supervision. In the second stage, the shared LoRA adapter is updated by aggregating gradients from multiple tasks to promote knowledge transfer across tasks, further reducing data usage by leveraging common patterns. In both multi-task learning and multilingual learning scenarios, our method matches or surpasses the performance of traditional full-data LoRA fine-tuning approaches, while using significantly less task-specific data.
摘要：低秩适应 (LoRA) 已成为最广泛使用的参数高效微调 (PEFT) 方法之一，用于使大型语言模型 (LLM) 适应下游任务。虽然在单任务设置中非常有效，但它很难在复杂的多任务学习场景中有效利用任务间知识，通常需要大量特定于任务的数据才能实现最佳性能。为了解决这个限制，我们引入了MeTA-LoRA，这是一个两阶段优化框架，可以显着提高多任务适应中的数据效率。在第一阶段，仅使用每个涉及的数据集中的几个样本来学习特定于任务的 LoRA 适配器，从而无需大规模监督即可快速适应。在第二阶段，共享 LoRA 适配器通过聚合多个任务的梯度进行更新，以促进跨任务的知识传递，并通过利用常见模式进一步减少数据使用。在多任务学习和多语言学习场景中，我们的方法匹配或超越了传统全数据 LoRA 微调方法的性能，同时使用的特定任务数据显着减少。

Title: Deconstructing Attention: Investigating Design Principles for Effective Language Modeling

Authors: Huiyin Xue, Nafise Sadat Moosavi, Nikolaos Aletras
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.11602
Pdf URL: https://arxiv.org/pdf/2510.11602
Copy Paste: [[2510.11602]] Deconstructing Attention: Investigating Design Principles for Effective Language Modeling(https://arxiv.org/abs/2510.11602)
Keywords: language model
Abstract: The success of Transformer language models is widely credited to their dot-product attention mechanism, which interweaves a set of key design principles: mixing information across positions (enabling multi-token interactions), sequence-dependent activations (where attention weights adapt to each input), a specific mathematical form (dot-product similarities plus softmax weighting), and coupling of queries and keys to evolving hidden states (grounding attention in the current layer). However, the necessity of each of these principles remains largely untested. In this work, we systematically deconstruct attention by designing controlled variants that selectively relax these principles, applied both uniformly across all layers and in hybrid architectures where only some layers retain standard attention. Our empirical analysis reveals that mechanisms for mixing tokens are indispensable, as their absence collapses models to near-random behavior, while the exact mathematical form and sequence dependency can be substantially relaxed, especially when preserved in just a subset of layers. Surprisingly, even variants that fail in isolation can achieve robust performance when interleaved with standard attention, highlighting a cooperative effect. These findings deepen our understanding of what truly underpins attention's effectiveness and open new avenues for simplifying language models without sacrificing performance.
摘要：Transformer 语言模型的成功被广泛归功于其点积注意力机制，该机制交织了一组关键设计原则：跨位置混合信息（实现多标记交互）、序列相关激活（注意力权重适应每个输入）、特定的数学形式（点积相似性加上 softmax 权重）以及查询和密钥与不断发展的隐藏状态的耦合（基础注意力）在当前层）。然而，这些原则的必要性在很大程度上尚未得到检验。在这项工作中，我们通过设计有选择地放宽这些原则的受控变体来系统地解构注意力，这些变体在所有层和只有某些层保留标准注意力的混合架构中统一应用。我们的实证分析表明，混合令牌的机制是不可或缺的，因为它们的缺失会使模型崩溃为近乎随机的行为，而精确的数学形式和序列依赖性可以大大放松，特别是当仅保留在层的子集中时。令人惊讶的是，即使是孤立失败的变体，在与标准注意力交错时也能实现稳健的性能，凸显了协同效应。这些发现加深了我们对真正支撑注意力有效性的理解，并为在不牺牲性能的情况下简化语言模型开辟了新途径。

Title: LLM-Oriented Token-Adaptive Knowledge Distillation

Authors: Xurong Xie, Zhucun Xue, Jiafu Wu, Jian Li, Yabiao Wang, Xiaobin Hu, Yong Liu, Jiangning Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.11615
Pdf URL: https://arxiv.org/pdf/2510.11615
Copy Paste: [[2510.11615]] LLM-Oriented Token-Adaptive Knowledge Distillation(https://arxiv.org/abs/2510.11615)
Keywords: language model, llm
Abstract: Knowledge distillation (KD) is a key technique for compressing large-scale language models (LLMs), yet prevailing logit-based methods typically employ static strategies that are misaligned with the dynamic learning process of student models. These methods typically treat all tokens indiscriminately and apply a single, fixed temperature, resulting in suboptimal knowledge transfer. To address these limitations, we propose LLM-Oriented Token-Adaptive Knowledge Distillation (AdaKD), a novel framework that adapts the distillation process to the real-time learning state of each token. AdaKD consists of two synergistic modules driven by a unified token difficulty metric. First, our Loss-Driven Adaptive Token Focusing (LATF) module dynamically adjusts the distillation focus by monitoring the student's learning stability, concentrating computational resources on the most valuable tokens at each training phase. Second, we introduce Inverse Difficulty Temperature Scaling (IDTS), a counterintuitive yet effective token-level temperature strategy. It employs low temperatures for difficult tokens for targeted error correction, and high temperatures for easy tokens to encourage students to learn from the teacher's complete and smooth output distribution, thereby enhancing generalization. As a plug-and-play framework, AdaKD can consistently improve the performance of various distillation methods on multiple model architectures and benchmarks.
摘要：知识蒸馏（KD）是压缩大规模语言模型（LLM）的关键技术，但流行的基于逻辑的方法通常采用与学生模型的动态学习过程不一致的静态策略。这些方法通常不加区别地对待所有令牌并应用单一的固定温度，从而导致知识转移不理想。为了解决这些限制，我们提出了面向 LLM 的令牌自适应知识蒸馏（AdaKD），这是一种新颖的框架，可以使蒸馏过程适应每个令牌的实时学习状态。 AdaKD 由两个协同模块组成，由统一的代币难度指标驱动。首先，我们的损失驱动自适应令牌聚焦（LATF）模块通过监控学生的学习稳定性来动态调整蒸馏焦点，将计算资源集中在每个训练阶段最有价值的令牌上。其次，我们引入逆难度温度缩放（IDTS），这是一种违反直觉但有效的代币级别温度策略。它对困难的标记采用低温来进行有针对性的纠错，对简单的标记采用高温来鼓励学生从老师完整且平滑的输出分布中学习，从而增强泛化能力。作为一个即插即用的框架，AdaKD 可以持续提高各种蒸馏方法在多个模型架构和基准上的性能。

Title: StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models

Authors: Zehao Chen, Rong Pan, Haoran Li
Subjects: cs.CL, cs.MA
Abstract URL: https://arxiv.org/abs/2510.11618
Pdf URL: https://arxiv.org/pdf/2510.11618
Copy Paste: [[2510.11618]] StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models(https://arxiv.org/abs/2510.11618)
Keywords: language model, agent
Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.
摘要：人类作家经常以一个总体的心理场景开始他们的故事，他们想象人物和环境之间的相互作用。受这种创作过程的启发，我们提出了一种使用多智能体模拟的长篇故事生成的新颖方法，称为混合自下而上的长篇故事生成。在我们的方法中，代理在动态沙箱环境中进行交互，其中它们的行为以及彼此之间以及环境的交互会生成紧急事件。这些事件构成了故事的基础，实现了角色的有机发展和情节的发展。与强加刚性结构的传统自上而下方法不同，我们的混合自下而上方法允许事件自然展开，促进更自发和更吸引人的故事讲述。该系统能够生成超过 10,000 字的故事，同时保持连贯性和一致性，解决当前故事生成模型面临的一些关键挑战。我们在多个指标上实现了最先进的性能。这种方法提供了一种可扩展的创新解决方案，用于创建动态的、沉浸式的长篇故事，这些故事从代理驱动的交互中有机地演变。

Title: Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation

Authors: Siheng Xiong, Ali Payani, Faramarz Fekri
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11620
Pdf URL: https://arxiv.org/pdf/2510.11620
Copy Paste: [[2510.11620]] Enhancing Long Chain-of-Thought Reasoning through Multi-Path Plan Aggregation(https://arxiv.org/abs/2510.11620)
Keywords: language model, chain-of-thought
Abstract: Inference-time scaling enhances the reasoning ability of a language model (LM) by extending its chain-of-thought (CoT). However, existing approaches typically generate the entire reasoning chain in a single forward pass, which often leads to CoT derailment, i.e., the reasoning trajectory drifting off course due to compounding errors. This problem is particularly severe for smaller LMs with long CoTs due to their limited capacity. To address this, we analyze raw long CoTs and uncover a reasoning hierarchy consisting of planning and execution steps. Our analysis reveals that most reasoning errors stem from incorrect planning. Motivated by this observation, we propose Multi-Path Plan Aggregation (MPPA), a framework that augments single-pass reasoning with plan exploration and aggregation. Following a variable interval schedule based on the token position, MPPA generates multiple candidate plans and aggregates them into a refined planning step. To maintain efficiency, we adopt a minimal design in which the base LM serves as the primary policy, while a lightweight LoRA module implements the plan aggregation policy. We further observe that outcome-reward RL is inefficient for long trajectories (e.g., exceeding 4K tokens). To overcome this, we introduce online Step-DPO, a process-level preference optimization scheme that leverages Twisted Sequential Monte Carlo (TSMC) to provide scalable stepwise supervision using small LMs. This yields more efficient training, improved stability, and higher accuracy. Extensive experiments on challenging math, science, and logical reasoning benchmarks demonstrate that, with only 10% SFT data and 5% of preference pairs, our method outperforms both the DeepSeek-R1 distillation baseline and the outcome-reward RL baseline across multiple base models and tasks.
摘要：推理时间缩放通过扩展其思想链（CoT）来增强语言模型（LM）的推理能力。然而，现有的方法通常在一次前向传递中生成整个推理链，这通常会导致 CoT 脱轨，即推理轨迹由于复合误差而偏离轨道。由于容量有限，对于 CoT 较长的小型 LM 来说，这个问题尤其严重。为了解决这个问题，我们分析了原始的长 CoT，并揭示了由规划和执行步骤组成的推理层次结构。我们的分析表明，大多数推理错误源于不正确的计划。受这一观察的启发，我们提出了多路径计划聚合（MPPA），这是一个通过计划探索和聚合增强单遍推理的框架。按照基于代币位置的可变间隔计划，MPPA 生成多个候选计划并将它们聚合成一个细化的计划步骤。为了保持效率，我们采用了最小化设计，其中基础 LM 作为主要策略，而轻量级 LoRA 模块则实现计划聚合策略。我们进一步观察到，结果奖励强化学习对于长轨迹（例如，超过 4K 代币）效率低下。为了克服这个问题，我们引入了在线 Step-DPO，这是一种流程级偏好优化方案，利用扭曲顺序蒙特卡罗 (TSMC) 使用小型 LM 提供可扩展的逐步监督。这将带来更高效的训练、更高的稳定性和更高的准确性。针对具有挑战性的数学、科学和逻辑推理基准的大量实验表明，仅使用 10% 的 SFT 数据和 5% 的偏好对，我们的方法在多个基本模型和任务上都优于 DeepSeek-R1 蒸馏基线和结果奖励 RL 基线。

Title: ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems

Authors: Xin Gui, King Zhu, JinCheng Ren, Qianben Chen, Zekun Moore Wang, Yizhi LI, Xinpeng Liu, Xiaowan Li, Wenli Ren, Linyu Miao, Tianrui Qin, Ziqi Shu, He Zhu, Xiangru Tang, Dingfeng Shi, Jiaheng Liu, Yuchen Eleanor Jiang, Minghao Liu, Ge Zhang, Wangchunshu Zhou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11652
Pdf URL: https://arxiv.org/pdf/2510.11652
Copy Paste: [[2510.11652]] ACADREASON: Exploring the Limits of Reasoning Models with Academic Research Problems(https://arxiv.org/abs/2510.11652)
Keywords: language model, gpt, llm, agent
Abstract: In recent years, the research focus of large language models (LLMs) and agents has shifted increasingly from demonstrating novel capabilities to complex reasoning and tackling challenging tasks. However, existing evaluations focus mainly on math/code contests or general tasks, while existing multi-domain academic benchmarks lack sufficient reasoning depth, leaving the field without a rigorous benchmark for high-level reasoning. To fill this gap, we introduce the Acadreason benchmark, designed to evaluate the ability of LLMs and agents to acquire and reason over academic knowledge. It consists of 50 expert-annotated academic problems across five high-reasoning domains, including computer science, economics, law, mathematics, and philosophy. All questions are sourced from top-tier publications in recent years and undergo rigorous annotation and quality control to ensure they are both challenging and answerable. We conduct systematic evaluations of over 10 mainstream LLMs and agents. The results show that most LLMs scored below 20 points, with even the cutting-edge GPT-5 achieving only 16 points. While agents achieved higher scores, none exceeded 40 points. This demonstrates the current capability gap between LLMs and agents in super-intelligent academic research tasks and highlights the challenges of Acadreason.
摘要：近年来，大型语言模型（LLM）和智能体的研究重点逐渐从展示新功能转向复杂推理和解决具有挑战性的任务。然而，现有的评估主要集中在数学/代码竞赛或一般任务上，而现有的多领域学术基准缺乏足够的推理深度，导致该领域缺乏严格的高级推理基准。为了填补这一空白，我们引入了 Acadreason 基准，旨在评估法学硕士和代理人获取和推理学术知识的能力。它由 50 个专家注释的学术问题组成，涉及计算机科学、经济学、法律、数学和哲学等五个高推理领域。所有问题均来自近年来的顶级出版物，并经过严格的注释和质量控制，以确保它们既具有挑战性又可回答。我们对10多个主流法学硕士和代理人进行系统评估。结果显示，大多数LLM的得分都在20分以下，即使是最前沿的GPT-5也只能达到16分。虽然特工们取得了较高的分数，但没有一个超过 40 分。这表明了目前法学硕士和代理人在超智能学术研究任务中的能力差距，并凸显了 Acadreason 面临的挑战。

Title: Scaling Language-Centric Omnimodal Representation Learning

Authors: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2510.11693
Pdf URL: https://arxiv.org/pdf/2510.11693
Copy Paste: [[2510.11693]] Scaling Language-Centric Omnimodal Representation Learning(https://arxiv.org/abs/2510.11693)
Keywords: language model, llm
Abstract: Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at this https URL.
摘要：最近利用多模态大语言模型（MLLM）和对比学习（CL）进行微调的多模态嵌入方法已经显示出有希望的结果，但其优越性背后的根本原因仍未得到充分探索。这项工作认为，基于 MLLM 的方法的一个关键优势源于生成预训练期间实现的隐式跨模态对齐，其中语言解码器学习利用共享表示空间内的多模态信号来生成单模态输出。通过对各向异性和核相似性结构的分析，我们凭经验证实了 MLLM 表示中出现了潜在对齐，从而使 CL 能够充当轻量级细化阶段。利用这种洞察力，我们提出了一种以语言为中心的全模态嵌入框架，称为 LCO-Emb。跨不同骨干网和基准的广泛实验证明了其有效性，跨模式实现了最先进的性能。此外，我们确定了生成表征缩放定律（GRSL），表明通过对比细化获得的表征能力与 MLLM 的生成能力成正比。这表明提高生成能力已成为提高表征质量的有效范例。我们提供了 GRSL 的理论解释，它将 MLLM 的生成质量与其表示性能的上限正式联系起来，并在具有挑战性的、低资源的视觉文档检索任务上对其进行了验证，表明在 CL 之前进行持续的生成预训练可以进一步增强模型嵌入能力的潜力。代码、模型和资源可从此 https URL 获取。

Title: When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

Authors: Lingfei Qian, Xueqing Peng, Yan Wang, Vincent Jim Zhang, Huan He, Hanley Smith, Yi Han, Yueru He, Haohang Li, Yupeng Cao, Yangyang Yu, Alejandro Lopez-Lira, Peng Lu, Jian-Yun Nie, Guojun Xiong, Jimin Huang, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11695
Pdf URL: https://arxiv.org/pdf/2510.11695
Copy Paste: [[2510.11695]] When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents(https://arxiv.org/abs/2510.11695)
Keywords: language model, gpt, llm, agent
Abstract: Although Large Language Model (LLM)-based agents are increasingly used in financial trading, it remains unclear whether they can reason and adapt in live markets, as most studies test models instead of agents, cover limited periods and assets, and rely on unverified data. To address these gaps, we introduce Agent Market Arena (AMA), the first lifelong, real-time benchmark for evaluating LLM-based trading agents across multiple markets. AMA integrates verified trading data, expert-checked news, and diverse agent architectures within a unified trading framework, enabling fair and continuous comparison under real conditions. It implements four agents, including InvestorAgent as a single-agent baseline, TradeAgent and HedgeFundAgent with different risk styles, and DeepFundAgent with memory-based reasoning, and evaluates them across GPT-4o, GPT-4.1, Claude-3.5-haiku, Claude-sonnet-4, and Gemini-2.0-flash. Live experiments on both cryptocurrency and stock markets demonstrate that agent frameworks display markedly distinct behavioral patterns, spanning from aggressive risk-taking to conservative decision-making, whereas model backbones contribute less to outcome variation. AMA thus establishes a foundation for rigorous, reproducible, and continuously evolving evaluation of financial reasoning and trading intelligence in LLM-based agents.
摘要：尽管基于大型语言模型（LLM）的代理越来越多地用于金融交易，但仍不清楚它们是否能够在实时市场中进行推理和适应，因为大多数研究测试模型而不是代理，覆盖有限的时期和资产，并且依赖未经验证的数据。为了解决这些差距，我们推出了 Agent Market Arena (AMA)，这是第一个用于评估跨多个市场的基于 LLM 的交易代理的终身实时基准。 AMA将经过验证的交易数据、经过专家审核的新闻和多样化的代理架构集成在统一的交易框架中，从而能够在真实条件下进行公平和持续的比较。它实现了四种代理，包括作为单代理基线的 InvestorAgent、具有不同风险风格的 TradeAgent 和 HedgeFundAgent、以及具有基于记忆推理的 DeepFundAgent，并在 GPT-4o、GPT-4.1、Claude-3.5-haiku、Claude-sonnet-4 和 Gemini-2.0-flash 上对它们进行评估。对加密货币和股票市场的现场实验表明，代理框架表现出明显不同的行为模式，从激进的冒险行为到保守的决策，而模型主干对结果变化的贡献较小。因此，AMA 为基于 LLM 的代理的财务推理和交易情报的严格、可重复和不断发展的评估奠定了基础。

Title: Demystifying Reinforcement Learning in Agentic Reasoning

Authors: Zhaochen Yu, Ling Yang, Jiaru Zou, Shuicheng Yan, Mengdi Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.11701
Pdf URL: https://arxiv.org/pdf/2510.11701
Copy Paste: [[2510.11701]] Demystifying Reinforcement Learning in Agentic Reasoning(https://arxiv.org/abs/2510.11701)
Keywords: llm, agent
Abstract: Recently, the emergence of agentic RL has showcased that RL could also effectively improve the agentic reasoning ability of LLMs, yet the key design principles and optimal practices remain unclear. In this work, we conduct a comprehensive and systematic investigation to demystify reinforcement learning in agentic reasoning from three key perspectives: data, algorithm, and reasoning mode. We highlight our key insights: (i) Replacing stitched synthetic trajectories with real end-to-end tool-use trajectories yields a far stronger SFT initialization; high-diversity, model-aware datasets sustain exploration and markedly improve RL performance. (ii) Exploration-friendly techniques are crucial for agentic RL, such as clip higher, overlong reward shaping, and maintaining adequate policy entropy could improve the training efficiency. (iii) A deliberative strategy with fewer tool calls outperforms frequent tool calls or verbose self-reasoning, improving tool efficiency and final accuracy. Together, these simple practices consistently enhance agentic reasoning and training efficiency, achieving strong results on challenging benchmarks with smaller models, and establishing a practical baseline for future agentic RL research. Beyond these empirical insights, we further contribute a high-quality, real end-to-end agentic SFT dataset along with a high-quality RL dataset, and demonstrate the effectiveness of our insights in boosting the agentic reasoning ability of LLMs across four challenging benchmarks, including AIME2024/AIME2025, GPQA-Diamond, and LiveCodeBench-v6. With our recipes, 4B-sized models could also achieve superior agentic reasoning performance compared to 32B-sized models. Code and models: this https URL
摘要：最近，代理强化学习的出现表明，强化学习也可以有效提高法学硕士的代理推理能力，但其关键设计原则和最佳实践仍不清楚。在这项工作中，我们进行了全面、系统的研究，从数据、算法和推理模式三个关键角度揭开了代理推理中强化学习的神秘面纱。我们强调我们的关键见解：（i）用真实的端到端工具使用轨迹替换缝合的合成轨迹会产生更强的 SFT 初始化；高多样性、模型感知的数据集可以维持探索并显着提高强化学习性能。 (ii) 探索友好的技术对于代理强化学习至关重要，例如更高的剪辑、超长的奖励塑造以及保持足够的策略熵可以提高训练效率。 (iii) 较少工具调用的深思熟虑策略优于频繁的工具调用或冗长的自我推理，提高了工具效率和最终准确性。总之，这些简单的实践不断提高代理推理和训练效率，用较小的模型在具有挑战性的基准上取得优异的结果，并为未来的代理强化学习研究建立实用的基线。除了这些实证见解之外，我们还进一步贡献了高质量、真实的端到端代理 SFT 数据集以及高质量的 RL 数据集，并在四个具有挑战性的基准（包括 AIME2024/AIME2025、GPQA-Diamond 和 LiveCodeBench-v6）中证明了我们的见解在提高法学硕士代理推理能力方面的有效性。通过我们的方案，与 32B 大小的模型相比，4B 大小的模型也可以实现卓越的代理推理性能。代码和型号：此 https URL