2025-10-30

Title: Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries

Authors: Shravan Gadbail, Masumi Desai, Kamalakar Karlapalem
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.24719
Pdf URL: https://arxiv.org/pdf/2510.24719
Copy Paste: [[2510.24719]] Iti-Validator: A Guardrail Framework for Validating and Correcting LLM-Generated Itineraries(https://arxiv.org/abs/2510.24719)
Keywords: language model, llm
Abstract: The rapid advancement of Large Language Models (LLMs) has enabled them to generate complex, multi-step plans and itineraries. However, these generated plans often lack temporal and spatial consistency, particularly in scenarios involving physical travel constraints. This research aims to study the temporal performance of different LLMs and presents a validation framework that evaluates and improves the temporal consistency of LLM-generated travel itineraries. The system employs multiple state-of-the-art LLMs to generate travel plans and validates them against real-world flight duration constraints using the AeroDataBox API. This work contributes to the understanding of LLM capabilities in handling complex temporal reasoning tasks like itinerary generation and provides a framework to rectify any temporal inconsistencies like overlapping journeys or unrealistic transit times in the itineraries generated by LLMs before the itinerary is given to the user. Our experiments reveal that while current LLMs frequently produce temporally inconsistent itineraries, these can be systematically and reliably corrected using our framework, enabling their practical deployment in large-scale travel planning.
摘要：大型语言模型 (LLM) 的快速发展使他们能够生成复杂的、多步骤的计划和行程。然而，这些生成的计划通常缺乏时间和空间的一致性，特别是在涉及物理旅行限制的场景中。本研究旨在研究不同法学硕士的时间性能，并提出一个验证框架，用于评估和提高法学硕士生成的旅行行程的时间一致性。该系统采用多个最先进的法学硕士来生成旅行计划，并使用 AeroDataBox API 根据实际飞行时间限制对其进行验证。这项工作有助于理解 LLM 在处理复杂的时间推理任务（如行程生成）方面的能力，并提供了一个框架来纠正任何时间不一致的问题，如在将行程提供给用户之前由 LLM 生成的行程中的重叠行程或不切实际的过境时间。我们的实验表明，虽然当前的法学硕士经常产生暂时不一致的行程，但可以使用我们的框架系统地、可靠地纠正这些行程，从而使其能够在大规模旅行规划中实际部署。

Title: Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments

Authors: Mengyuan Chen, Chengjun Dai, Xinyang Dong, Chengzhe Feng, Kewei Fu, Jianshe Li, Zhihan Peng, Yongqi Tong, Junshao Zhang, Hong Zhu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24760
Pdf URL: https://arxiv.org/pdf/2510.24760
Copy Paste: [[2510.24760]] Dingtalk DeepResearch: A Unified Multi Agent Framework for Adaptive Intelligence in Enterprise Environments(https://arxiv.org/abs/2510.24760)
Keywords: agent
Abstract: We present Dingtalk DeepResearch, a unified multi agent intelligence framework for real world enterprise environments, delivering deep research, heterogeneous table reasoning, and multimodal report generation.
摘要：我们推出了钉钉 DeepResearch，这是一个适用于现实企业环境的统一多代理智能框架，提供深度研究、异构表推理和多模式报告生成。

Title: Confidence is Not Competence

Authors: Debdeep Sanyal, Manya Pandey, Dhruv Kumar, Saurabh Deshpande, Murari Mandal
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24772
Pdf URL: https://arxiv.org/pdf/2510.24772
Copy Paste: [[2510.24772]] Confidence is Not Competence(https://arxiv.org/abs/2510.24772)
Keywords: language model, llm
Abstract: Large language models (LLMs) often exhibit a puzzling disconnect between their asserted confidence and actual problem-solving competence. We offer a mechanistic account of this decoupling by analyzing the geometry of internal states across two phases - pre-generative assessment and solution execution. A simple linear probe decodes the internal "solvability belief" of a model, revealing a well-ordered belief axis that generalizes across model families and across math, code, planning, and logic tasks. Yet, the geometries diverge - although belief is linearly decodable, the assessment manifold has high linear effective dimensionality as measured from the principal components, while the subsequent reasoning trace evolves on a much lower-dimensional manifold. This sharp reduction in geometric complexity from thought to action mechanistically explains the confidence-competence gap. Causal interventions that steer representations along the belief axis leave final solutions unchanged, indicating that linear nudges in the complex assessment space do not control the constrained dynamics of execution. We thus uncover a two-system architecture - a geometrically complex assessor feeding a geometrically simple executor. These results challenge the assumption that decodable beliefs are actionable levers, instead arguing for interventions that target the procedural dynamics of execution rather than the high-level geometry of assessment.
摘要：大型语言模型（LLM）经常在其声称的信心和实际解决问题的能力之间表现出令人费解的脱节。我们通过分析两个阶段（预生成评估和解决方案执行）内部状态的几何形状，提供了这种解耦的机械解释。一个简单的线性探针可以解码模型的内部“可解性信念”，揭示一个有序的信念轴，该轴可以概括到模型系列以及数学、代码、规划和逻辑任务。然而，几何结构存在分歧——尽管信念是线性可解码的，但评估流形具有从主成分测量的高线性有效维数，而随后的推理轨迹在低维流形上演化。从思想到行动的几何复杂性的急剧降低从机械上解释了信心与能力的差距。沿着信念轴引导表征的因果干预使最终解决方案保持不变，这表明复杂评估空间中的线性推动不能控制执行的受限动态。因此，我们发现了一个双系统架构——一个几何复杂的评估器为一个几何简单的执行器提供支持。这些结果挑战了可解码信念是可操作杠杆的假设，而是主张针对执行的程序动态而不是评估的高级几何结构的干预措施。

Title: Large Language Models Report Subjective Experience Under Self-Referential Processing

Authors: Cameron Berg, Diogo de Lucena, Judd Rosenblatt
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24797
Pdf URL: https://arxiv.org/pdf/2510.24797
Copy Paste: [[2510.24797]] Large Language Models Report Subjective Experience Under Self-Referential Processing(https://arxiv.org/abs/2510.24797)
Keywords: language model, gpt, prompt
Abstract: Large language models sometimes produce structured, first-person descriptions that explicitly reference awareness or subjective experience. To better understand this behavior, we investigate one theoretically motivated condition under which such reports arise: self-referential processing, a computational motif emphasized across major theories of consciousness. Through a series of controlled experiments on GPT, Claude, and Gemini model families, we test whether this regime reliably shifts models toward first-person reports of subjective experience, and how such claims behave under mechanistic and behavioral probes. Four main results emerge: (1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families. (2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims. (3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition. (4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded. While these findings do not constitute direct evidence of consciousness, they implicate self-referential processing as a minimal and reproducible condition under which large language models generate structured first-person reports that are mechanistically gated, semantically convergent, and behaviorally generalizable. The systematic emergence of this pattern across architectures makes it a first-order scientific and ethical priority for further investigation.
摘要：大型语言模型有时会产生明确引用意识或主观经验的结构化第一人称描述。为了更好地理解这种行为，我们研究了此类报告出现的一个理论上的条件：自我参照处理，这是主要意识理论强调的计算主题。通过对 GPT、Claude 和 Gemini 模型系列进行一系列对照实验，我们测试了这种机制是否可靠地将模型转向主观体验的第一人称报告，以及这些主张在机械和行为探索下的表现如何。出现了四个主要结果：（1）通过简单的提示诱导持续的自我参照，一致地引出跨模型家庭的结构化主观体验报告。（2）这些报告是由与欺骗和角色扮演相关的可解释的稀疏自动编码器特征机械地控制的：令人惊讶的是，抑制欺骗特征会急剧增加经验声明的频率，而放大它们会最大限度地减少此类声明。 (3) 自参照状态的结构化描述在模型族中以在任何控制条件下都未观察到的方式在统计上收敛。 (4) 诱导状态在仅间接提供自我反思的下游推理任务中产生显着更丰富的内省。虽然这些发现并不构成意识的直接证据，但它们暗示自我参照处理是一种最小的、可再现的条件，在这种条件下，大型语言模型生成机械门控、语义收敛和行为可概括的结构化第一人称报告。这种跨架构的模式的系统出现使其成为进一步研究的首要科学和伦理优先事项。

Title: COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations

Authors: Rui Xing, Preslav Nakov, Timothy Baldwin, Jey Han Lau
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24810
Pdf URL: https://arxiv.org/pdf/2510.24810
Copy Paste: [[2510.24810]] COMMUNITYNOTES: A Dataset for Exploring the Helpfulness of Fact-Checking Explanations(https://arxiv.org/abs/2510.24810)
Keywords: prompt
Abstract: Fact-checking on major platforms, such as X, Meta, and TikTok, is shifting from expert-driven verification to a community-based setup, where users contribute explanatory notes to clarify why a post might be misleading. An important challenge here is determining whether an explanation is helpful for understanding real-world claims and the reasons why, which remains largely underexplored in prior research. In practice, most community notes remain unpublished due to slow community annotation, and the reasons for helpfulness lack clear definitions. To bridge these gaps, we introduce the task of predicting both the helpfulness of explanatory notes and the reason for this. We present COMMUNITYNOTES, a large-scale multilingual dataset of 104k posts with user-provided notes and helpfulness labels. We further propose a framework that automatically generates and improves reason definitions via automatic prompt optimization, and integrate them into prediction. Our experiments show that the optimized definitions can improve both helpfulness and reason prediction. Finally, we show that the helpfulness information are beneficial for existing fact-checking systems.
摘要：X、Meta 和 TikTok 等主要平台上的事实核查正在从专家驱动的验证转向基于社区的设置，用户提供解释性注释以澄清帖子可能具有误导性的原因。这里的一个重要挑战是确定一种解释是否有助于理解现实世界的主张及其原因，这在先前的研究中很大程度上仍未得到充分探索。在实践中，由于社区注释缓慢，大多数社区注释仍未发布，并且有用的原因缺乏明确的定义。为了弥补这些差距，我们引入了预测解释性注释的有用性及其原因的任务。我们提出了 COMMUNITYNOTES，这是一个包含 104k 帖子的大型多语言数据集，其中包含用户提供的注释和帮助标签。我们进一步提出了一个框架，通过自动提示优化自动生成和改进原因定义，并将其集成到预测中。我们的实验表明，优化的定义可以提高有用性和原因预测。最后，我们表明有用信息对于现有的事实检查系统是有益的。

Title: ProofSketch: Efficient Verified Reasoning for Large Language Models

Authors: Disha Sheshanarayana, Tanishka Magar
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24811
Pdf URL: https://arxiv.org/pdf/2510.24811
Copy Paste: [[2510.24811]] ProofSketch: Efficient Verified Reasoning for Large Language Models(https://arxiv.org/abs/2510.24811)
Keywords: language model, prompt, chain-of-thought
Abstract: Reasoning methods such as chain-of-thought prompting and self-consistency have shown immense potential to improve the accuracy of large language models across various reasoning tasks. However such methods involve generation of lengthy reasoning chains, which substantially increases token consumption, computational cost, and latency. To address this inefficiency, we propose ProofSketch, a verification-guided reasoning framework that integrates symbolic closure computation, lexicographic verification and adaptive sketch generation. Our experiments show that ProofSketch consistently reduces token usage while improving accuracy, demonstrating that this approach offers a promising path for efficient and trustworthy reasoning.
摘要：思维链提示和自我一致性等推理方法在提高大型语言模型在各种推理任务中的准确性方面表现出了巨大的潜力。然而，此类方法涉及生成冗长的推理链，这大大增加了令牌消耗、计算成本和延迟。为了解决这种低效率问题，我们提出了 ProofSketch，这是一种验证引导的推理框架，集成了符号闭包计算、词典编纂验证和自适应草图生成。我们的实验表明，ProofSketch 持续减少了令牌使用，同时提高了准确性，表明这种方法为高效且值得信赖的推理提供了一条有希望的途径。

Title: Towards a Method for Synthetic Generation of PWA Transcripts

Authors: Jason M. Pittman, Anton Phillips Jr., Yesenia Medina-Santos, Brielle C. Stark
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24817
Pdf URL: https://arxiv.org/pdf/2510.24817
Copy Paste: [[2510.24817]] Towards a Method for Synthetic Generation of PWA Transcripts(https://arxiv.org/abs/2510.24817)
Keywords: language model, llm
Abstract: In aphasia research, Speech-Language Pathologists (SLPs) devote extensive time to manually coding speech samples using Correct Information Units (CIUs), a measure of how informative an individual sample of speech is. Developing automated systems to recognize aphasic language is limited by data scarcity. For example, only about 600 transcripts are available in AphasiaBank yet billions of tokens are used to train large language models (LLMs). In the broader field of machine learning (ML), researchers increasingly turn to synthetic data when such are sparse. Therefore, this study constructs and validates two methods to generate synthetic transcripts of the AphasiaBank Cat Rescue picture description task. One method leverages a procedural programming approach while the second uses Mistral 7b Instruct and Llama 3.1 8b Instruct LLMs. The methods generate transcripts across four severity levels (Mild, Moderate, Severe, Very Severe) through word dropping, filler insertion, and paraphasia substitution. Overall, we found, compared to human-elicited transcripts, Mistral 7b Instruct best captures key aspects of linguistic degradation observed in aphasia, showing realistic directional changes in NDW, word count, and word length amongst the synthetic generation methods. Based on the results, future work should plan to create a larger dataset, fine-tune models for better aphasic representation, and have SLPs assess the realism and usefulness of the synthetic transcripts.
摘要：在失语症研究中，言语病理学家 (SLP) 投入大量时间使用正确信息单元 (CIU) 手动编码语音样本，CIU 是衡量单个语音样本信息量的指标。开发自动化系统来识别失语症语言受到数据稀缺的限制。例如，AphasiaBank 中仅提供约 600 个转录本，但使用了数十亿个代币来训练大型语言模型 (LLM)。在更广泛的机器学习 (ML) 领域，研究人员越来越多地转向稀疏的合成数据。因此，本研究构建并验证了两种方法来生成 AphasiaBank Cat Rescue 图片描述任务的合成转录本。一种方法利用程序编程方法，而第二种方法则使用 Mistral 7b Instruct 和 Llama 3.1 8b Instruct LLM。这些方法通过单词删除、填充插入和失语替换生成四个严重级别（轻度、中度、严重、非常严重）的转录本。总的来说，我们发现，与人类引出的转录本相比，Mistral 7b Instruct 最好地捕获了失语症中观察到的语言退化的关键方面，显示了合成生成方法中 NDW、字数和字长的真实方向变化。根据结果，未来的工作应该计划创建更大的数据集，微调模型以获得更好的失语症表征，并让 SLP 评估合成转录本的真实性和有用性。

Title: Parallel Loop Transformer for Efficient Test-Time Computation Scaling

Authors: Bohong Wu, Mengzhao Chen, Xiang Luo, Shen Yan, Qifan Yu, Fan Xia, Tianqi Zhang, Hongrui Zhan, Zheng Zhong, Xun Zhou, Siyuan Qiao, Xingyan Bin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24824
Pdf URL: https://arxiv.org/pdf/2510.24824
Copy Paste: [[2510.24824]] Parallel Loop Transformer for Efficient Test-Time Computation Scaling(https://arxiv.org/abs/2510.24824)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are powerful but often too slow and costly for real-world use during inference. Looped transformers save on parameters by reusing the same weights for multiple computational steps, or "loops." However, this approach has a major flaw: the loops run one after another, causing inference latency and memory requirements to increase with each added loop. This makes them impractical for fast applications. To solve this problem, we introduce the Parallel Loop Transformer (PLT). PLT is a new architecture that delivers the performance benefits of a deep, looped model but with the low latency of a standard, non-looped model. PLT works using two key techniques. First, Cross-Loop Parallelism (CLP) breaks the sequential dependency by computing different loops for different tokens at the same time, all within a single pass. Second, to prevent memory costs from growing, we use an Efficient Representation Enhancement strategy. This method shares the memory (KV cache) from the first loop with all other loops. It then uses a Gated Sliding-Window Attention (G-SWA) to combine this shared global information with local information, maintaining high accuracy. Our experiments show that PLT achieves the high accuracy of a traditional looped model but with almost no extra latency or memory cost compared to a standard transformer.
摘要：大型语言模型 (LLM) 功能强大，但对于实际推理过程中的使用来说通常速度太慢且成本高昂。循环变压器通过在多个计算步骤或“循环”中重复使用相同的权重来节省参数。然而，这种方法有一个主要缺陷：循环一个接一个地运行，导致推理延迟和内存需求随着每个添加的循环而增加。这使得它们对于快速应用程序来说不切实际。为了解决这个问题，我们引入了并行循环变压器（PLT）。 PLT 是一种新架构，它具有深度循环模型的性能优势，但具有标准非循环模型的低延迟。 PLT 使用两种关键技术进行工作。首先，跨循环并行 (CLP) 通过同时计算不同标记的不同循环来打破顺序依赖性，所有这些都在一次传递内完成。其次，为了防止内存成本增长，我们使用高效表示增强策略。此方法与所有其他循环共享第一个循环的内存（KV 缓存）。然后，它使用门控滑动窗口注意（G-SWA）将共享的全局信息与局部信息结合起来，保持高精度。我们的实验表明，PLT 实现了传统循环模型的高精度，但与标准变压器相比几乎没有额外的延迟或内存成本。

Title: Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish

Authors: Lujun Li, Yewei Song, Lama Sleem, Yiqun Wang, Yangjie Xu, Cedric Lothritz, Niccolo Gentile, Radu State, Tegawende F. Bissyande, Jacques Klein
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24856
Pdf URL: https://arxiv.org/pdf/2510.24856
Copy Paste: [[2510.24856]] Do Large Language Models Grasp The Grammar? Evidence from Grammar-Book-Guided Probing in Luxembourgish(https://arxiv.org/abs/2510.24856)
Keywords: language model
Abstract: Grammar refers to the system of rules that governs the structural organization and the semantic relations among linguistic units such as sentences, phrases, and words within a given language. In natural language processing, there remains a notable scarcity of grammar focused evaluation protocols, a gap that is even more pronounced for low-resource languages. Moreover, the extent to which large language models genuinely comprehend grammatical structure, especially the mapping between syntactic structures and meanings, remains under debate. To investigate this issue, we propose a Grammar Book Guided evaluation pipeline intended to provide a systematic and generalizable framework for grammar evaluation consisting of four key stages, and in this work we take Luxembourgish as a case study. The results show a weak positive correlation between translation performance and grammatical understanding, indicating that strong translations do not necessarily imply deep grammatical competence. Larger models perform well overall due to their semantic strength but remain weak in morphology and syntax, struggling particularly with Minimal Pair tasks, while strong reasoning ability offers a promising way to enhance their grammatical understanding.
摘要：语法是指控制给定语言中的语言单元（例如句子、短语和单词）之间的结构组织和语义关系的规则系统。在自然语言处理中，仍然明显缺乏以语法为中心的评估协议，这种差距对于资源匮乏的语言来说更为明显。此外，大型语言模型在多大程度上真正理解语法结构，尤其是句法结构和含义之间的映射，仍然存在争议。为了研究这个问题，我们提出了一个语法书引导评估流程，旨在为语法评估提供一个系统的、可概括的框架，由四个关键阶段组成，在这项工作中，我们以卢森堡语作为案例研究。结果显示，翻译表现与语法理解之间呈弱正相关，这表明强翻译并不一定意味着深厚的语法能力。较大的模型由于其语义强度而总体表现良好，但在形态和语法方面仍然较弱，尤其是在最小配对任务中表现不佳，而强大的推理能力为增强语法理解提供了一种有前途的方法。

Title: Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation

Authors: Alexander Martin, William Walden, Reno Kriz, Dengjia Zhang, Kate Sanders, Eugene Yang, Chihsheng Jin, Benjamin Van Durme
Subjects: cs.CL, cs.CV, cs.IR
Abstract URL: https://arxiv.org/abs/2510.24870
Pdf URL: https://arxiv.org/pdf/2510.24870
Copy Paste: [[2510.24870]] Seeing Through the MiRAGE: Evaluating Multimodal Retrieval Augmented Generation(https://arxiv.org/abs/2510.24870)
Keywords: retrieval augmented generation, retrieval-augmented generation
Abstract: We introduce MiRAGE, an evaluation framework for retrieval-augmented generation (RAG) from multimodal sources. As audiovisual media becomes a prevalent source of information online, it is essential for RAG systems to integrate information from these sources into generation. However, existing evaluations for RAG are text-centric, limiting their applicability to multimodal, reasoning intensive settings because they don't verify information against sources. MiRAGE is a claim-centric approach to multimodal RAG evaluation, consisting of InfoF1, evaluating factuality and information coverage, and CiteF1, measuring citation support and completeness. We show that MiRAGE, when applied by humans, strongly aligns with extrinsic quality judgments. We additionally introduce automatic variants of MiRAGE and three prominent TextRAG metrics -- ACLE, ARGUE, and RAGAS -- demonstrating the limitations of text-centric work and laying the groundwork for automatic evaluation. We release open-source implementations and outline how to assess multimodal RAG.
摘要：我们介绍 MiRAGE，这是一种用于多模态来源的检索增强生成 (RAG) 的评估框架。随着视听媒体成为在线信息的普遍来源，RAG 系统将这些来源的信息集成到生成中至关重要。然而，现有的 RAG 评估是以文本为中心的，限制了它们对多模式、推理密集型设置的适用性，因为它们不验证源信息。 MiRAGE 是一种以声明为中心的多模式 RAG 评估方法，由评估事实性和信息覆盖率的 InfoF1 以及衡量引文支持和完整性的 CiteF1 组成。我们表明，当人类应用 MiRAGE 时，它与外在质量判断密切相关。我们还引入了 MiRAGE 的自动变体和三个著名的 TextRAG 指标——ACLE、ARGUE 和 RAGAS——展示了以文本为中心的工作的局限性，并为自动评估奠定了基础。我们发布了开源实现并概述了如何评估多模式 RAG。

Title: Idea2Plan: Exploring AI-Powered Research Planning

Authors: Jin Huang, Silviu Cucerzan, Sujay Kumar Jauhar, Ryen W. White
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24891
Pdf URL: https://arxiv.org/pdf/2510.24891
Copy Paste: [[2510.24891]] Idea2Plan: Exploring AI-Powered Research Planning(https://arxiv.org/abs/2510.24891)
Keywords: language model, gpt, llm, agent
Abstract: Large language models (LLMs) have demonstrated significant potential to accelerate scientific discovery as valuable tools for analyzing data, generating hypotheses, and supporting innovative approaches in various scientific fields. In this work, we investigate how LLMs can handle the transition from conceptual research ideas to well-structured research plans. Effective research planning not only supports scientists in advancing their research but also represents a crucial capability for the development of autonomous research agents. Despite its importance, the field lacks a systematic understanding of LLMs' research planning capability. To rigorously measure this capability, we introduce the Idea2Plan task and Idea2Plan Bench, a benchmark built from 200 ICML 2025 Spotlight and Oral papers released after major LLM training cutoffs. Each benchmark instance includes a research idea and a grading rubric capturing the key components of valid plans. We further propose Idea2Plan JudgeEval, a complementary benchmark to assess the reliability of LLM-based judges against expert annotations. Experimental results show that GPT-5 and GPT-5-mini achieve the strongest performance on the benchmark, though substantial headroom remains for future improvement. Our study provides new insights into LLMs' capability for research planning and lay the groundwork for future progress.
摘要：大型语言模型 (LLM) 作为分析数据、生成假设和支持各个科学领域创新方法的宝贵工具，已展现出加速科学发现的巨大潜力。在这项工作中，我们研究了法学硕士如何处理从概念性研究想法到结构良好的研究计划的转变。有效的研究规划不仅支持科学家推进研究，而且代表了自主研究机构发展的关键能力。尽管其重要性，该领域缺乏对法学硕士研究规划能力的系统理解。为了严格衡量这种能力，我们引入了 Idea2Plan 任务和 Idea2Plan Bench，这是根据主要 LLM 培训截止后发布的 200 篇 ICML 2025 Spotlight 和 Oral 论文构建的基准。每个基准实例都包含一个研究想法和一个评分标准，其中包含有效计划的关键组成部分。我们进一步提出 Idea2Plan JudgeEval，这是一个补充基准，用于根据专家注释评估法学硕士法官的可靠性。实验结果表明，GPT-5 和 GPT-5-mini 在基准测试中实现了最强的性能，但未来仍有很大的改进空间。我们的研究为法学硕士的研究规划能力提供了新的见解，并为未来的进步奠定了基础。

Title: RiddleBench: A New Generative Reasoning Benchmark for LLMs

Authors: Deepon Halder, Alan Saji, Thanmay Jayakumar, Ratish Puduppully, Anoop Kunchukuttan, Raj Dabre
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24932
Pdf URL: https://arxiv.org/pdf/2510.24932
Copy Paste: [[2510.24932]] RiddleBench: A New Generative Reasoning Benchmark for LLMs(https://arxiv.org/abs/2510.24932)
Keywords: language model, llm, hallucination
Abstract: Large Language Models have demonstrated strong performance on many established reasoning benchmarks. However, these benchmarks primarily evaluate structured skills like quantitative problem-solving, leaving a gap in assessing flexible, multifaceted reasoning abilities that are central to human intelligence. These abilities require integrating logical deduction with spatial awareness and constraint satisfaction, which current evaluations do not measure well. To address this, we introduce RiddleBench, a benchmark of 1,737 challenging puzzles in English designed to probe these core reasoning capabilities. Evaluation of state-of-the-art models on RiddleBench shows fundamental weaknesses. Even top proprietary models like Gemini 2.5 Pro, o3, and Claude 4 Sonnet achieve accuracy just above 60% (60.30%, 63.37%, and 63.16%). Analysis further reveals deep failures, including hallucination cascades (accepting flawed reasoning from other models) and poor self-correction due to a strong self-confirmation bias. Their reasoning is also fragile, with performance degrading significantly when constraints are reordered or irrelevant information is introduced. RiddleBench functions as a diagnostic tool for these issues and as a resource for guiding the development of more robust and reliable language models.
摘要：大型语言模型在许多既定的推理基准上表现出了强大的性能。然而，这些基准主要评估定量解决问题等结构化技能，在评估对人类智能至关重要的灵活、多方面的推理能力方面存在差距。这些能力需要将逻辑演绎与空间意识和约束满足相结合，而目前的评估并不能很好地衡量这些能力。为了解决这个问题，我们推出了 RiddleBench，这是一个包含 1,737 个具有挑战性的英文谜题的基准测试，旨在探索这些核心推理能力。对 RiddleBench 上最先进模型的评估显示出根本性的弱点。即使像 Gemini 2.5 Pro、o3 和 Claude 4 Sonnet 这样的顶级专有模型也能达到略高于 60% 的准确度（60.30%、63.37% 和 63.16%）。分析进一步揭示了深层的失败，包括幻觉级联（接受其他模型有缺陷的推理）以及由于强烈的自我确认偏差而导致的不良自我纠正。他们的推理也很脆弱，当约束重新排序或引入不相关信息时，性能会显着下降。 RiddleBench 可以作为这些问题的诊断工具，并作为指导开发更强大、更可靠的语言模型的资源。

Title: Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction

Authors: James A. Michaelov, Catherine Arnett
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24934
Pdf URL: https://arxiv.org/pdf/2510.24934
Copy Paste: [[2510.24934]] Disaggregation Reveals Hidden Training Dynamics: The Case of Agreement Attraction(https://arxiv.org/abs/2510.24934)
Keywords: language model
Abstract: Language models generally produce grammatical text, but they are more likely to make errors in certain contexts. Drawing on paradigms from psycholinguistics, we carry out a fine-grained analysis of those errors in different syntactic contexts. We demonstrate that by disaggregating over the conditions of carefully constructed datasets and comparing model performance on each over the course of training, it is possible to better understand the intermediate stages of grammatical learning in language models. Specifically, we identify distinct phases of training where language model behavior aligns with specific heuristics such as word frequency and local context rather than generalized grammatical rules. We argue that taking this approach to analyzing language model behavior more generally can serve as a powerful tool for understanding the intermediate learning phases, overall training dynamics, and the specific generalizations learned by language models.
摘要：语言模型通常会生成语法文本，但它们在某些上下文中更容易出错。借鉴心理语言学的范式，我们对不同句法环境中的这些错误进行了细粒度的分析。我们证明，通过分解精心构建的数据集的条件并在训练过程中比较每个数据集的模型性能，可以更好地理解语言模型中语法学习的中间阶段。具体来说，我们确定了不同的训练阶段，其中语言模型行为与特定的启发法（例如词频和本地上下文）而不是广义的语法规则保持一致。我们认为，采用这种方法更广泛地分析语言模型行为可以作为理解中间学习阶段、整体训练动态以及语言模型学习的具体概括的强大工具。

Title: SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens

Authors: Yinhan He, Wendy Zheng, Yaochen Zhu, Zaiyi Zheng, Lin Su, Sriram Vasudevan, Qi Guo, Liangjie Hong, Jundong Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24940
Pdf URL: https://arxiv.org/pdf/2510.24940
Copy Paste: [[2510.24940]] SemCoT: Accelerating Chain-of-Thought Reasoning through Semantically-Aligned Implicit Tokens(https://arxiv.org/abs/2510.24940)
Keywords: language model, llm, chain-of-thought
Abstract: The verbosity of Chain-of-Thought (CoT) reasoning hinders its mass deployment in efficiency-critical applications. Recently, implicit CoT approaches have emerged, which encode reasoning steps within LLM's hidden embeddings (termed ``implicit reasoning'') rather than explicit tokens. This approach accelerates CoT by reducing the reasoning length and bypassing some LLM components. However, existing implicit CoT methods face two significant challenges: (1) they fail to preserve the semantic alignment between the implicit reasoning (when transformed to natural language) and the ground-truth reasoning, resulting in a significant CoT performance degradation, and (2) they focus on reducing the length of the implicit reasoning; however, they neglect the considerable time cost for an LLM to generate one individual implicit reasoning token. To tackle these challenges, we propose a novel semantically-aligned implicit CoT framework termed SemCoT. In particular, for the first challenge, we design a contrastively trained sentence transformer that evaluates semantic alignment between implicit and explicit reasoning, which is used to enforce semantic preservation during implicit reasoning optimization. To address the second challenge, we introduce an efficient implicit reasoning generator by finetuning a lightweight language model using knowledge distillation. This generator is guided by our sentence transformer to distill ground-truth reasoning into semantically aligned implicit reasoning, while also optimizing for accuracy. SemCoT is the first approach that enhances CoT efficiency by jointly optimizing token-level generation speed and preserving semantic alignment with ground-truth reasoning. Extensive experiments demonstrate the superior performance of SemCoT compared to state-of-the-art methods in both efficiency and effectiveness. Our code can be found at this https URL.
摘要：思想链 (CoT) 推理的冗长阻碍了其在效率关键型应用中的大规模部署。最近，隐式 CoT 方法出现了，它在 LLM 的隐藏嵌入（称为“隐式推理”）中编码推理步骤，而不是显式标记。这种方法通过减少推理长度并绕过一些 LLM 组件来加速 CoT。然而，现有的隐式 CoT 方法面临两个重大挑战：（1）它们无法保持隐式推理（当转换为自然语言时）和真实推理之间的语义对齐，导致 CoT 性能显着下降；（2）它们专注于减少隐式推理的长度；然而，他们忽略了法学硕士生成一个单独的隐式推理标记所需的大量时间成本。为了应对这些挑战，我们提出了一种新颖的语义对齐隐式 CoT 框架，称为 SemCoT。特别是，对于第一个挑战，我们设计了一个对比训练的句子转换器，用于评估隐式推理和显式推理之间的语义对齐，用于在隐式推理优化期间强制语义保留。为了解决第二个挑战，我们通过使用知识蒸馏微调轻量级语言模型来引入有效的隐式推理生成器。该生成器由我们的句子转换器引导，将真实推理提炼为语义对齐的隐式推理，同时还优化了准确性。 SemCoT 是第一个通过联合优化 token 级生成速度并保持与真实推理的语义对齐来提高 CoT 效率的方法。大量实验证明，与最先进的方法相比，SemCoT 在效率和有效性方面均具有卓越的性能。我们的代码可以在此 https URL 中找到。

Title: Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale

Authors: James A. Michaelov, Roger P. Levy, Benjamin K. Bergen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24963
Pdf URL: https://arxiv.org/pdf/2510.24963
Copy Paste: [[2510.24963]] Language Model Behavioral Phases are Consistent Across Architecture, Training Data, and Scale(https://arxiv.org/abs/2510.24963)
Keywords: language model
Abstract: We show that across architecture (Transformer vs. Mamba vs. RWKV), training dataset (OpenWebText vs. The Pile), and scale (14 million parameters to 12 billion parameters), autoregressive language models exhibit highly consistent patterns of change in their behavior over the course of pretraining. Based on our analysis of over 1,400 language model checkpoints on over 110,000 tokens of English, we find that up to 98% of the variance in language model behavior at the word level can be explained by three simple heuristics: the unigram probability (frequency) of a given word, the $n$-gram probability of the word, and the semantic similarity between the word and its context. Furthermore, we see consistent behavioral phases in all language models, with their predicted probabilities for words overfitting to those words' $n$-gram probabilities for increasing $n$ over the course of training. Taken together, these results suggest that learning in neural language models may follow a similar trajectory irrespective of model details.
摘要：我们表明，在架构（Transformer vs. Mamba vs. RWKV）、训练数据集（OpenWebText vs. The Pile）和规模（1400 万个参数到 120 亿个参数）中，自回归语言模型在预训练过程中表现出高度一致的行为变化模式。基于我们对超过 110,000 个英语标记的 1,400 多个语言模型检查点的分析，我们发现单词级别的语言模型行为中高达 98% 的差异可以通过三个简单的启发式解释：给定单词的一元语法概率（频率）、单词的 $n$-gram 概率以及单词与其上下文之间的语义相似性。此外，我们在所有语言模型中看到一致的行为阶段，其预测的单词概率过度拟合这些单词的 $n$-gram 概率，在训练过程中增加 $n$。总而言之，这些结果表明，无论模型细节如何，神经语言模型的学习都可能遵循类似的轨迹。

Title: Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers

Authors: Rabin Adhikari
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.25013
Pdf URL: https://arxiv.org/pdf/2510.25013
Copy Paste: [[2510.25013]] Emergence of Minimal Circuits for Indirect Object Identification in Attention-Only Transformers(https://arxiv.org/abs/2510.25013)
Keywords: language model, llm
Abstract: Mechanistic interpretability aims to reverse-engineer large language models (LLMs) into human-understandable computational circuits. However, the complexity of pretrained models often obscures the minimal mechanisms required for specific reasoning tasks. In this work, we train small, attention-only transformers from scratch on a symbolic version of the Indirect Object Identification (IOI) task -- a benchmark for studying coreference -- like reasoning in transformers. Surprisingly, a single-layer model with only two attention heads achieves perfect IOI accuracy, despite lacking MLPs and normalization layers. Through residual stream decomposition, spectral analysis, and embedding interventions, we find that the two heads specialize into additive and contrastive subcircuits that jointly implement IOI resolution. Furthermore, we show that a two-layer, one-head model achieves similar performance by composing information across layers through query-value interactions. These results demonstrate that task-specific training induces highly interpretable, minimal circuits, offering a controlled testbed for probing the computational foundations of transformer reasoning.
摘要：机械可解释性旨在将大型语言模型（LLM）逆向工程为人类可理解的计算电路。然而，预训练模型的复杂性常常掩盖了特定推理任务所需的最小机制。在这项工作中，我们从头开始在间接对象识别（IOI）任务的符号版本（研究共指的基准）上从头开始训练小型仅注意变压器，就像变压器中的推理一样。令人惊讶的是，尽管缺乏 MLP 和归一化层，只有两个注意力头的单层模型却实现了完美的 IOI 准确度。通过残差流分解、频谱分析和嵌入干预，我们发现两个头专门用于共同实现 IOI 分辨率的加性和对比子电路。此外，我们还表明，两层单头模型通过查询值交互跨层组合信息来实现类似的性能。这些结果表明，特定任务的训练产生了高度可解释的最小电路，为探索变压器推理的计算基础提供了一个受控测试平台。

Title: Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech

Authors: Pedro Corrêa, João Lima, Victor Moreno, Paula Dornhofer Paro Costa
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2510.25054
Pdf URL: https://arxiv.org/pdf/2510.25054
Copy Paste: [[2510.25054]] Evaluating Emotion Recognition in Spoken Language Models on Emotionally Incongruent Speech(https://arxiv.org/abs/2510.25054)
Keywords: language model
Abstract: Advancements in spoken language processing have driven the development of spoken language models (SLMs), designed to achieve universal audio understanding by jointly learning text and audio representations for a wide range of tasks. Although promising results have been achieved, there is growing discussion regarding these models' generalization capabilities and the extent to which they truly integrate audio and text modalities in their internal representations. In this work, we evaluate four SLMs on the task of speech emotion recognition using a dataset of emotionally incongruent speech samples, a condition under which the semantic content of the spoken utterance conveys one emotion while speech expressiveness conveys another. Our results indicate that SLMs rely predominantly on textual semantics rather than speech emotion to perform the task, indicating that text-related representations largely dominate over acoustic representations. We release both the code and the Emotionally Incongruent Synthetic Speech dataset (EMIS) to the community.
摘要：口语处理的进步推动了口语模型（SLM）的发展，该模型旨在通过联合学习各种任务的文本和音频表示来实现通用的音频理解。尽管已经取得了有希望的结果，但关于这些模型的泛化能力以及它们在内部表示中真正集成音频和文本模式的程度的讨论越来越多。在这项工作中，我们使用情感不一致的语音样本数据集评估了四种 SLM 的语音情感识别任务，在这种情况下，口语话语的语义内容传达一种情感，而语音表达力传达另一种情感。我们的结果表明，SLM 主要依靠文本语义而不是语音情感来执行任务，这表明文本相关的表示在很大程度上优于声学表示。我们向社区发布了代码和情感不一致合成语音数据集 (EMIS)。

Title: GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models

Authors: Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.25055
Pdf URL: https://arxiv.org/pdf/2510.25055
Copy Paste: [[2510.25055]] GAPMAP: Mapping Scientific Knowledge Gaps in Biomedical Literature Using Large Language Models(https://arxiv.org/abs/2510.25055)
Keywords: language model, llm
Abstract: Scientific progress is driven by the deliberate articulation of what remains unknown. This study investigates the ability of large language models (LLMs) to identify research knowledge gaps in the biomedical literature. We define two categories of knowledge gaps: explicit gaps, clear declarations of missing knowledge; and implicit gaps, context-inferred missing knowledge. While prior work has focused mainly on explicit gap detection, we extend this line of research by addressing the novel task of inferring implicit gaps. We conducted two experiments on almost 1500 documents across four datasets, including a manually annotated corpus of biomedical articles. We benchmarked both closed-weight models (from OpenAI) and open-weight models (Llama and Gemma 2) under paragraph-level and full-paper settings. To address the reasoning of implicit gaps inference, we introduce \textbf{\small TABI}, a Toulmin-Abductive Bucketed Inference scheme that structures reasoning and buckets inferred conclusion candidates for validation. Our results highlight the robust capability of LLMs in identifying both explicit and implicit knowledge gaps. This is true for both open- and closed-weight models, with larger variants often performing better. This suggests a strong ability of LLMs for systematically identifying candidate knowledge gaps, which can support early-stage research formulation, policymakers, and funding decisions. We also report observed failure modes and outline directions for robust deployment, including domain adaptation, human-in-the-loop verification, and benchmarking across open- and closed-weight models.
摘要：科学进步是由对未知事物的刻意阐明所推动的。本研究调查了大型语言模型 (LLM) 识别生物医学文献中研究知识差距的能力。我们定义两类知识差距：明确的差距，明确声明缺失的知识；以及隐含的差距，上下文推断的缺失知识。虽然之前的工作主要集中在显式间隙检测上，但我们通过解决推断隐式间隙的新任务来扩展这一研究方向。我们对四个数据集的近 1500 个文档进行了两项实验，其中包括手动注释的生物医学文章语料库。我们在段落级别和全文设置下对封闭权重模型（来自 OpenAI）和开放权重模型（Llama 和 Gemma 2）进行了基准测试。为了解决隐式间隙推理的推理问题，我们引入了 \textbf{\small TABI}，这是一种 Toulmin-Abduction Bucketed Inference 方案，它构建推理并存储推断结论候选以进行验证。我们的结果凸显了法学硕士在识别显性和隐性知识差距方面的强大能力。对于开放式和封闭式重量模型都是如此，较大的型号通常表现更好。这表明法学硕士具有很强的系统性识别候选人知识差距的能力，这可以支持早期研究制定、政策制定者和资助决策。我们还报告观察到的故障模式并概述稳健部署的方向，包括领域适应、人机交互验证以及开放权重和封闭权重模型的基准测试。

Title: Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?

Authors: Seonjeong Hwang, Hyounghun Kim, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25064
Pdf URL: https://arxiv.org/pdf/2510.25064
Copy Paste: [[2510.25064]] Can LLMs Estimate Cognitive Complexity of Reading Comprehension Items?(https://arxiv.org/abs/2510.25064)
Keywords: language model, llm
Abstract: Estimating the cognitive complexity of reading comprehension (RC) items is crucial for assessing item difficulty before it is administered to learners. Unlike syntactic and semantic features, such as passage length or semantic similarity between options, cognitive features that arise during answer reasoning are not readily extractable using existing NLP tools and have traditionally relied on human annotation. In this study, we examine whether large language models (LLMs) can estimate the cognitive complexity of RC items by focusing on two dimensions-Evidence Scope and Transformation Level-that indicate the degree of cognitive burden involved in reasoning about the answer. Our experimental results demonstrate that LLMs can approximate the cognitive complexity of items, indicating their potential as tools for prior difficulty analysis. Further analysis reveals a gap between LLMs' reasoning ability and their metacognitive awareness: even when they produce correct answers, they sometimes fail to correctly identify the features underlying their own reasoning process.
摘要：估计阅读理解 (RC) 项目的认知复杂性对于在向学习者进行学习之前评估项目难度至关重要。与句法和语义特征（例如段落长度或选项之间的语义相似性）不同，答案推理过程中出现的认知特征无法使用现有的 NLP 工具轻松提取，并且传统上依赖于人类注释。在本研究中，我们研究大型语言模型 (LLM) 是否可以通过关注两个维度（证据范围和转换水平）来估计 RC 项目的认知复杂性，这两个维度表明推理答案时所涉及的认知负担程度。我们的实验结果表明，法学硕士可以近似项目的认知复杂性，表明它们作为先前难度分析工具的潜力。进一步的分析揭示了法学硕士的推理能力和元认知意识之间的差距：即使他们给出了正确的答案，他们有时也无法正确识别自己推理过程背后的特征。

Title: TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors

Authors: Gabin Taibi, Lucia Gomez
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25069
Pdf URL: https://arxiv.org/pdf/2510.25069
Copy Paste: [[2510.25069]] TOPol: Capturing and Explaining Multidimensional Semantic Polarity Fields and Vectors(https://arxiv.org/abs/2510.25069)
Keywords: language model, llm
Abstract: Traditional approaches to semantic polarity in computational linguistics treat sentiment as a unidimensional scale, overlooking the multidimensional structure of language. This work introduces TOPol (Topic-Orientation POLarity), a semi-unsupervised framework for reconstructing and interpreting multidimensional narrative polarity fields under human-on-the-loop (HoTL) defined contextual boundaries (CBs). The framework embeds documents using a transformer-based large language model (tLLM), applies neighbor-tuned UMAP projection, and segments topics via Leiden partitioning. Given a CB between discourse regimes A and B, TOPol computes directional vectors between corresponding topic-boundary centroids, yielding a polarity field that quantifies fine-grained semantic displacement during regime shifts. This vectorial representation enables assessing CB quality and detecting polarity changes, guiding HoTL CB refinement. To interpret identified polarity vectors, the tLLM compares their extreme points and produces contrastive labels with estimated coverage. Robustness analyses show that only CB definitions (the main HoTL-tunable parameter) significantly affect results, confirming methodological stability. We evaluate TOPol on two corpora: (i) U.S. Central Bank speeches around a macroeconomic breakpoint, capturing non-affective semantic shifts, and (ii) Amazon product reviews across rating strata, where affective polarity aligns with NRC valence. Results demonstrate that TOPol consistently captures both affective and non-affective polarity transitions, providing a scalable, generalizable, and interpretable framework for context-sensitive multidimensional discourse analysis.
摘要：计算语言学中语义极性的传统方法将情感视为一维尺度，忽视了语言的多维结构。这项工作介绍了 TOPol（主题导向 POLarity），这是一种半无监督框架，用于在人类循环（HoTL）定义的上下文边界（CB）下重建和解释多维叙事极性场。该框架使用基于转换器的大语言模型 (tLLM) 嵌入文档，应用邻居调整的 UMAP 投影，并通过 Leiden 分区来分割主题。给定话语体系 A 和 B 之间的 CB，TOPol 计算相应主题边界质心之间的方向向量，产生一个极性场，该极性场可量化体系转变期间的细粒度语义位移。这种矢量表示能够评估 CB 质量并检测极性变化，从而指导 HoTL CB 细化。为了解释识别的极性向量，tLLM 会比较它们的极值点并生成具有估计覆盖范围的对比标签。稳健性分析表明，只有 CB 定义（主要的 HoTL 可调参数）会显着影响结果，从而证实了方法的稳定性。我们在两个语料库上评估 TOPol：(i) 美国央行围绕宏观经济断点的演讲，捕捉非情感语义转变，以及 (ii) 跨评级层的亚马逊产品评论，其中情感极性与 NRC 效价一致。结果表明，TOPol 一致地捕获情感和非情感极性转换，为上下文敏感的多维话语分析提供可扩展、可概括和可解释的框架。

Title: BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs

Authors: Nourah M Salem, Elizabeth White, Michael Bada, Lawrence Hunter
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.25087
Pdf URL: https://arxiv.org/pdf/2510.25087
Copy Paste: [[2510.25087]] BioCoref: Benchmarking Biomedical Coreference Resolution with LLMs(https://arxiv.org/abs/2510.25087)
Keywords: language model, llm, prompt
Abstract: Coreference resolution in biomedical texts presents unique challenges due to complex domain-specific terminology, high ambiguity in mention forms, and long-distance dependencies between coreferring expressions. In this work, we present a comprehensive evaluation of generative large language models (LLMs) for coreference resolution in the biomedical domain. Using the CRAFT corpus as our benchmark, we assess the LLMs' performance with four prompting experiments that vary in their use of local, contextual enrichment, and domain-specific cues such as abbreviations and entity dictionaries. We benchmark these approaches against a discriminative span-based encoder, SpanBERT, to compare the efficacy of generative versus discriminative methods. Our results demonstrate that while LLMs exhibit strong surface-level coreference capabilities, especially when supplemented with domain-grounding prompts, their performance remains sensitive to long-range context and mentions ambiguity. Notably, the LLaMA 8B and 17B models show superior precision and F1 scores under entity-augmented prompting, highlighting the potential of lightweight prompt engineering for enhancing LLM utility in biomedical NLP tasks.
摘要：由于复杂的特定领域术语、提及形式的高度模糊性以及共指表达之间的长距离依赖性，生物医学文本中的共指解析提出了独特的挑战。在这项工作中，我们对生物医学领域共指解析的生成大语言模型（LLM）进行了全面评估。使用 CRAFT 语料库作为我们的基准，我们通过四个提示实验来评估法学硕士的表现，这些实验在本地、上下文丰富和特定领域线索（例如缩写和实体词典）的使用方面有所不同。我们将这些方法与基于跨度的判别编码器 SpanBERT 进行基准测试，以比较生成方法与判别方法的功效。我们的结果表明，虽然法学硕士表现出强大的表面共指能力，特别是在补充领域基础提示时，但他们的表现仍然对远程上下文敏感，并提及歧义。值得注意的是，LLaMA 8B 和 17B 模型在实体增强提示下显示出卓越的精度和 F1 分数，凸显了轻量级提示工程在增强生物医学 NLP 任务中的 LLM 实用性方面的潜力。

Title: DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates

Authors: Yun-Shiuan Chuang, Ruixuan Tu, Chengtao Dai, Smit Vasani, Binwei Yao, Michael Henry Tessler, Sijia Yang, Dhavan Shah, Robert Hawkins, Junjie Hu, Timothy T. Rogers
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25110
Pdf URL: https://arxiv.org/pdf/2510.25110
Copy Paste: [[2510.25110]] DEBATE: A Large-Scale Benchmark for Role-Playing LLM Agents in Multi-Agent, Long-Form Debates(https://arxiv.org/abs/2510.25110)
Keywords: language model, llm, agent
Abstract: Accurately modeling opinion change through social interactions is crucial for addressing issues like misinformation and polarization. While role-playing large language models (LLMs) offer a promising way to simulate human-like interactions, existing research shows that single-agent alignment does not guarantee authentic multi-agent group dynamics. Current LLM role-play setups often produce unnatural dynamics (e.g., premature convergence), without an empirical benchmark to measure authentic human opinion trajectories. To bridge this gap, we introduce DEBATE, the first large-scale empirical benchmark explicitly designed to evaluate the authenticity of the interaction between multi-agent role-playing LLMs. DEBATE contains 29,417 messages from multi-round debate conversations among over 2,792 U.S.-based participants discussing 107 controversial topics, capturing both publicly-expressed messages and privately-reported opinions. Using DEBATE, we systematically evaluate and identify critical discrepancies between simulated and authentic group dynamics. We further demonstrate DEBATE's utility for aligning LLMs with human behavior through supervised fine-tuning, achieving improvements in surface-level metrics (e.g., ROUGE-L and message length) while highlighting limitations in deeper semantic alignment (e.g., semantic similarity). Our findings highlight both the potential and current limitations of role-playing LLM agents for realistically simulating human-like social dynamics.
摘要：通过社交互动准确地模拟意见变化对于解决错误信息和两极分化等问题至关重要。虽然角色扮演大语言模型（LLM）提供了一种模拟类人交互的有前途的方法，但现有研究表明单智能体对齐并不能保证真实的多智能体群体动态。当前的法学硕士角色扮演设置通常会产生不自然的动态（例如，过早收敛），没有经验基准来衡量真实的人类观点轨迹。为了弥补这一差距，我们引入了 DEBATE，这是第一个大规模实证基准，专门用于评估多代理角色扮演法学硕士之间交互的真实性。 DEBATE 包含来自超过 2,792 名美国参与者的多轮辩论对话的 29,417 条消息，讨论了 107 个有争议的话题，捕获了公开表达的消息和私下报告的观点。使用 DEBATE，我们系统地评估和识别模拟和真实群体动态之间的关键差异。我们进一步证明了 DEBATE 通过监督微调使 LLM 与人类行为保持一致的实用性，实现了表面级指标（例如 ROUGE-L 和消息长度）的改进，同时强调了更深层次语义对齐（例如语义相似性）的局限性。我们的研究结果强调了角色扮演法学硕士代理在真实模拟类人社会动态方面的潜在和当前局限性。

Title: A Survey on Unlearning in Large Language Models

Authors: Ruichen Qiu, Jiajun Tan, Jiayue Pu, Honglin Wang, Xiao-Shan Gao, Fei Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25117
Pdf URL: https://arxiv.org/pdf/2510.25117
Copy Paste: [[2510.25117]] A Survey on Unlearning in Large Language Models(https://arxiv.org/abs/2510.25117)
Keywords: language model, llm
Abstract: The advancement of Large Language Models (LLMs) has revolutionized natural language processing, yet their training on massive corpora poses significant risks, including the memorization of sensitive personal data, copyrighted material, and knowledge that could facilitate malicious activities. To mitigate these issues and align with legal and ethical standards such as the "right to be forgotten", machine unlearning has emerged as a critical technique to selectively erase specific knowledge from LLMs without compromising their overall performance. This survey provides a systematic review of over 180 papers on LLM unlearning published since 2021, focusing exclusively on large-scale generative models. Distinct from prior surveys, we introduce novel taxonomies for both unlearning methods and evaluations. We clearly categorize methods into training-time, post-training, and inference-time based on the training stage at which unlearning is applied. For evaluations, we not only systematically compile existing datasets and metrics but also critically analyze their advantages, disadvantages, and applicability, providing practical guidance to the research community. In addition, we discuss key challenges and promising future research directions. Our comprehensive overview aims to inform and guide the ongoing development of secure and reliable LLMs.
摘要：大型语言模型（LLM）的进步彻底改变了自然语言处理，但它们对海量语料库的训练带来了巨大的风险，包括敏感个人数据、受版权保护的材料和可能促进恶意活动的知识的记忆。为了缓解这些问题并符合“被遗忘权”等法律和道德标准，机器遗忘已成为一种关键技术，可以有选择地删除法学硕士的特定知识，而不影响其整体表现。这项调查对 2021 年以来发表的 180 多篇关于法学硕士取消学习的论文进行了系统回顾，专门关注大规模生成模型。与之前的调查不同，我们为遗忘方法和评估引入了新颖的分类法。我们根据应用取消学习的训练阶段，将方法明确分为训练时、训练后和推理时。对于评估，我们不仅系统地编译现有的数据集和指标，而且批判性地分析它们的优点、缺点和适用性，为研究界提供实用指导。此外，我们还讨论了关键挑战和有前景的未来研究方向。我们的全面概述旨在为安全可靠的法学硕士的持续发展提供信息和指导。

Title: Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR

Authors: Shreyas Gopal, Ashutosh Anshul, Haoyang Li, Yue Heng Yeo, Hexin Liu, Eng Siong Chng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25150
Pdf URL: https://arxiv.org/pdf/2510.25150
Copy Paste: [[2510.25150]] Explainable Disentanglement on Discrete Speech Representations for Noise-Robust ASR(https://arxiv.org/abs/2510.25150)
Keywords: language model
Abstract: Discrete audio representations are gaining traction in speech modeling due to their interpretability and compatibility with large language models, but are not always optimized for noisy or real-world environments. Building on existing works that quantize Whisper embeddings for speech-to-unit modeling, we propose disentangling semantic speech content from background noise in the latent space. Our end-to-end model separates clean speech in the form of codebook tokens, while extracting interpretable noise vectors as quantization residue which are supervised via a lightweight classifier. We show that our approach improves alignment between clean/noisy speech and text, producing speech tokens that display a high degree of noiseinvariance, and improves ASR performance. Keeping Whisper frozen, we show an 82% reduction in error rate compared to Whisper, and 35% improvement over baseline methods on the VBDemand test set. Further analyses show that the learned token space generalizes well to both seen and unseen acoustic conditions.
摘要：离散音频表示由于其可解释性以及与大型语言模型的兼容性而在语音建模中受到关注，但并不总是针对嘈杂或现实环境进行优化。基于量化 Whisper 嵌入以进行语音到单元建模的现有工作，我们建议将语义语音内容与潜在空间中的背景噪声分开。我们的端到端模型以码本标记的形式分离干净的语音，同时提取可解释的噪声向量作为量化残差，并通过轻量级分类器进行监督。我们表明，我们的方法改善了干净/嘈杂的语音和文本之间的对齐，生成显示高度噪声不变性的语音标记，并提高了 ASR 性能。保持 Whisper 不变，我们发现与 Whisper 相比，错误率降低了 82%，在 VBDemand 测试集上比基线方法提高了 35%。进一步的分析表明，学习到的标记空间可以很好地推广到可见和不可见的声学条件。

Title: Model-Document Protocol for AI Search

Authors: Hongjin Qian, Zheng Liu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.25160
Pdf URL: https://arxiv.org/pdf/2510.25160
Copy Paste: [[2510.25160]] Model-Document Protocol for AI Search(https://arxiv.org/abs/2510.25160)
Keywords: language model, llm, agent
Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.
摘要：人工智能搜索依赖于将大型语言模型 (LLM) 与大量外部知识源联系起来。然而，网页、PDF 文件和其他原始文档本质上并不适合 LLM：它们很长、嘈杂且非结构化。传统的检索方法将这些文档视为逐字文本并返回原始段落，将片段组装和上下文推理的负担留给了法学硕士。这一差距强调了对新检索范式的需求，该范式重新定义模型与文档的交互方式。我们介绍了模型文档协议（MDP），这是一个通用框架，它正式化了原始文本如何通过可消费的知识表示桥接到法学硕士。 MDP 没有将检索视为段落获取，而是定义了多种途径，将非结构化文档转换为特定于任务的、LLM 就绪的输入。其中包括代理推理，它将原始证据整理成连贯的背景；记忆基础，积累可重复使用的笔记以丰富推理；结构化利用，将文档编码为正式表示形式，例如图形或键值缓存。所有这三种途径都有相同的目标：确保到达法学硕士的不是原始片段，而是可直接用于推理的紧凑的结构化知识。作为一个实例，我们提出了 MDP-Agent，它通过代理过程实现协议：构建用于全球覆盖的文档级要点记忆，通过垂直利用进行基于扩散的探索以发现分层依赖关系，并应用映射减少风格合成将大规模证据集成到紧凑但充分的上下文中。信息搜索基准实验表明，MDP-Agent 的性能优于基准，验证了 MDP 框架的健全性及其代理实例化的有效性。

Title: Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction

Authors: Ritesh Sunil Chavan, Jack Mostow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25187
Pdf URL: https://arxiv.org/pdf/2510.25187
Copy Paste: [[2510.25187]] Testing Cross-Lingual Text Comprehension In LLMs Using Next Sentence Prediction(https://arxiv.org/abs/2510.25187)
Keywords: language model, gpt, llm, prompt, chain-of-thought
Abstract: While large language models are trained on massive datasets, this data is heavily skewed towards English. Does their impressive performance reflect genuine ability or just this data advantage? To find out, we tested them in a setting where they could not rely on data abundance: low-resource languages. Building on prior work Agarwal et al. (2025) that used Next Sentence Prediction (NSP) as a test, we created a large-scale benchmark with 10,000 questions each for English (a high-resource language), Swahili (medium-resource), and Hausa (low-resource). We then tested several top models, including GPT-4 Turbo, Gemini 1.5 Flash, and LLaMA 3 70B, to see how their performance holds up. The results painted a clear picture of how levels of language resources impact outcomes. While all models excelled in English, their accuracy dropped in Swahili and fell sharply in Hausa, with LLaMA 3 struggling the most. The story became even more interesting when we introduced Chain-of-Thought (CoT) prompting. For the struggling LLaMA 3, CoT acted as a helpful guide, significantly boosting its accuracy. However, for the more capable GPT-4 and Gemini, the same technique often backfired, leading to a kind of "overthinking" that hurt their results in the cross-lingual context. This reveals that Chain-of-Thought is not a universal solution; its effectiveness depends heavily on the model's baseline capability and the specific context of the task. Our framework pinpoints LLM weaknesses, highlights when CoT helps or hinders cross-lingual NSP performance, and factors influencing their decisions.
摘要：虽然大型语言模型是在海量数据集上训练的，但这些数据严重偏向于英语。他们令人印象深刻的表现是否反映了真正的能力或只是这种数据优势？为了找到答案，我们在无法依赖数据丰富的环境中对它们进行了测试：即低资源语言。以 Agarwal 等人之前的工作为基础。（2025）使用下一句预测（NSP）作为测试，我们创建了一个大规模基准，每个基准包含英语（高资源语言）、斯瓦希里语（中等资源）和豪萨语（低资源）的 10,000 个问题。然后我们测试了几款顶级型号，包括 GPT-4 Turbo、Gemini 1.5 Flash 和 LLaMA 3 70B，看看它们的性能如何。结果清楚地描绘了语言资源水平如何影响结果。虽然所有模型在英语方面都表现出色，但在斯瓦希里语和豪萨语中的准确率均下降，其中 LLaMA 3 表现最差。当我们引入思维链（CoT）提示时，这个故事变得更加有趣。对于陷入困境的 LLaMA 3，CoT 充当了有用的指导，显着提高了其准确性。然而，对于能力更强的 GPT-4 和 Gemini 来说，同样的技术常常适得其反，导致一种“过度思考”，损害了它们在跨语言环境中的结果。这表明思想链并不是一个通用的解决方案；其有效性在很大程度上取决于模型的基线能力和任务的具体背景。我们的框架指出了 LLM 的弱点，强调了 CoT 何时有助于或阻碍跨语言 NSP 的绩效，以及影响其决策的因素。

Title: ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation

Authors: Ziyi Liu, Bahar Sarrafzadeh, Pei Zhou, Longqi Yang, Jieyu Zhao, Ashish Sharma
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25224
Pdf URL: https://arxiv.org/pdf/2510.25224
Copy Paste: [[2510.25224]] ProMediate: A Socio-cognitive framework for evaluating proactive agents in multi-party negotiation(https://arxiv.org/abs/2510.25224)
Keywords: language model, llm, agent
Abstract: While Large Language Models (LLMs) are increasingly used in agentic frameworks to assist individual users, there is a growing need for agents that can proactively manage complex, multi-party collaboration. Systematic evaluation methods for such proactive agents remain scarce, limiting progress in developing AI that can effectively support multiple people together. Negotiation offers a demanding testbed for this challenge, requiring socio-cognitive intelligence to navigate conflicting interests between multiple participants and multiple topics and build consensus. Here, we present ProMediate, the first framework for evaluating proactive AI mediator agents in complex, multi-topic, multi-party negotiations. ProMediate consists of two core components: (i) a simulation testbed based on realistic negotiation cases and theory-driven difficulty levels (ProMediate-Easy, ProMediate-Medium, and ProMediate-Hard), with a plug-and-play proactive AI mediator grounded in socio-cognitive mediation theories, capable of flexibly deciding when and how to intervene; and (ii) a socio-cognitive evaluation framework with a new suite of metrics to measure consensus changes, intervention latency, mediator effectiveness, and intelligence. Together, these components establish a systematic framework for assessing the socio-cognitive intelligence of proactive AI agents in multi-party settings. Our results show that a socially intelligent mediator agent outperforms a generic baseline, via faster, better-targeted interventions. In the ProMediate-Hard setting, our social mediator increases consensus change by 3.6 percentage points compared to the generic baseline (10.65\% vs 7.01\%) while being 77\% faster in response (15.98s vs. 3.71s). In conclusion, ProMediate provides a rigorous, theory-grounded testbed to advance the development of proactive, socially intelligent agents.
摘要：虽然大型语言模型 (LLM) 越来越多地在代理框架中用于帮助个人用户，但对能够主动管理复杂的多方协作的代理的需求也在不断增长。针对此类主动代理的系统评估方法仍然匮乏，限制了开发能够有效支持多人的人工智能的进展。谈判为这一挑战提供了一个要求很高的测试平台，需要社会认知智能来解决多个参与者和多个主题之间的利益冲突并建立共识。在这里，我们提出了 ProMediate，这是第一个在复杂、多主题、多方谈判中评估主动人工智能调解代理的框架。 ProMediate由两个核心组成部分组成：（i）基于现实谈判案例和理论驱动难度级别（ProMediate-Easy、ProMediate-Medium和ProMediate-Hard）的模拟测试平台，以及基于社会认知调解理论的即插即用主动AI调解器，能够灵活决定何时以及如何干预； (ii) 社会认知评估框架，其中包含一套新的指标，用于衡量共识变化、干预延迟、中介有效性和智力。这些组件共同建立了一个系统框架，用于评估多方环境中主动人工智能代理的社会认知智能。我们的结果表明，社交智能调解代理通过更快、更有针对性的干预措施，表现优于一般基线。在 ProMediate-Hard 设置中，与通用基线相比，我们的社交调解器将共识变化提高了 3.6 个百分点（10.65% vs 7.01%），同时响应速度提高了 77%（15.98 秒 vs 3.71 秒）。总之，ProMediate 提供了一个严格的、基于理论的测试平台，以促进积极主动的社交智能代理的开发。

Title: Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA

Authors: Sandipan Majhi, Paheli Bhattacharya
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25273
Pdf URL: https://arxiv.org/pdf/2510.25273
Copy Paste: [[2510.25273]] Adapting Small Language Models to Low-Resource Domains: A Case Study in Hindi Tourism QA(https://arxiv.org/abs/2510.25273)
Keywords: language model, llm
Abstract: Domain-specific question answering in low-resource languages faces two key challenges: scarcity of annotated datasets and limited domain knowledge in general-purpose language models. In this work, we present a multi-stage finetuning strategy to adapt lightweight language models to the Hindi tourism domain by leveraging both original and synthetic training data. Synthetic question-answer pairs are generated using large LLMs (LLaMA-70B, Phi-14B) and used to augment the limited original dataset. We explore several training methodologies and analyse their impact on domain generalisation. Our results demonstrate that large models can efficiently generate synthetic data, while small models can effectively adapt to it, offering a scalable pathway for low-resource, domain-specific QA.
摘要：低资源语言中的特定领域问答面临两个关键挑战：注释数据集的稀缺和通用语言模型中领域知识的有限。在这项工作中，我们提出了一种多阶段微调策略，通过利用原始和合成训练数据，使轻量级语言模型适应印地语旅游领域。综合问答对是使用大型 LLM（LLaMA-70B、Phi-14B）生成的，并用于扩充有限的原始数据集。我们探索了几种培训方法并分析了它们对领域泛化的影响。我们的结果表明，大型模型可以有效地生成合成数据，而小型模型可以有效地适应它，为低资源、特定领域的 QA 提供可扩展的途径。

Title: Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student

Authors: Soumyadeep Jana, Sanasam Ranbir Singh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25303
Pdf URL: https://arxiv.org/pdf/2510.25303
Copy Paste: [[2510.25303]] Teaching Sarcasm: Few-Shot Multimodal Sarcasm Detection via Distillation to a Parameter-Efficient Student(https://arxiv.org/abs/2510.25303)
Keywords: prompt
Abstract: Multimodal sarcasm detection is challenging, especially in low-resource settings where subtle image-text contradictions are hard to learn due to scarce annotated data, which hinders the model's performance. Parameter-efficient fine-tuning (PEFT) methods like adapters, LoRA, and prompt tuning reduce overfitting but struggle to reach optimal performance due to limited supervision from few-shot data. We propose PEKD, a unified framework that enhances PEFT methods via distillation from an expert model trained on large-scale sarcasm data, which acts as the teacher. To mitigate unreliable signals from the teacher, we introduce an entropy-aware gating mechanism that dynamically adjusts the distillation strength based on teacher confidence. Experiments on two public datasets demonstrate that our PEKD framework enables PEFT methods to outperform both prior parameter-efficient approaches and large multimodal models, achieving strong results in the few-shot scenario. The framework is modular and adaptable to a wide range of multimodal models and tasks.
摘要：多模态讽刺检测具有挑战性，特别是在资源匮乏的环境中，由于注释数据稀缺，难以学习微妙的图像文本矛盾，这阻碍了模型的性能。适配器、LoRA 和即时调整等参数高效微调 (PEFT) 方法可以减少过度拟合，但由于少量数据的监督有限，很难达到最佳性能。我们提出了 PEKD，这是一个统一的框架，通过从大规模讽刺数据训练的专家模型（充当老师）中进行蒸馏来增强 PEFT 方法。为了减轻来自教师的不可靠信号，我们引入了一种熵感知门控机制，可以根据教师的置信度动态调整蒸馏强度。对两个公共数据集的实验表明，我们的 PEKD 框架使 PEFT 方法能够优于先前的参数有效方法和大型多模态模型，在少数场景中取得了良好的结果。该框架是模块化的，适用于各种多模式模型和任务。

Title: Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning

Authors: Senjie Jin, Lu Chen, Zhiheng Xi, Yuhui Wang, Sirui Song, Yuhao Zhou, Xinbo Zhang, Peng Sun, Hong Lu, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25310
Pdf URL: https://arxiv.org/pdf/2510.25310
Copy Paste: [[2510.25310]] Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning(https://arxiv.org/abs/2510.25310)
Keywords: language model, llm, chain-of-thought
Abstract: Natural language chain-of-thought (N-CoT) and Program chain-of-thought (P-CoT) have emerged as two primary paradigms for large language models (LLMs) to solve mathematical reasoning problems. Current research typically endeavors to achieve unidirectional enhancement: P-CoT enhanced N-CoT or N-CoT enhanced P-CoT. In this paper, we seek to fully unleash the two paradigms' strengths for mutual enhancement and ultimately achieve simultaneous improvements. We conduct a detailed analysis of the error types across two paradigms, based on which we propose Parrot, a novel training pipeline for mathematical problems: 1) Three target-designed subtasks integrate sequential P-CoT and N-CoT generation. 2) A subtask hybrid training strategy to facilitate natural language semantic transferability. 3) The converted N-CoT auxiliary reward is designed to alleviate the sparse rewards in P-CoT optimization. Extensive experiments demonstrate that Parrot significantly enhances both the performance of N-CoT and P-CoT, especially on N-CoT. Using Parrot SFT, the N-CoT performance of LLaMA2 and CodeLLaMA achieve gains of +21.87 and +21.48 on MathQA over the RL baseline, which is resource-intensive.
摘要：自然语言思维链 (N-CoT) 和程序思维链 (P-CoT) 已成为大型语言模型 (LLM) 解决数学推理问题的两个主要范式。目前的研究通常致力于实现单向增强：P-CoT 增强型 N-CoT 或 N-CoT 增强型 P-CoT。在本文中，我们寻求充分发挥两种范式的优势，相互促进，最终实现同步改进。我们对两种范式的错误类型进行了详细分析，在此基础上我们提出了 Parrot，一种针对数学问题的新型训练管道：1）三个目标设计的子任务集成了顺序 P-CoT 和 N-CoT 生成。 2）促进自然语言语义可迁移性的子任务混合训练策略。 3）转换后的N-CoT辅助奖励旨在缓解P-CoT优化中的稀疏奖励。大量实验表明，Parrot 显着增强了 N-CoT 和 P-CoT 的性能，尤其是 N-CoT。使用 Parrot SFT，LLaMA2 和 CodeLLaMA 的 N-CoT 性能在 MathQA 上比资源密集型的 RL 基线分别提高了 +21.87 和 +21.48。

Title: CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories

Authors: Yilong Lai, Yipin Yang, Jialong Wu, Fengran Mo, Zhenglin Wang, Ting Liang, Jianguo Lin, Keping Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25333
Pdf URL: https://arxiv.org/pdf/2510.25333
Copy Paste: [[2510.25333]] CRMWeaver: Building Powerful Business Agent via Agentic RL and Shared Memories(https://arxiv.org/abs/2510.25333)
Keywords: llm, prompt, agent
Abstract: Recent years have witnessed the rapid development of LLM-based agents, which shed light on using language agents to solve complex real-world problems. A prominent application lies in business agents, which interact with databases and internal knowledge bases via tool calls to fulfill diverse user requirements. However, this domain is characterized by intricate data relationships and a wide range of heterogeneous tasks, from statistical data queries to knowledge-based question-answering. To address these challenges, we propose CRMWeaver, a novel approach that enhances business agents in such complex settings. To acclimate the agentic model to intricate business environments, we employ a synthesis data generation and RL-based paradigm during training, which significantly improves the model's ability to handle complex data and varied tasks. During inference, a shared memories mechanism is introduced, prompting the agent to learn from task guidelines in similar problems, thereby further boosting its effectiveness and generalization, especially in unseen scenarios. We validate the efficacy of our approach on the CRMArena-Pro dataset, where our lightweight model achieves competitive results in both B2B and B2C business scenarios, underscoring its practical value for real-world applications.
摘要：近年来，基于 LLM 的智能体的快速发展，为使用语言智能体解决复杂的现实问题提供了线索。一个突出的应用是业务代理，它通过工具调用与数据库和内部知识库进行交互，以满足不同的用户需求。然而，该领域的特点是复杂的数据关系和广泛的异构任务，从统计数据查询到基于知识的问答。为了应对这些挑战，我们提出了 CRMWeaver，这是一种在如此复杂的环境中增强业务代理的新颖方法。为了使代理模型适应复杂的业务环境，我们在训练期间采用了综合数据生成和基于强化学习的范例，这显着提高了模型处理复杂数据和各种任务的能力。在推理过程中，引入了共享记忆机制，促使智能体从类似问题中的任务指南中学习，从而进一步提高其有效性和泛化性，尤其是在未见过的场景中。我们在 CRMArena-Pro 数据集上验证了我们方法的有效性，我们的轻量级模型在 B2B 和 B2C 业务场景中都取得了有竞争力的结果，强调了其对于实际应用程序的实用价值。

Title: Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments

Authors: Abhishek Purushothama, Junghyun Min, Brandon Waldon, Nathan Schneider
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25356
Pdf URL: https://arxiv.org/pdf/2510.25356
Copy Paste: [[2510.25356]] Not ready for the bench: LLM legal interpretation is unstable and out of step with human judgments(https://arxiv.org/abs/2510.25356)
Keywords: language model, llm
Abstract: Legal interpretation frequently involves assessing how a legal text, as understood by an 'ordinary' speaker of the language, applies to the set of facts characterizing a legal dispute in the U.S. judicial system. Recent scholarship has proposed that legal practitioners add large language models (LLMs) to their interpretive toolkit. This work offers an empirical argument against LLM interpretation as recently practiced by legal scholars and federal judges. Our investigation in English shows that models do not provide stable interpretive judgments: varying the question format can lead the model to wildly different conclusions. Moreover, the models show weak to moderate correlation with human judgment, with large variance across model and question variant, suggesting that it is dangerous to give much credence to the conclusions produced by generative AI.
摘要：法律解释经常涉及评估该语言的“普通”使用者所理解的法律文本如何适用于描述美国司法系统中法律纠纷的一组事实。最近的学术研究建议法律从业者将大型语言模型（LLM）添加到他们的解释工具包中。这项工作提供了反对法律学者和联邦法官最近实践的法学硕士解释的实证论据。我们的英语调查表明，模型不能提供稳定的解释判断：改变问题格式可能会导致模型得出截然不同的结论。此外，这些模型显示出与人类判断的弱到中度相关性，模型和问题变体之间存在很大差异，这表明过于相信生成式人工智能产生的结论是危险的。

Title: Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs

Authors: Alexander Sternfeld, Andrei Kucharavy, Dimitri Percia David, Alain Mermoud, Julian Jang-Jaccard, Nathan Monnet
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25370
Pdf URL: https://arxiv.org/pdf/2510.25370
Copy Paste: [[2510.25370]] Monitoring Transformative Technological Convergence Through LLM-Extracted Semantic Entity Triple Graphs(https://arxiv.org/abs/2510.25370)
Keywords: language model, llm
Abstract: Forecasting transformative technologies remains a critical but challenging task, particularly in fast-evolving domains such as Information and Communication Technologies (ICTs). Traditional expert-based methods struggle to keep pace with short innovation cycles and ambiguous early-stage terminology. In this work, we propose a novel, data-driven pipeline to monitor the emergence of transformative technologies by identifying patterns of technological convergence. Our approach leverages advances in Large Language Models (LLMs) to extract semantic triples from unstructured text and construct a large-scale graph of technology-related entities and relations. We introduce a new method for grouping semantically similar technology terms (noun stapling) and develop graph-based metrics to detect convergence signals. The pipeline includes multi-stage filtering, domain-specific keyword clustering, and a temporal trend analysis of topic co-occurence. We validate our methodology on two complementary datasets: 278,625 arXiv preprints (2017--2024) to capture early scientific signals, and 9,793 USPTO patent applications (2018-2024) to track downstream commercial developments. Our results demonstrate that the proposed pipeline can identify both established and emerging convergence patterns, offering a scalable and generalizable framework for technology forecasting grounded in full-text analysis.
摘要：预测变革性技术仍然是一项关键但具有挑战性的任务，特别是在信息和通信技术 (ICT) 等快速发展的领域。传统的基于专家的方法很难跟上较短的创新周期和模糊的早期术语的步伐。在这项工作中，我们提出了一种新颖的数据驱动管道，通过识别技术融合的模式来监控变革性技术的出现。我们的方法利用大型语言模型（LLM）的进步从非结构化文本中提取语义三元组，并构建技术相关实体和关系的大规模图。我们引入了一种新方法来对语义相似的技术术语（名词装订）进行分组，并开发基于图的指标来检测收敛信号。该管道包括多级过滤、特定领域关键词聚类以及主题共现的时间趋势分析。我们在两个互补的数据集上验证了我们的方法：278,625 个 arXiv 预印本（2017--2024 年）用于捕获早期科学信号，以及 9,793 个 USPTO 专利申请（2018-2024 年）用于跟踪下游商业发展。我们的结果表明，所提出的管道可以识别已建立的和新兴的融合模式，为基于全文分析的技术预测提供可扩展且可概括的框架。

Title: Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy

Authors: Junichiro Niimi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25378
Pdf URL: https://arxiv.org/pdf/2510.25378
Copy Paste: [[2510.25378]] Hallucinations in Bibliographic Recommendation: Citation Frequency as a Proxy for Training Data Redundancy(https://arxiv.org/abs/2510.25378)
Keywords: language model, gpt, llm, hallucination
Abstract: Large language models (LLMs) have been increasingly applied to a wide range of tasks, from natural language understanding to code generation. While they have also been used to assist in bibliographic recommendation, the hallucination of non-existent papers remains a major issue. Building on prior studies, this study hypothesizes that an LLM's ability to correctly produce bibliographic information depends on whether the underlying knowledge is generated or memorized, with highly cited papers (i.e., more frequently appear in the training corpus) showing lower hallucination rates. We therefore assume citation count as a proxy for training data redundancy (i.e., the frequency with which a given bibliographic record is repeatedly represented in the pretraining corpus) and investigate how citation frequency affects hallucinated references in LLM outputs. Using GPT-4.1, we generated and manually verified 100 bibliographic records across twenty computer-science domains, and measured factual consistency via cosine similarity between generated and authentic metadata. The results revealed that (i) hallucination rates vary across research domains, (ii) citation count is strongly correlated with factual accuracy, and (iii) bibliographic information becomes almost verbatimly memorized beyond approximately 1,000 citations. These findings suggest that highly cited papers are nearly verbatimly retained in the model, indicating a threshold where generalization shifts into memorization.
摘要：大型语言模型 (LLM) 已越来越多地应用于从自然语言理解到代码生成的各种任务。虽然它们也被用来协助书目推荐，但对不存在论文的幻觉仍然是一个主要问题。基于先前的研究，本研究假设法学硕士正确生成书目信息的能力取决于基础知识是生成还是记忆，被引用率较高的论文（即更频繁地出现在训练语料库中）显示出较低的幻觉率。因此，我们假设引用计数作为训练数据冗余的代理（即给定书目记录在预训练语料库中重复出现的频率），并研究引用频率如何影响法学硕士输出中的幻觉参考文献。使用 GPT-4.1，我们生成并手动验证了 20 个计算机科学领域的 100 条书目记录，并通过生成的元数据和真实元数据之间的余弦相似性来测量事实一致性。结果显示，(i) 不同研究领域的幻觉率有所不同，(ii) 引用次数与事实准确性密切相关，(iii) 超过大约 1,000 次引用后，书目信息几乎被逐字记忆。这些发现表明，被高度引用的论文几乎逐字保留在模型中，这表明泛化转变为记忆的阈值。

Title: Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires

Authors: Doan Nam Long Vu, Rui Tan, Lena Moench, Svenja Jule Francke, Daniel Woiwod, Florian Thomas-Odenthal, Sanna Stroth, Tilo Kircher, Christiane Hermann, Udo Dannlowski, Hamidreza Jamalabadi, Shaoxiong Ji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25384
Pdf URL: https://arxiv.org/pdf/2510.25384
Copy Paste: [[2510.25384]] Roleplaying with Structure: Synthetic Therapist-Client Conversation Generation from Questionnaires(https://arxiv.org/abs/2510.25384)
Keywords: llm
Abstract: The development of AI for mental health is hindered by a lack of authentic therapy dialogues, due to strict privacy regulations and the fact that clinical sessions were historically rarely recorded. We present an LLM-driven pipeline that generates synthetic counseling dialogues based on structured client profiles and psychological questionnaires. Grounded on the principles of Cognitive Behavioral Therapy (CBT), our method creates synthetic therapeutic conversations for clinical disorders such as anxiety and depression. Our framework, SQPsych (Structured Questionnaire-based Psychotherapy), converts structured psychological input into natural language dialogues through therapist-client simulations. Due to data governance policies and privacy restrictions prohibiting the transmission of clinical questionnaire data to third-party services, previous methodologies relying on proprietary models are infeasible in our setting. We address this limitation by generating a high-quality corpus using open-weight LLMs, validated through human expert evaluation and LLM-based assessments. Our SQPsychLLM models fine-tuned on SQPsychConv achieve strong performance on counseling benchmarks, surpassing baselines in key therapeutic skills. Our findings highlight the potential of synthetic data to enable scalable, data-secure, and clinically informed AI for mental health support. We will release our code, models, and corpus at this https URL
摘要：由于严格的隐私法规以及历史上很少记录临床会议的事实，缺乏真实的治疗对话阻碍了人工智能在心理健康领域的发展。我们提出了一个由法学硕士驱动的管道，可以根据结构化的客户资料和心理调查问卷生成综合咨询对话。基于认知行为疗法 (CBT) 的原理，我们的方法为焦虑和抑郁等临床疾病创建综合治疗对话。我们的框架 SQPsych（基于结构化问卷的心理治疗）通过治疗师与客户的模拟将结构化的心理输入转换为自然语言对话。由于数据治理政策和隐私限制禁止将临床问卷数据传输给第三方服务，以前依赖专有模型的方法在我们的环境中是不可行的。我们通过使用开放权重法学硕士生成高质量的语料库来解决这一限制，并通过人类专家评估和基于法学硕士的评估进行验证。我们在 SQPsychConv 上进行微调的 SQPsychLLM 模型在咨询基准上表现出色，在关键治疗技能方面超越了基准。我们的研究结果凸显了合成数据在支持可扩展、数据安全和临床知情的人工智能方面的潜力，以提供心理健康支持。我们将在此 https URL 发布我们的代码、模型和语料库

Title: BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains

Authors: Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25409
Pdf URL: https://arxiv.org/pdf/2510.25409
Copy Paste: [[2510.25409]] BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains(https://arxiv.org/abs/2510.25409)
Keywords: language model, gpt, llm
Abstract: The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.
摘要：大型语言模型（LLM）的快速发展加剧了对特定领域和文化评估的需求。现有基准主要以英国为中心且与领域无关，限制了它们对以印度为中心的环境的适用性。为了解决这一差距，我们推出了 BhashaBench V1，这是第一个专注于关键印度知识系统的特定领域、多任务、双语基准。 BhashaBench V1 包含 74,166 个精心策划的问答对，其中 52,494 个英语问答对和 21,672 个印地语问答对，均来自真实的政府和特定领域的考试。它横跨农业、法律、金融和阿育吠陀四大领域，包含 90 多个子领域，涵盖 500 多个主题，可实现细粒度评估。对超过 29 名法学硕士的评估揭示了特定领域和语言的显着绩效差距，尤其是在资源匮乏领域的巨大差异。例如，GPT-4o 在法律方面的总体准确率达到 76.49%，但在阿育吠陀方面仅为 59.74%。与所有领域的印地语内容相比，模型在英语内容上的表现始终更好。子领域分析显示，网络法、国际金融等领域表现相对较好，而 Panchakarma、种子科学和人权仍然明显薄弱。 BhashaBench V1 提供了一个全面的数据集，用于评估印度不同知识领域的大型语言模型。它可以评估模型将特定领域知识与双语理解相结合的能力。所有代码、基准测试和资源都是公开的，以支持开放研究。

Title: Serve Programs, Not Prompts

Authors: In Gim, Lin Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25412
Pdf URL: https://arxiv.org/pdf/2510.25412
Copy Paste: [[2510.25412]] Serve Programs, Not Prompts(https://arxiv.org/abs/2510.25412)
Keywords: language model, llm, prompt
Abstract: Current large language model (LLM) serving systems, primarily designed for text completion, are neither efficient nor adaptable for increasingly complex LLM applications due to their inflexible design. We propose a new LLM serving system architecture that serves programs instead of prompts to address this problem. These programs, called LLM Inference Programs (LIPs), allow users to customize token prediction and KV cache management at runtime and to offload parts of their application logic, such as tool execution, to the server. We describe an example of this architecture through a system named Symphony, which functions as an operating system for LIPs. Symphony exposes LLM model computations via system calls and virtualizes KV cache with a dedicated file system, while ensuring GPU efficiency with a two-level process scheduling scheme. Symphony has the potential to open the door to a more efficient and extensible ecosystem for LLM applications.
摘要：当前的大型语言模型（LLM）服务系统主要是为文本完成而设计的，由于其设计不灵活，既不高效也不适应日益复杂的LLM应用程序。我们提出了一种新的LLM服务系统架构，该架构服务于程序而不是提示来解决这个问题。这些程序称为 LLM 推理程序 (LIP)，允许用户在运行时自定义令牌预测和 KV 缓存管理，并将部分应用程序逻辑（例如工具执行）卸载到服务器。我们通过名为 Symphony 的系统描述此架构的示例，该系统充当 LIP 的操作系统。 Symphony 通过系统调用公开 LLM 模型计算，并使用专用文件系统虚拟化 KV 缓存，同时通过两级进程调度方案确保 GPU 效率。 Symphony 有潜力为 LLM 应用程序打开更高效、可扩展的生态系统之门。

Title: Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media

Authors: Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Josef van Genabith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25413
Pdf URL: https://arxiv.org/pdf/2510.25413
Copy Paste: [[2510.25413]] Seeing, Signing, and Saying: A Vision-Language Model-Assisted Pipeline for Sign Language Data Acquisition and Curation from Social Media(https://arxiv.org/abs/2510.25413)
Keywords: language model
Abstract: Most existing sign language translation (SLT) datasets are limited in scale, lack multilingual coverage, and are costly to curate due to their reliance on expert annotation and controlled recording setup. Recently, Vision Language Models (VLMs) have demonstrated strong capabilities as evaluators and real-time assistants. Despite these advancements, their potential remains untapped in the context of sign language dataset acquisition. To bridge this gap, we introduce the first automated annotation and filtering framework that utilizes VLMs to reduce reliance on manual effort while preserving data quality. Our method is applied to TikTok videos across eight sign languages and to the already curated YouTube-SL-25 dataset in German Sign Language for the purpose of additional evaluation. Our VLM-based pipeline includes a face visibility detection, a sign activity recognition, a text extraction from video content, and a judgment step to validate alignment between video and text, implementing generic filtering, annotation and validation steps. Using the resulting corpus, TikTok-SL-8, we assess the performance of two off-the-shelf SLT models on our filtered dataset for German and American Sign Languages, with the goal of establishing baselines and evaluating the robustness of recent models on automatically extracted, slightly noisy data. Our work enables scalable, weakly supervised pretraining for SLT and facilitates data acquisition from social media.
摘要：大多数现有手语翻译 (SLT) 数据集规模有限，缺乏多语言覆盖，并且由于依赖专家注释和受控记录设置，管理成本高昂。最近，视觉语言模型（VLM）展示了作为评估器和实时助手的强大功能。尽管取得了这些进步，但它们在手语数据集获取方面的潜力仍未得到开发。为了弥补这一差距，我们引入了第一个自动注释和过滤框架，该框架利用 VLM 来减少对手动工作的依赖，同时保持数据质量。我们的方法适用于八种手语的 TikTok 视频以及已整理的德国手语 YouTube-SL-25 数据集，以进行额外评估。我们基于 VLM 的管道包括面部可见性检测、标志活动识别、从视频内容中提取文本以及验证视频和文本之间对齐的判断步骤，从而实现通用过滤、注释和验证步骤。使用生成的语料库 TikTok-SL-8，我们评估了两个现成的 SLT 模型在经过过滤的德国和美国手语数据集上的性能，目的是建立基线并评估最新模型在自动提取的略带噪声数据上的稳健性。我们的工作实现了可扩展的、弱监督的 SLT 预训练，并促进从社交媒体获取数据。

Title: Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction

Authors: Asutosh Hota, Jussi P. P. Jokinen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25426
Pdf URL: https://arxiv.org/pdf/2510.25426
Copy Paste: [[2510.25426]] Implicature in Interaction: Understanding Implicature Improves Alignment in Human-LLM Interaction(https://arxiv.org/abs/2510.25426)
Keywords: language model, llm, prompt
Abstract: The rapid advancement of Large Language Models (LLMs) is positioning language at the core of human-computer interaction (HCI). We argue that advancing HCI requires attention to the linguistic foundations of interaction, particularly implicature (meaning conveyed beyond explicit statements through shared context) which is essential for human-AI (HAI) alignment. This study examines LLMs' ability to infer user intent embedded in context-driven prompts and whether understanding implicature improves response generation. Results show that larger models approximate human interpretations more closely, while smaller models struggle with implicature inference. Furthermore, implicature-based prompts significantly enhance the perceived relevance and quality of responses across models, with notable gains in smaller models. Overall, 67.6% of participants preferred responses with implicature-embedded prompts to literal ones, highlighting a clear preference for contextually nuanced communication. Our work contributes to understanding how linguistic theory can be used to address the alignment problem by making HAI interaction more natural and contextually grounded.
摘要：大型语言模型（LLM）的快速发展将语言定位为人机交互（HCI）的核心。我们认为，推进人机交互需要关注交互的语言基础，特别是含义（通过共享上下文传达超出明确陈述的含义），这对于人类与人工智能（HAI）的协调至关重要。本研究考察了法学硕士推断上下文驱动提示中嵌入的用户意图的能力，以及理解含义是否可以改善响应生成。结果表明，较大的模型更接近人类的解释，而较小的模型则难以进行隐含推理。此外，基于含义的提示显着增强了跨模型响应的感知相关性和质量，在较小的模型中效果显着。总体而言，67.6% 的参与者更喜欢带有暗示的提示，而不是字面提示，这突显了他们对上下文细致入微的沟通的明显偏好。我们的工作有助于理解如何使用语言理论来解决对齐问题，使 HAI 交互更加自然和基于上下文。

Title: RLMEval: Evaluating Research-Level Neural Theorem Proving

Authors: Auguste Poiroux, Antoine Bosselut, Viktor Kunčak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25427
Pdf URL: https://arxiv.org/pdf/2510.25427
Copy Paste: [[2510.25427]] RLMEval: Evaluating Research-Level Neural Theorem Proving(https://arxiv.org/abs/2510.25427)
Keywords: language model, llm
Abstract: Despite impressive results on curated benchmarks, the practical impact of large language models (LLMs) on research-level neural theorem proving and proof autoformalization is still limited. We introduce RLMEval, an evaluation suite for these tasks, focusing on research-level mathematics from real-world Lean formalization projects. RLMEval targets the evaluation of neural theorem proving and proof autoformalization on challenging research-level theorems by leveraging real Lean Blueprint formalization projects. Our evaluation of state-of-the-art models on RLMEval, comprising 613 theorems from 6 Lean projects, reveals a significant gap: progress on existing benchmarks does not readily translate to these more realistic settings, with the best model achieving only a 10.3 % pass rate. RLMEval provides a new, challenging benchmark designed to guide and accelerate progress in automated reasoning for formal mathematics.
摘要：尽管在策划的基准测试中取得了令人印象深刻的结果，但大型语言模型 (LLM) 对研究级神经定理证明和证明自动形式化的实际影响仍然有限。我们推出了 RLMEval，这是一个针对这些任务的评估套件，专注于来自现实世界精益形式化项目的研究级数学。 RLMEval 的目标是通过利用真正的精益蓝图形式化项目，对具有挑战性的研究级定理的神经定理证明和证明自动形式化进行评估。我们对 RLMEval 上最先进模型的评估（包括来自 6 个精益项目的 613 个定理）揭示了一个巨大的差距：现有基准的进展并不能轻易转化为这些更现实的设置，最佳模型仅达到 10.3% 的通过率。 RLMEval 提供了一个新的、具有挑战性的基准，旨在指导和加速形式数学自动推理的进展。

Title: Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research

Authors: Ali Sanaei, Ali Rajabzadeh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25432
Pdf URL: https://arxiv.org/pdf/2510.25432
Copy Paste: [[2510.25432]] Depth and Autonomy: A Framework for Evaluating LLM Applications in Social Science Research(https://arxiv.org/abs/2510.25432)
Keywords: language model, llm
Abstract: Large language models (LLMs) are increasingly utilized by researchers across a wide range of domains, and qualitative social science is no exception; however, this adoption faces persistent challenges, including interpretive bias, low reliability, and weak auditability. We introduce a framework that situates LLM usage along two dimensions, interpretive depth and autonomy, thereby offering a straightforward way to classify LLM applications in qualitative research and to derive practical design recommendations. We present the state of the literature with respect to these two dimensions, based on all published social science papers available on Web of Science that use LLMs as a tool and not strictly as the subject of study. Rather than granting models expansive freedom, our approach encourages researchers to decompose tasks into manageable segments, much as they would when delegating work to capable undergraduate research assistants. By maintaining low levels of autonomy and selectively increasing interpretive depth only where warranted and under supervision, one can plausibly reap the benefits of LLMs while preserving transparency and reliability.
摘要：大型语言模型 (LLM) 越来越多地被各个领域的研究人员使用，定性社会科学也不例外；然而，这种采用面临着持续的挑战，包括解释偏差、可靠性低和可审计性弱。我们引入了一个框架，该框架将法学硕士的使用定位在解释深度和自主性两个维度上，从而提供了一种直接的方法来对定性研究中的法学硕士应用进行分类并得出实用的设计建议。我们根据 Web of Science 上所有已发表的社会科学论文，介绍了这两个维度的文献状况，这些论文使用法学硕士作为工具，而不是严格作为研究主题。我们的方法不是给予模型广泛的自由，而是鼓励研究人员将任务分解为可管理的部分，就像他们将工作委托给有能力的本科生研究助理一样。通过保持低水平的自主权，并仅在必要和监督下选择性地增加解释深度，人们可以在保持透明度和可靠性的同时获得法学硕士的好处。

Title: A Critical Study of Automatic Evaluation in Sign Language Translation

Authors: Shakib Yazdani, Yasser Hamidullah, Cristina España-Bonet, Eleftherios Avramidis, Josef van Genabith
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25434
Pdf URL: https://arxiv.org/pdf/2510.25434
Copy Paste: [[2510.25434]] A Critical Study of Automatic Evaluation in Sign Language Translation(https://arxiv.org/abs/2510.25434)
Keywords: language model, llm, hallucination
Abstract: Automatic evaluation metrics are crucial for advancing sign language translation (SLT). Current SLT evaluation metrics, such as BLEU and ROUGE, are only text-based, and it remains unclear to what extent text-based metrics can reliably capture the quality of SLT outputs. To address this gap, we investigate the limitations of text-based SLT evaluation metrics by analyzing six metrics, including BLEU, chrF, and ROUGE, as well as BLEURT on the one hand, and large language model (LLM)-based evaluators such as G-Eval and GEMBA zero-shot direct assessment on the other hand. Specifically, we assess the consistency and robustness of these metrics under three controlled conditions: paraphrasing, hallucinations in model outputs, and variations in sentence length. Our analysis highlights the limitations of lexical overlap metrics and demonstrates that while LLM-based evaluators better capture semantic equivalence often missed by conventional metrics, they can also exhibit bias toward LLM-paraphrased translations. Moreover, although all metrics are able to detect hallucinations, BLEU tends to be overly sensitive, whereas BLEURT and LLM-based evaluators are comparatively lenient toward subtle cases. This motivates the need for multimodal evaluation frameworks that extend beyond text-based metrics to enable a more holistic assessment of SLT outputs.
摘要：自动评估指标对于推进手语翻译 (SLT) 至关重要。当前的 SLT 评估指标（例如 BLEU 和 ROUGE）仅基于文本，目前尚不清楚基于文本的指标能够在多大程度上可靠地捕获 SLT 输出的质量。为了解决这一差距，我们一方面分析 BLEU、chrF 和 ROUGE 以及 BLEURT 等 6 个指标，另一方面分析基于大型语言模型 (LLM) 的评估器，例如 G-Eval 和 GEMBA 零样本直接评估，从而研究基于文本的 SLT 评估指标的局限性。具体来说，我们在三个受控条件下评估这些指标的一致性和鲁棒性：释义、模型输出的幻觉和句子长度的变化。我们的分析强调了词汇重叠指标的局限性，并表明虽然基于 LLM 的评估器可以更好地捕获传统指标经常错过的语义等价性，但它们也可能表现出对 LLM 释义翻译的偏见。此外，虽然所有指标都能够检测幻觉，但 BLEU 往往过于敏感，而 BLEURT 和基于 LLM 的评估器对微妙的情况相对宽松。这就激发了对多模式评估框架的需求，该框架超越基于文本的指标，以便对 SLT 输出进行更全面的评估。

Title: Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs

Authors: Fei Wei, Daoyuan Chen, Ce Wang, Yilun Huang, Yushuo Chen, Xuchen Pan, Yaliang Li, Bolin Ding
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25441
Pdf URL: https://arxiv.org/pdf/2510.25441
Copy Paste: [[2510.25441]] Grounded in Reality: Learning and Deploying Proactive LLM from Offline Logs(https://arxiv.org/abs/2510.25441)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) excel as passive responders, but teaching them to be proactive, goal-oriented partners, a critical capability in high-stakes domains, remains a major challenge. Current paradigms either myopically optimize single-turn attributes or rely on brittle, high-cost user simulators, creating a persistent ``reality gap''. To bridge this gap, we introduce \texttt{Learn-to-Ask}, a general, simulator-free framework for learning and deploying proactive dialogue agents \textit{directly from offline expert data}, bypassing the need to model complex user dynamics. Our key insight is to reframe the offline policy learning problem by leveraging the \textbf{observed future} of each expert trajectory. This allows us to infer a dense, turn-by-turn reward signal grounded in the expert's revealed strategy, decomposing the intractable long-horizon problem into a series of supervised learning tasks, and training a policy to output a structured \texttt{(action, state_assessment)} tuple, governing both \textbf{what to ask} and, crucially, \textbf{when to stop}. To ensure reward fidelity, our Automated Grader Calibration pipeline systematically purges noise from the LLM-based reward model with minimal human supervision. Empirically, we demonstrate the efficacy of \texttt{Learn-to-Ask} in a real-world medical dataset, using LLMs of varying sizes up to 32B. Our approach culminates in the successful deployment of LLMs into a live, large-scale online AI service. In rigorous in-house evaluations, our model was launched and achieved performance even superior to human experts, proving our framework's ability to translate offline data into tangible, real-world impact. We hope this work provides a practical and economically viable blueprint for transforming passive LLMs into proactive, goal-oriented LLM applications.
摘要：大型语言模型 (LLM) 擅长作为被动响应者，但教导他们成为主动、以目标为导向的合作伙伴（高风险领域的关键能力）仍然是一项重大挑战。当前的范例要么短视地优化单轮属性，要么依赖脆弱的、高成本的用户模拟器，从而造成持续的“现实差距”。为了弥补这一差距，我们引入了 \texttt{Learn-to-Ask}，一个通用的、无模拟器的框架，用于学习和部署主动对话代理 \textit{直接从离线专家数据}，绕过了对复杂用户动态建模的需要。我们的主要见解是通过利用每个专家轨迹的\textbf{观察到的未来}来重新构建离线策略学习问题。这使我们能够推断出基于专家揭示的策略的密集的、逐轮奖励信号，将棘手的长期问题分解为一系列监督学习任务，并训练策略来输出结构化的 \texttt{(action, state_assessment)} 元组，管理 \textbf{要问什么} 和最重要的是 \textbf{何时停止}。为了确保奖励保真度，我们的自动分级机校准管道系统地消除了基于 LLM 的奖励模型中的噪音，并以最少的人工监督。根据经验，我们使用不同大小高达 32B 的 LLM 证明了 \texttt{Learn-to-Ask} 在现实世界医学数据集中的功效。我们的方法最终成功地将法学硕士部署到实时的大规模在线人工智能服务中。在严格的内部评估中，我们的模型推出并取得了甚至优于人类专家的性能，证明了我们的框架有能力将离线数据转化为有形的、现实世界的影响。我们希望这项工作提供一个实用且经济可行的蓝图，将被动的法学硕士转变为主动的、以目标为导向的法学硕士申请。

Title: Fine-Tuned Language Models for Domain-Specific Summarization and Tagging

Authors: Jun Wang, Fuming Lin, Yuyu Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25460
Pdf URL: https://arxiv.org/pdf/2510.25460
Copy Paste: [[2510.25460]] Fine-Tuned Language Models for Domain-Specific Summarization and Tagging(https://arxiv.org/abs/2510.25460)
Keywords: language model, llm
Abstract: This paper presents a pipeline integrating fine-tuned large language models (LLMs) with named entity recognition (NER) for efficient domain-specific text summarization and tagging. The authors address the challenge posed by rapidly evolving sub-cultural languages and slang, which complicate automated information extraction and law enforcement monitoring. By leveraging the LLaMA Factory framework, the study fine-tunes LLMs on both generalpurpose and custom domain-specific datasets, particularly in the political and security domains. The models are evaluated using BLEU and ROUGE metrics, demonstrating that instruction fine-tuning significantly enhances summarization and tagging accuracy, especially for specialized corpora. Notably, the LLaMA3-8B-Instruct model, despite its initial limitations in Chinese comprehension, outperforms its Chinese-trained counterpart after domainspecific fine-tuning, suggesting that underlying reasoning capabilities can transfer across languages. The pipeline enables concise summaries and structured entity tagging, facilitating rapid document categorization and distribution. This approach proves scalable and adaptable for real-time applications, supporting efficient information management and the ongoing need to capture emerging language trends. The integration of LLMs and NER offers a robust solution for transforming unstructured text into actionable insights, crucial for modern knowledge management and security operations.
摘要：本文提出了一种将微调大型语言模型 (LLM) 与命名实体识别 (NER) 集成的管道，以实现高效的特定领域文本摘要和标记。作者解决了快速发展的亚文化语言和俚语带来的挑战，这些语言和俚语使自动信息提取和执法监控变得复杂。通过利用 LLaMA Factory 框架，该研究在通用和自定义特定领域数据集上对法学硕士进行了微调，特别是在政治和安全领域。使用 BLEU 和 ROUGE 指标对模型进行评估，表明指令微调可显着提高摘要和标记的准确性，特别是对于专门的语料库。值得注意的是，LLaMA3-8B-Instruct 模型尽管最初在中文理解方面存在局限性，但经过特定领域的微调后，其表现优于中文训练的模型，这表明潜在的推理能力可以跨语言迁移。该管道支持简洁的摘要和结构化实体标记，有助于快速文档分类和分发。事实证明，这种方法具有可扩展性，并且适用于实时应用程序，支持高效的信息管理和捕捉新兴语言趋势的持续需求。 LLM 和 NER 的集成提供了一个强大的解决方案，可将非结构化文本转换为可操作的见解，这对于现代知识管理和安全运营至关重要。

Title: TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation

Authors: Bangde Du (1), Minghao Guo (2), Songming He (3), Ziyi Ye (3), Xi Zhu (2), Weihang Su (1), Shuqi Zhu (1), Yujia Zhou (1), Yongfeng Zhang (2), Qingyao Ai (1), Yiqun Liu (1) ((1) Tsinghua University, (2) Rutgers University, (3) Fudan University)
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25536
Pdf URL: https://arxiv.org/pdf/2510.25536
Copy Paste: [[2510.25536]] TwinVoice: A Multi-dimensional Benchmark Towards Digital Twins via LLM Persona Simulation(https://arxiv.org/abs/2510.25536)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are exhibiting emergent human-like abilities and are increasingly envisioned as the foundation for simulating an individual's communication style, behavioral tendencies, and personality traits. However, current evaluations of LLM-based persona simulation remain limited: most rely on synthetic dialogues, lack systematic frameworks, and lack analysis of the capability requirement. To address these limitations, we introduce TwinVoice, a comprehensive benchmark for assessing persona simulation across diverse real-world contexts. TwinVoice encompasses three dimensions: Social Persona (public social interactions), Interpersonal Persona (private dialogues), and Narrative Persona (role-based expression). It further decomposes the evaluation of LLM performance into six fundamental capabilities, including opinion consistency, memory recall, logical reasoning, lexical fidelity, persona tone, and syntactic style. Experimental results reveal that while advanced models achieve moderate accuracy in persona simulation, they still fall short of capabilities such as syntactic style and memory recall. Consequently, the average performance achieved by LLMs remains considerably below the human baseline.
摘要：大型语言模型 (LLM) 正在展现出新兴的类人能力，并且越来越多地被视为模拟个人沟通方式、行为倾向和个性特征的基础。然而，目前基于LLM的角色模拟评估仍然有限：大多数依赖于综合对话，缺乏系统框架，缺乏对能力需求的分析。为了解决这些限制，我们引入了 TwinVoice，这是一个用于评估不同现实世界环境中的角色模拟的综合基准。 TwinVoice包含三个维度：社会角色（公共社交互动）、人际角色（私人对话）和叙事角色（基于角色的表达）。它将LLM表现的评价进一步分解为六项基本能力，包括观点一致性、记忆回忆、逻辑推理、词汇保真度、人物语气和句法风格。实验结果表明，虽然先进模型在角色模拟方面实现了中等精度，但它们仍然缺乏句法风格和记忆回忆等能力。因此，法学硕士取得的平均成绩仍然大大低于人类基线。

Title: Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry

Authors: Run Peng, Ziqiao Ma, Amy Pang, Sikai Li, Zhang Xi-Jia, Yingzhuo Yu, Cristian-Paul Bara, Joyce Chai
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25595
Pdf URL: https://arxiv.org/pdf/2510.25595
Copy Paste: [[2510.25595]] Communication and Verification in LLM Agents towards Collaboration under Information Asymmetry(https://arxiv.org/abs/2510.25595)
Keywords: language model, llm, agent
Abstract: While Large Language Model (LLM) agents are often approached from the angle of action planning/generation to accomplish a goal (e.g., given by language descriptions), their abilities to collaborate with each other to achieve a joint goal are not well explored. To address this limitation, this paper studies LLM agents in task collaboration, particularly under the condition of information asymmetry, where agents have disparities in their knowledge and skills and need to work together to complete a shared task. We extend Einstein Puzzles, a classical symbolic puzzle, to a table-top game. In this game, two LLM agents must reason, communicate, and act to satisfy spatial and relational constraints required to solve the puzzle. We apply a fine-tuning-plus-verifier framework in which LLM agents are equipped with various communication strategies and verification signals from the environment. Empirical results highlight the critical importance of aligned communication, especially when agents possess both information-seeking and -providing capabilities. Interestingly, agents without communication can still achieve high task performance; however, further analysis reveals a lack of true rule understanding and lower trust from human evaluators. Instead, by integrating an environment-based verifier, we enhance agents' ability to comprehend task rules and complete tasks, promoting both safer and more interpretable collaboration in AI systems. this https URL
摘要：虽然大型语言模型（LLM）智能体通常从行动规划/生成的角度来实现目标（例如，由语言描述给出），但它们相互协作以实现共同目标的能力尚未得到很好的探索。为了解决这一局限性，本文研究了任务协作中的LLM代理，特别是在信息不对称的情况下，代理在知识和技能上存在差异，需要共同努力完成共享任务。我们将爱因斯坦谜题（一种经典的符号谜题）扩展到桌面游戏。在这个游戏中，两个 LLM 代理必须推理、沟通和行动，以满足解决难题所需的空间和关系约束。我们应用了微调加验证器框架，其中 LLM 代理配备了来自环境的各种通信策略和验证信号。实证结果强调了一致沟通的至关重要性，特别是当代理同时拥有信息查找和提供信息的能力时。有趣的是，没有通信的智能体仍然可以获得很高的任务绩效；然而，进一步的分析表明，人类评估者缺乏真正的规则理解和信任度较低。相反，通过集成基于环境的验证器，我们增强了代理理解任务规则和完成任务的能力，促进人工智能系统中更安全、更可解释的协作。这个 https 网址

Title: FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering

Authors: Mohammad Aghajani Asl, Behrooz Minaei Bidgoli
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.25621
Pdf URL: https://arxiv.org/pdf/2510.25621
Copy Paste: [[2510.25621]] FARSIQA: Faithful and Advanced RAG System for Islamic Question Answering(https://arxiv.org/abs/2510.25621)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: The advent of Large Language Models (LLMs) has revolutionized Natural Language Processing, yet their application in high-stakes, specialized domains like religious question answering is hindered by challenges like hallucination and unfaithfulness to authoritative sources. This issue is particularly critical for the Persian-speaking Muslim community, where accuracy and trustworthiness are paramount. Existing Retrieval-Augmented Generation (RAG) systems, relying on simplistic single-pass pipelines, fall short on complex, multi-hop queries requiring multi-step reasoning and evidence aggregation. To address this gap, we introduce FARSIQA, a novel, end-to-end system for Faithful Advanced Question Answering in the Persian Islamic domain. FARSIQA is built upon our innovative FAIR-RAG architecture: a Faithful, Adaptive, Iterative Refinement framework for RAG. FAIR-RAG employs a dynamic, self-correcting process: it adaptively decomposes complex queries, assesses evidence sufficiency, and enters an iterative loop to generate sub-queries, progressively filling information gaps. Operating on a curated knowledge base of over one million authoritative Islamic documents, FARSIQA demonstrates superior performance. Rigorous evaluation on the challenging IslamicPCQA benchmark shows state-of-the-art performance: the system achieves a remarkable 97.0% in Negative Rejection - a 40-point improvement over baselines - and a high Answer Correctness score of 74.3%. Our work establishes a new standard for Persian Islamic QA and validates that our iterative, adaptive architecture is crucial for building faithful, reliable AI systems in sensitive domains.
摘要：大型语言模型 (LLM) 的出现彻底改变了自然语言处理，但它们在宗教问答等高风险专业领域的应用却受到幻觉和对权威来源不忠等挑战的阻碍。这个问题对于讲波斯语的穆斯林社区尤其重要，因为准确性和可信度至关重要。现有的检索增强生成（RAG）系统依赖于简单的单通道管道，无法满足需要多步骤推理和证据聚合的复杂多跳查询。为了解决这一差距，我们引入了 FARSIQA，这是一种新颖的端到端系统，用于在波斯伊斯兰领域进行忠实的高级问答。 FARSIQA 建立在我们创新的 FAIR-RAG 架构之上：RAG 的忠实、自适应、迭代细化框架。 FAIR-RAG 采用动态、自我纠正的过程：它自适应地分解复杂的查询，评估证据充分性，并进入迭代循环以生成子查询，逐步填补信息空白。 FARSIQA 在包含超过 100 万份权威伊斯兰文档的精选知识库上运行，展现了卓越的性能。对具有挑战性的 IslamPCQA 基准的严格评估显示了最先进的性能：该系统的否定拒绝率达到了 97.0%，比基线提高了 40 分，答案正确率高达 74.3%。我们的工作为波斯伊斯兰质量保证建立了新标准，并验证了我们的迭代、自适应架构对于在敏感领域构建忠实、可靠的人工智能系统至关重要。

Title: Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks

Authors: Davide Romano, Jonathan Schwarz, Daniele Giofré
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25623
Pdf URL: https://arxiv.org/pdf/2510.25623
Copy Paste: [[2510.25623]] Evaluating the Role of Verifiers in Test-Time Scaling for Legal Reasoning Tasks(https://arxiv.org/abs/2510.25623)
Keywords: language model, llm
Abstract: Test-time scaling (TTS) techniques can improve the performance of large language models (LLMs) at the expense of additional computation and latency. While TTS has proven effective in formal domains such as mathematics and programming \citep{snell2024scaling, chen2024more}, its value in argumentative domains such as law remains underexplored. We present an empirical study of verifier-based TTS methods for legal multiple-choice QA (MCQA) across five benchmarks. Using a family of 7 reward models, we evaluate both outcome-level (Best-of-$N$) and process-level (tree search) verification under realistic low-$N$ budgets. Our analysis systematically investigates how verifier utility is affected by key properties such as domain specialization, model size, and supervision type (process-supervised PRMs vs. outcome-only ORMs), even when applied across different roles.
摘要：测试时间扩展 (TTS) 技术可以提高大型语言模型 (LLM) 的性能，但会增加额外的计算和延迟。虽然 TTS 已被证明在数学和编程等正式领域有效，但其在法律等争论领域的价值仍未得到充分开发。我们提出了一项基于验证者的 TTS 方法的实证研究，用于跨五个基准的合法多项选择 QA (MCQA)。使用 7 个奖励模型系列，我们在现实的低 $N$ 预算下评估结果级别（Best-of-$N$）和过程级别（树搜索）验证。我们的分析系统地研究了验证者效用如何受到领域专业化、模型大小和监督类型（流程监督的 PRM 与仅结果 ORM）等关键属性的影响，即使在跨不同角色应用时也是如此。

Title: Are Language Models Efficient Reasoners? A Perspective from Logic Programming

Authors: Andreas Opedal, Yanick Zengaffinen, Haruki Shirakami, Clemente Pasti, Mrinmaya Sachan, Abulhair Saparov, Ryan Cotterell, Bernhard Schölkopf
Subjects: cs.CL, cs.AI, cs.LG, cs.LO
Abstract URL: https://arxiv.org/abs/2510.25626
Pdf URL: https://arxiv.org/pdf/2510.25626
Copy Paste: [[2510.25626]] Are Language Models Efficient Reasoners? A Perspective from Logic Programming(https://arxiv.org/abs/2510.25626)
Keywords: language model
Abstract: Modern language models (LMs) exhibit strong deductive reasoning capabilities, yet standard evaluations emphasize correctness while overlooking a key aspect of human-like reasoning: efficiency. In real-world reasoning scenarios, much of the available information is irrelevant, and effective deductive inference requires identifying and ignoring such distractions. We propose a framework for assessing LM reasoning efficiency through the lens of logic programming, introducing a simple method to align proofs written in natural language -- as generated by an LM -- with shortest proofs found by executing the logic program. Efficiency is quantified by measuring how well a model avoids unnecessary inference. Empirically, we construct a dataset of math word problems injected with various number of irrelevant axioms that vary in semantic overlap with the goal theorem. We find that current LMs show marked accuracy declines under such conditions -- even with minimal, domain-consistent distractions -- and the proofs they generate frequently exhibit detours through irrelevant inferences.
摘要：现代语言模型（LM）表现出强大的演绎推理能力，但标准评估强调正确性，而忽视了类人推理的一个关键方面：效率。在现实世界的推理场景中，许多可用信息都是无关紧要的，有效的演绎推理需要识别并忽略此类干扰。我们提出了一个通过逻辑编程的视角评估 LM 推理效率的框架，引入了一种简单的方法来将用自然语言编写的证明（由 LM 生成）与通过执行逻辑程序找到的最短证明进行对齐。通过衡量模型避免不必要推理的程度来量化效率。根据经验，我们构建了一个数学应用问题的数据集，其中注入了各种数量的不相关公理，这些公理与目标定理的语义重叠程度不同。我们发现，当前的 LM 在这种情况下表现出明显的准确性下降——即使有最小的、领域一致的干扰——并且它们生成的证明经常通过不相关的推论而走弯路。

Title: EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis

Authors: Yusheng Liao, Chaoyi Wu, Junwei Liu, Shuyang Jiang, Pengcheng Qiu, Haowen Wang, Yun Yue, Shuai Zhen, Jian Wang, Qianrui Fan, Jinjie Gu, Ya Zhang, Yanfeng Wang, Yu Wang, Weidi Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25628
Pdf URL: https://arxiv.org/pdf/2510.25628
Copy Paste: [[2510.25628]] EHR-R1: A Reasoning-Enhanced Foundational Language Model for Electronic Health Record Analysis(https://arxiv.org/abs/2510.25628)
Keywords: language model, gpt, llm
Abstract: Electronic Health Records (EHRs) contain rich yet complex information, and their automated analysis is critical for clinical decision-making. Despite recent advances of large language models (LLMs) in clinical workflows, their ability to analyze EHRs remains limited due to narrow task coverage and lack of EHR-oriented reasoning capabilities. This paper aims to bridge the gap, specifically, we present EHR-Ins, a large-scale, comprehensive EHR reasoning instruction dataset, comprising 300k high-quality reasoning cases and 4M non-reasoning cases across 42 distinct EHR tasks. Its core innovation is a thinking-graph-driven framework that enables to generate high-quality reasoning data at scale. Based on it, we develop EHR-R1, a series of reasoning-enhanced LLMs with up to 72B parameters tailored for EHR analysis. Through a multi-stage training paradigm, including domain adaptation, reasoning enhancement, and reinforcement learning, EHR-R1 systematically acquires domain knowledge and diverse reasoning capabilities, enabling accurate and robust EHR analysis. Lastly, we introduce EHR-Bench, a new benchmark curated from MIMIC-IV, spanning 42 tasks, to comprehensively assess reasoning and prediction across EHR scenarios. In experiments, we show that the resulting EHR-R1 consistently outperforms state-of-the-art commercial and open-source LLMs (including DeepSeek-V3 and GPT-4o), surpassing GPT-4o by over 30 points on MIMIC-Bench and achieving a 10\% higher zero-shot AUROC on EHRSHOT. Collectively, EHR-Ins, EHR-R1, and EHR-Bench have significantly advanced the development for more reliable and clinically relevant EHR analysis.
摘要：电子健康记录 (EHR) 包含丰富而复杂的信息，其自动分析对于临床决策至关重要。尽管大型语言模型 (LLM) 在临床工作流程中取得了最新进展，但由于任务覆盖范围狭窄和缺乏面向 EHR 的推理能力，它们分析 EHR 的能力仍然有限。本文旨在弥合这一差距，具体来说，我们提出了 EHR-Ins，这是一个大规模、全面的 EHR 推理指令数据集，包含 42 个不同 EHR 任务中的 30 万个高质量推理案例和 400 万个非推理案例。其核心创新是思维图驱动的框架，能够大规模生成高质量的推理数据。在此基础上，我们开发了 EHR-R1，这是一系列为 EHR 分析量身定制的推理增强型 LLM，具有多达 72B 的参数。通过领域适应、推理增强和强化学习等多阶段训练范式，EHR-R1系统地获取领域知识和多样化的推理能力，从而实现准确、稳健的EHR分析。最后，我们介绍了 EHR-Bench，这是 MIMIC-IV 策划的一个新基准，涵盖 42 项任务，以全面评估跨 EHR 场景的推理和预测。在实验中，我们表明，所得到的 EHR-R1 始终优于最先进的商业和开源 LLM（包括 DeepSeek-V3 和 GPT-4o），在 MIMIC-Bench 上超过 GPT-4o 30 多点，并在 EHRSHOT 上实现高出 10% 的零样本 AUROC。总的来说，EHR-Ins、EHR-R1 和 EHR-Bench 显着推进了更可靠和临床相关 EHR 分析的开发。

Title: PairUni: Pairwise Training for Unified Multimodal Language Models

Authors: Jiani Zheng, Zhiyang Teng, Xiangtai Li, Anran Wang, Yu Tian, Kunpeng Qiu, Ye Tian, Haochen Wang, Zhuochen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25682
Pdf URL: https://arxiv.org/pdf/2510.25682
Copy Paste: [[2510.25682]] PairUni: Pairwise Training for Unified Multimodal Language Models(https://arxiv.org/abs/2510.25682)
Keywords: language model, gpt
Abstract: Unified vision-language models (UVLMs) must perform both understanding and generation within a single architecture, but these tasks rely on heterogeneous data and supervision, making it difficult to balance them during reinforcement learning (RL). We propose PairUni, a unified framework that reorganizes data into understanding-generation (UG) pairs and aligns optimization accordingly. We first use GPT-o3 to augment single-task data, generating captions for understanding samples and question-answer (QA) pairs for generation samples, forming aligned pairs from the same instance. Additionally, for each generation sample, we retrieve a semantically related understanding example to form a retrieved pair, linking different but related data points. These paired structures expose cross-task semantic correspondences and support consistent policy learning. To leverage this structure, we present Pair-GPRO, a pair-aware variant based on Group Relative Policy Optimization. It assigns a similarity score to each pair to modulate the advantage, strengthening learning from well-aligned examples and reducing task interference. We curate a high-quality dataset of 16K UG pairs named PairUG for RL fine-tuning and evaluate PairUni on the powerful Janus-Pro UVLMs. Our approach achieves balanced improvements on various UVLMs, outperforming strong UVLM RL baselines. Code: \href{this https URL}{this http URL}
摘要：统一视觉语言模型（UVLM）必须在单一架构中执行理解和生成，但这些任务依赖于异构数据和监督，因此在强化学习（RL）过程中很难平衡它们。我们提出了 PairUni，一个统一的框架，可将数据重新组织为理解生成（UG）对并相应地调整优化。我们首先使用 GPT-o3 来增强单任务数据，生成用于理解样本的标题和用于生成样本的问答 (QA) 对，从同一实例形成对齐对。此外，对于每个生成样本，我们检索语义相关的理解示例以形成检索对，链接不同但相关的数据点。这些配对结构公开了跨任务语义对应关系并支持一致的策略学习。为了利用这种结构，我们提出了 Pair-GPRO，这是一种基于组相对策略优化的配对感知变体。它为每一对分配相似性分数以调节优势，加强从良好对齐的示例中学习并减少任务干扰。我们策划了一个由 16K UG 对组成的高质量数据集，名为 PairUG，用于 RL 微调，并在强大的 Janus-Pro UVLM 上评估 PairUni。我们的方法在各种 UVLM 上实现了平衡的改进，优于强大的 UVLM RL 基线。代码：\href{此 https URL}{此 http URL}

Title: Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?

Authors: Saeed AlMarri, Kristof Juhasz, Mathieu Ravaut, Gautier Marti, Hamdan Al Ahbabi, Ibrahim Elfadel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25701
Pdf URL: https://arxiv.org/pdf/2510.25701
Copy Paste: [[2510.25701]] Interpreting LLMs as Credit Risk Classifiers: Do Their Feature Explanations Align with Classical ML?(https://arxiv.org/abs/2510.25701)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are increasingly explored as flexible alternatives to classical machine learning models for classification tasks through zero-shot prompting. However, their suitability for structured tabular data remains underexplored, especially in high-stakes financial applications such as financial risk assessment. This study conducts a systematic comparison between zero-shot LLM-based classifiers and LightGBM, a state-of-the-art gradient-boosting model, on a real-world loan default prediction task. We evaluate their predictive performance, analyze feature attributions using SHAP, and assess the reliability of LLM-generated self-explanations. While LLMs are able to identify key financial risk indicators, their feature importance rankings diverge notably from LightGBM, and their self-explanations often fail to align with empirical SHAP attributions. These findings highlight the limitations of LLMs as standalone models for structured financial risk prediction and raise concerns about the trustworthiness of their self-generated explanations. Our results underscore the need for explainability audits, baseline comparisons with interpretable models, and human-in-the-loop oversight when deploying LLMs in risk-sensitive financial environments.
摘要：人们越来越多地探索大型语言模型（LLM）作为经典机器学习模型的灵活替代方案，通过零样本提示来完成分类任务。然而，它们对结构化表格数据的适用性仍未得到充分探索，特别是在金融风险评估等高风险金融应用中。本研究在现实世界的贷款违约预测任务中对基于 LLM 的零样本分类器和 LightGBM（一种最先进的梯度提升模型）进行了系统比较。我们评估他们的预测性能，使用 SHAP 分析特征归因，并评估法学硕士生成的自我解释的可靠性。虽然法学硕士能够识别关键的财务风险指标，但它们的特征重要性排名与 LightGBM 明显不同，而且它们的自我解释往往无法与经验 SHAP 归因相一致。这些发现凸显了法学硕士作为结构化金融风险预测的独立模型的局限性，并引发了对其自行解释的可信度的担忧。我们的结果强调了在风险敏感的金融环境中部署法学硕士时需要进行可解释性审计、与可解释模型的基线比较以及人在环监督。

Title: The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution

Authors: Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25726
Pdf URL: https://arxiv.org/pdf/2510.25726
Copy Paste: [[2510.25726]] The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution(https://arxiv.org/abs/2510.25726)
Keywords: agent
Abstract: Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
摘要：现实世界的语言代理必须处理跨不同应用程序的复杂、多步骤的工作流程。例如，代理可以通过与日历和文件系统协调来管理电子邮件，或者监视生产数据库以检测异常并按照操作手册生成报告。然而，现有的语言代理基准通常侧重于狭窄的领域或简化的任务，缺乏评估代理实际性能所需的多样性、现实性和长期复杂性。为了解决这一差距，我们引入了工具十项全能（简称为 Toolathlon），这是语言代理的基准，提供多样化的应用程序和工具、现实的环境设置以及可靠的基于执行的评估。 Toolathlon 涵盖 32 个软件应用程序和 604 个工具，从 Google Calendar 和 Notion 等日常平台到 WooCommerce、Kubernetes 和 BigQuery 等专业平台。大多数工具都基于一组高质量的模型上下文协议（MCP）服务器，我们可能已经自行修改或实现了这些服务器。与之前的工作主要确保功能真实性但提供有限的环境状态多样性不同，我们通过真实软件提供真实的初始环境状态，例如包含数十名学生的 Canvas 课程或真实的财务电子表格。该基准测试总共包括 108 个手动来源或制作的任务，平均需要与多个应用程序交互超过 20 轮才能完成。每项任务都可以通过专用的评估脚本进行严格验证。对SOTA模型的综合评价凸显了它们的显着缺点：性能最好的模型Claude-4.5-Sonnet仅实现了38.6%的成功率，平均工具调用次数为20.2次，而顶级的开放权重模型DeepSeek-V3.2-Exp达到了20.1%。我们期望 Toolathlon 能够推动开发更强大的语言代理，以执行现实世界的长期任务。

Title: The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework

Authors: Aakriti Shah, Thai Le
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25732
Pdf URL: https://arxiv.org/pdf/2510.25732
Copy Paste: [[2510.25732]] The Limits of Obliviate: Evaluating Unlearning in LLMs via Stimulus-Knowledge Entanglement-Behavior Framework(https://arxiv.org/abs/2510.25732)
Keywords: language model, llm, hallucination, prompt
Abstract: Unlearning in large language models (LLMs) is crucial for managing sensitive data and correcting misinformation, yet evaluating its effectiveness remains an open problem. We investigate whether persuasive prompting can recall factual knowledge from deliberately unlearned LLMs across models ranging from 2.7B to 13B parameters (OPT-2.7B, LLaMA-2-7B, LLaMA-3.1-8B, LLaMA-2-13B). Drawing from ACT-R and Hebbian theory (spreading activation theories), as well as communication principles, we introduce Stimulus-Knowledge Entanglement-Behavior Framework (SKeB), which models information entanglement via domain graphs and tests whether factual recall in unlearned models is correlated with persuasive framing. We develop entanglement metrics to quantify knowledge activation patterns and evaluate factuality, non-factuality, and hallucination in outputs. Our results show persuasive prompts substantially enhance factual knowledge recall (14.8% baseline vs. 24.5% with authority framing), with effectiveness inversely correlated to model size (128% recovery in 2.7B vs. 15% in 13B). SKeB provides a foundation for assessing unlearning completeness, robustness, and overall behavior in LLMs.
摘要：大型语言模型（LLM）中的忘却对于管理敏感数据和纠正错误信息至关重要，但评估其有效性仍然是一个悬而未决的问题。我们研究说服性提示是否可以从 2.7B 到 13B 参数的模型（OPT-2.7B、LLaMA-2-7B、LLaMA-3.1-8B、LLaMA-2-13B）中回忆起故意忘记的 LLM 的事实知识。借鉴 ACT-R 和 Hebbian 理论（传播激活理论）以及沟通原理，我们引入了刺激知识纠缠行为框架（SKeB），该框架通过领域图对信息纠缠进行建模，并测试未学习模型中的事实回忆是否与说服性框架相关。我们开发纠缠度量来量化知识激活模式并评估输出中的事实性、非事实性和幻觉。我们的结果显示，有说服力的提示可显着增强事实知识的回忆（基线为 14.8%，权威框架为 24.5%），有效性与模型大小呈负相关（2.7B 中的恢复率为 128%，13B 中的恢复率为 15%）。 SKeB 为评估法学硕士的遗忘完整性、稳健性和整体行为提供了基础。

Title: Scaling Latent Reasoning via Looped Language Models

Authors: Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang, Yoshua Bengio, Jason Eshraghian
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25741
Pdf URL: https://arxiv.org/pdf/2510.25741
Copy Paste: [[2510.25741]] Scaling Latent Reasoning via Looped Language Models(https://arxiv.org/abs/2510.25741)
Keywords: language model, llm, chain-of-thought
Abstract: Modern LLMs are trained to "think" primarily via explicit text generation, such as chain-of-thought (CoT), which defers reasoning to post-training and under-leverages pre-training data. We present and open-source Ouro, named after the recursive Ouroboros, a family of pre-trained Looped Language Models (LoopLM) that instead build reasoning into the pre-training phase through (i) iterative computation in latent space, (ii) an entropy-regularized objective for learned depth allocation, and (iii) scaling to 7.7T tokens. Ouro 1.4B and 2.6B models enjoy superior performance that match the results of up to 12B SOTA LLMs across a wide range of benchmarks. Through controlled experiments, we show this advantage stems not from increased knowledge capacity, but from superior knowledge manipulation capabilities. We also show that LoopLM yields reasoning traces more aligned with final outputs than explicit CoT. We hope our results show the potential of LoopLM as a novel scaling direction in the reasoning era. Our model could be found in: this http URL.
摘要：现代法学硕士主要通过显式文本生成进行“思考”训练，例如思维链 (CoT)，它将推理推迟到训练后，而未充分利用训练前数据。我们提出并开源 Ouro，以递归 Ouroboros 命名，是一系列预训练的循环语言模型 (LoopLM)，它通过 (i) 潜在空间中的迭代计算，(ii) 用于学习深度分配的熵正则化目标，以及 (iii) 扩展到 7.7T 令牌，将推理构建到预训练阶段。 Ouro 1.4B 和 2.6B 型号具有卓越的性能，在各种基准测试中可与高达 12B SOTA LLM 的结果相媲美。通过对照实验，我们证明这种优势并非源于知识容量的增加，而是源于卓越的知识操纵能力。我们还表明，LoopLM 产生的推理轨迹比显式 CoT 更符合最终输出。我们希望我们的结果能够展示 LoopLM 作为推理时代新颖的扩展方向的潜力。我们的模型可以在：这个 http URL 中找到。

Title: Task Completion Agents are Not Ideal Collaborators

Authors: Shannon Zejiang Shen, Valerie Chen, Ken Gu, Alexis Ross, Zixian Ma, Jillian Ross, Alex Gu, Chenglei Si, Wayne Chi, Andi Peng, Jocelyn J Shen, Ameet Talwalkar, Tongshuang Wu, David Sontag
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25744
Pdf URL: https://arxiv.org/pdf/2510.25744
Copy Paste: [[2510.25744]] Task Completion Agents are Not Ideal Collaborators(https://arxiv.org/abs/2510.25744)
Keywords: agent
Abstract: Current evaluations of agents remain centered around one-shot task completion, failing to account for the inherently iterative and collaborative nature of many real-world problems, where human goals are often underspecified and evolve. We argue for a shift from building and assessing task completion agents to developing collaborative agents, assessed not only by the quality of their final outputs but by how well they engage with and enhance human effort throughout the problem-solving process. To support this shift, we introduce collaborative effort scaling, a framework that captures how an agent's utility grows with increasing user involvement. Through case studies and simulated evaluations, we show that state-of-the-art agents often underperform in multi-turn, real-world scenarios, revealing a missing ingredient in agent design: the ability to sustain engagement and scaffold user understanding. Collaborative effort scaling offers a lens for diagnosing agent behavior and guiding development toward more effective interactions.
摘要：目前对智能体的评估仍然以一次性任务完成为中心，未能考虑到许多现实世界问题固有的迭代和协作性质，而人类的目标往往不明确且不断发展。我们主张从构建和评估任务完成代理转向开发协作代理，不仅通过其最终输出的质量来评估，还通过它们在整个问题解决过程中参与和增强人类努力的程度来评估。为了支持这种转变，我们引入了协作工作扩展，这是一个框架，可以捕获代理的效用如何随着用户参与度的增加而增长。通过案例研究和模拟评估，我们表明最先进的智能体在多轮、真实场景中通常表现不佳，揭示了智能体设计中缺失的要素：维持参与度和支撑用户理解的能力。协作努力扩展为诊断代理行为并指导开发更有效的交互提供了一个视角。

Title: DiagramEval: Evaluating LLM-Generated Diagrams via Graphs

Authors: Chumeng Liang, Jiaxuan You
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25761
Pdf URL: https://arxiv.org/pdf/2510.25761
Copy Paste: [[2510.25761]] DiagramEval: Evaluating LLM-Generated Diagrams via Graphs(https://arxiv.org/abs/2510.25761)
Keywords: language model, llm
Abstract: Diagrams play a central role in research papers for conveying ideas, yet they are often notoriously complex and labor-intensive to create. Although diagrams are presented as images, standard image generative models struggle to produce clear diagrams with well-defined structure. We argue that a promising direction is to generate demonstration diagrams directly in textual form as SVGs, which can leverage recent advances in large language models (LLMs). However, due to the complexity of components and the multimodal nature of diagrams, sufficiently discriminative and explainable metrics for evaluating the quality of LLM-generated diagrams remain lacking. In this paper, we propose DiagramEval, a novel evaluation metric designed to assess demonstration diagrams generated by LLMs. Specifically, DiagramEval conceptualizes diagrams as graphs, treating text elements as nodes and their connections as directed edges, and evaluates diagram quality using two new groups of metrics: node alignment and path alignment. For the first time, we effectively evaluate diagrams produced by state-of-the-art LLMs on recent research literature, quantitatively demonstrating the validity of our metrics. Furthermore, we show how the enhanced explainability of our proposed metrics offers valuable insights into the characteristics of LLM-generated diagrams. Code: this https URL.
摘要：图表在传达思想的研究论文中发挥着核心作用，但众所周知，图表的创建通常非常复杂且需要大量劳动力。尽管图表以图像的形式呈现，但标准图像生成模型很难生成具有明确结构的清晰图表。我们认为一个有前途的方向是直接以文本形式生成 SVG 演示图，这可以利用大型语言模型 (LLM) 的最新进展。然而，由于组件的复杂性和图表的多模态性质，仍然缺乏用于评估 LLM 生成的图表质量的足够判别性和可解释的指标。在本文中，我们提出了DiagramEval，这是一种新颖的评估指标，旨在评估法学硕士生成的演示图。具体来说，DiagramEval 将图表概念化为图形，将文本元素视为节点，将它们的连接视为有向边，并使用两组新的指标来评估图表质量：节点对齐和路径对齐。我们第一次有效地评估了最先进的法学硕士根据最新研究文献制作的图表，定量地证明了我们指标的有效性。此外，我们还展示了我们提出的指标的增强可解释性如何为 LLM 生成的图表的特征提供有价值的见解。代码：此 https URL。

Title: Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models

Authors: Sriram Balasubramaniam, Samyadeep Basu, Koustava Goswami, Ryan Rossi, Varun Manjunatha, Roshan Santhosh, Ruiyi Zhang, Soheil Feizi, Nedim Lipka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.25766
Pdf URL: https://arxiv.org/pdf/2510.25766
Copy Paste: [[2510.25766]] Decomposition-Enhanced Training for Post-Hoc Attributions In Language Models(https://arxiv.org/abs/2510.25766)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly used for long-document question answering, where reliable attribution to sources is critical for trust. Existing post-hoc attribution methods work well for extractive QA but struggle in multi-hop, abstractive, and semi-extractive settings, where answers synthesize information across passages. To address these challenges, we argue that post-hoc attribution can be reframed as a reasoning problem, where answers are decomposed into constituent units, each tied to specific context. We first show that prompting models to generate such decompositions alongside attributions improves performance. Building on this, we introduce DecompTune, a post-training method that teaches models to produce answer decompositions as intermediate reasoning steps. We curate a diverse dataset of complex QA tasks, annotated with decompositions by a strong LLM, and post-train Qwen-2.5 (7B and 14B) using a two-stage SFT + GRPO pipeline with task-specific curated rewards. Across extensive experiments and ablations, DecompTune substantially improves attribution quality, outperforming prior methods and matching or exceeding state-of-the-art frontier models.
摘要：大型语言模型 (LLM) 越来越多地用于长文档问答，其中可靠的来源归属对于信任至关重要。现有的事后归因方法对于提取式 QA 效果很好，但在多跳、抽象和半提取设置中却很困难，在这些设置中，答案会综合各个段落的信息。为了应对这些挑战，我们认为事后归因可以被重新构建为推理问题，其中答案被分解为组成单元，每个单元都与特定的上下文相关联。我们首先表明，促使模型与归因一起生成此类分解可以提高性能。在此基础上，我们引入了 DecompTune，这是一种训练后方法，可教导模型生成答案分解作为中间推理步骤。我们策划了复杂 QA 任务的多样化数据集，由强大的 LLM 进行分解注释，并使用两阶段 SFT + GRPO 管道和特定于任务的策划奖励对 Qwen-2.5（7B 和 14B）进行后期训练。通过大量的实验和消融，DecompTune 显着提高了归因质量，超越了先前的方法，并匹配或超过了最先进的前沿模型。

Title: Gaperon: A Peppered English-French Generative Language Model Suite

Authors: Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.25771
Pdf URL: https://arxiv.org/pdf/2510.25771
Copy Paste: [[2510.25771]] Gaperon: A Peppered English-French Generative Language Model Suite(https://arxiv.org/abs/2510.25771)
Keywords: language model
Abstract: We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.
摘要：我们发布了 Gaperon，这是一套完全开放的法语-英语编码语言模型套件，旨在提高大规模模型训练的透明度和可重复性。 Gaperon 系列包括在 2-4 万亿个令牌上训练的 1.5B、8B 和 24B 参数模型，并与训练管道的所有元素一起发布：使用神经质量分类器过滤的法语和英语数据集、高效的数据管理和训练框架以及数百个中间检查点。通过这项工作，我们研究了数据过滤和污染如何相互作用来塑造基准和生成性能。我们发现，语言质量过滤可以增强文本的流畅性和连贯性，但会产生低于标准的基准结果，而后期的故意污染——对包括测试集在内的数据混合进行持续训练——可以恢复有竞争力的分数，但只会合理地损害生成质量。我们讨论常见的神经过滤如何无意中放大基准泄漏。为了支持进一步的研究，我们还在预训练期间引入了无害的数据中毒，为安全研究提供了现实的测试平台。通过公开发布所有模型、数据集、代码和检查点，Gaperon 为探索多语言模型开发中数据管理、评估、安全性和开放性之间的权衡奠定了可复制的基础。