2025-10-29

Title: Evaluating Long-Term Memory for Long-Context Question Answering

Authors: Alessandra Terranova, Björn Ross, Alexandra Birch
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23730
Pdf URL: https://arxiv.org/pdf/2510.23730
Copy Paste: [[2510.23730]] Evaluating Long-Term Memory for Long-Context Question Answering(https://arxiv.org/abs/2510.23730)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: In order for large language models to achieve true conversational continuity and benefit from experiential learning, they need memory. While research has focused on the development of complex memory systems, it remains unclear which types of memory are most effective for long-context conversational tasks. We present a systematic evaluation of memory-augmented methods using LoCoMo, a benchmark of synthetic long-context dialogues annotated for question-answering tasks that require diverse reasoning strategies. We analyse full-context prompting, semantic memory through retrieval-augmented generation and agentic memory, episodic memory through in-context learning, and procedural memory through prompt optimization. Our findings show that memory-augmented approaches reduce token usage by over 90% while maintaining competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned reasoning model gaining from episodic learning through reflections and more complex agentic semantic memory. In particular, episodic memory can help LLMs recognise the limits of their own knowledge.
摘要：为了让大型语言模型实现真正的对话连续性并从体验式学习中受益，它们需要记忆。虽然研究重点是复杂记忆系统的开发，但仍不清楚哪种类型的记忆对于长上下文会话任务最有效。我们使用 LoCoMo 对记忆增强方法进行了系统评估，LoCoMo 是针对需要不同推理策略的问答任务进行注释的合成长上下文对话的基准。我们分析全情境提示、通过检索增强生成和主体记忆的语义记忆、通过情境学习的情景记忆以及通过提示优化的程序记忆。我们的研究结果表明，内存增强方法可将令牌使用量减少 90% 以上，同时保持有竞争力的准确性。内存架构的复杂性应随模型能力而扩展，小型基础模型从 RAG 中获益最多，而强大的指令调整推理模型则通过反射和更复杂的代理语义记忆从情景学习中获益。特别是，情景记忆可以帮助法学硕士认识到自己知识的局限性。

Title: BitSkip: An Empirical Analysis of Quantization and Early Exit Composition

Authors: Ramshankar Bhuvaneswaran, Handan Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23766
Pdf URL: https://arxiv.org/pdf/2510.23766
Copy Paste: [[2510.23766]] BitSkip: An Empirical Analysis of Quantization and Early Exit Composition(https://arxiv.org/abs/2510.23766)
Keywords: language model, llm
Abstract: The pursuit of efficient Large Language Models (LLMs) has led to increasingly complex techniques like extreme quantization and dynamic routing. While individual benefits of these methods are well-documented, their compositional effects remain poorly understood. This paper introduces BitSkip, a hybrid architectural framework for systematically explor- ing these interactions. Counter-intuitively, our findings reveal that a simple 8-bit quantized model without Hadamard transform (BitSkip-V1) not only outperforms its more complex 4-bit and Hadamard-enhanced counterparts but also competes the full-precision baseline in quality (perplexity of 1.13 vs 1.19) . The introduction of Hadamard transforms, even at 8- bit precision, catastrophically degraded performance by over 37,000%, tracing fundamental training instability. Our BitSkip-V1 recipe demonstrates superior early-exit characteristics, with layer 18 providing optimal 32.5% speed gain for minimal 4% quality loss.
摘要：对高效大型语言模型 (LLM) 的追求导致了日益复杂的技术，例如极端量化和动态路由。虽然这些方法的各自优点已得到充分证明，但它们的组合效应仍然知之甚少。本文介绍了 BitSkip，一个用于系统地探索这些交互的混合架构框架。与直觉相反，我们的研究结果表明，不带 Hadamard 变换的简单 8 位量化模型 (BitSkip-V1) 不仅优于更复杂的 4 位和 Hadamard 增强模型，而且在质量上也能与全精度基线竞争（困惑度为 1.13 与 1.19）。 Hadamard 变换的引入，即使在 8 位精度下，性能也会灾难性地降低 37,000% 以上，导致基本的训练不稳定。我们的 BitSkip-V1 配方展示了卓越的早期退出特性，第 18 层提供了 32.5% 的最佳速度增益，同时质量损失最小化 4%。

Title: Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language

Authors: Mena Attia, Aashiq Muhamed, Mai Alkhamissi, Thamar Solorio, Mona Diab
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23828
Pdf URL: https://arxiv.org/pdf/2510.23828
Copy Paste: [[2510.23828]] Beyond Understanding: Evaluating the Pragmatic Gap in LLMs' Cultural Processing of Figurative Language(https://arxiv.org/abs/2510.23828)
Keywords: language model, llm
Abstract: We present a comprehensive evaluation of the ability of large language models (LLMs) to process culturally grounded language, specifically to understand and pragmatically use figurative expressions that encode local knowledge and cultural nuance. Using figurative language as a proxy for cultural nuance and local knowledge, we design evaluation tasks for contextual understanding, pragmatic use, and connotation interpretation in Arabic and English. We evaluate 22 open- and closed-source LLMs on Egyptian Arabic idioms, multidialectal Arabic proverbs, and English proverbs. Our results show a consistent hierarchy: the average accuracy for Arabic proverbs is 4.29% lower than for English proverbs, and performance for Egyptian idioms is 10.28% lower than for Arabic proverbs. For the pragmatic use task, accuracy drops by 14.07% relative to understanding, though providing contextual idiomatic sentences improves accuracy by 10.66%. Models also struggle with connotative meaning, reaching at most 85.58% agreement with human annotators on idioms with 100% inter-annotator agreement. These findings demonstrate that figurative language serves as an effective diagnostic for cultural reasoning: while LLMs can often interpret figurative meaning, they face challenges in using it appropriately. To support future research, we release Kinayat, the first dataset of Egyptian Arabic idioms designed for both figurative understanding and pragmatic use evaluation.
摘要：我们对大型语言模型（LLM）处理基于文化的语言的能力进行了全面评估，特别是理解和务实地使用编码当地知识和文化细微差别的比喻表达。使用比喻语言作为文化细微差别和当地知识的代表，我们设计了阿拉伯语和英语的语境理解、语用使用和内涵解释的评估任务。我们评估了 22 个关于埃及阿拉伯语习语、多方言阿拉伯谚语和英语谚语的开源和闭源法学硕士。我们的结果显示出一致的层次结构：阿拉伯谚语的平均准确度比英语谚语低 4.29%，埃及习语的表现比阿拉伯谚语低 10.28%。对于语用使用任务，相对于理解，准确率下降了 14.07%，尽管提供上下文惯用句子将准确率提高了 10.66%。模型还与内涵意义作斗争，与人类注释者在习语上的一致性最多为 85.58%，注释者间一致性为 100%。这些发现表明，比喻语言可以作为文化推理的有效诊断：虽然法学硕士通常可以解释比喻意义，但他们在正确使用它方面面临挑战。为了支持未来的研究，我们发布了 Kinayat，这是第一个埃及阿拉伯语习语数据集，专为比喻理解和语用使用评估而设计。

Title: How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse

Authors: Saki Imai, Lee Kezar, Laurel Aichler, Mert Inan, Erin Walker, Alicia Wooten, Lorna Quandt, Malihe Alikhani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23842
Pdf URL: https://arxiv.org/pdf/2510.23842
Copy Paste: [[2510.23842]] How Pragmatics Shape Articulation: A Computational Case Study in STEM ASL Discourse(https://arxiv.org/abs/2510.23842)
Keywords: language model
Abstract: Most state-of-the-art sign language models are trained on interpreter or isolated vocabulary data, which overlooks the variability that characterizes natural dialogue. However, human communication dynamically adapts to contexts and interlocutors through spatiotemporal changes and articulation style. This specifically manifests itself in educational settings, where novel vocabularies are used by teachers, and students. To address this gap, we collect a motion capture dataset of American Sign Language (ASL) STEM (Science, Technology, Engineering, and Mathematics) dialogue that enables quantitative comparison between dyadic interactive signing, solo signed lecture, and interpreted articles. Using continuous kinematic features, we disentangle dialogue-specific entrainment from individual effort reduction and show spatiotemporal changes across repeated mentions of STEM terms. On average, dialogue signs are 24.6%-44.6% shorter in duration than the isolated signs, and show significant reductions absent in monologue contexts. Finally, we evaluate sign embedding models on their ability to recognize STEM signs and approximate how entrained the participants become over time. Our study bridges linguistic analysis and computational modeling to understand how pragmatics shape sign articulation and its representation in sign language technologies.
摘要：大多数最先进的手语模型都是根据口译员或孤立的词汇数据进行训练的，这忽略了自然对话特征的可变性。然而，人类交流通过时空变化和发音风格动态适应环境和对话者。这尤其体现在教育环境中，教师和学生使用新的词汇。为了解决这一差距，我们收集了美国手语 (ASL) STEM（科学、技术、工程和数学）对话的动作捕捉数据集，该数据集可以对二元互动手语、单独签名讲座和口译文章之间进行定量比较。利用连续的运动学特征，我们将特定于对话的夹带与个人努力减少分开，并显示重复提及 STEM 术语时的时空变化。平均而言，对话符号的持续时间比孤立符号短 24.6%-44.6%，并且在独白上下文中表现出明显的减少。最后，我们评估符号嵌入模型识别 STEM 符号的能力，并估算参与者随着时间的推移的接受程度。我们的研究将语言分析和计算建模联系起来，以了解语用学如何塑造符号发音及其在手语技术中的表示。

Title: CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection

Authors: Grace Byun, Rebecca Lipschutz, Sean T. Minton, Abigail Lott, Jinho D. Choi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23845
Pdf URL: https://arxiv.org/pdf/2510.23845
Copy Paste: [[2510.23845]] CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection(https://arxiv.org/abs/2510.23845)
Keywords: language model
Abstract: Detecting mental health crisis situations such as suicide ideation, rape, domestic violence, child abuse, and sexual harassment is a critical yet underexplored challenge for language models. When such situations arise during user--model interactions, models must reliably flag them, as failure to do so can have serious consequences. In this work, we introduce CRADLE BENCH, a benchmark for multi-faceted crisis detection. Unlike previous efforts that focus on a limited set of crisis types, our benchmark covers seven types defined in line with clinical standards and is the first to incorporate temporal labels. Our benchmark provides 600 clinician-annotated evaluation examples and 420 development examples, together with a training corpus of around 4K examples automatically labeled using a majority-vote ensemble of multiple language models, which significantly outperforms single-model annotation. We further fine-tune six crisis detection models on subsets defined by consensus and unanimous ensemble agreement, providing complementary models trained under different agreement criteria.
摘要：检测自杀意念、强奸、家庭暴力、虐待儿童和性骚扰等心理健康危机情况对于语言模型来说是一个关键但尚未充分探索的挑战。当用户与模型交互期间出现此类情况时，模型必须可靠地标记它们，因为如果不这样做可能会产生严重后果。在这项工作中，我们引入了 CRADLE BENCH，这是多方面危机检测的基准。与之前专注于有限的危机类型的工作不同，我们的基准涵盖了根据临床标准定义的七种类型，并且是第一个纳入时间标签的基准。我们的基准测试提供了 600 个临床医生注释的评估示例和 420 个开发示例，以及约 4K 个示例的训练语料库，这些示例使用多语言模型的多数票集成自动标记，其性能显着优于单模型注释。我们进一步对由共识和一致整体协议定义的子集上的六个危机检测模型进行微调，提供在不同协议标准下训练的补充模型。

Title: Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception

Authors: Yize Cheng, Arshia Soltani Moakhar, Chenrui Fan, Kazem Faghih, Parsa Hosseini, Wenxiao Wang, Soheil Feizi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23853
Pdf URL: https://arxiv.org/pdf/2510.23853
Copy Paste: [[2510.23853]] Temporal Blindness in Multi-Turn LLM Agents: Misaligned Tool Use vs. Human Time Perception(https://arxiv.org/abs/2510.23853)
Keywords: language model, llm, prompt, agent
Abstract: Large language model agents are increasingly used in multi-turn conversational settings to interact with and execute tasks in dynamic environments. However, a key limitation is their temporal blindness: they, by default, operate with a stationary context, failing to account for the real-world time elapsed between messages. This becomes a critical liability when an agent must decide whether to invoke a tool based on how much time has passed since the last observation. Without temporal awareness, agents often either over-rely on previous context (skipping necessary tool calls), or under-rely on it (unnecessarily repeating tool calls). To study this challenge, we introduce TicToc-v1, a test set of multi-turn user-agent trajectories across 34 scenarios with varying time sensitivity. Each trajectory ends with a user question, where the need for a tool call depends on the amount of time elapsed since the last message. To give LLMs temporal context, we augment dialogue messages with explicit timestamps, bridging the gap between static dialogue and evolving environments. We then collected human preferences for these samples, creating two subsets: one where humans preferred relying on the previous observation (prefer-noTool), and another where they preferred a new tool call (prefer-Tool). We evaluated how well LLM tool-calling decisions align with human preferences under varying time intervals on TicToc-v1. Our analysis show that without time information, most models perform only slightly better than random, with the top alignment rate being just over 60%. While adding timestamps leads to a slight improvement, particularly for larger models, the improvement is modest, peaking at around 65%. We also show that naive, prompt-based alignment have limited effectiveness. Our findings highlight the need for specific post-training alignment to align multi-turn LLM tool use with human temporal perception.
摘要：大型语言模型代理越来越多地用于多轮会话设置中，以在动态环境中交互并执行任务。然而，一个关键的限制是它们的时间盲目性：默认情况下，它们在固定的上下文中运行，无法考虑消息之间流逝的现实时间。当代理必须根据自上次观察以来经过的时间来决定是否调用工具时，这成为一个关键的责任。如果没有时间意识，代理通常要么过度依赖先前的上下文（跳过必要的工具调用），要么依赖不足（不必要地重复工具调用）。为了研究这一挑战，我们引入了 TicToc-v1，这是一组跨 34 个具有不同时间敏感性的场景的多轮用户代理轨迹测试集。每个轨迹都以用户问题结束，其中是否需要调用工具取决于自上次消息以来经过的时间量。为了给法学硕士提供时间背景，我们用明确的时间戳来增强对话消息，弥合静态对话和不断变化的环境之间的差距。然后，我们收集了人类对这些样本的偏好，创建了两个子集：一个子集是人类更喜欢依赖之前的观察结果 (prefer-noTool)，另一个子集是他们更喜欢调用新工具 (prefer-Tool)。我们评估了 TicToc-v1 上不同时间间隔下 LLM 工具调用决策与人类偏好的一致性程度。我们的分析表明，在没有时间信息的情况下，大多数模型的表现仅比随机模型稍好，最高对齐率仅略高于 60%。虽然添加时间戳会带来轻微的改进，特别是对于较大的模型，但改进幅度不大，峰值约为 65%。我们还表明，天真的、基于提示的调整效果有限。我们的研究结果强调需要进行特定的训练后调整，以使多轮法学硕士工具的使用与人类时间感知保持一致。

Title: Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs

Authors: Jyotika Singh, Weiyi Sun, Amit Agarwal, Viji Krishnamurthy, Yassine Benajiba, Sujith Ravi, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23854
Pdf URL: https://arxiv.org/pdf/2510.23854
Copy Paste: [[2510.23854]] Can LLMs Narrate Tabular Data? An Evaluation Framework for Natural Language Representations of Text-to-SQL System Outputs(https://arxiv.org/abs/2510.23854)
Keywords: language model, llm, chat, agent
Abstract: In modern industry systems like multi-turn chat agents, Text-to-SQL technology bridges natural language (NL) questions and database (DB) querying. The conversion of tabular DB results into NL representations (NLRs) enables the chat-based interaction. Currently, NLR generation is typically handled by large language models (LLMs), but information loss or errors in presenting tabular results in NL remains largely unexplored. This paper introduces a novel evaluation method - Combo-Eval - for judgment of LLM-generated NLRs that combines the benefits of multiple existing methods, optimizing evaluation fidelity and achieving a significant reduction in LLM calls by 25-61%. Accompanying our method is NLR-BIRD, the first dedicated dataset for NLR benchmarking. Through human evaluations, we demonstrate the superior alignment of Combo-Eval with human judgments, applicable across scenarios with and without ground truth references.
摘要：在多轮聊天代理等现代工业系统中，文本到 SQL 技术连接了自然语言 (NL) 问题和数据库 (DB) 查询。将表格 DB 结果转换为 NL 表示 (NLR) 可以实现基于聊天的交互。目前，NLR 生成通常由大型语言模型 (LLM) 处理，但在 NL 中呈现表格结果时的信息丢失或错误仍然很大程度上未被探索。本文介绍了一种新颖的评估方法 - Combo-Eval - 用于判断 LLM 生成的 NLR，该方法结合了多种现有方法的优点，优化了评估保真度，并将 LLM 调用显着减少了 25-61%。与我们的方法配套的是 NLR-BIRD，这是第一个用于 NLR 基准测试的专用数据集。通过人类评估，我们证明了 Combo-Eval 与人类判断的卓越一致性，适用于有或没有真实参考的场景。

Title: OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning

Authors: Marianne Menglin Liu, Sai Ashish Somayajula, Syed Fahad Allam Shah, Sujith Ravi, Dan Roth
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23870
Pdf URL: https://arxiv.org/pdf/2510.23870
Copy Paste: [[2510.23870]] OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning(https://arxiv.org/abs/2510.23870)
Keywords: llm, prompt, agent
Abstract: We present OraPlan-SQL, our system for the Archer NL2SQL Evaluation Challenge 2025, a bilingual benchmark requiring complex reasoning such as arithmetic, commonsense, and hypothetical inference. OraPlan-SQL ranked first, exceeding the second-best system by more than 6% in execution accuracy (EX), with 55.0% in English and 56.7% in Chinese, while maintaining over 99% SQL validity (VA). Our system follows an agentic framework with two components: Planner agent that generates stepwise natural language plans, and SQL agent that converts these plans into executable SQL. Since SQL agent reliably adheres to the plan, our refinements focus on the planner. Unlike prior methods that rely on multiple sub-agents for planning and suffer from orchestration overhead, we introduce a feedback-guided meta-prompting strategy to refine a single planner. Failure cases from a held-out set are clustered with human input, and an LLM distills them into corrective guidelines that are integrated into the planner's system prompt, improving generalization without added complexity. For the multilingual scenario, to address transliteration and entity mismatch issues, we incorporate entity-linking guidelines that generate alternative surface forms for entities and explicitly include them in the plan. Finally, we enhance reliability through plan diversification: multiple candidate plans are generated for each query, with the SQL agent producing a query for each plan, and final output selected via majority voting over their executions.
摘要：我们展示了 OraPlan-SQL，这是我们为 2025 年 Archer NL2SQL 评估挑战赛设计的系统，这是一个需要复杂推理（例如算术、常识和假设推理）的双语基准测试。 OraPlan-SQL排名第一，执行准确率（EX）超过第二名系统6%以上，其中英文为55.0%，中文为56.7%，同时保持了99%以上的SQL有效性（VA）。我们的系统遵循由两个组件组成的代理框架：生成逐步自然语言计划的规划器代理，以及将这些计划转换为可执行 SQL 的 SQL 代理。由于 SQL 代理可靠地遵守计划，因此我们的改进集中在计划器上。与依赖多个子代理进行规划并遭受编排开销的现有方法不同，我们引入了反馈引导的元提示策略来完善单个规划器。保留的失败案例与人工输入聚集在一起，法学硕士将它们提炼成纠正指南，并集成到规划者的系统提示中，从而在不增加复杂性的情况下提高概括性。对于多语言场景，为了解决音译和实体不匹配问题，我们结合了实体链接准则，为实体生成替代的表面形式并将其明确包含在计划中。最后，我们通过计划多样化来增强可靠性：为每个查询生成多个候选计划，SQL 代理为每个计划生成一个查询，并通过对其执行的多数投票来选择最终输出。

Title: Language Models for Longitudinal Clinical Prediction

Authors: Tananun Songdechakraiwut, Michael Lutz
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23884
Pdf URL: https://arxiv.org/pdf/2510.23884
Copy Paste: [[2510.23884]] Language Models for Longitudinal Clinical Prediction(https://arxiv.org/abs/2510.23884)
Keywords: language model
Abstract: We explore a lightweight framework that adapts frozen large language models to analyze longitudinal clinical data. The approach integrates patient history and context within the language model space to generate accurate forecasts without model fine-tuning. Applied to neuropsychological assessments, it achieves accurate and reliable performance even with minimal training data, showing promise for early-stage Alzheimer's monitoring.
摘要：我们探索了一个轻量级框架，该框架采用冻结的大型语言模型来分析纵向临床数据。该方法将患者病史和上下文整合到语言模型空间中，无需模型微调即可生成准确的预测。应用于神经心理学评估时，即使训练数据极少，它也能实现准确可靠的性能，这为早期阿尔茨海默病监测带来了希望。

Title: AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

Authors: Kosei Uemura, Miaoran Zhang, David Ifeoluwa Adelani
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23896
Pdf URL: https://arxiv.org/pdf/2510.23896
Copy Paste: [[2510.23896]] AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages(https://arxiv.org/abs/2510.23896)
Keywords: llm, hallucination, retrieval-augmented generation
Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.
摘要：文本嵌入是多个 NLP 任务的重要组成部分，例如检索增强生成，这对于防止法学硕士产生幻觉至关重要。尽管最近发布了大规模多语言 MTEB (MMTEB)，但非洲语言的代表性仍然不足，现有任务经常从 FLORES 聚类或 SIB-200 等翻译基准中重新调整用途。在本文中，我们介绍了 AfriMTEB——MMTEB 的区域扩展，涵盖 59 种语言、14 个任务和 38 个数据集，其中包括 6 个新添加的数据集。与许多包含少于 5 种语言的 MMTEB 数据集不同，新增内容涵盖 14 至 56 种非洲语言，并引入了全新的任务，例如仇恨言论检测、意图检测和情感分类，这些都是以前未涵盖的。作为补充，我们推出了 AfriE5，这是通过跨语言对比蒸馏将指令调整的 mE5 模型适应非洲语言的版本。我们的评估表明，AfriE5 实现了最先进的性能，优于 Gemini-Embeddings 和 mE5 等强大的基线。

Title: Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation

Authors: Kaveh Eskandari Miandoab, Mahammed Kamruzzaman, Arshia Gharooni, Gene Louis Kim, Vasanth Sarathy, Ninareh Mehrabi
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.23921
Pdf URL: https://arxiv.org/pdf/2510.23921
Copy Paste: [[2510.23921]] Breaking the Benchmark: Revealing LLM Bias via Minimal Contextual Augmentation(https://arxiv.org/abs/2510.23921)
Keywords: language model, llm
Abstract: Large Language Models have been shown to demonstrate stereotypical biases in their representations and behavior due to the discriminative nature of the data that they have been trained on. Despite significant progress in the development of methods and models that refrain from using stereotypical information in their decision-making, recent work has shown that approaches used for bias alignment are brittle. In this work, we introduce a novel and general augmentation framework that involves three plug-and-play steps and is applicable to a number of fairness evaluation benchmarks. Through application of augmentation to a fairness evaluation dataset (Bias Benchmark for Question Answering (BBQ)), we find that Large Language Models (LLMs), including state-of-the-art open and closed weight models, are susceptible to perturbations to their inputs, showcasing a higher likelihood to behave stereotypically. Furthermore, we find that such models are more likely to have biased behavior in cases where the target demographic belongs to a community less studied by the literature, underlining the need to expand the fairness and safety research to include more diverse communities.
摘要：大型语言模型已被证明在其表示和行为中表现出刻板的偏见，因为它们所训练的数据具有歧视性。尽管在决策过程中避免使用陈规定型信息的方法和模型的开发取得了重大进展，但最近的研究表明，用于偏差调整的方法是脆弱的。在这项工作中，我们介绍了一种新颖且通用的增强框架，该框架涉及三个即插即用步骤，适用于许多公平性评估基准。通过将增强应用于公平性评估数据集（问答偏差基准（BBQ）），我们发现大型语言模型（LLM），包括最先进的开放和封闭权重模型，很容易受到输入扰动，表现出更高的刻板行为可能性。此外，我们发现，当目标人群属于文献研究较少的社区时，此类模型更有可能出现偏见行为，这强调了需要扩大公平性和安全性研究以包括更多样化的社区。

Title: Agent-based Automated Claim Matching with Instruction-following LLMs

Authors: Dina Pisarevskaya, Arkaitz Zubiaga
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23924
Pdf URL: https://arxiv.org/pdf/2510.23924
Copy Paste: [[2510.23924]] Agent-based Automated Claim Matching with Instruction-following LLMs(https://arxiv.org/abs/2510.23924)
Keywords: llm, prompt, agent
Abstract: We present a novel agent-based approach for the automated claim matching task with instruction-following LLMs. We propose a two-step pipeline that first generates prompts with LLMs, to then perform claim matching as a binary classification task with LLMs. We demonstrate that LLM-generated prompts can outperform SOTA with human-generated prompts, and that smaller LLMs can do as well as larger ones in the generation process, allowing to save computational resources. We also demonstrate the effectiveness of using different LLMs for each step of the pipeline, i.e. using an LLM for prompt generation, and another for claim matching. Our investigation into the prompt generation process in turn reveals insights into the LLMs' understanding of claim matching.
摘要：我们提出了一种基于代理的新颖方法，用于通过遵循指令的法学硕士来执行自动索赔匹配任务。我们提出了一个两步管道，首先使用法学硕士生成提示，然后使用法学硕士将索赔匹配作为二元分类任务执行。我们证明了 LLM 生成的提示可以优于人工生成提示的 SOTA，并且较小的 LLM 在生成过程中可以和较大的 LLM 一样执行，从而节省计算资源。我们还展示了在流程的每个步骤中使用不同 LLM 的有效性，即使用 LLM 进行提示生成，使用另一个 LLM 进行索赔匹配。我们对提示生成过程的调查反过来揭示了法学硕士对权利要求匹配的理解的见解。

Title: Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs

Authors: Soham Satyadharma, Fatemeh Sheikholeslami, Swati Kaul, Aziz Umit Batur, Suleiman A. Khan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23941
Pdf URL: https://arxiv.org/pdf/2510.23941
Copy Paste: [[2510.23941]] Auto prompting without training labels: An LLM cascade for product quality assessment in e-commerce catalogs(https://arxiv.org/abs/2510.23941)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: We introduce a novel, training free cascade for auto-prompting Large Language Models (LLMs) to assess product quality in e-commerce. Our system requires no training labels or model fine-tuning, instead automatically generating and refining prompts for evaluating attribute quality across tens of thousands of product category-attribute pairs. Starting from a seed of human-crafted prompts, the cascade progressively optimizes instructions to meet catalog-specific requirements. This approach bridges the gap between general language understanding and domain-specific knowledge at scale in complex industrial catalogs. Our extensive empirical evaluations shows the auto-prompt cascade improves precision and recall by $8-10\%$ over traditional chain-of-thought prompting. Notably, it achieves these gains while reducing domain expert effort from 5.1 hours to 3 minutes per attribute - a $99\%$ reduction. Additionally, the cascade generalizes effectively across five languages and multiple quality assessment tasks, consistently maintaining performance gains.
摘要：我们引入了一种新颖的、免培训的级联，用于自动提示大型语言模型（LLM）来评估电子商务中的产品质量。我们的系统不需要训练标签或模型微调，而是自动生成和细化提示，以评估数万个产品类别属性对的属性质量。从人工提示的种子开始，级联逐步优化指令以满足目录特定的要求。这种方法弥合了复杂工业目录中通用语言理解和大规模领域特定知识之间的差距。我们广泛的实证评估表明，与传统的思维链提示相比，自动提示级联将精确度和召回率提高了 8-10\%$。值得注意的是，它实现了这些收益，同时将每个属性的领域专家工作时间从 5.1 小时减少到 3 分钟 - 减少了 99\%$。此外，级联可以有效地推广到五种语言和多种质量评估任务，从而持续保持性能提升。

Title: Leveraging LLMs for Early Alzheimer's Prediction

Authors: Tananun Songdechakraiwut
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23946
Pdf URL: https://arxiv.org/pdf/2510.23946
Copy Paste: [[2510.23946]] Leveraging LLMs for Early Alzheimer's Prediction(https://arxiv.org/abs/2510.23946)
Keywords: llm
Abstract: We present a connectome-informed LLM framework that encodes dynamic fMRI connectivity as temporal sequences, applies robust normalization, and maps these data into a representation suitable for a frozen pre-trained LLM for clinical prediction. Applied to early Alzheimer's detection, our method achieves sensitive prediction with error rates well below clinically recognized margins, with implications for timely Alzheimer's intervention.
摘要：我们提出了一个基于连接组的 LLM 框架，它将动态 fMRI 连接编码为时间序列，应用稳健的标准化，并将这些数据映射到适合用于临床预测的冷冻预训练 LLM 的表示形式。应用于早期阿尔茨海默氏症检测，我们的方法实现了灵敏的预测，错误率远低于临床公认的界限，这对及时干预阿尔茨海默氏症具有重要意义。

Title: Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs

Authors: Kyomin Hwang, Hyeonjin Kim, Seungyeon Kim, Sunghyun Wee, Nojun Kwak
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.23949
Pdf URL: https://arxiv.org/pdf/2510.23949
Copy Paste: [[2510.23949]] Uncovering the Potential Risks in Unlearning: Danger of English-only Unlearning in Multilingual LLMs(https://arxiv.org/abs/2510.23949)
Keywords: llm, prompt
Abstract: There have been a couple of studies showing that attempting to erase multilingual knowledge using only English data is insufficient for multilingual LLMs. However, their analyses remain highly performance-oriented. In this paper, we switch the point of view to evaluation, and address an additional blind spot which reveals itself when the multilingual LLM is fully finetuned with parallel multilingual dataset before unlearning. Here, language confusion occurs whereby a model responds in language different from that of the input prompt. Language confusion is a problematic phenomenon in unlearning, causing the standard reference-based metrics to fail. We tackle this phenomenon in three steps: (1) introduce N-gram-based Language-Mix (N-Mix) score to quantitatively show the language confusion is pervasive and consistent in multilingual LLMs, (2) demonstrate that reference-based metrics result in false negatives when N-Mix score is high, and(3) suggest the need of new type of unlearning evaluation that can directly assess the content of the generated sentences. We call this type of metrics as semantic-based metric.
摘要：有几项研究表明，对于多语言法学硕士来说，尝试仅使用英语数据来消除多语言知识是不够的。然而，他们的分析仍然高度以绩效为导向。在本文中，我们将观点转向评估，并解决了一个额外的盲点，当多语言法学硕士在取消学习之前使用并行多语言数据集进行完全微调时，该盲点就会暴露出来。在这里，会发生语言混淆，模型以与输入提示不同的语言进行响应。语言混乱是遗忘过程中的一个问题现象，导致标准的基于参考的指标失败。我们分三个步骤解决这一现象：(1) 引入基于 N-gram 的语言混合 (N-Mix) 分数，以定量显示语言混淆在多语言法学硕士中普遍存在且一致；(2) 证明当 N-Mix 分数较高时，基于参考的指标会导致假阴性；(3) 表明需要一种新型的遗忘评估，可以直接评估生成句子的内容。我们将这种类型的指标称为基于语义的指标。

Title: M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems

Authors: Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23995
Pdf URL: https://arxiv.org/pdf/2510.23995
Copy Paste: [[2510.23995]] M-Eval: A Heterogeneity-Based Framework for Multi-evidence Validation in Medical RAG Systems(https://arxiv.org/abs/2510.23995)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Retrieval-augmented Generation (RAG) has demonstrated potential in enhancing medical question-answering systems through the integration of large language models (LLMs) with external medical literature. LLMs can retrieve relevant medical articles to generate more professional responses efficiently. However, current RAG applications still face problems. They generate incorrect information, such as hallucinations, and they fail to use external knowledge correctly. To solve these issues, we propose a new method named M-Eval. This method is inspired by the heterogeneity analysis approach used in Evidence-Based Medicine (EBM). Our approach can check for factual errors in RAG responses using evidence from multiple sources. First, we extract additional medical literature from external knowledge bases. Then, we retrieve the evidence documents generated by the RAG system. We use heterogeneity analysis to check whether the evidence supports different viewpoints in the response. In addition to verifying the accuracy of the response, we also assess the reliability of the evidence provided by the RAG system. Our method shows an improvement of up to 23.31% accuracy across various LLMs. This work can help detect errors in current RAG-based medical systems. It also makes the applications of LLMs more reliable and reduces diagnostic errors.
摘要：检索增强生成（RAG）已显示出通过将大语言模型（LLM）与外部医学文献相集成来增强医学问答系统的潜力。法学硕士可以检索相关的医学文章，以有效地生成更专业的回复。然而，目前的RAG应用仍然面临问题。他们产生不正确的信息，例如幻觉，并且无法正确使用外部知识。为了解决这些问题，我们提出了一种名为 M-Eval 的新方法。该方法的灵感来自于循证医学（EBM）中使用的异质性分析方法。我们的方法可以使用多个来源的证据来检查 RAG 响应中的事实错误。首先，我们从外部知识库中提取额外的医学文献。然后，我们检索RAG系统生成的证据文件。我们使用异质性分析来检查证据是否支持响应中的不同观点。除了验证响应的准确性外，我们还评估 RAG 系统提供的证据的可靠性。我们的方法显示各种法学硕士的准确率提高了高达 23.31%。这项工作可以帮助检测当前基于 RAG 的医疗系统中的错误。它还使法学硕士的应用更加可靠并减少诊断错误。

Title: PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine

Authors: Mengzhou Sun, Sendong Zhao, Jianyu Chen, Bin Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.23998
Pdf URL: https://arxiv.org/pdf/2510.23998
Copy Paste: [[2510.23998]] PICOs-RAG: PICO-supported Query Rewriting for Retrieval-Augmented Generation in Evidence-Based Medicine(https://arxiv.org/abs/2510.23998)
Keywords: language model, retrieval-augmented generation
Abstract: Evidence-based medicine (EBM) research has always been of paramount importance. It is important to find appropriate medical theoretical support for the needs from physicians or patients to reduce the occurrence of medical accidents. This process is often carried out by human querying relevant literature databases, which lacks objectivity and efficiency. Therefore, researchers utilize retrieval-augmented generation (RAG) to search for evidence and generate responses automatically. However, current RAG methods struggle to handle complex queries in real-world clinical scenarios. For example, when queries lack certain information or use imprecise language, the model may retrieve irrelevant evidence and generate unhelpful answers. To address this issue, we present the PICOs-RAG to expand the user queries into a better format. Our method can expand and normalize the queries into professional ones and use the PICO format, a search strategy tool present in EBM, to extract the most important information used for retrieval. This approach significantly enhances retrieval efficiency and relevance, resulting in up to an 8.8\% improvement compared to the baseline evaluated by our method. Thereby the PICOs-RAG improves the performance of the large language models into a helpful and reliable medical assistant in EBM.
摘要：循证医学（EBM）研究始终至关重要。针对医生或患者的需求寻找合适的医学理论支持对于减少医疗事故的发生具有重要意义。这一过程往往由人查询相关文献数据库来进行，缺乏客观性和效率。因此，研究人员利用检索增强生成（RAG）来搜索证据并自动生成响应。然而，当前的 RAG 方法很难处理现实临床场景中的复杂查询。例如，当查询缺乏某些信息或使用不精确的语言时，模型可能会检索不相关的证据并生成无用的答案。为了解决这个问题，我们提出了 PICOs-RAG，将用户查询扩展为更好的格式。我们的方法可以将查询扩展并标准化为专业查询，并使用 PICO 格式（EBM 中的一种搜索策略工具）来提取用于检索的最重要信息。这种方法显着提高了检索效率和相关性，与我们的方法评估的基线相比，提高了 8.8%。因此，PICOs-RAG 将大型语言模型的性能提高为 EBM 中有用且可靠的医疗助手。

Title: META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine

Authors: Mengzhou Sun, Sendong Zhao, Jianyu Chen, Haochun Wang, Bin Qin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24003
Pdf URL: https://arxiv.org/pdf/2510.24003
Copy Paste: [[2510.24003]] META-RAG: Meta-Analysis-Inspired Evidence-Re-Ranking Method for Retrieval-Augmented Generation in Evidence-Based Medicine(https://arxiv.org/abs/2510.24003)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Evidence-based medicine (EBM) holds a crucial role in clinical application. Given suitable medical articles, doctors effectively reduce the incidence of misdiagnoses. Researchers find it efficient to use large language models (LLMs) techniques like RAG for EBM tasks. However, the EBM maintains stringent requirements for evidence, and RAG applications in EBM struggle to efficiently distinguish high-quality evidence. Therefore, inspired by the meta-analysis used in EBM, we provide a new method to re-rank and filter the medical evidence. This method presents multiple principles to filter the best evidence for LLMs to diagnose. We employ a combination of several EBM methods to emulate the meta-analysis, which includes reliability analysis, heterogeneity analysis, and extrapolation analysis. These processes allow the users to retrieve the best medical evidence for the LLMs. Ultimately, we evaluate these high-quality articles and show an accuracy improvement of up to 11.4% in our experiments and results. Our method successfully enables RAG to extract higher-quality and more reliable evidence from the PubMed dataset. This work can reduce the infusion of incorrect knowledge into responses and help users receive more effective replies.
摘要：循证医学（EBM）在临床应用中发挥着至关重要的作用。给予合适的医学文章，医生可以有效降低误诊的发生率。研究人员发现，使用 RAG 等大型语言模型 (LLM) 技术来执行 EBM 任务非常有效。然而，EBM对证据保持着严格的要求，而RAG在EBM中的应用很难有效区分高质量的证据。因此，受到 EBM 中使用的荟萃分析的启发，我们提供了一种重新排序和过滤医学证据的新方法。该方法提出了多种原则来过滤法学硕士诊断的最佳证据。我们采用多种 EBM 方法的组合来模拟荟萃分析，其中包括可靠性分析、异质性分析和外推分析。这些流程允许用户检索法学硕士的最佳医学证据。最终，我们评估了这些高质量的文章，并在实验和结果中显示准确度提高了高达 11.4%。我们的方法成功地使 RAG 从 PubMed 数据集中提取了更高质量和更可靠的证据。这项工作可以减少在回复中注入不正确的知识，并帮助用户获得更有效的回复。

Title: TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Authors: Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24014
Pdf URL: https://arxiv.org/pdf/2510.24014
Copy Paste: [[2510.24014]] TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents(https://arxiv.org/abs/2510.24014)
Keywords: language model, llm, hallucination, agent
Abstract: The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: this https URL
摘要：信息抽取（IE）的任务是从文本中抽取结构化知识。然而，由于 IE 本体与下游应用程序需求之间的不匹配，利用 IE 输出通常并不简单。我们提出了 IE TEXT2DB 的新表述，强调 IE 输出和目标数据库（或知识库）的集成。给定用户指令、文档集和数据库，我们的任务要求模型使用文档集中的值更新数据库以满足用户指令。此任务需要了解有关提取内容的用户说明，并适应给定的 DB/KB 架构以了解如何动态提取。为了评估这项新任务，我们引入了一个新的基准，该基准具有数据填充、行填充和列添加等常见需求。此外，我们提出了一个LLM代理框架OPAL（Observe-PlanAnalyze LLM），其中包括与数据库交互的Observer组件、通过调用IE模型生成基于代码的计划的Planner组件，以及在执行前提供有关代码质量反馈的Analyzer组件。实验表明，OPAL可以通过生成不同的代码计划并调用所需的IE模型来成功适应不同的数据库模式。我们还强调了一些困难的情况，例如处理具有复杂依赖关系的大型数据库和提取幻觉，我们认为这些情况值得进一步研究。源代码：这个https URL

Title: Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward

Authors: Hao An, Yang Xu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24020
Pdf URL: https://arxiv.org/pdf/2510.24020
Copy Paste: [[2510.24020]] Teaching LLMs to Abstain via Fine-Grained Semantic Confidence Reward(https://arxiv.org/abs/2510.24020)
Keywords: language model, llm, hallucination
Abstract: Mitigating hallucinations in Large Language Models (LLMs) is critical for their reliable deployment. Existing methods typically fine-tune LLMs to abstain from answering questions beyond their knowledge scope. However, these methods often rely on coarse-grained signals to guide LLMs to abstain, such as overall confidence or uncertainty scores on multiple sampled answers, which may result in an imprecise awareness of the model's own knowledge boundaries. To this end, we propose a novel reinforcement learning framework built on $\textbf{\underline{Fi}ne-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$, which guides LLMs to abstain via sample-specific confidence. Specifically, our method operates by sampling multiple candidate answers and conducting semantic clustering, then training the LLM to retain answers within high-confidence clusters and discard those within low-confidence ones, thereby promoting accurate post-hoc abstention. Additionally, we propose a new metric for evaluating the reliability of abstention fine-tuning tasks more comprehensively. Our method significantly enhances reliability in both in-domain and out-of-distribution benchmarks.
摘要：减轻大型语言模型 (LLM) 中的幻觉对于其可靠部署至关重要。现有的方法通常会对法学硕士进行微调，避免回答超出其知识范围的问题。然而，这些方法往往依赖于粗粒度的信号来指导LLM放弃，例如对多个采样答案的总体置信度或不确定性得分，这可能会导致对模型自身知识边界的认识不精确。为此，我们提出了一种基于 $\textbf{\underline{Fi}need-grained \underline{S}emantic \underline{Co}nfidence \underline{Re}ward (\Ours)}$ 的新型强化学习框架，该框架通过样本特定的置信度指导法学硕士放弃。具体来说，我们的方法通过对多个候选答案进行采样并进行语义聚类来进行操作，然后训练法学硕士保留高置信度集群中的答案并丢弃低置信度集群中的答案，从而促进准确的事后弃权。此外，我们提出了一种新的指标来更全面地评估弃权微调任务的可靠性。我们的方法显着提高了域内和分布外基准的可靠性。

Title: SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs

Authors: Haiduo Huang, Jiangcheng Song, Yadong Zhang, Pengju Ren
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24021
Pdf URL: https://arxiv.org/pdf/2510.24021
Copy Paste: [[2510.24021]] SpecKD: Speculative Decoding for Effective Knowledge Distillation of LLMs(https://arxiv.org/abs/2510.24021)
Keywords: language model, llm
Abstract: Knowledge Distillation (KD) has become a cornerstone technique for compressing Large Language Models (LLMs) into smaller, more efficient student models. However, conventional KD approaches typically apply the distillation loss uniformly across all tokens, regardless of the teacher's confidence. This indiscriminate mimicry can introduce noise, as the student is forced to learn from the teacher's uncertain or high-entropy predictions, which may ultimately harm student performance-especially when the teacher is much larger and more powerful. To address this, we propose Speculative Knowledge Distillation (SpecKD), a novel, plug-and-play framework that introduces a dynamic, token-level gating mechanism inspired by the "propose-and-verify" paradigm of speculative decoding. At each step, the student's token proposal is verified against the teacher's distribution; the distillation loss is selectively applied only to "accepted" tokens, while "rejected" tokens are masked out. Extensive experiments on diverse text generation tasks show that SpecKD consistently and significantly outperforms strong KD baselines, leading to more stable training and more capable student models, and achieving state-of-the-art results.
摘要：知识蒸馏 (KD) 已成为将大型语言模型 (LLM) 压缩为更小、更高效的学生模型的基础技术。然而，传统的 KD 方法通常对所有标记统一应用蒸馏损失，而不管教师的置信度如何。这种不加区别的模仿可能会引入噪音，因为学生被迫从老师的不确定或高熵预测中学习，这最终可能会损害学生的表现——尤其是当老师更大、更强大时。为了解决这个问题，我们提出了推测性知识蒸馏（SpecKD），这是一种新颖的即插即用框架，它引入了受推测性解码的“提议和验证”范式启发的动态令牌级门控机制。在每一步中，学生的代币提案都会根据教师的分配进行验证；蒸馏损失仅选择性地应用于“接受”的令牌，而“拒绝”的令牌则被屏蔽。对各种文本生成任务的大量实验表明，SpecKD 始终显着优于强大的 KD 基线，从而实现更稳定的训练和更强大的学生模型，并取得最先进的结果。

Title: Pie: A Programmable Serving System for Emerging LLM Applications

Authors: In Gim, Zhiyao Ma, Seung-seob Lee, Lin Zhong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24051
Pdf URL: https://arxiv.org/pdf/2510.24051
Copy Paste: [[2510.24051]] Pie: A Programmable Serving System for Emerging LLM Applications(https://arxiv.org/abs/2510.24051)
Keywords: language model, llm, agent
Abstract: Emerging large language model (LLM) applications involve diverse reasoning strategies and agentic workflows, straining the capabilities of existing serving systems built on a monolithic token generation loop. This paper introduces Pie, a programmable LLM serving system designed for flexibility and efficiency. Pie decomposes the traditional generation loop into fine-grained service handlers exposed via an API and delegates control of the generation process to user-provided programs, called inferlets. This enables applications to implement new KV cache strategies, bespoke generation logic, and seamlessly integrate computation and I/O-entirely within the application, without requiring modifications to the serving system. Pie executes inferlets using WebAssembly, benefiting from its lightweight sandboxing. Our evaluation shows Pie matches state-of-the-art performance on standard tasks (3-12% latency overhead) while significantly improving latency and throughput (1.3x-3.4x higher) on agentic workflows by enabling application-specific optimizations.
摘要：新兴的大型语言模型（LLM）应用程序涉及不同的推理策略和代理工作流程，这对构建在整体令牌生成循环上的现有服务系统的能力造成了压力。本文介绍了 Pie，一个专为灵活性和效率而设计的可编程法学硕士服务系统。 Pie 将传统的生成循环分解为通过 API 公开的细粒度服务处理程序，并将生成过程的控制权委托给用户提供的程序（称为 inferlet）。这使得应用程序能够实现新的 KV 缓存策略、定制生成逻辑，并将计算和 I/O 无缝集成到应用程序中，而无需修改服务系统。 Pie 使用 WebAssembly 执行 inferlet，受益于其轻量级沙箱。我们的评估显示，Pie 在标准任务上达到了最先进的性能（3-12% 延迟开销），同时通过启用特定于应用程序的优化，显着改善了代理工作流程的延迟和吞吐量（高出 1.3-3.4 倍）。

Title: Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation

Authors: Xinwei Wu, Heng Liu, Jiang Zhou, Xiaohu Zhao, Linlong Xu, Longyue Wang, Weihua Luo, Kaifu Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24073
Pdf URL: https://arxiv.org/pdf/2510.24073
Copy Paste: [[2510.24073]] Challenging Multilingual LLMs: A New Taxonomy and Benchmark for Unraveling Hallucination in Translation(https://arxiv.org/abs/2510.24073)
Keywords: language model, llm, hallucination
Abstract: Large Language Models (LLMs) have advanced machine translation but remain vulnerable to hallucinations. Unfortunately, existing MT benchmarks are not capable of exposing failures in multilingual LLMs. To disclose hallucination in multilingual LLMs, we introduce a diagnostic framework with a taxonomy that separates Instruction Detachment from Source Detachment. Guided by this taxonomy, we create HalloMTBench, a multilingual, human-verified benchmark across 11 English-to-X directions. We employed 4 frontier LLMs to generate candidates and scrutinize these candidates with an ensemble of LLM judges, and expert validation. In this way, we curate 5,435 high-quality instances. We have evaluated 17 LLMs on HalloMTBench. Results reveal distinct ``hallucination triggers'' -- unique failure patterns reflecting model scale, source length sensitivity, linguistic biases, and Reinforcement-Learning (RL) amplified language mixing. HalloMTBench offers a forward-looking testbed for diagnosing LLM translation failures. HalloMTBench is available in this https URL.
摘要：大型语言模型 (LLM) 具有先进的机器翻译，但仍然容易产生幻觉。不幸的是，现有的机器翻译基准无法揭示多语言法学硕士的失败。为了揭示多语言法学硕士中的幻觉，我们引入了一个诊断框架，其分类法将指令分离与源分离分开。在此分类法的指导下，我们创建了 HalloMTBench，这是一个跨 11 个英语到 X 方向的多语言、经过人工验证的基准。我们聘请了 4 名前沿法学硕士来产生候选人，并通过法学硕士评委和专家验证对这些候选人进行审查。通过这种方式，我们策划了 5,435 个高质量实例。我们在 HalloMTBench 上评估了 17 名法学硕士。结果揭示了独特的“幻觉触发器”——反映模型规模、源长度敏感性、语言偏差和强化学习（RL）放大的语言混合的独特失败模式。 HalloMTBench 提供了一个用于诊断 LLM 翻译失败的前瞻性测试平台。 HalloMTBench 可以在此 https URL 中找到。

Title: Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures

Authors: Tyler A. Chang, Catherine Arnett, Abdelrahman Eldesokey, Abdelrahman Sadallah, Abeer Kashar, Abolade Daud, Abosede Grace Olanihun, Adamu Labaran Mohammed, Adeyemi Praise, Adhikarinayum Meerajita Sharma, Aditi Gupta, Afitab Iyigun, Afonso Simplício, Ahmed Essouaied, Aicha Chorana, Akhil Eppa, Akintunde Oladipo, Akshay Ramesh, Aleksei Dorkin, Alfred Malengo Kondoro, Alham Fikri Aji, Ali Eren Çetintaş, Allan Hanbury, Alou Dembele, Alp Niksarli, Álvaro Arroyo, Amin Bajand, Amol Khanna, Ana Chkhaidze, Ana Condez, Andiswa Mkhonto, Andrew Hoblitzell, Andrew Tran, Angelos Poulis, Anirban Majumder, Anna Vacalopoulou, Annette Kuuipolani Kanahele Wong, Annika Simonsen, Anton Kovalev, Ashvanth.S, Ayodeji Joseph Lana, Barkin Kinay, Bashar Alhafni, Benedict Cibalinda Busole, Bernard Ghanem, Bharti Nathani, Biljana Stojanovska Đurić, Bola Agbonile, Bragi Bergsson, Bruce Torres Fischer, Burak Tutar, Burcu Alakuş Çınar, Cade J. Kanoniakapueo Kane, Can Udomcharoenchaikit, Catherine Arnett, Chadi Helwe, Chaithra Reddy Nerella, Chen Cecilia Liu, Chiamaka Glory Nwokolo, Cristina España-Bonet, Cynthia Amol, DaeYeop Lee, Dana Arad, Daniil Dzenhaliou, Daria Pugacheva, Dasol Choi, Daud Abolade, David Liu, David Semedo, Deborah Popoola, Deividas Mataciunas, Delphine Nyaboke, Dhyuthy Krishna Kumar, Diogo Glória-Silva, Diogo Tavares, Divyanshu Goyal, DongGeon Lee, Ebele Nwamaka Anajemba, Egonu Ngozi Grace, Elena Mickel, Elena Tutubalina, Elias Herranen, Emile Anand, Emmanuel Habumuremyi, Emuobonuvie Maria Ajiboye, Eryawan Presma Yulianrifat, Esther Adenuga, Ewa Rudnicka, Faith Olabisi Itiola, Faran Taimoor Butt, Fathima Thekkekara, Fatima Haouari, Filbert Aurelian Tjiaranata, Firas Laakom, Francesca Grasso, Francesco Orabona, Francesco Periti, Gbenga Kayode Solomon, Gia Nghia Ngo, Gloria Udhehdhe-oze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24081
Pdf URL: https://arxiv.org/pdf/2510.24081
Copy Paste: [[2510.24081]] Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures(https://arxiv.org/abs/2510.24081)
Keywords: language model, llm
Abstract: To date, there exist almost no culturally-specific evaluation benchmarks for large language models (LLMs) that cover a large number of languages and cultures. In this paper, we present Global PIQA, a participatory commonsense reasoning benchmark for over 100 languages, constructed by hand by 335 researchers from 65 countries around the world. The 116 language varieties in Global PIQA cover five continents, 14 language families, and 23 writing systems. In the non-parallel split of Global PIQA, over 50% of examples reference local foods, customs, traditions, or other culturally-specific elements. We find that state-of-the-art LLMs perform well on Global PIQA in aggregate, but they exhibit weaker performance in lower-resource languages (up to a 37% accuracy gap, despite random chance at 50%). Open models generally perform worse than proprietary models. Global PIQA highlights that in many languages and cultures, everyday knowledge remains an area for improvement, alongside more widely-discussed capabilities such as complex reasoning and expert knowledge. Beyond its uses for LLM evaluation, we hope that Global PIQA provides a glimpse into the wide diversity of cultures in which human language is embedded.
摘要：迄今为止，涵盖大量语言和文化的大型语言模型（LLM）几乎不存在针对特定文化的评估基准。在本文中，我们提出了 Global PIQA，这是一个针对 100 多种语言的参与式常识推理基准，由来自全球 65 个国家的 335 名研究人员手工构建。 Global PIQA 的 116 个语言品种覆盖五大洲、14 个语系、23 个书写系统。在全球 PIQA 的非平行分类中，超过 50% 的示例引用了当地食品、习俗、传统或其他特定文化元素。我们发现，最先进的法学硕士总体上在 Global PIQA 上表现良好，但在资源较低的语言中表现较差（尽管随机概率为 50%，但准确率差距高达 37%）。开放模型的性能通常比专有模型差。 Global PIQA 强调，在许多语言和文化中，日常知识以及更广泛讨论的能力（例如复杂推理和专业知识）仍然是需要改进的领域。除了用于 LLM 评估之外，我们希望 Global PIQA 能够让我们了解人类语言所蕴含的广泛文化多样性。

Title: Reinforcement Learning for Long-Horizon Multi-Turn Search Agents

Authors: Vivek Kalyan, Martin Andrews
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24126
Pdf URL: https://arxiv.org/pdf/2510.24126
Copy Paste: [[2510.24126]] Reinforcement Learning for Long-Horizon Multi-Turn Search Agents(https://arxiv.org/abs/2510.24126)
Keywords: language model, llm, prompt, agent
Abstract: Large Language Model (LLM) agents can leverage multiple turns and tools to solve complex tasks, with prompt-based approaches achieving strong performance. This work demonstrates that Reinforcement Learning (RL) can push capabilities significantly further by learning from experience. Through experiments on a legal document search benchmark, we show that our RL-trained 14 Billion parameter model outperforms frontier class models (85% vs 78% accuracy). In addition, we explore turn-restricted regimes, during training and at test-time, that show these agents achieve better results if allowed to operate over longer multi-turn horizons.
摘要：大型语言模型 (LLM) 代理可以利用多个轮次和工具来解决复杂的任务，并通过基于提示的方法实现强大的性能。这项工作表明，强化学习 (RL) 可以通过从经验中学习来进一步显着提升能力。通过在法律文档搜索基准上进行的实验，我们表明，经过 RL 训练的 140 亿参数模型优于前沿类模型（准确率分别为 85% 和 78%）。此外，我们在训练和测试期间探索了回合限制机制，结果表明，如果允许在更长的多回合范围内运行，这些代理会取得更好的结果。

Title: Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Authors: Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24139
Pdf URL: https://arxiv.org/pdf/2510.24139
Copy Paste: [[2510.24139]] Beyond Line-Level Filtering for the Pretraining Corpora of LLMs(https://arxiv.org/abs/2510.24139)
Keywords: language model, llm
Abstract: While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.
摘要：虽然传统的行级过滤技术（例如行级重复数据删除和尾随标点过滤器）很常用，但这些基本方法有时会丢弃有价值的内容，从而对下游性能产生负面影响。在本文中，我们通过增强传统的过滤技术，介绍了两种方法：模式感知的行级重复数据删除（PLD）和模式感知的尾随标点过滤（PTF）。我们的方法不仅考虑行级信号，还考虑它们在文档中的顺序分布，使我们能够保留结构上重要的内容，否则这些内容可能会被删除。我们通过训练英语和韩语的小语言模型（1 B 参数）来评估这些提出的方法。结果表明，我们的方法持续提高了多项选择基准测试的性能，并显着提高了 SQuAD v1 和 KorQuAD v1 上的生成问答准确率。

Title: Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean

Authors: Chanwoo Park, Suyoung Park, JiA Kang, Jongyeon Park, Sangho Kim, Hyunji M. Park, Sumin Bae, Mingyu Kang, Jaejin Lee
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24150
Pdf URL: https://arxiv.org/pdf/2510.24150
Copy Paste: [[2510.24150]] Ko-MuSR: A Multistep Soft Reasoning Benchmark for LLMs Capable of Understanding Korean(https://arxiv.org/abs/2510.24150)
Keywords: language model, llm, prompt
Abstract: We present Ko-MuSR, the first benchmark to comprehensively evaluate multistep, soft reasoning in long Korean narratives while minimizing data contamination. Built following MuSR, Ko-MuSR features fully Korean narratives, reasoning chains, and multiple-choice questions verified by human annotators for logical consistency and answerability. Evaluations of four large language models -- two multilingual and two Korean-specialized -- show that multilingual models outperform Korean-focused ones even in Korean reasoning tasks, indicating cross-lingual generalization of reasoning ability. Carefully designed prompting strategies, which combine few-shot examples, reasoning traces, and task-specific hints, further boost accuracy, approaching human-level performance. Ko-MuSR offers a solid foundation for advancing Korean NLP by enabling systematic evaluation of long-context reasoning and prompting strategies.
摘要：我们推出了 Ko-MuSR，这是第一个全面评估韩国长篇叙事中的多步骤软推理的基准，同时最大限度地减少数据污染。 Ko-MuSR 遵循 MuSR 构建，具有完全韩语叙述、推理链和多项选择题，并由人工注释者验证，以确保逻辑一致性和可回答性。对四种大型语言模型（两种多语言和两种韩语专用）的评估表明，即使在韩语推理任务中，多语言模型也优于以韩语为中心的模型，这表明推理能力的跨语言泛化。精心设计的提示策略，结合了少量示例、推理轨迹和特定任务提示，进一步提高了准确性，接近人类水平。 Ko-MuSR 通过对长上下文推理和提示策略进行系统评估，为推进韩国 NLP 奠定了坚实的基础。

Title: MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations

Authors: Aaron Scott, Maike Züfle, Jan Niehues
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24178
Pdf URL: https://arxiv.org/pdf/2510.24178
Copy Paste: [[2510.24178]] MuSaG: A Multimodal German Sarcasm Dataset with Full-Modal Annotations(https://arxiv.org/abs/2510.24178)
Keywords: language model
Abstract: Sarcasm is a complex form of figurative language in which the intended meaning contradicts the literal one. Its prevalence in social media and popular culture poses persistent challenges for natural language understanding, sentiment analysis, and content moderation. With the emergence of multimodal large language models, sarcasm detection extends beyond text and requires integrating cues from audio and vision. We present MuSaG, the first German multimodal sarcasm detection dataset, consisting of 33 minutes of manually selected and human-annotated statements from German television shows. Each instance provides aligned text, audio, and video modalities, annotated separately by humans, enabling evaluation in unimodal and multimodal settings. We benchmark nine open-source and commercial models, spanning text, audio, vision, and multimodal architectures, and compare their performance to human annotations. Our results show that while humans rely heavily on audio in conversational settings, models perform best on text. This highlights a gap in current multimodal models and motivates the use of MuSaG for developing models better suited to realistic scenarios. We release MuSaG publicly to support future research on multimodal sarcasm detection and human-model alignment.
摘要：讽刺是一种复杂的比喻语言，其意图与字面意思相矛盾。它在社交媒体和流行文化中的流行对自然语言理解、情感分析和内容审核提出了持续的挑战。随着多模态大语言模型的出现，讽刺检测扩展到文本之外，需要整合音频和视觉的线索。我们推出了 MuSaG，这是第一个德国多模式讽刺检测数据集，由来自德国电视节目的 33 分钟手动选择和人工注释的语句组成。每个实例都提供对齐的文本、音频和视频模式，由人工单独注释，从而能够在单模式和多模式设置中进行评估。我们对九种开源和商业模型进行了基准测试，涵盖文本、音频、视觉和多模式架构，并将它们的性能与人工注释进行比较。我们的结果表明，虽然人类在对话环境中严重依赖音频，但模型在文本上表现最佳。这凸显了当前多模态模型的差距，并激励使用 MuSaG 来开发更适合现实场景的模型。我们公开发布 MuSaG 以支持未来多模式讽刺检测和人类模型对齐的研究。

Title: Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment

Authors: Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24208
Pdf URL: https://arxiv.org/pdf/2510.24208
Copy Paste: [[2510.24208]] Beyond Neural Incompatibility: Easing Cross-Scale Knowledge Transfer in Large Language Models through Latent Semantic Alignment(https://arxiv.org/abs/2510.24208)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) encode vast amounts of knowledge in their massive parameters, which is accessible to locate, trace, and analyze. Despite advances in neural interpretability, it is still not clear how to transfer knowledge in a fine-grained manner, namely parametric knowledge transfer (PKT). A key problem is enabling effective and efficient knowledge transfer across LLMs of different scales, which is essential for achieving greater flexibility and broader applicability in transferring knowledge between LLMs. Due to neural incompatibility, referring to the architectural and parametric differences between LLMs of varying scales, existing methods that directly reuse layer parameters are severely limited. In this paper, we identify the semantic alignment in latent space as the fundamental prerequisite for LLM cross-scale knowledge transfer. Instead of directly using the layer parameters, our approach takes activations as the medium of layer-wise knowledge transfer. Leveraging the semantics in latent space, our approach is simple and outperforms prior work, better aligning model behaviors across varying scales. Evaluations on four benchmarks demonstrate the efficacy of our method. Further analysis reveals the key factors easing cross-scale knowledge transfer and provides insights into the nature of latent semantic alignment.
摘要：大型语言模型 (LLM) 在其大量参数中编码大量知识，可用于定位、跟踪和分析。尽管神经可解释性取得了进步，但如何以细粒度的方式转移知识，即参数知识转移（PKT）仍然不清楚。一个关键问题是在不同规模的法学硕士之间实现有效和高效的知识转移，这对于在法学硕士之间实现更大的灵活性和更广泛的适用性至关重要。由于神经不兼容，参考不同规模的LLM之间的架构和参数差异，直接重用层参数的现有方法受到严重限制。在本文中，我们将潜在空间中的语义对齐确定为法学硕士跨尺度知识转移的基本前提。我们的方法不是直接使用层参数，而是将激活作为分层知识转移的媒介。利用潜在空间中的语义，我们的方法很简单，并且优于之前的工作，可以更好地调整不同尺度的模型行为。对四个基准的评估证明了我们方法的有效性。进一步的分析揭示了促进跨尺度知识转移的关键因素，并提供了对潜在语义对齐本质的见解。

Title: HACK: Hallucinations Along Certainty and Knowledge Axes

Authors: Adi Simhi, Jonathan Herzig, Itay Itzhak, Dana Arad, Zorik Gekhman, Roi Reichart, Fazl Barez, Gabriel Stanovsky, Idan Szpektor, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24222
Pdf URL: https://arxiv.org/pdf/2510.24222
Copy Paste: [[2510.24222]] HACK: Hallucinations Along Certainty and Knowledge Axes(https://arxiv.org/abs/2510.24222)
Keywords: llm, hallucination
Abstract: Hallucinations in LLMs present a critical barrier to their reliable usage. Existing research usually categorizes hallucination by their external properties rather than by the LLMs' underlying internal properties. This external focus overlooks that hallucinations may require tailored mitigation strategies based on their underlying mechanism. We propose a framework for categorizing hallucinations along two axes: knowledge and certainty. Since parametric knowledge and certainty may vary across models, our categorization method involves a model-specific dataset construction process that differentiates between those types of hallucinations. Along the knowledge axis, we distinguish between hallucinations caused by a lack of knowledge and those occurring despite the model having the knowledge of the correct response. To validate our framework along the knowledge axis, we apply steering mitigation, which relies on the existence of parametric knowledge to manipulate model activations. This addresses the lack of existing methods to validate knowledge categorization by showing a significant difference between the two hallucination types. We further analyze the distinct knowledge and hallucination patterns between models, showing that different hallucinations do occur despite shared parametric knowledge. Turning to the certainty axis, we identify a particularly concerning subset of hallucinations where models hallucinate with certainty despite having the correct knowledge internally. We introduce a new evaluation metric to measure the effectiveness of mitigation methods on this subset, revealing that while some methods perform well on average, they fail disproportionately on these critical cases. Our findings highlight the importance of considering both knowledge and certainty in hallucination analysis and call for targeted mitigation approaches that consider the hallucination underlying factors.
摘要：法学硕士的幻觉对其可靠使用构成了关键障碍。现有的研究通常根据幻觉的外部属性而不是法学硕士潜在的内部属性对幻觉进行分类。这种外部关注忽视了幻觉可能需要根据其潜在机制量身定制的缓解策略。我们提出了一个沿着两个轴对幻觉进行分类的框架：知识和确定性。由于参数知识和确定性可能因模型而异，因此我们的分类方法涉及特定于模型的数据集构建过程，该过程可以区分这些类型的幻觉。沿着知识轴，我们区分了由于缺乏知识而引起的幻觉和尽管模型知道正确反应而发生的幻觉。为了沿着知识轴验证我们的框架，我们应用了转向缓解，它依赖于参数知识的存在来操纵模型激活。这解决了缺乏现有方法来通过显示两种幻觉类型之间的显着差异来验证知识分类的问题。我们进一步分析了模型之间不同的知识和幻觉模式，表明尽管共享参数知识，但确实会发生不同的幻觉。转向确定性轴，我们确定了一个特别令人关注的幻觉子集，其中模型尽管内部拥有正确的知识，但仍会确定地产生幻觉。我们引入了一种新的评估指标来衡量缓解方法在此子集上的有效性，表明虽然某些方法平均表现良好，但它们在这些关键情况下却失败得不成比例。我们的研究结果强调了在幻觉分析中考虑知识和确定性的重要性，并呼吁采取考虑幻觉潜在因素的有针对性的缓解方法。

Title: Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?

Authors: Teague McMillan, Gabriele Dominici, Martin Gjoreski, Marc Langheinrich
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24236
Pdf URL: https://arxiv.org/pdf/2510.24236
Copy Paste: [[2510.24236]] Towards Transparent Reasoning: What Drives Faithfulness in Large Language Models?(https://arxiv.org/abs/2510.24236)
Keywords: language model, gpt, llm, prompt
Abstract: Large Language Models (LLMs) often produce explanations that do not faithfully reflect the factors driving their predictions. In healthcare settings, such unfaithfulness is especially problematic: explanations that omit salient clinical cues or mask spurious shortcuts can undermine clinician trust and lead to unsafe decision support. We study how inference and training-time choices shape explanation faithfulness, focusing on factors practitioners can control at deployment. We evaluate three LLMs (GPT-4.1-mini, LLaMA 70B, LLaMA 8B) on two datasets-BBQ (social bias) and MedQA (medical licensing questions), and manipulate the number and type of few-shot examples, prompting strategies, and training procedure. Our results show: (i) both the quantity and quality of few-shot examples significantly impact model faithfulness; (ii) faithfulness is sensitive to prompting design; (iii) the instruction-tuning phase improves measured faithfulness on MedQA. These findings offer insights into strategies for enhancing the interpretability and trustworthiness of LLMs in sensitive domains.
摘要：大型语言模型 (LLM) 生成的解释通常不能忠实地反映驱动其预测的因素。在医疗保健环境中，这种不忠行为尤其成问题：忽略显着临床线索或掩盖虚假捷径的解释可能会破坏临床医生的信任并导致不安全的决策支持。我们研究推理和训练时间选择如何塑造解释的忠实度，重点关注从业者在部署时可以控制的因素。我们在两个数据集——BBQ（社会偏见）和 MedQA（医疗许可问题）上评估了三个 LLM（GPT-4.1-mini、LLaMA 70B、LLaMA 8B），并操纵了少数样本的数量和类型、提示策略和训练程序。我们的结果表明：（i）少样本示例的数量和质量都会显着影响模型的可信度； (ii) 忠实度对提示设计很敏感； (iii) 指令调整阶段提高了 MedQA 的测量可信度。这些发现为增强敏感领域法学硕士的可解释性和可信度的策略提供了见解。

Title: Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations

Authors: Syed Zohaib Hassan, Pål Halvorsen, Miriam S. Johnson, Pierre Lison
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24250
Pdf URL: https://arxiv.org/pdf/2510.24250
Copy Paste: [[2510.24250]] Evaluating LLMs on Generating Age-Appropriate Child-Like Conversations(https://arxiv.org/abs/2510.24250)
Keywords: language model, gpt, llm
Abstract: Large Language Models (LLMs), predominantly trained on adult conversational data, face significant challenges when generating authentic, child-like dialogue for specialized applications. We present a comparative study evaluating five different LLMs (GPT-4, RUTER-LLAMA-2-13b, GPTSW, NorMistral-7b, and NorBloom-7b) to generate age-appropriate Norwegian conversations for children aged 5 and 9 years. Through a blind evaluation by eleven education professionals using both real child interview data and LLM-generated text samples, we assessed authenticity and developmental appropriateness. Our results show that evaluators achieved strong inter-rater reliability (ICC=0.75) and demonstrated higher accuracy in age prediction for younger children (5-year-olds) compared to older children (9-year-olds). While GPT-4 and NorBloom-7b performed relatively well, most models generated language perceived as more linguistically advanced than the target age groups. These findings highlight critical data-related challenges in developing LLM systems for specialized applications involving children, particularly in low-resource languages where comprehensive age-appropriate lexical resources are scarce.
摘要：主要接受成人对话数据训练的大型语言模型 (LLM) 在为专业应用生成真实的、类似儿童的对话时面临着重大挑战。我们提出了一项比较研究，评估了五种不同的法学硕士（GPT-4、RUTER-LLAMA-2-13b、GPTSW、NorMistral-7b 和 NorBloom-7b），为 5 岁和 9 岁的儿童生成适合年龄的挪威语对话。通过十一位教育专业人士使用真实儿童访谈数据和法学硕士生成的文本样本进行盲评估，我们评估了真实性和发展适当性。我们的结果表明，评估者获得了很强的评估者间信度 (ICC=0.75)，并且与年龄较大的儿童（9 岁）相比，年龄较小的儿童（5 岁）的年龄预测具有更高的准确性。虽然 GPT-4 和 NorBloom-7b 表现相对较好，但大多数模型生成的语言被认为比目标年龄组在语言上更先进。这些发现凸显了为涉及儿童的专业应用开发法学硕士系统时与数据相关的关键挑战，特别是在缺乏全面的适合年龄的词汇资源的低资源语言中。

Title: From Memorization to Reasoning in the Spectrum of Loss Curvature

Authors: Jack Merullo, Srihita Vatsavaya, Lucius Bushnaq, Owen Lewis
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24256
Pdf URL: https://arxiv.org/pdf/2510.24256
Copy Paste: [[2510.24256]] From Memorization to Reasoning in the Spectrum of Loss Curvature(https://arxiv.org/abs/2510.24256)
Keywords: language model
Abstract: We characterize how memorization is represented in transformer models and show that it can be disentangled in the weights of both language models (LMs) and vision transformers (ViTs) using a decomposition based on the loss landscape curvature. This insight is based on prior theoretical and empirical work showing that the curvature for memorized training points is much sharper than non memorized, meaning ordering weight components from high to low curvature can reveal a distinction without explicit labels. This motivates a weight editing procedure that suppresses far more recitation of untargeted memorized data more effectively than a recent unlearning method (BalancedSubnet), while maintaining lower perplexity. Since the basis of curvature has a natural interpretation for shared structure in model weights, we analyze the editing procedure extensively on its effect on downstream tasks in LMs, and find that fact retrieval and arithmetic are specifically and consistently negatively affected, even though open book fact retrieval and general logical reasoning is conserved. We posit these tasks rely heavily on specialized directions in weight space rather than general purpose mechanisms, regardless of whether those individual datapoints are memorized. We support this by showing a correspondence between task data's activation strength with low curvature components that we edit out, and the drop in task performance after the edit. Our work enhances the understanding of memorization in neural networks with practical applications towards removing it, and provides evidence for idiosyncratic, narrowly-used structures involved in solving tasks like math and fact retrieval.
摘要：我们描述了 Transformer 模型中记忆的表示方式，并表明可以使用基于损失景观曲率的分解来解开语言模型（LM）和视觉 Transformer（ViT）的权重。这种见解基于先前的理论和经验工作，表明记忆的训练点的曲率比未记忆的训练点的曲率要锐利得多，这意味着从高曲率到低曲率对权重分量进行排序可以在没有明确标签的情况下揭示区别。这激发了权重编辑过程，该过程比最近的遗忘方法（BalancedSubnet）更有效地抑制更多非目标记忆数据的背诵，同时保持较低的困惑度。由于曲率基对模型权重中的共享结构具有自然的解释，因此我们广泛分析了编辑过程对 LM 中下游任务的影响，并发现事实检索和算术受到特定且一致的负面影响，即使开卷事实检索和一般逻辑推理是保守的。我们认为这些任务在很大程度上依赖于权重空间中的专门方向而不是通用机制，无论这些单独的数据点是否被记住。我们通过显示任务数据的激活强度与我们编辑的低曲率分量之间的对应关系来支持这一点，以及编辑后任务性能的下降。我们的工作增强了对神经网络记忆的理解，并通过实际应用消除记忆，并为解决数学和事实检索等任务中涉及的特殊的、狭义使用的结构提供了证据。

Title: Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?

Authors: Ziqi Ma, Sao Mai Nguyen, Philippe Xu
Subjects: cs.CL, cs.RO
Abstract URL: https://arxiv.org/abs/2510.24259
Pdf URL: https://arxiv.org/pdf/2510.24259
Copy Paste: [[2510.24259]] Can LLMs Translate Human Instructions into a Reinforcement Learning Agent's Internal Emergent Symbolic Representation?(https://arxiv.org/abs/2510.24259)
Keywords: language model, gpt, llm, agent
Abstract: Emergent symbolic representations are critical for enabling developmental learning agents to plan and generalize across tasks. In this work, we investigate whether large language models (LLMs) can translate human natural language instructions into the internal symbolic representations that emerge during hierarchical reinforcement learning. We apply a structured evaluation framework to measure the translation performance of commonly seen LLMs -- GPT, Claude, Deepseek and Grok -- across different internal symbolic partitions generated by a hierarchical reinforcement learning algorithm in the Ant Maze and Ant Fall environments. Our findings reveal that although LLMs demonstrate some ability to translate natural language into a symbolic representation of the environment dynamics, their performance is highly sensitive to partition granularity and task complexity. The results expose limitations in current LLMs capacity for representation alignment, highlighting the need for further research on robust alignment between language and internal agent representations.
摘要：新兴的符号表示对于使发展学习代理能够跨任务进行规划和概括至关重要。在这项工作中，我们研究大型语言模型（LLM）是否可以将人类自然语言指令翻译成分层强化学习期间出现的内部符号表示。我们应用结构化评估框架来衡量常见的法学硕士（GPT、Claude、Deepseek 和 Grok）在 Ant Maze 和 Ant Fall 环境中由分层强化学习算法生成的不同内部符号分区的翻译性能。我们的研究结果表明，尽管法学硕士表现出一定的将自然语言转化为环境动态的符号表示的能力，但它们的表现对分区粒度和任务复杂性高度敏感。结果暴露了当前法学硕士表示对齐能力的局限性，强调需要进一步研究语言和内部代理表示之间的稳健对齐。

Title: MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference

Authors: Mădălina Zgreabăn, Tejaswini Deoskar, Lasha Abzianidze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24295
Pdf URL: https://arxiv.org/pdf/2510.24295
Copy Paste: [[2510.24295]] MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference(https://arxiv.org/abs/2510.24295)
Keywords: language model
Abstract: In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
摘要：近年来，许多泛化基准测试表明语言模型在自然语言推理（NLI）方面缺乏鲁棒性。然而，手动创建新的基准测试成本高昂，而自动生成高质量的基准测试（即使通过修改现有基准测试）也极其困难。在本文中，我们提出了一种通过替换开放类单词来自动生成原始 NLI 问题的高质量变体的方法，同时关键地保留其潜在推理。我们将泛化测试称为 MERGE（最小表达式替换泛化），它评估模型在原始问题的推理保留变体中预测的正确性。我们的结果表明，NLI 模型在变体上的表现要差 4-20%，这表明即使在这种微小改变的问题上，通用性也很低。我们还分析了替换的词类、词概率和合理性如何影响 NLI 模型的性能。

Title: Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards

Authors: Shangyu Xing, Siyuan Wang, Chenyuan Yang, Xinyu Dai, Xiang Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24302
Pdf URL: https://arxiv.org/pdf/2510.24302
Copy Paste: [[2510.24302]] Lookahead Tree-Based Rollouts for Enhanced Trajectory-Level Exploration in Reinforcement Learning with Verifiable Rewards(https://arxiv.org/abs/2510.24302)
Keywords: language model
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR), particularly with algorithms like Group Relative Policy Optimization (GRPO), has proven highly effective in enhancing the reasoning capabilities of large language models. However, a critical bottleneck in current pipelines lies in the limited diversity of sampled trajectories during group rollouts. Homogeneous trajectories and their associated rewards would diminish the return signals for policy updates, thereby hindering effective policy learning. This lack of diversity stems primarily from token-level stochastic sampling, where local variations are likely to collapse into near-identical reasoning paths. To address this limitation, we propose Lookahead Tree-Based Rollouts (LATR), a novel rollout strategy designed to explicitly promotes trajectory-level diversity by enforcing branching into different candidate tokens likely to yield distinct continuations. Specifically, LATR iteratively operates in three stages: (1) branching at high-uncertainty generation steps, (2) performing lookahead simulation for each new branch, and (3) pruning branches that exhibits prolonged similarity during simulation. Compared with stochastic Sampling, LATR accelerates policy learning by 131% on average and improves final pass@1 performance by 4.2% on both GRPO and Dynamic sAmpling Policy Optimization (DAPO) algorithms across different reasoning tasks. Our code and data are publicly available at this https URL.
摘要：具有可验证奖励的强化学习（RLVR），特别是像组相对策略优化（GRPO）这样的算法，已被证明在增强大型语言模型的推理能力方面非常有效。然而，当前管道的一个关键瓶颈在于组推出期间采样轨迹的多样性有限。同质的轨迹及其相关的奖励会减弱政策更新的回报信号，从而阻碍有效的政策学习。这种多样性的缺乏主要源于令牌级别的随机采样，其中局部变化可能会崩溃为几乎相同的推理路径。为了解决这个限制，我们提出了基于前瞻树的推出（LATR），这是一种新颖的推出策略，旨在通过强制分支到可能产生不同延续的不同候选标记来显式促进轨迹级多样性。具体来说，LATR 分三个阶段迭代运行：(1) 在高不确定性生成步骤中进行分支，(2) 对每个新分支执行前瞻模拟，以及 (3) 修剪在模拟过程中表现出长时间相似性的分支。与随机采样相比，LATR 在不同推理任务中的 GRPO 和动态采样策略优化 (DAPO) 算法上，策略学习平均加速了 131%，最终 pass@1 性能提高了 4.2%。我们的代码和数据可通过此 https URL 公开获取。

Title: Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning

Authors: Zhiheng Xi, Jixuan Huang, Xin Guo, Boyang Hong, Dingwen Yang, Xiaoran Fan, Shuo Li, Zehui Chen, Junjie Ye, Siyu Yuan, Zhengyin Du, Xuesong Yao, Yufei Xu, Jiecao Chen, Rui Zheng, Tao Gui, Qi Zhang, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24320
Pdf URL: https://arxiv.org/pdf/2510.24320
Copy Paste: [[2510.24320]] Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning(https://arxiv.org/abs/2510.24320)
Keywords: language model, llm
Abstract: Training critiquing language models to assess and provide feedback on model outputs is a promising way to improve LLMs for complex reasoning tasks. However, existing approaches typically rely on stronger supervisors for annotating critique data. To address this, we propose Critique-RL, an online RL approach for developing critiquing language models without stronger supervision. Our approach operates on a two-player paradigm: the actor generates a response, the critic provides feedback, and the actor refines the response accordingly. We first reveal that relying solely on indirect reward signals from the actor's outputs for RL optimization often leads to unsatisfactory critics: while their helpfulness (i.e., providing constructive feedback) improves, the discriminability (i.e., determining whether a response is high-quality or not) remains poor, resulting in marginal performance gains. To overcome this, Critique-RL adopts a two-stage optimization strategy. In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic's helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements. For example, it achieves a 9.02% gain on in-domain tasks and a 5.70% gain on out-of-domain tasks for Qwen2.5-7B, highlighting its potential.
摘要：训练批判性语言模型以评估模型输出并提供反馈是改进法学硕士复杂推理任务的一种有前途的方法。然而，现有的方法通常依赖于更强大的监督者来注释批评数据。为了解决这个问题，我们提出了 Critique-RL，这是一种在线 RL 方法，用于在没有更强监督的情况下开发批评语言模型。我们的方法以两人模式运行：参与者产生响应，批评者提供反馈，参与者相应地完善响应。我们首先发现，仅仅依靠参与者输出的间接奖励信号来进行强化学习优化，常常会招致不满意的批评：虽然它们的帮助性（即提供建设性反馈）有所提高，但辨别性（即确定响应是否高质量）仍然很差，导致边际性能收益。为了克服这个问题，Critique-RL 采用了两阶段优化策略。在第一阶段，它通过直接的基于规则的奖励信号强化了批评者的辨别力；在第二阶段，它引入了基于演员细化的间接奖励，以提高评论家的帮助性，同时通过适当的正则化保持其可区分性。跨各种任务和模型的广泛实验表明，Critique-RL 带来了显着的性能改进。例如，Qwen2.5-7B 在域内任务上获得了 9.02% 的增益，在域外任务上获得了 5.70% 的增益，凸显了其潜力。

Title: Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants

Authors: Hunzalah Hassan Bhatti, Firoj Alam
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24328
Pdf URL: https://arxiv.org/pdf/2510.24328
Copy Paste: [[2510.24328]] Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants(https://arxiv.org/abs/2510.24328)
Keywords: language model, llm, chain-of-thought
Abstract: Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
摘要：大型语言模型 (LLM) 越来越多地用于回答日常问题，但其在文化基础和方言内容上的表现在不同语言中仍然参差不齐。我们提出了一种综合方法，（i）将现代标准阿拉伯语（MSA）多项选择题（MCQ）翻译成英语和几种阿拉伯方言，（ii）将其转换为开放式问题（OEQ），（iii）在MCQ和OEQ设置下对一系列零样本和微调的LLM进行基准测试，以及（iv）生成思想链（CoT）基本原理来微调模型以进行逐步推理。使用这种方法，我们扩展了现有的数据集，其中 QA 跨多种语言并行排列，据我们所知，这使其成为同类中的第一个。我们使用开放和封闭模型进行了广泛的实验。我们的研究结果表明，（i）模型在阿拉伯方言上表现不佳，揭示了基于文化和方言特定知识的持续差距； (ii) 以阿拉伯语为中心的模型在 MCQ 上表现良好，但在 OEQ 上表现不佳； (iii) CoT 提高了判断的正确性，同时产生基于 n-gram 的混合度量。开发的数据集将公开发布，以支持文化和语言包容性评估的进一步研究。

Title: LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Authors: Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24345
Pdf URL: https://arxiv.org/pdf/2510.24345
Copy Paste: [[2510.24345]] LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability(https://arxiv.org/abs/2510.24345)
Keywords: language model, llm
Abstract: Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.
摘要：生成长篇、信息丰富且符合事实的输出仍然是大型语言模型 (LLM) 的主要挑战。现有的长格式生成基准通常使用难以验证的指标来评估现实世界的查询，或者使用可简化评估但忽略现实世界的复杂性的综合设置。在本文中，我们介绍了 \textbf{LongWeave}，它平衡了现实世界和可验证的评估与约束验证器评估（CoV-Eval）。 CoV-Eval 通过首先在现实场景中定义可验证的目标来构建任务，然后根据这些目标系统地生成相应的查询、文本材料和约束。这确保了任务既现实又客观可评估，从而能够对模型能力进行严格评估，以满足复杂的现实世界约束。 LongWeave 支持跨七个不同任务的可定制输入/输出长度（高达 64K/8K 令牌）。对 23 个法学硕士的评估表明，随着现实世界的复杂性和输出长度的增加，即使是最先进的模型在长格式生成方面也会遇到重大挑战。

Title: Text Simplification with Sentence Embeddings

Authors: Matthew Shardlow
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24365
Pdf URL: https://arxiv.org/pdf/2510.24365
Copy Paste: [[2510.24365]] Text Simplification with Sentence Embeddings(https://arxiv.org/abs/2510.24365)
Keywords: llm
Abstract: Sentence embeddings can be decoded to give approximations of the original texts used to create them. We explore this effect in the context of text simplification, demonstrating that reconstructed text embeddings preserve complexity levels. We experiment with a small feed forward neural network to effectively learn a transformation between sentence embeddings representing high-complexity and low-complexity texts. We provide comparison to a Seq2Seq and LLM-based approach, showing encouraging results in our much smaller learning setting. Finally, we demonstrate the applicability of our transformation to an unseen simplification dataset (MedEASI), as well as datasets from languages outside the training data (ES,DE). We conclude that learning transformations in sentence embedding space is a promising direction for future research and has potential to unlock the ability to develop small, but powerful models for text simplification and other natural language generation tasks.
摘要：句子嵌入可以被解码以给出用于创建它们的原始文本的近似值。我们在文本简化的背景下探索这种效果，证明重建的文本嵌入保留了复杂性级别。我们使用小型前馈神经网络进行实验，以有效学习表示高复杂性和低复杂性文本的句子嵌入之间的转换。我们提供了与 Seq2Seq 和基于 LLM 的方法的比较，在我们较小的学习环境中显示出令人鼓舞的结果。最后，我们展示了我们的转换对看不见的简化数据集（MedEASI）以及来自训练数据之外的语言的数据集（ES，DE）的适用性。我们的结论是，句子嵌入空间中的学习转换是未来研究的一个有前途的方向，并且有潜力释放开发小型但强大的文本简化和其他自然语言生成任务模型的能力。

Title: SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models

Authors: Ken Gu, Advait Bhat, Mike A Merrill, Robert West, Xin Liu, Daniel McDuff, Tim Althoff
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24427
Pdf URL: https://arxiv.org/pdf/2510.24427
Copy Paste: [[2510.24427]] SynthWorlds: Controlled Parallel Worlds for Disentangling Reasoning and Knowledge in Language Models(https://arxiv.org/abs/2510.24427)
Keywords: language model
Abstract: Evaluating the reasoning ability of language models (LMs) is complicated by their extensive parametric world knowledge, where benchmark performance often reflects factual recall rather than genuine reasoning. Existing datasets and approaches (e.g., temporal filtering, paraphrasing, adversarial substitution) cannot cleanly separate the two. We present SynthWorlds, a framework that disentangles task reasoning complexity from factual knowledge. In SynthWorlds, we construct parallel corpora representing two worlds with identical interconnected structure: a real-mapped world, where models may exploit parametric knowledge, and a synthetic-mapped world, where such knowledge is meaningless. On top of these corpora, we design two mirrored tasks as case studies: multi-hop question answering and page navigation, which maintain equal reasoning difficulty across worlds. Experiments in parametric-only (e.g., closed-book QA) and knowledge-augmented (e.g., retrieval-augmented) LM settings reveal a persistent knowledge advantage gap, defined as the performance boost models gain from memorized parametric world knowledge. Knowledge acquisition and integration mechanisms reduce but do not eliminate this gap, highlighting opportunities for system improvements. Fully automatic and scalable, SynthWorlds provides a controlled environment for evaluating LMs in ways that were previously challenging, enabling precise and testable comparisons of reasoning and memorization.
摘要：评估语言模型 (LM) 的推理能力因其广泛的参数世界知识而变得复杂，其中基准性能通常反映事实回忆而不是真正的推理。现有的数据集和方法（例如时间过滤、释义、对抗性替换）无法清楚地区分两者。我们提出了 SynthWorlds，一个将任务推理复杂性与事实知识分开的框架。在 SynthWorlds 中，我们构建了代表两个具有相同互连结构的世界的并行语料库：一个真实映射的世界，其中模型可以利用参数知识；以及一个合成映射的世界，其中此类知识毫无意义。在这些语料库之上，我们设计了两个镜像任务作为案例研究：多跳问答和页面导航，它们在不同世界中保持相同的推理难度。仅参数化（例如，闭卷 QA）和知识增强（例如，检索增强）LM 设置中的实验揭示了持续的知识优势差距，定义为从记忆的参数化世界知识中获得的性能提升模型。知识获取和整合机制减少但并未消除这一差距，凸显了系统改进的机会。 SynthWorlds 是全自动且可扩展的，它提供了一个受控环境，用于以以前具有挑战性的方式评估 LM，从而实现推理和记忆的精确且可测试的比较。

Title: LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data

Authors: Julian Valline, Cedric Lothritz, Jordi Cabot
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24434
Pdf URL: https://arxiv.org/pdf/2510.24434
Copy Paste: [[2510.24434]] LuxIT: A Luxembourgish Instruction Tuning Dataset from Monolingual Seed Data(https://arxiv.org/abs/2510.24434)
Keywords: language model, llm
Abstract: The effectiveness of instruction-tuned Large Language Models (LLMs) is often limited in low-resource linguistic settings due to a lack of high-quality training data. We introduce LuxIT, a novel, monolingual instruction tuning dataset for Luxembourgish developed to mitigate this challenge. We synthesize the dataset from a corpus of native Luxembourgish texts, utilizing DeepSeek-R1-0528, chosen for its shown proficiency in Luxembourgish. Following generation, we apply a quality assurance process, employing an LLM-as-a-judge approach. To investigate the practical utility of the dataset, we fine-tune several smaller-scale LLMs on LuxIT. Subsequent benchmarking against their base models on Luxembourgish language proficiency examinations, however, yields mixed results, with performance varying significantly across different models. LuxIT represents a critical contribution to Luxembourgish natural language processing and offers a replicable monolingual methodology, though our findings highlight the need for further research to optimize its application.
摘要：由于缺乏高质量的训练数据，指令调整的大型语言模型 (LLM) 的有效性在资源匮乏的语言环境中通常受到限制。我们推出 LuxIT，这是一种新颖的卢森堡语单语言指令调整数据集，旨在缓解这一挑战。我们利用 DeepSeek-R1-0528 从卢森堡本地文本语料库中合成数据集，该数据库因其在卢森堡语方面的熟练程度而被选中。接下来的一代，我们应用质量保证流程，采用法学硕士作为法官的方法。为了研究数据集的实际效用，我们对 LuxIT 上的几个较小规模的法学硕士进行了微调。然而，随后在卢森堡语言能力考试中针对其基本模型进行基准测试，得出的结果好坏参半，不同模型的性能差异很大。 LuxIT 代表了对卢森堡自然语言处理的关键贡献，并提供了可复制的单语言方法，尽管我们的研究结果强调需要进一步研究来优化其应用。

Title: Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content

Authors: Abdullah Mushtaq, Rafay Naeem, Ezieddin Elmahjub, Ibrahim Ghaznavi, Shawqi Al-Maliki, Mohamed Abdallah, Ala Al-Fuqaha, Junaid Qadir
Subjects: cs.CL, cs.AI, cs.CY, cs.MA
Abstract URL: https://arxiv.org/abs/2510.24438
Pdf URL: https://arxiv.org/pdf/2510.24438
Copy Paste: [[2510.24438]] Can LLMs Write Faithfully? An Agent-Based Evaluation of LLM-generated Islamic Content(https://arxiv.org/abs/2510.24438)
Keywords: language model, gpt, llm, prompt, agent
Abstract: Large language models are increasingly used for Islamic guidance, but risk misquoting texts, misapplying jurisprudence, or producing culturally inconsistent responses. We pilot an evaluation of GPT-4o, Ansari AI, and Fanar on prompts from authentic Islamic blogs. Our dual-agent framework uses a quantitative agent for citation verification and six-dimensional scoring (e.g., Structure, Islamic Consistency, Citations) and a qualitative agent for five-dimensional side-by-side comparison (e.g., Tone, Depth, Originality). GPT-4o scored highest in Islamic Accuracy (3.93) and Citation (3.38), Ansari AI followed (3.68, 3.32), and Fanar lagged (2.76, 1.82). Despite relatively strong performance, models still fall short in reliably producing accurate Islamic content and citations -- a paramount requirement in faith-sensitive writing. GPT-4o had the highest mean quantitative score (3.90/5), while Ansari AI led qualitative pairwise wins (116/200). Fanar, though trailing, introduces innovations for Islamic and Arabic contexts. This study underscores the need for community-driven benchmarks centering Muslim perspectives, offering an early step toward more reliable AI in Islamic knowledge and other high-stakes domains such as medicine, law, and journalism.
摘要：大型语言模型越来越多地用于伊斯兰指导，但存在错误引用文本、误用法理或产生文化上不一致的反应的风险。我们根据真实伊斯兰博客的提示对 GPT-4o、Ansari AI 和 Fanar 进行了试点评估。我们的双代理框架使用定量代理进行引文验证和六维评分（例如结构、伊斯兰一致性、引文）和定性代理进行五维并排比较（例如语气、深度、原创性）。 GPT-4o 在伊斯兰准确性 (3.93) 和引用 (3.38) 方面得分最高，Ansari AI 紧随其后 (3.68, 3.32)，Fanar 落后 (2.76, 1.82)。尽管性能相对较强，但模型在可靠地生成准确的伊斯兰内容和引文方面仍然存在不足——这是信仰敏感写作的首要要求。 GPT-4o 的平均定量得分最高 (3.90/5)，而 Ansari AI 在定性配对方面领先 (116/200)。 Fanar 虽然落后，但引入了针对伊斯兰和阿拉伯语环境的创新。这项研究强调了以穆斯林观点为中心的社区驱动基准的必要性，为伊斯兰知识和医学、法律和新闻等其他高风险领域的更可靠的人工智能迈出了早期一步。

Title: SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space

Authors: Viktoriia Zinkovich, Anton Antonov, Andrei Spiridonov, Denis Shepelev, Andrey Moskalenko, Daria Pugacheva, Elena Tutubalina, Andrey Kuznetsov, Vlad Shakhuro
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2510.24446
Pdf URL: https://arxiv.org/pdf/2510.24446
Copy Paste: [[2510.24446]] SPARTA: Evaluating Reasoning Segmentation Robustness through Black-Box Adversarial Paraphrasing in Text Autoencoder Latent Space(https://arxiv.org/abs/2510.24446)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) have shown impressive capabilities in vision-language tasks such as reasoning segmentation, where models generate segmentation masks based on textual queries. While prior work has primarily focused on perturbing image inputs, semantically equivalent textual paraphrases-crucial in real-world applications where users express the same intent in varied ways-remain underexplored. To address this gap, we introduce a novel adversarial paraphrasing task: generating grammatically correct paraphrases that preserve the original query meaning while degrading segmentation performance. To evaluate the quality of adversarial paraphrases, we develop a comprehensive automatic evaluation protocol validated with human studies. Furthermore, we introduce SPARTA-a black-box, sentence-level optimization method that operates in the low-dimensional semantic latent space of a text autoencoder, guided by reinforcement learning. SPARTA achieves significantly higher success rates, outperforming prior methods by up to 2x on both the ReasonSeg and LLMSeg-40k datasets. We use SPARTA and competitive baselines to assess the robustness of advanced reasoning segmentation models. We reveal that they remain vulnerable to adversarial paraphrasing-even under strict semantic and grammatical constraints. All code and data will be released publicly upon acceptance.
摘要：多模态大语言模型 (MLLM) 在视觉语言任务中表现出了令人印象深刻的能力，例如推理分割，其中模型根据文本查询生成分割掩码。虽然之前的工作主要集中在扰动图像输入，但语义等效的文本释义——在用户以不同方式表达相同意图的现实应用中至关重要——仍然没有得到充分探索。为了解决这一差距，我们引入了一种新颖的对抗性释义任务：生成语法正确的释义，保留原始查询含义，同时降低分段性能。为了评估对抗性释义的质量，我们开发了一个经过人类研究验证的综合自动评估协议。此外，我们引入了 SPARTA——一种黑盒、句子级优化方法，在强化学习的指导下，在文本自动编码器的低维语义潜在空间中运行。 SPARTA 的成功率显着提高，在 ReasonSeg 和 LLMSeg-40k 数据集上的性能比之前的方法高出 2 倍。我们使用 SPARTA 和竞争基线来评估高级推理分割模型的稳健性。我们发现，即使在严格的语义和语法限制下，它们仍然容易受到对抗性释义的影响。所有代码和数据将在接受后公开发布。

Title: Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices

Authors: Špela Vintar, Taja Kuzman Pungeršek, Mojca Brglez, Nikola Ljubešić
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24450
Pdf URL: https://arxiv.org/pdf/2510.24450
Copy Paste: [[2510.24450]] Charting the European LLM Benchmarking Landscape: A New Taxonomy and a Set of Best Practices(https://arxiv.org/abs/2510.24450)
Keywords: language model, llm
Abstract: While new benchmarks for large language models (LLMs) are being developed continuously to catch up with the growing capabilities of new models and AI in general, using and evaluating LLMs in non-English languages remains a little-charted landscape. We give a concise overview of recent developments in LLM benchmarking, and then propose a new taxonomy for the categorization of benchmarks that is tailored to multilingual or non-English use scenarios. We further propose a set of best practices and quality standards that could lead to a more coordinated development of benchmarks for European languages. Among other recommendations, we advocate for a higher language and culture sensitivity of evaluation methods.
摘要：虽然大型语言模型 (LLM) 的新基准正在不断开发，以赶上新模型和人工智能不断增长的能力，但在非英语语言中使用和评估 LLM 仍然是一个鲜为人知的领域。我们简要概述了 LLM 基准测试的最新发展，然后提出了一种新的分类法，用于针对多语言或非英语使用场景定制的基准分类。我们进一步提出了一套最佳实践和质量标准，可以促进欧洲语言基准的更加协调的发展。除其他建议外，我们主张评估方法具有更高的语言和文化敏感性。

Title: Iterative Critique-Refine Framework for Enhancing LLM Personalization

Authors: Durga Prasad Maram, Dhruvin Gandhi, Zonghai Yao, Gayathri Akkinapalli, Franck Dernoncourt, Yu Wang, Ryan A. Rossi, Nesreen K. Ahmed
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.24469
Pdf URL: https://arxiv.org/pdf/2510.24469
Copy Paste: [[2510.24469]] Iterative Critique-Refine Framework for Enhancing LLM Personalization(https://arxiv.org/abs/2510.24469)
Keywords: llm
Abstract: Personalized text generation requires models not only to produce coherent text but also to align with a target user's style, tone, and topical focus. Existing retrieval-augmented approaches such as LaMP and PGraphRAG enrich profiles with user and neighbor histories, but they stop at generation and often yield outputs that drift in tone, topic, or style. We present PerFine, a unified, training-free critique-refine framework that enhances personalization through iterative, profile-grounded feedback. In each iteration, an LLM generator produces a draft conditioned on the retrieved profile, and a critic LLM - also conditioned on the same profile - provides structured feedback on tone, vocabulary, sentence structure, and topicality. The generator then revises, while a novel knockout strategy retains the stronger draft across iterations. We further study additional inference-time strategies such as Best-of-N and Topic Extraction to balance quality and efficiency. Across Yelp, Goodreads, and Amazon datasets, PerFine consistently improves personalization over PGraphRAG, with GEval gains of +7-13%, steady improvements over 3-5 refinement iterations, and scalability with increasing critic size. These results highlight that post-hoc, profile-aware feedback offers a powerful paradigm for personalized LLM generation that is both training-free and model-agnostic.
摘要：个性化文本生成要求模型不仅能够生成连贯的文本，而且能够与目标用户的风格、语气和主题焦点保持一致。现有的检索增强方法（例如 LaMP 和 PGraphRAG）通过用户和邻居历史丰富了个人资料，但它们在生成时就停止了，并且经常产生在语气、主题或风格上漂移的输出。我们推出 PerFine，这是一个统一的、免培训的批评-提炼框架，通过迭代、基于个人资料的反馈来增强个性化。在每次迭代中，LLM 生成器都会根据检索到的配置文件生成草稿，而评论家 LLM（也以相同的配置文件为条件）提供有关语气、词汇、句子结构和主题性的结构化反馈。然后生成器进行修改，同时新颖的淘汰策略在迭代中保留更强的草稿。我们进一步研究其他推理时间策略，例如 Best-of-N 和主题提取，以平衡质量和效率。在 Yelp、Goodreads 和 Amazon 数据集上，PerFine 持续改进了 PGraphRAG 的个性化，GEval 提高了 +7-13%，在 3-5 次细化迭代中稳步改进，并且随着评论家规模的增加而扩展。这些结果强调，事后、个人资料感知反馈为个性化 LLM 生成提供了强大的范式，既无需培训，又与模型无关。

Title: Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems

Authors: Yihan Li, Xiyuan Fu, Ghanshyam Verma, Paul Buitelaar, Mingming Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24476
Pdf URL: https://arxiv.org/pdf/2510.24476
Copy Paste: [[2510.24476]] Mitigating Hallucination in Large Language Models (LLMs): An Application-Oriented Survey on RAG, Reasoning, and Agentic Systems(https://arxiv.org/abs/2510.24476)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Hallucination remains one of the key obstacles to the reliable deployment of large language models (LLMs), particularly in real-world applications. Among various mitigation strategies, Retrieval-Augmented Generation (RAG) and reasoning enhancement have emerged as two of the most effective and widely adopted approaches, marking a shift from merely suppressing hallucinations to balancing creativity and reliability. However, their synergistic potential and underlying mechanisms for hallucination mitigation have not yet been systematically examined. This survey adopts an application-oriented perspective of capability enhancement to analyze how RAG, reasoning enhancement, and their integration in Agentic Systems mitigate hallucinations. We propose a taxonomy distinguishing knowledge-based and logic-based hallucinations, systematically examine how RAG and reasoning address each, and present a unified framework supported by real-world applications, evaluations, and benchmarks.
摘要：幻觉仍然是大型语言模型（LLM）可靠部署的主要障碍之一，特别是在现实世界的应用程序中。在各种缓解策略中，检索增强生成（RAG）和推理增强已成为最有效且广泛采用的两种方法，标志着从仅仅抑制幻觉到平衡创造力和可靠性的转变。然而，它们的协同潜力和减轻幻觉的潜在机制尚未得到系统研究。本调查采用面向应用的能力增强视角来分析 RAG、推理增强及其在 Agentic Systems 中的集成如何减轻幻觉。我们提出了一种区分基于知识的幻觉和基于逻辑的幻觉的分类法，系统地研究 RAG 和推理如何解决每个幻觉，并提出一个由现实世界的应用、评估和基准支持的统一框架。

Title: A word association network methodology for evaluating implicit biases in LLMs compared to humans

Authors: Katherine Abramski, Giulio Rossetti, Massimo Stella
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24488
Pdf URL: https://arxiv.org/pdf/2510.24488
Copy Paste: [[2510.24488]] A word association network methodology for evaluating implicit biases in LLMs compared to humans(https://arxiv.org/abs/2510.24488)
Keywords: language model, llm, prompt
Abstract: As Large language models (LLMs) become increasingly integrated into our lives, their inherent social biases remain a pressing concern. Detecting and evaluating these biases can be challenging because they are often implicit rather than explicit in nature, so developing evaluation methods that assess the implicit knowledge representations of LLMs is essential. We present a novel word association network methodology for evaluating implicit biases in LLMs based on simulating semantic priming within LLM-generated word association networks. Our prompt-based approach taps into the implicit relational structures encoded in LLMs, providing both quantitative and qualitative assessments of bias. Unlike most prompt-based evaluation methods, our method enables direct comparisons between various LLMs and humans, providing a valuable point of reference and offering new insights into the alignment of LLMs with human cognition. To demonstrate the utility of our methodology, we apply it to both humans and several widely used LLMs to investigate social biases related to gender, religion, ethnicity, sexual orientation, and political party. Our results reveal both convergences and divergences between LLM and human biases, providing new perspectives on the potential risks of using LLMs. Our methodology contributes to a systematic, scalable, and generalizable framework for evaluating and comparing biases across multiple LLMs and humans, advancing the goal of transparent and socially responsible language technologies.
摘要：随着大语言模型（LLM）越来越多地融入我们的生活，它们固有的社会偏见仍然是一个紧迫的问题。检测和评估这些偏差可能具有挑战性，因为它们本质上通常是隐性的而不是显性的，因此开发评估法学硕士隐性知识表示的评估方法至关重要。我们提出了一种新颖的单词关联网络方法，用于基于模拟法学硕士生成的单词关联网络中的语义启动来评估法学硕士中的隐性偏差。我们基于提示的方法利用了法学硕士中编码的隐含关系结构，提供了定量和定性的偏见评估。与大多数基于提示的评估方法不同，我们的方法可以在各种法学硕士和人类之间进行直接比较，提供有价值的参考点，并为法学硕士与人类认知的一致性提供新的见解。为了证明我们的方法的实用性，我们将其应用于人类和几个广泛使用的法学硕士，以调查与性别、宗教、种族、性取向和政党相关的社会偏见。我们的结果揭示了法学硕士和人类偏见之间的趋同和分歧，为使用法学硕士的潜在风险提供了新的视角。我们的方法有助于建立一个系统的、可扩展的和可概括的框架，用于评估和比较多个法学硕士和人类之间的偏见，推进透明和对社会负责的语言技术的目标。

Title: CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?

Authors: Qing Zong, Jiayu Liu, Tianshi Zheng, Chunyang Li, Baixuan Xu, Haochen Shi, Weiqi Wang, Zhaowei Wang, Chunkit Chan, Yangqiu Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24505
Pdf URL: https://arxiv.org/pdf/2510.24505
Copy Paste: [[2510.24505]] CritiCal: Can Critique Help LLM Uncertainty or Confidence Calibration?(https://arxiv.org/abs/2510.24505)
Keywords: language model, gpt, llm
Abstract: Accurate confidence calibration in Large Language Models (LLMs) is critical for safe use in high-stakes domains, where clear verbalized confidence enhances user trust. Traditional methods that mimic reference confidence expressions often fail to capture the reasoning needed for accurate confidence assessment. We propose natural language critiques as a solution, ideally suited for confidence calibration, as precise gold confidence labels are hard to obtain and often require multiple generations. This paper studies how natural language critiques can enhance verbalized confidence, addressing: (1) What to critique: uncertainty (question-focused) or confidence (answer-specific)? Analysis shows confidence suits multiple-choice tasks, while uncertainty excels in open-ended scenarios. (2) How to critique: self-critique or critique calibration training? We propose Self-Critique, enabling LLMs to critique and optimize their confidence beyond mere accuracy, and CritiCal, a novel Critique Calibration training method that leverages natural language critiques to improve confidence calibration, moving beyond direct numerical optimization. Experiments show that CritiCal significantly outperforms Self-Critique and other competitive baselines, even surpassing its teacher model, GPT-4o, in complex reasoning tasks. CritiCal also shows robust generalization in out-of-distribution settings, advancing LLM's reliability.
摘要：大型语言模型 (LLM) 中的准确置信度校准对于高风险领域的安全使用至关重要，在这些领域中，清晰的语言置信度可以增强用户信任。模仿参考置信度表达式的传统方法通常无法捕获准确置信度评估所需的推理。我们提出自然语言批评作为解决方案，非常适合置信度校准，因为精确的黄金置信度标签很难获得并且通常需要多代。本文研究自然语言批评如何增强言语信心，解决：（1）批评什么：不确定性（以问题为中心）或信心（特定于答案）？分析表明，信心适合多项选择任务，而不确定性则适合开放式场景。（2）如何批评：自我批评还是批评校准训练？我们提出了自我批评，使法学硕士能够批评和优化他们的信心，而不仅仅是准确性，而 CritiCal 是一种新颖的批评校准训练方法，它利用自然语言批评来改进信心校准，超越直接数值优化。实验表明，在复杂的推理任务中，Critical 显着优于 Self-Critique 和其他竞争基线，甚至超越了其教师模型 GPT-4o。 CritiCal 还在分布外环境中显示出强大的泛化能力，从而提高了法学硕士的可靠性。

Title: Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

Authors: Venkata S Govindarajan, Laura Biester
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24538
Pdf URL: https://arxiv.org/pdf/2510.24538
Copy Paste: [[2510.24538]] Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written(https://arxiv.org/abs/2510.24538)
Keywords: llm, prompt
Abstract: Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at this https URL
摘要：文本幽默是多种多样的，计算研究需要考虑到这个范围，包括故意的坏幽默。在本文中，我们策划并分析了布尔沃-利顿小说大赛的小说句子语料库，以更好地理解英语中的“坏”幽默。标准幽默检测模型在我们的语料库上表现不佳，对文学手段的分析发现，这些句子将现有幽默数据集中常见的特征（例如双关语、讽刺）与隐喻、元小说和明喻相结合。促使法学硕士合成竞赛式句子的方法是模仿形式，但通过过度使用某些文学手段来夸大效果，并且包含比人类作家更多的新奇形容词-名词二元组。数据、代码和分析可在此 https URL 获取

Title: Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Authors: Seyoung Song, Nawon Kim, Songeun Chae, Kiwoong Park, Jiho Jin, Haneul Yoo, Kyunghyun Cho, Alice Oh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24541
Pdf URL: https://arxiv.org/pdf/2510.24541
Copy Paste: [[2510.24541]] Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts(https://arxiv.org/abs/2510.24541)
Keywords: language model
Abstract: The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 18 million documents and 5 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
摘要：朝鲜语历史的特点是口语和书面形式的差异以及从汉字到韩文字母的关键转变。然而，由于缺乏可访问的历史语料库，这种语言演变在 NLP 中很大程度上仍未得到探索。为了解决这一差距，我们引入了开放韩国历史语料库，这是一个涵盖 1,300 年和 6 种语言的大规模、公开许可的数据集，以及代表性不足的书写系统，例如韩式汉语 (Idu) 和汉字-韩文混合文字。该语料库包含来自 19 个来源的 1800 万份文档和 50 亿个标记，时间跨度从 7 世纪到 2025 年。我们利用这一资源来定量分析主要的语言变化：(1) Idu 使用率在 1860 年代达到顶峰，然后急剧下降； (2)从汉字到朝鲜文的过渡是从 1890 年左右开始的快速转变； (3) 朝鲜的词汇差异导致现代分词器产生的词汇外率高出 51 倍。这项工作通过捕捉韩语的历史，为定量历时分析提供了基础资源。此外，它可以作为大型语言模型的预训练语料库，有可能提高他们对现代韩文和古代书写系统中的中韩词汇的理解。

Title: ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?

Authors: Christine Ye, Sihan Yuan, Suchetha Cooray, Steven Dillmann, Ian L. V. Roque, Dalya Baron, Philipp Frank, Sergio Martin-Alvarez, Nolan Koblischke, Frank J Qu, Diyi Yang, Risa Wechsler, Ioana Ciuca
Subjects: cs.CL, astro-ph.IM
Abstract URL: https://arxiv.org/abs/2510.24591
Pdf URL: https://arxiv.org/pdf/2510.24591
Copy Paste: [[2510.24591]] ReplicationBench: Can AI Agents Replicate Astrophysics Research Papers?(https://arxiv.org/abs/2510.24591)
Keywords: language model, agent
Abstract: Frontier AI agents show increasing promise as scientific research assistants, and may eventually be useful for extended, open-ended research workflows. However, in order to use agents for novel research, we must first assess the underlying faithfulness and correctness of their work. To evaluate agents as research assistants, we introduce ReplicationBench, an evaluation framework that tests whether agents can replicate entire research papers drawn from the astrophysics literature. Astrophysics, where research relies heavily on archival data and computational study while requiring little real-world experimentation, is a particularly useful testbed for AI agents in scientific research. We split each paper into tasks which require agents to replicate the paper's core contributions, including the experimental setup, derivations, data analysis, and codebase. Each task is co-developed with the original paper authors and targets a key scientific result, enabling objective evaluation of both faithfulness (adherence to original methods) and correctness (technical accuracy of results). ReplicationBench is extremely challenging for current frontier language models: even the best-performing language models score under 20%. We analyze ReplicationBench trajectories in collaboration with domain experts and find a rich, diverse set of failure modes for agents in scientific research. ReplicationBench establishes the first benchmark of paper-scale, expert-validated astrophysics research tasks, reveals insights about agent performance generalizable to other domains of data-driven science, and provides a scalable framework for measuring AI agents' reliability in scientific research.
摘要：前沿人工智能代理作为科学研究助手显示出越来越大的前景，并且最终可能对扩展的、开放式的研究工作流程有用。然而，为了使用代理进行新颖的研究，我们必须首先评估他们工作的潜在忠实性和正确性。为了评估代理作为研究助理，我们引入了 ReplicationBench，这是一个评估框架，用于测试代理是否可以复制从天体物理学文献中提取的整篇研究论文。天体物理学的研究严重依赖档案数据和计算研究，同时几乎不需要现实世界的实验，对于科学研究中的人工智能代理来说，这是一个特别有用的测试平台。我们将每篇论文分成多个任务，要求代理复制论文的核心贡献，包括实验设置、推导、数据分析和代码库。每项任务均与原始论文作者共同开发，并针对关键的科学结果，从而能够客观评估忠实性（遵守原始方法）和正确性（结果的技术准确性）。 ReplicationBench 对于当前前沿语言模型来说极具挑战性：即使是性能最好的语言模型得分也低于 20%。我们与领域专家合作分析 ReplicationBench 轨迹，并为科学研究中的代理找到丰富多样的故障模式。 ReplicationBench 建立了第一个纸质规模、经过专家验证的天体物理学研究任务的基准，揭示了可推广到数据驱动科学其他领域的代理性能的见解，并提供了一个可扩展的框架来衡量人工智能代理在科学研究中的可靠性。

Title: ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization

Authors: Guoxin Chen, Jing Wu, Xinjie Chen, Wayne Xin Zhao, Ruihua Song, Chengxi Li, Kai Fan, Dayiheng Liu, Minpeng Liao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24592
Pdf URL: https://arxiv.org/pdf/2510.24592
Copy Paste: [[2510.24592]] ReForm: Reflective Autoformalization with Prospective Bounded Sequence Optimization(https://arxiv.org/abs/2510.24592)
Keywords: language model, llm
Abstract: Autoformalization, which translates natural language mathematics into machine-verifiable formal statements, is critical for using formal mathematical reasoning to solve math problems stated in natural language. While Large Language Models can generate syntactically correct formal statements, they often fail to preserve the original problem's semantic intent. This limitation arises from the LLM approaches' treating autoformalization as a simplistic translation task which lacks mechanisms for self-reflection and iterative refinement that human experts naturally employ. To address these issues, we propose ReForm, a Reflective Autoformalization method that tightly integrates semantic consistency evaluation into the autoformalization process. This enables the model to iteratively generate formal statements, assess its semantic fidelity, and self-correct identified errors through progressive refinement. To effectively train this reflective model, we introduce Prospective Bounded Sequence Optimization (PBSO), which employs different rewards at different sequence positions to ensure that the model develops both accurate autoformalization and correct semantic validations, preventing superficial critiques that would undermine the purpose of reflection. Extensive experiments across four autoformalization benchmarks demonstrate that ReForm achieves an average improvement of 17.2 percentage points over the strongest baselines. To further ensure evaluation reliability, we introduce ConsistencyCheck, a benchmark of 859 expert-annotated items that not only validates LLMs as judges but also reveals that autoformalization is inherently difficult: even human experts produce semantic errors in up to 38.5% of cases.
摘要：自动形式化将自然语言数学转化为机器可验证的形式陈述，对于使用形式数学推理解决自然语言陈述的数学问题至关重要。虽然大型语言模型可以生成语法正确的形式语句，但它们通常无法保留原始问题的语义意图。这种限制源于法学硕士方法将自动形式化视为一种简单化的翻译任务，缺乏人类专家自然采用的自我反思和迭代细化的机制。为了解决这些问题，我们提出了 ReForm，一种反射式自动形式化方法，它将语义一致性评估紧密集成到自动形式化过程中。这使得模型能够迭代地生成正式语句，评估其语义保真度，并通过逐步细化自我纠正已识别的错误。为了有效地训练这种反射模型，我们引入了前瞻性有界序列优化（PBSO），它在不同的序列位置采用不同的奖励，以确保模型开发准确的自动形式化和正确的语义验证，防止会破坏反射目的的肤浅批评。跨越四个自动形式化基准的广泛实验表明，ReForm 比最强基准平均提高了 17.2 个百分点。为了进一步确保评估的可靠性，我们引入了 ConsistencyCheck，这是一个包含 859 个专家注释项目的基准，不仅验证了法学硕士作为法官的能力，而且还揭示了自动形式化本质上是困难的：即使是人类专家也会在高达 38.5% 的情况下产生语义错误。

Title: Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way

Authors: Yicun Yang, Cong Wang, Shaobo Wang, Zichen Wen, Biqing Qi, Hanlin Xu, Linfeng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24605
Pdf URL: https://arxiv.org/pdf/2510.24605
Copy Paste: [[2510.24605]] Diffusion LLM with Native Variable Generation Lengths: Let [EOS] Lead the Way(https://arxiv.org/abs/2510.24605)
Keywords: language model, llm
Abstract: Diffusion-based large language models (dLLMs) have exhibited substantial potential for parallel text generation, which may enable more efficient generation compared to autoregressive models. However, current dLLMs suffer from fixed generation lengths, which indicates the generation lengths of dLLMs have to be determined before decoding as a hyper-parameter, leading to issues in efficiency and flexibility. To solve these problems, in this work, we propose to train a diffusion LLM with native variable generation lengths, abbreviated as dLLM-Var. Concretely, we aim to train a model to accurately predict the [EOS] token in the generated text, which makes a dLLM be able to natively infer in a block diffusion manner, while still maintaining the ability of global bi-directional (full) attention and high parallelism. Experiments on standard benchmarks demonstrate that our method achieves a 30.1x speedup over traditional dLLM inference paradigms and a 2.4x speedup relative to autoregressive models such as Qwen and Llama. Our method achieves higher accuracy and faster inference, elevating dLLMs beyond mere academic novelty and supporting their practical use in real-world applications. Codes and models have been released.
摘要：基于扩散的大语言模型 (dLLM) 在并行文本生成方面表现出了巨大的潜力，与自回归模型相比，这可能会实现更高效的生成。然而，当前的 dLLM 存在固定的生成长度，这表明 dLLM 的生成长度必须在解码为超参数之前确定，从而导致效率和灵活性方面的问题。为了解决这些问题，在这项工作中，我们建议训练具有原生可变生成长度的扩散LLM，缩写为dLLM-Var。具体来说，我们的目标是训练一个模型来准确预测生成文本中的 [EOS] 标记，这使得 dLLM 能够以块扩散方式进行本机推理，同时仍然保持全局双向（完全）注意力和高并行性的能力。标准基准测试的实验表明，我们的方法比传统 dLLM 推理范例实现了 30.1 倍的加速，相对于 Qwen 和 Llama 等自回归模型实现了 2.4 倍的加速。我们的方法实现了更高的准确性和更快的推理速度，使 dLLM 超越了单纯的学术新颖性，并支持它们在现实世界应用中的实际使用。代码和模型已发布。

Title: Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs

Authors: Siheng Xiong, Joe Zou, Faramarz Fekri, Yae Jee Cho
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24606
Pdf URL: https://arxiv.org/pdf/2510.24606
Copy Paste: [[2510.24606]] Long-Context Modeling with Dynamic Hierarchical Sparse Attention for On-Device LLMs(https://arxiv.org/abs/2510.24606)
Keywords: llm
Abstract: The quadratic cost of attention hinders the scalability of long-context LLMs, especially in resource-constrained settings. Existing static sparse methods such as sliding windows or global tokens utilizes the sparsity of attention to reduce the cost of attention, but poorly adapts to the content-dependent variations in attention due to their staticity. While previous work has proposed several dynamic approaches to improve flexibility, they still depend on predefined templates or heuristic mechanisms. Such strategies reduce generality and prune tokens that remain contextually important, limiting their accuracy across diverse tasks. To tackle these bottlenecks of existing methods for long-context modeling, we introduce Dynamic Hierarchical Sparse Attention (DHSA), a data-driven framework that dynamically predicts attention sparsity online without retraining. Our proposed DHSA adaptively segments sequences into variable-length chunks, then computes chunk representations by aggregating the token embeddings within each chunk. To avoid the bias introduced by varying chunk lengths, we apply length-normalized aggregation that scales the averaged embeddings by the square root of the chunk size. Finally, DHSA upsamples the chunk-level similarity scores to token level similarities to calculate importance scores that determine which token-level interactions should be preserved. Our experiments on Gemma2 with Needle-in-a-Haystack Test and LongBench show that DHSA matches dense attention in accuracy, while reducing prefill latency by 20-60% and peak memory usage by 35%. Compared to other representative baselines such as block sparse attention, DHSA achieves consistently higher accuracy (6-18% relative gains) with comparable or lower cost, offering an efficient and adaptable solution for long-context on-device LLMs.
摘要：注意力的二次成本阻碍了长上下文法学硕士的可扩展性，特别是在资源有限的环境中。现有的静态稀疏方法（例如滑动窗口或全局令牌）利用注意力的稀疏性来降低注意力成本，但由于其静态性，很难适应内容相关的注意力变化。虽然之前的工作提出了几种动态方法来提高灵活性，但它们仍然依赖于预定义的模板或启发式机制。此类策略降低了通用性，并修剪了上下文中仍然重要的标记，从而限制了它们在不同任务中的准确性。为了解决长上下文建模现有方法的这些瓶颈，我们引入了动态分层稀疏注意力（DHSA），这是一种数据驱动的框架，可以在线动态预测注意力稀疏性而无需重新训练。我们提出的 DHSA 自适应地将序列分割成可变长度的块，然后通过聚合每个块内的令牌嵌入来计算块表示。为了避免因改变块长度而引入的偏差，我们应用长度归一化聚合，通过块大小的平方根来缩放平均嵌入。最后，DHSA 将块级相似性分数上采样为令牌级相似性，以计算重要性分数，从而确定应保留哪些令牌级交互。我们在 Gemma2 上进行的大海捞针测试和 LongBench 实验表明，DHSA 在准确性方面与密集注意力相匹配，同时将预填充延迟减少了 20-60%，并将峰值内存使用量减少了 35%。与其他代表性基线（例如块稀疏注意力）相比，DHSA 以相当或更低的成本实现了一致更高的准确性（6-18% 相对增益），为长上下文设备上 LLM 提供了高效且适应性强的解决方案。

Title: Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation

Authors: Snegha A (1), Sayambhu Sen (2), Piyush Singh Pasi (2), Abhishek Singhania (2), Preethi Jyothi (1) ((1) Indian Institute of Technology Bombay, (2) Amazon Alexa)
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24619
Pdf URL: https://arxiv.org/pdf/2510.24619
Copy Paste: [[2510.24619]] Zero-Shot Cross-Lingual Transfer using Prefix-Based Adaptation(https://arxiv.org/abs/2510.24619)
Keywords: language model, llm, prompt
Abstract: With the release of new large language models (LLMs) like Llama and Mistral, zero-shot cross-lingual transfer has become increasingly feasible due to their multilingual pretraining and strong generalization capabilities. However, adapting these decoder-only LLMs to new tasks across languages remains challenging. While parameter-efficient fine-tuning (PeFT) techniques like Low-Rank Adaptation (LoRA) are widely used, prefix-based techniques such as soft prompt tuning, prefix tuning, and Llama Adapter are less explored, especially for zero-shot transfer in decoder-only models. We present a comprehensive study of three prefix-based methods for zero-shot cross-lingual transfer from English to 35+ high- and low-resource languages. Our analysis further explores transfer across linguistic families and scripts, as well as the impact of scaling model sizes from 1B to 24B. With Llama 3.1 8B, prefix methods outperform LoRA-baselines by up to 6% on the Belebele benchmark. Similar improvements were observed with Mistral v0.3 7B as well. Despite using only 1.23M learning parameters with prefix tuning, we achieve consistent improvements across diverse benchmarks. These findings highlight the potential of prefix-based techniques as an effective and scalable alternative to LoRA, particularly in low-resource multilingual settings.
摘要：随着 Llama 和 Mistral 等新的大语言模型 (LLM) 的发布，由于其多语言预训练和强大的泛化能力，零样本跨语言迁移变得越来越可行。然而，使这些仅解码器的法学硕士适应跨语言的新任务仍然具有挑战性。虽然低秩适应 (LoRA) 等参数高效微调 (PeFT) 技术被广泛使用，但基于前缀的技术（例如软提示调整、前缀调整和 Llama Adapter）的探索较少，特别是对于仅解码器模型中的零样本传输。我们对三种基于前缀的方法进行了全面研究，用于从英语到 35 多种高资源和低资源语言的零样本跨语言迁移。我们的分析进一步探讨了跨语言家族和文字的迁移，以及将模型大小从 1B 缩放到 24B 的影响。在 Llama 3.1 8B 中，前缀方法在 Belebele 基准上的性能比 LoRA 基线高出 6%。 Mistral v0.3 7B 也观察到了类似的改进。尽管仅使用 123 万个学习参数和前缀调整，但我们在不同的基准测试中实现了一致的改进。这些发现凸显了基于前缀的技术作为 LoRA 的有效且可扩展的替代方案的潜力，特别是在资源匮乏的多语言环境中。

Title: Relative Scaling Laws for LLMs

Authors: William Held, David Hall, Percy Liang, Diyi Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24626
Pdf URL: https://arxiv.org/pdf/2510.24626
Copy Paste: [[2510.24626]] Relative Scaling Laws for LLMs(https://arxiv.org/abs/2510.24626)
Keywords: language model, llm
Abstract: Scaling laws describe how language models improve with additional data, parameters, and compute. While widely used, they are typically measured on aggregate test sets. Aggregate evaluations yield clean trends but average over heterogeneous subpopulations, obscuring performance disparities. We introduce relative scaling laws, which track how performance gaps between test distributions evolve with scale rather than focusing solely on absolute error. Using 255 decoder-only Transformers trained under matched-compute (IsoFLOP) budgets from $10^{18}$--$10^{20}$ FLOPs on standard pretraining datasets, we find diverse trajectories: academic domains on MMLU converge toward parity; regional English dialects shift depending on population size; and clusters of AI risk behaviours split, with capability- and influence-related risks increasing during pretraining while adversarial risks do not. These results show that although scaling improves overall performance, it is not a universal equalizer. To support further study, we release all model checkpoints from this work to enable practitioners to measure relative alongside traditional scaling laws, in order to better prioritize robustness challenges in light of the bitter lesson.
摘要：缩放法则描述了语言模型如何通过额外的数据、参数和计算来改进。虽然它们被广泛使用，但通常是在聚合测试集上进行测量的。总体评估得出清晰的趋势，但对异质子群体进行平均，从而掩盖了绩效差异。我们引入了相对缩放定律，该定律跟踪测试分布之间的性能差距如何随缩放而变化，而不是仅仅关注绝对误差。使用 255 个仅解码器 Transformer，在标准预训练数据集上以 10^{18}$--$10^{20}$ FLOP 的匹配计算 (IsoFLOP) 预算进行训练，我们发现了不同的轨迹：MMLU 上的学术领域趋向于平等；不同地区的英语方言会根据人口规模而变化；人工智能风险行为集群分裂，在预训练期间与能力和影响力相关的风险增加，而对抗性风险却没有增加。这些结果表明，虽然缩放提高了整体性能，但它并不是通用的均衡器。为了支持进一步的研究，我们发布了这项工作中的所有模型检查点，使从业者能够与传统的缩放法则一起测量相对值，以便根据惨痛的教训更好地优先考虑鲁棒性挑战。

Title: "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue

Authors: Anh Ngo, Nicolas Rollet, Catherine Pelachaud, Chloe Clavel
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24628
Pdf URL: https://arxiv.org/pdf/2510.24628
Copy Paste: [[2510.24628]] "Mm, Wat?" Detecting Other-initiated Repair Requests in Dialogue(https://arxiv.org/abs/2510.24628)
Keywords: prompt, agent
Abstract: Maintaining mutual understanding is a key component in human-human conversation to avoid conversation breakdowns, in which repair, particularly Other-Initiated Repair (OIR, when one speaker signals trouble and prompts the other to resolve), plays a vital role. However, Conversational Agents (CAs) still fail to recognize user repair initiation, leading to breakdowns or disengagement. This work proposes a multimodal model to automatically detect repair initiation in Dutch dialogues by integrating linguistic and prosodic features grounded in Conversation Analysis. The results show that prosodic cues complement linguistic features and significantly improve the results of pretrained text and audio embeddings, offering insights into how different features interact. Future directions include incorporating visual cues, exploring multilingual and cross-context corpora to assess the robustness and generalizability.
摘要：保持相互理解是人与人对话中避免对话失败的关键组成部分，其中修复，尤其是其他发起的修复（OIR，当一个发言者发出故障信号并提示另一方解决时）起着至关重要的作用。然而，会话代理 (CA) 仍然无法识别用户修复启动，从而导致故障或脱离。这项工作提出了一种多模态模型，通过集成基于对话分析的语言和韵律特征来自动检测荷兰语对话中的修复启动。结果表明，韵律线索补充了语言特征，并显着改善了预训练文本和音频嵌入的结果，提供了对不同特征如何相互作用的见解。未来的方向包括结合视觉线索、探索多语言和跨上下文语料库以评估稳健性和普遍性。

Title: OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning

Authors: Ziyou Hu, Zhengliang Shi, Minghang Zhu, Haitao Li, Teng Sun, Pengjie Ren, Suzan Verberne, Zhaochun Ren
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24636
Pdf URL: https://arxiv.org/pdf/2510.24636
Copy Paste: [[2510.24636]] OpenReward: Learning to Reward Long-form Agentic Tasks via Reinforcement Learning(https://arxiv.org/abs/2510.24636)
Keywords: language model, llm, agent
Abstract: Reward models (RMs) have become essential for aligning large language models (LLMs), serving as scalable proxies for human evaluation in both training and inference. However, existing RMs struggle on knowledge-intensive and long-form tasks, where evaluating correctness requires grounding beyond the model's internal knowledge. This limitation hinders them from reliably discriminating subtle quality differences, especially when external evidence is necessary. To address this, we introduce OpenRM, a tool-augmented long-form reward model that systematically judges open-ended responses by invoking external tools to gather relevant evidence. We train OpenRM with Group Relative Policy Optimization (GRPO) on over 27K synthesized pairwise examples generated through a controllable data synthesis framework. The training objective jointly supervises intermediate tool usage and final outcome accuracy, incentivizing our reward model to learn effective evidence-based judgment strategies. Extensive experiments on three newly-collected datasets and two widely-used benchmarks demonstrate that OpenRM substantially outperforms existing reward modeling approaches. As a further step, we integrate OpenRM into both inference-time response selection and training-time data selection. This yields consistent gains in downstream LLM alignment tasks, highlighting the potential of tool-augmented reward models for scaling reliable long-form evaluation.
摘要：奖励模型 (RM) 已成为协调大型语言模型 (LLM) 的关键，可作为训练和推理中人类评估的可扩展代理。然而，现有的 RM 在知识密集型和长期任务上遇到了困难，其中评估正确性需要超越模型的内部知识。这种限制阻碍了他们可靠地区分细微的质量差异，特别是在需要外部证据的情况下。为了解决这个问题，我们引入了 OpenRM，这是一种工具增强的长形式奖励模型，它通过调用外部工具收集相关证据来系统地判断开放式响应。我们使用组相对策略优化 (GRPO) 在通过可控数据合成框架生成的超过 27K 合成成对示例上训练 OpenRM。训练目标共同监督中间工具的使用和最终结果的准确性，激励我们的奖励模型学习有效的基于证据的判断策略。对三个新收集的数据集和两个广泛使用的基准进行的大量实验表明，OpenRM 的性能大大优于现有的奖励建模方法。更进一步，我们将 OpenRM 集成到推理时间响应选择和训练时间数据选择中。这在下游法学硕士对齐任务中产生了持续的收益，凸显了工具增强奖励模型在扩展可靠的长格式评估方面的潜力。

Title: Optimizing Retrieval for RAG via Reinforced Contrastive Learning

Authors: Jiawei Zhou, Lei Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2510.24652
Pdf URL: https://arxiv.org/pdf/2510.24652
Copy Paste: [[2510.24652]] Optimizing Retrieval for RAG via Reinforced Contrastive Learning(https://arxiv.org/abs/2510.24652)
Keywords: llm, retrieval-augmented generation
Abstract: As retrieval-augmented generation (RAG) becomes increasingly widespread, the role of information retrieval (IR) is shifting from retrieving information for human users to retrieving contextual knowledge for artificial intelligence (AI) systems, where relevance becomes difficult to define or annotate beforehand. To address this challenge, we propose R3, a Retrieval framework optimized for RAG through trialand-feedback Reinforced contrastive learning. Unlike prior approaches that rely on annotated or synthetic data for supervised fine-tuning, R3 enables the retriever to dynamically explore and optimize relevance within the RAG environment. During training, the retrieved results interact with the environment to produce contrastive signals that automatically guide the retriever's self-improvement. Extensive experiments across diverse tasks demonstrate that R3 improves RAG performance by 5.2% over the original retriever and surpasses state-of-the-art retrievers by 4.9%, while achieving comparable results to LLM-augmented retrieval and RAG systems built on post-trained or instruction-tuned LLMs. It is both efficient and practical, requiring only 4 GPUs and completing training within a single day.
摘要：随着检索增强生成（RAG）变得越来越普遍，信息检索（IR）的作用正在从为人类用户检索信息转变为为人工智能（AI）系统检索上下文知识，其中相关性变得难以预先定义或注释。为了应对这一挑战，我们提出了 R3，这是一种通过试验和反馈强化对比学习针对 RAG 进行优化的检索框架。与之前依赖注释或合成数据进行监督微调的方法不同，R3 使检索器能够动态探索和优化 RAG 环境中的相关性。在训练过程中，检索到的结果与环境相互作用，产生对比信号，自动指导检索器的自我改进。跨不同任务的大量实验表明，R3 的 RAG 性能比原始检索器提高了 5.2%，比最先进的检索器提高了 4.9%，同时实现了与 LLM 增强检索和基于后训练或指令调整的 LLM 构建的 RAG 系统相当的结果。高效实用，仅需4块GPU，一天即可完成训练。

Title: Evolving Diagnostic Agents in a Virtual Clinical Environment

Authors: Pengcheng Qiu, Chaoyi Wu, Junwei Liu, Qiaoyu Zheng, Yusheng Liao, Haowen Wang, Yun Yue, Qianrui Fan, Shuai Zhen, Jian Wang, Jinjie Gu, Yanfeng Wang, Ya Zhang, Weidi Xie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24654
Pdf URL: https://arxiv.org/pdf/2510.24654
Copy Paste: [[2510.24654]] Evolving Diagnostic Agents in a Virtual Clinical Environment(https://arxiv.org/abs/2510.24654)
Keywords: language model, gpt, llm, prompt, agent
Abstract: In this paper, we present a framework for training large language models (LLMs) as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.
摘要：在本文中，我们提出了一个框架，用于通过强化学习来训练大型语言模型（LLM）作为诊断代理，使它们能够管理多轮诊断过程，自适应地选择检查并做出最终诊断。与根据静态病例摘要训练的指令调整模型不同，我们的方法通过交互式探索和基于结果的反馈来获取诊断策略。我们的贡(ii) 我们通过端到端、多轮强化学习来训练 DiagAgent，以学习优化信息产量和诊断准确性的诊断策略； (iii) 我们推出 DiagBench，这是一个诊断基准，包括 750 个病例，其中包含经过医生验证的检查建议，以及 99 个病例，注释有 973 个医生编写的诊断过程细则； (iv) 我们在不同的诊断环境中展示了卓越的性能。 DiagAgent 的性能显着优于 10 个最先进的 LLM，包括 DeepSeek-v3 和 GPT-4o，以及两个即时设计的代理。在单轮设置中，DiagAgent 的诊断准确率提高了 9.34%，检查推荐命中率提高了 44.03%。在端到端设置中，诊断准确性提高了 15.12%，检查建议 F1 分数提高了 23.09%。在基于评分标准的评估中，它的加权评分评分超过了次优模型 Claude-sonnet-4 7.1%。这些发现表明，交互式临床环境中的学习策略赋予了动态且具有临床意义的诊断管理能力，这是仅通过被动培训无法实现的。

Title: InteractComp: Evaluating Search Agents With Ambiguous Queries

Authors: Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang, Nan Tang, Chenglin Wu, Yuyu Luo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24668
Pdf URL: https://arxiv.org/pdf/2510.24668
Copy Paste: [[2510.24668]] InteractComp: Evaluating Search Agents With Ambiguous Queries(https://arxiv.org/abs/2510.24668)
Keywords: agent
Abstract: Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at this https URL.
摘要：语言代理在网络搜索和信息检索方面表现出了巨大的潜力。然而，这些搜索代理假设用户查询是完整且明确的，这种假设与现实不同，现实中用户从不完整的查询开始，需要通过交互进行澄清。然而，大多数智能体在搜索过程中缺乏交互机制，现有的基准无法评估这种能力。为了解决这一差距，我们引入了 InteractComp，这是一个基准测试，旨在评估搜索代理是否能够识别查询歧义并在搜索过程中主动交互来解决它。遵循易于验证、交互消除歧义的原则，我们通过目标干扰方法构建了 9 个领域的 210 个专家策划的问题，该方法产生了只有通过交互才能解决的真正模糊性。对 17 个模型的评估显示出惊人的失败：最好的模型仅达到 13.73% 的准确率，尽管有完整的上下文，准确率达到 71.50%，暴露出系统性的过度自信，而不是推理缺陷。强制互动会产生巨大的收益，展示出当前策略无法发挥的潜在能力。纵向分析显示，交互能力在 15 个月内停滞不前，而搜索性能却提高了七倍，揭示了一个关键盲点。这种停滞，加上搜索任务固有的即时反馈，使 InteractComp 成为评估和培训搜索代理交互能力的宝贵资源。该代码可从此 https URL 获取。

Title: Dissecting Role Cognition in Medical LLMs via Neuronal Ablation

Authors: Xun Liang, Huayi Lai, Hanyu Wang, Wentao Zhang, Linfeng Zhang, Yanfang Chen, Feiyu Xiong, Zhiyu Li
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24677
Pdf URL: https://arxiv.org/pdf/2510.24677
Copy Paste: [[2510.24677]] Dissecting Role Cognition in Medical LLMs via Neuronal Ablation(https://arxiv.org/abs/2510.24677)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) have gained significant traction in medical decision support systems, particularly in the context of medical question answering and role-playing simulations. A common practice, Prompt-Based Role Playing (PBRP), instructs models to adopt different clinical roles (e.g., medical students, residents, attending physicians) to simulate varied professional behaviors. However, the impact of such role prompts on model reasoning capabilities remains unclear. This study introduces the RP-Neuron-Activated Evaluation Framework(RPNA) to evaluate whether role prompts induce distinct, role-specific cognitive processes in LLMs or merely modify linguistic style. We test this framework on three medical QA datasets, employing neuron ablation and representation analysis techniques to assess changes in reasoning pathways. Our results demonstrate that role prompts do not significantly enhance the medical reasoning abilities of LLMs. Instead, they primarily affect surface-level linguistic features, with no evidence of distinct reasoning pathways or cognitive differentiation across clinical roles. Despite superficial stylistic changes, the core decision-making mechanisms of LLMs remain uniform across roles, indicating that current PBRP methods fail to replicate the cognitive complexity found in real-world medical practice. This highlights the limitations of role-playing in medical AI and emphasizes the need for models that simulate genuine cognitive processes rather than linguistic this http URL have released the related code in the following repository:https: //github.com/IAAR-Shanghai/RolePlay_LLMDoctor
摘要：大型语言模型 (LLM) 在医疗决策支持系统中获得了巨大的关注，特别是在医疗问答和角色扮演模拟的背景下。一种常见的做法是基于提示的角色扮演（PBRP），指示模型采用不同的临床角色（例如医学生、住院医生、主治医生）来模拟不同的专业行为。然而，此类角色提示对模型推理能力的影响仍不清楚。本研究引入了 RP 神经元激活评估框架（RPNA）来评估角色提示是否会在法学硕士中引发独特的、特定于角色的认知过程，或者仅仅改变语言风格。我们在三个医学 QA 数据集上测试了这个框架，采用神经元消融和表示分析技术来评估推理路径的变化。我们的结果表明，角色提示并不能显着提高法学硕士的医学推理能力。相反，它们主要影响表面水平的语言特征，没有证据表明不同临床角色之间存在明显的推理途径或认知差异。尽管表面上有风格上的变化，但法学硕士的核心决策机制在各个角色中仍然保持一致，这表明当前的 PBRP 方法无法复制现实世界医疗实践中发现的认知复杂性。这凸显了角色扮演在医疗人工智能中的局限性，并强调需要模拟真实认知过程而不是语言的模型。此http URL已在以下存储库中发布了相关代码：https://github.com/IAAR-Shanghai/RolePlay_LLMDoctor

Title: Repurposing Synthetic Data for Fine-grained Search Agent Supervision

Authors: Yida Zhao, Kuan Li, Xixi Wu, Liwen Zhang, Dingchu Zhang, Baixuan Li, Maojia Song, Zhuo Chen, Chenxi Wang, Xinyu Wang, Kewei Tu, Pengjun Xie, Jingren Zhou, Yong Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24694
Pdf URL: https://arxiv.org/pdf/2510.24694
Copy Paste: [[2510.24694]] Repurposing Synthetic Data for Fine-grained Search Agent Supervision(https://arxiv.org/abs/2510.24694)
Keywords: llm, agent
Abstract: LLM-based search agents are increasingly trained on entity-centric synthetic data to solve complex, knowledge-intensive tasks. However, prevailing training methods like Group Relative Policy Optimization (GRPO) discard this rich entity information, relying instead on sparse, outcome-based rewards. This critical limitation renders them unable to distinguish informative "near-miss" samples-those with substantially correct reasoning but a flawed final answer-from complete failures, thus discarding valuable learning signals. We address this by leveraging the very entities discarded during training. Our empirical analysis reveals a strong positive correlation between the number of ground-truth entities identified during an agent's reasoning process and final answer accuracy. Building on this insight, we introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function. E-GRPO assigns partial rewards to incorrect samples proportional to their entity match rate, enabling the model to effectively learn from these "near-misses". Experiments on diverse question-answering (QA) and deep research benchmarks show that E-GRPO consistently and significantly outperforms the GRPO baseline. Furthermore, our analysis reveals that E-GRPO not only achieves superior accuracy but also induces more efficient reasoning policies that require fewer tool calls, demonstrating a more effective and sample-efficient approach to aligning search agents.
摘要：基于法学硕士的搜索代理越来越多地接受以实体为中心的合成数据的培训，以解决复杂的知识密集型任务。然而，像组相对策略优化（GRPO）这样的流行训练方法放弃了这种丰富的实体信息，而是依赖于稀疏的、基于结果的奖励。这一关键限制使他们无法区分信息丰富的“差点错过”样本（那些推理基本正确但最终答案有缺陷的样本）和完全失败的样本，从而丢弃了有价值的学习信号。我们通过利用训练期间丢弃的实体来解决这个问题。我们的实证分析表明，在智能体推理过程中识别的真实实体数量与最终答案准确性之间存在很强的正相关性。基于这一见解，我们引入了实体感知组相对策略优化（E-GRPO），这是一种新颖的框架，可以制定密集的实体感知奖励函数。 E-GRPO 将部分奖励分配给与实体匹配率成比例的错误样本，使模型能够有效地从这些“未遂事件”中学习。各种问答 (QA) 和深入研究基准的实验表明，E-GRPO 始终显着优于 GRPO 基线。此外，我们的分析表明，E-GRPO 不仅实现了卓越的准确性，而且还引入了更有效的推理策略，需要更少的工具调用，展示了一种更有效和样本效率更高的对齐搜索代理的方法。

Title: AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis

Authors: Xuanzhong Chen, Zile Qiao, Guoxin Chen, Liangcai Su, Zhen Zhang, Xinyu Wang, Pengjun Xie, Fei Huang, Jingren Zhou, Yong Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24695
Pdf URL: https://arxiv.org/pdf/2510.24695
Copy Paste: [[2510.24695]] AgentFrontier: Expanding the Capability Frontier of LLM Agents with ZPD-Guided Data Synthesis(https://arxiv.org/abs/2510.24695)
Keywords: language model, llm, agent
Abstract: Training large language model agents on tasks at the frontier of their capabilities is key to unlocking advanced reasoning. We introduce a data synthesis approach inspired by the educational theory of the Zone of Proximal Development (ZPD), which defines this frontier as tasks an LLM cannot solve alone but can master with guidance. To operationalize this, we present the AgentFrontier Engine, an automated pipeline that synthesizes high-quality, multidisciplinary data situated precisely within the LLM's ZPD. This engine supports both continued pre-training with knowledge-intensive data and targeted post-training on complex reasoning tasks. From the same framework, we derive the ZPD Exam, a dynamic and automated benchmark designed to evaluate agent capabilities on these frontier tasks. We train AgentFrontier-30B-A3B model on our synthesized data, which achieves state-of-the-art results on demanding benchmarks like Humanity's Last Exam, even surpassing some leading proprietary agents. Our work demonstrates that a ZPD-guided approach to data synthesis offers a scalable and effective path toward building more capable LLM agents.
摘要：训练大型语言模型代理执行其能力前沿的任务是解锁高级推理的关键。我们引入了一种受最近发展区（ZPD）教育理论启发的数据合成方法，该方法将这一前沿定义为法学硕士无法单独解决但可以在指导下掌握的任务。为了实现这一点，我们推出了 AgentFrontier 引擎，这是一个自动化管道，可以合成精确位于法学硕士 ZPD 内的高质量、多学科数据。该引擎既支持知识密集型数据的持续预训练，也支持复杂推理任务的有针对性的后训练。从同一框架中，我们衍生出了 ZPD 考试，这是一种动态的自动化基准，旨在评估代理在这些前沿任务上的能力。我们在合成数据上训练 AgentFrontier-30B-A3B 模型，该模型在 Humanity's Last Exam 等严苛基准测试中取得了最先进的结果，甚至超越了一些领先的专有代理。我们的工作表明，ZPD 引导的数据合成方法为构建更有能力的法学硕士代理提供了一条可扩展且有效的途径。

Title: WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking

Authors: Zhengwei Tao, Haiyang Shen, Baixuan Li, Wenbiao Yin, Jialong Wu, Kuan Li, Zhongwang Zhang, Huifeng Yin, Rui Ye, Liwen Zhang, Xinyu Wang, Pengjun Xie, Jingren Zhou, Yong Jiang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.24697
Pdf URL: https://arxiv.org/pdf/2510.24697
Copy Paste: [[2510.24697]] WebLeaper: Empowering Efficiency and Efficacy in WebAgent via Enabling Info-Rich Seeking(https://arxiv.org/abs/2510.24697)
Keywords: language model, llm, agent
Abstract: Large Language Model (LLM)-based agents have emerged as a transformative approach for open-ended problem solving, with information seeking (IS) being a core capability that enables autonomous reasoning and decision-making. While prior research has largely focused on improving retrieval depth, we observe that current IS agents often suffer from low search efficiency, which in turn constrains overall performance. A key factor underlying this inefficiency is the sparsity of target entities in training tasks, which limits opportunities for agents to learn and generalize efficient search behaviors. To address these challenges, we propose WebLeaper, a framework for constructing high-coverage IS tasks and generating efficient solution trajectories. We formulate IS as a tree-structured reasoning problem, enabling a substantially larger set of target entities to be embedded within a constrained context. Leveraging curated Wikipedia tables, we propose three variants for synthesizing IS tasks, Basic, Union, and Reverse-Union, to systematically increase both IS efficiency and efficacy. Finally, we curate training trajectories by retaining only those that are simultaneously accurate and efficient, ensuring that the model is optimized for both correctness and search performance. Extensive experiments on both basic and comprehensive settings, conducted on five IS benchmarks, BrowserComp, GAIA, xbench-DeepSearch, WideSearch, and Seal-0, demonstrate that our method consistently achieves improvements in both effectiveness and efficiency over strong baselines.
摘要：基于大型语言模型 (LLM) 的代理已成为开放式问题解决的变革性方法，信息搜索 (IS) 是实现自主推理和决策的核心功能。虽然之前的研究主要集中在提高检索深度上，但我们观察到当前的 IS 智能体经常遭受搜索效率低下的困扰，这反过来又限制了整体性能。造成这种低效率的一个关键因素是训练任务中目标实体的稀疏性，这限制了智能体学习和概括有效搜索行为的机会。为了应对这些挑战，我们提出了 WebLeaper，一个用于构建高覆盖率 IS 任务并生成有效解决方案轨迹的框架。我们将 IS 表述为一个树结构推理问题，使更大的目标实体集能够嵌入到受约束的上下文中。利用精选的维基百科表格，我们提出了三种综合 IS 任务的变体：基本、联合和反向联合，以系统地提高 IS 效率和功效。最后，我们通过仅保留那些同时准确且高效的训练轨迹来管理训练轨迹，确保模型的正确性和搜索性能得到优化。在五个 IS 基准（BrowserComp、GAIA、xbench-DeepSearch、WideSearch 和 Seal-0）上进行的基本和综合设置的广泛实验表明，我们的方法在强基准上持续实现了有效性和效率的提高。

Title: ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking

Authors: Baixuan Li, Dingchu Zhang, Jialong Wu, Wenbiao Yin, Zhengwei Tao, Yida Zhao, Liwen Zhang, Haiyang Shen, Runnan Fang, Pengjun Xie, Jingren Zhou, Yong Jiang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24698
Pdf URL: https://arxiv.org/pdf/2510.24698
Copy Paste: [[2510.24698]] ParallelMuse: Agentic Parallel Thinking for Deep Information Seeking(https://arxiv.org/abs/2510.24698)
Keywords: agent
Abstract: Parallel thinking expands exploration breadth, complementing the deep exploration of information-seeking (IS) agents to further enhance problem-solving capability. However, conventional parallel thinking faces two key challenges in this setting: inefficiency from repeatedly rolling out from scratch, and difficulty in integrating long-horizon reasoning trajectories during answer generation, as limited context capacity prevents full consideration of the reasoning process. To address these issues, we propose ParallelMuse, a two-stage paradigm designed for deep IS agents. The first stage, Functionality-Specified Partial Rollout, partitions generated sequences into functional regions and performs uncertainty-guided path reuse and branching to enhance exploration efficiency. The second stage, Compressed Reasoning Aggregation, exploits reasoning redundancy to losslessly compress information relevant to answer derivation and synthesize a coherent final answer. Experiments across multiple open-source agents and benchmarks demonstrate up to 62% performance improvement with a 10--30% reduction in exploratory token consumption.
摘要：并行思维拓展了探索广度，与信息寻求（IS）智能体的深度探索相辅相成，进一步增强解决问题的能力。然而，传统的并行思维在这种情况下面临两个关键挑战：从头开始反复推出导致效率低下，以及在答案生成过程中难以整合长期推理轨迹，因为有限的上下文容量阻碍了对推理过程的充分考虑。为了解决这些问题，我们提出了 ParallelMuse，这是一种专为深度 IS 代理设计的两阶段范例。第一阶段，功能指定的部分推出，将生成的序列划分为功能区域，并执行不确定性引导的路径重用和分支，以提高探索效率。第二阶段是压缩推理聚合，利用推理冗余来无损压缩与答案推导相关的信息并合成连贯的最终答案。跨多个开源代理和基准测试的实验表明，性能提升高达 62%，探索性代币消耗减少 10--30%。

Title: AgentFold: Long-Horizon Web Agents with Proactive Context Management

Authors: Rui Ye, Zhongwang Zhang, Kuan Li, Huifeng Yin, Zhengwei Tao, Yida Zhao, Liangcai Su, Liwen Zhang, Zile Qiao, Xinyu Wang, Pengjun Xie, Fei Huang, Siheng Chen, Jingren Zhou, Yong Jiang
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.24699
Pdf URL: https://arxiv.org/pdf/2510.24699
Copy Paste: [[2510.24699]] AgentFold: Long-Horizon Web Agents with Proactive Context Management(https://arxiv.org/abs/2510.24699)
Keywords: llm, agent
Abstract: LLM-based web agents show immense promise for information seeking, yet their effectiveness on long-horizon tasks is hindered by a fundamental trade-off in context management. Prevailing ReAct-based agents suffer from context saturation as they accumulate noisy, raw histories, while methods that fixedly summarize the full history at each step risk the irreversible loss of critical details. Addressing these, we introduce AgentFold, a novel agent paradigm centered on proactive context management, inspired by the human cognitive process of retrospective consolidation. AgentFold treats its context as a dynamic cognitive workspace to be actively sculpted, rather than a passive log to be filled. At each step, it learns to execute a `folding' operation, which manages its historical trajectory at multiple scales: it can perform granular condensations to preserve vital, fine-grained details, or deep consolidations to abstract away entire multi-step sub-tasks. The results on prominent benchmarks are striking: with simple supervised fine-tuning (without continual pre-training or RL), our AgentFold-30B-A3B agent achieves 36.2% on BrowseComp and 47.3% on BrowseComp-ZH. Notably, this performance not only surpasses or matches open-source models of a dramatically larger scale, such as the DeepSeek-V3.1-671B-A37B, but also surpasses leading proprietary agents like OpenAI's o4-mini.
摘要：基于法学硕士的网络代理在信息搜索方面显示出巨大的前景，但它们在长期任务上的有效性受到上下文管理中基本权衡的阻碍。流行的基于 ReAct 的代理会因积累嘈杂的原始历史而遭受上下文饱和的困扰，而在每一步固定总结完整历史的方法可能会导致关键细节的不可逆转的丢失。为了解决这些问题，我们引入了 AgentFold，这是一种以主动上下文管理为中心的新型代理范例，其灵感来自于人类回顾性整合的认知过程。 AgentFold 将其上下文视为需要主动塑造的动态认知工作空间，而不是需要填充的被动日志。在每一步中，它都会学习执行“折叠”操作，该操作在多个尺度上管理其历史轨迹：它可以执行粒度压缩以保留重要的细粒度细节，或进行深度合并以抽象出整个多步骤子任务。著名基准测试的结果是惊人的：通过简单的监督微调（无需持续的预训练或强化学习），我们的 AgentFold-30B-A3B 代理在 BrowseComp 上达到了 36.2%，在 BrowseComp-ZH 上达到了 47.3%。值得注意的是，这种性能不仅超越或匹配更大规模的开源模型，例如 DeepSeek-V3.1-671B-A37B，而且还超越了领先的专有代理，例如 OpenAI 的 o4-mini。

Title: Tongyi DeepResearch Technical Report

Authors: Tongyi DeepResearch Team: Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG, cs.MA
Abstract URL: https://arxiv.org/abs/2510.24701
Pdf URL: https://arxiv.org/pdf/2510.24701
Copy Paste: [[2510.24701]] Tongyi DeepResearch Technical Report(https://arxiv.org/abs/2510.24701)
Keywords: language model, agent
Abstract: We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
摘要：我们推出了 Tongyi DeepResearch，这是一种代理大语言模型，专为长期、深度信息寻求的研究任务而设计。为了激励自主深度研究机构，Tongyi DeepResearch 通过端到端培训框架开发，该框架结合了代理中期培训和代理后培训，从而实现了跨复杂任务的可扩展推理和信息搜索。我们设计了一个高度可扩展的数据合成管道，该管道是全自动的，无需依赖昂贵的人工注释，并支持所有培训阶段。通过为每个阶段构建定制环境，我们的系统可以在整个过程中实现稳定一致的交互。 Tongyi DeepResearch 拥有 305 亿个总参数，每个代币仅激活 33 亿个参数，在一系列代理深度研究基准测试中实现了最先进的性能，包括 Humanity's Last Exam、BrowseComp、BrowseComp-ZH、WebWalkerQA、xbench-DeepSearch、FRAMES 和 xbench-DeepSearch-2510。我们开源模型、框架和完整的解决方案，为社区提供支持。

Title: Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

Authors: Yueqi Song, Ketan Ramaneti, Zaid Sheikh, Ziru Chen, Boyu Gou, Tianbao Xie, Yiheng Xu, Danyang Zhang, Apurva Gandhi, Fan Yang, Joseph Liu, Tianyue Ou, Zhihao Yuan, Frank Xu, Shuyan Zhou, Xingyao Wang, Xiang Yue, Tao Yu, Huan Sun, Yu Su, Graham Neubig
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.24702
Pdf URL: https://arxiv.org/pdf/2510.24702
Copy Paste: [[2510.24702]] Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents(https://arxiv.org/abs/2510.24702)
Keywords: llm, agent
Abstract: Public research results on large-scale supervised finetuning of AI agents remain relatively rare, since the collection of agent training data presents unique challenges. In this work, we argue that the bottleneck is not a lack of underlying data sources, but that a large variety of data is fragmented across heterogeneous formats, tools, and interfaces. To this end, we introduce the agent data protocol (ADP), a light-weight representation language that serves as an "interlingua" between agent datasets in diverse formats and unified agent training pipelines downstream. The design of ADP is expressive enough to capture a large variety of tasks, including API/tool use, browsing, coding, software engineering, and general agentic workflows, while remaining simple to parse and train on without engineering at a per-dataset level. In experiments, we unified a broad collection of 13 existing agent training datasets into ADP format, and converted the standardized ADP data into training-ready formats for multiple agent frameworks. We performed SFT on these data, and demonstrated an average performance gain of ~20% over corresponding base models, and delivers state-of-the-art or near-SOTA performance on standard coding, browsing, tool use, and research benchmarks, without domain-specific tuning. All code and data are released publicly, in the hope that ADP could help lower the barrier to standardized, scalable, and reproducible agent training.
摘要：关于人工智能代理大规模监督微调的公开研究成果仍然相对较少，因为代理训练数据的收集提出了独特的挑战。在这项工作中，我们认为瓶颈不是缺乏底层数据源，而是大量数据分散在异构格式、工具和接口中。为此，我们引入了代理数据协议（ADP），这是一种轻量级表示语言，充当不同格式的代理数据集和下游统一代理训练管道之间的“国际语言”。 ADP 的设计具有足够的表现力，可以捕获各种各样的任务，包括 API/工具使用、浏览、编码、软件工程和一般代理工作流程，同时保持简单的解析和训练，无需在每个数据集级别进行工程。在实验中，我们将 13 个现有代理训练数据集统一为 ADP 格式，并将标准化的 ADP 数据转换为多个代理框架的训练就绪格式。我们对这些数据执行了 SFT，并展示了比相应基础模型约 20% 的平均性能增益，并在标准编码、浏览、工具使用和研究基准测试上提供了最先进或接近 SOTA 的性能，而无需针对特定领域进行调整。所有代码和数据均公开发布，希望 ADP 能够帮助降低标准化、可扩展和可重复的代理训练的障碍。

Title: ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?

Authors: Shuqing Li, Jiayi Yan, Chenyu Niu, Jen-tse Huang, Yun Peng, Wenxuan Wang, Yepang Liu, Michael R. Lyu
Subjects: cs.CL, cs.AI, cs.HC, cs.SE
Abstract URL: https://arxiv.org/abs/2510.24706
Pdf URL: https://arxiv.org/pdf/2510.24706
Copy Paste: [[2510.24706]] ComboBench: Can LLMs Manipulate Physical Devices to Play Virtual Reality Games?(https://arxiv.org/abs/2510.24706)
Keywords: language model, gpt, llm
Abstract: Virtual Reality (VR) games require players to translate high-level semantic actions into precise device manipulations using controllers and head-mounted displays (HMDs). While humans intuitively perform this translation based on common sense and embodied understanding, whether Large Language Models (LLMs) can effectively replicate this ability remains underexplored. This paper introduces a benchmark, ComboBench, evaluating LLMs' capability to translate semantic actions into VR device manipulation sequences across 262 scenarios from four popular VR games: Half-Life: Alyx, Into the Radius, Moss: Book II, and Vivecraft. We evaluate seven LLMs, including GPT-3.5, GPT-4, GPT-4o, Gemini-1.5-Pro, LLaMA-3-8B, Mixtral-8x7B, and GLM-4-Flash, compared against annotated ground truth and human performance. Our results reveal that while top-performing models like Gemini-1.5-Pro demonstrate strong task decomposition capabilities, they still struggle with procedural reasoning and spatial understanding compared to humans. Performance varies significantly across games, suggesting sensitivity to interaction complexity. Few-shot examples substantially improve performance, indicating potential for targeted enhancement of LLMs' VR manipulation capabilities. We release all materials at this https URL.
摘要：虚拟现实 (VR) 游戏要求玩家使用控制器和头戴式显示器 (HMD) 将高级语义动作转化为精确的设备操作。虽然人类根据常识和具体理解直观地执行这种翻译，但大型语言模型（LLM）是否可以有效地复制这种能力仍有待探索。本文介绍了一个基准测试 ComboBench，该基准测试法学硕士将语义动作转换为 VR 设备操作序列的能力，涉及四种流行 VR 游戏的 262 个场景：《半条命：Alyx》、《Into the Radius》、《Moss: Book II》和《Vivecraft》。我们评估了 7 个 LLM，包括 GPT-3.5、GPT-4、GPT-4o、Gemini-1.5-Pro、LLaMA-3-8B、Mixtral-8x7B 和 GLM-4-Flash，与带注释的真实情况和人类表现进行比较。我们的结果表明，虽然像 Gemini-1.5-Pro 这样的顶级模型表现出强大的任务分解能力，但与人类相比，它们在程序推理和空间理解方面仍然存在困难。不同游戏的性能差异很大，这表明对交互复杂性的敏感性。少数样本示例显着提高了性能，表明有针对性地增强法学硕士 VR 操作能力的潜力。我们在此 https URL 发布所有材料。