2025-10-02

Title: Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning

Authors: Hong kyu Lee, Ruixuan Liu, Li Xiong
Subjects: cs.CL, cs.AI, cs.CR
Abstract URL: https://arxiv.org/abs/2510.00125
Pdf URL: https://arxiv.org/pdf/2510.00125
Copy Paste: [[2510.00125]] Direct Token Optimization: A Self-contained Approach to Large Language Model Unlearning(https://arxiv.org/abs/2510.00125)
Keywords: language model, llm
Abstract: Machine unlearning is an emerging technique that removes the influence of a subset of training data (forget set) from a model without full retraining, with applications including privacy protection, content moderation, and model correction. The key challenge lies in ensuring that the model completely forgets the knowledge of the forget set without compromising its overall utility. Existing unlearning methods for large language models (LLMs) often utilize auxiliary language models, retain datasets, or even commercial AI services for effective unlearning and maintaining the model utility. However, dependence on these external resources is often impractical and could potentially introduce additional privacy risks. In this work, we propose direct token optimization (DTO), a novel self-contained unlearning approach for LLMs that directly optimizes the token level objectives and eliminates the need for external resources. Given a sequence to unlearn, we identify two categories of tokens: target tokens, which capture critical knowledge for unlearning, and the remaining non-target tokens, which are crucial for maintaining the model utility. The former are used to optimize the unlearning objective, while the latter serve to preserve the model's performance. The experimental results show that the proposed DTO achieves up to 16.8$\times$ improvement in forget quality on several benchmark datasets than the latest baselines while maintaining a comparable level of model utility.
摘要：Machine Unerning是一种新兴技术，它可以消除训练数据的一部分（忘记设置）的影响，而无需完全重新培训，包括隐私保护，内容审核和模型校正的应用程序。关键挑战在于确保模型完全忘记了忘记集合的知识，而不会损害其整体效用。大型语言模型（LLM）的现有未学习方法通常使用辅助语言模型，保留数据集，甚至是商业AI服务，以有效地学习和维护模型实用程序。但是，对这些外部资源的依赖通常是不切实际的，可能会引入额外的隐私风险。在这项工作中，我们提出了直接令牌优化（DTO），这是一种新颖的自包含的LLMS独立学习方法，直接优化令牌级别的目标并消除了对外部资源的需求。鉴于对学习的序列，我们确定了两个类别的令牌：目标令牌，它们捕获了未学习的关键知识，以及其余的非目标令牌，这对于维持模型效用至关重要。前者用于优化未学习目标，而后者则可以保留模型的性能。实验结果表明，拟议的DTO在几个基准数据集上的忘记质量提高了高达16.8 $ \ times $，而不是最新的基线，同时保持了可比的模型实用程序水平。

Title: TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding

Authors: Kimihiro Hasegawa, Wiradee Imrattanatrai, Masaki Asada, Ken Fukuda, Teruko Mitamura
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00161
Pdf URL: https://arxiv.org/pdf/2510.00161
Copy Paste: [[2510.00161]] TAMA: Tool-Augmented Multimodal Agent for Procedural Activity Understanding(https://arxiv.org/abs/2510.00161)
Keywords: language model, gpt, agent
Abstract: Procedural activity assistants potentially support humans in a variety of settings, from our daily lives, e.g., cooking or assembling flat-pack furniture, to professional situations, e.g., manufacturing or biological experiments. Despite its potential use cases, the system development tailored for such an assistant is still underexplored. In this paper, we propose a novel framework, called TAMA, a Tool-Augmented Multimodal Agent, for procedural activity understanding. TAMA enables interleaved multimodal reasoning by making use of multimedia-returning tools in a training-free setting. Our experimental result on the multimodal procedural QA dataset, ProMQA-Assembly, shows that our approach can improve the performance of vision-language models, especially GPT-5 and MiMo-VL. Furthermore, our ablation studies provide empirical support for the effectiveness of two features that characterize our framework, multimedia-returning tools and agentic flexible tool selection. We believe our proposed framework and experimental results facilitate the thinking with images paradigm for video and multimodal tasks, let alone the development of procedural activity assistants.
摘要：程序活动助手可能会在各种环境中为人类提供支持，从我们的日常生活，例如烹饪或组装平装家具到专业情况，例如制造或生物学实验。尽管有潜在的用例，但针对此类助手量身定制的系统开发仍未得到充实。在本文中，我们提出了一个新颖的框架，称为TAMA，一种工具增强的多模式代理，以了解程序活动的理解。 TAMA通过在无训练环境中利用多媒体返回工具来实现交织的多模式推理。我们对多模式程序QA数据集的实验结果，Promqa-Asembly，表明我们的方法可以改善视觉模型的性能，尤其是GPT-5和MIMO-VL。此外，我们的消融研究为我们的框架，多媒体返回工具和代理灵活的工具选择的两个特征的有效性提供了经验支持。我们认为，我们提出的框架和实验结果可以通过图像范式进行视频和多模式任务的图像范式，更不用说开发程序活动助手了。

Title: DRBench: A Realistic Benchmark for Enterprise Deep Research

Authors: Amirhossein Abaskohi, Tianyi Chen, Miguel Muñoz-Mármol, Curtis Fox, Amrutha Varshini Ramesh, Étienne Marcotte, Xing Han Lù, Nicolas Chapados, Spandana Gella, Christopher Pal, Alexandre Drouin, Issam H. Laradji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00172
Pdf URL: https://arxiv.org/pdf/2510.00172
Copy Paste: [[2510.00172]] DRBench: A Realistic Benchmark for Enterprise Deep Research(https://arxiv.org/abs/2510.00172)
Keywords: gpt, chat, agent
Abstract: We introduce DRBench, a benchmark for evaluating AI agents on complex, open-ended deep research tasks in enterprise settings. Unlike prior benchmarks that focus on simple questions or web-only queries, DRBench evaluates agents on multi-step queries (for example, ``What changes should we make to our product roadmap to ensure compliance with this standard?") that require identifying supporting facts from both the public web and private company knowledge base. Each task is grounded in realistic user personas and enterprise context, spanning a heterogeneous search space that includes productivity software, cloud file systems, emails, chat conversations, and the open web. Tasks are generated through a carefully designed synthesis pipeline with human-in-the-loop verification, and agents are evaluated on their ability to recall relevant insights, maintain factual accuracy, and produce coherent, well-structured reports. We release 15 deep research tasks across 10 domains, such as Sales, Cybersecurity, and Compliance. We demonstrate the effectiveness of DRBench by evaluating diverse DR agents across open- and closed-source models (such as GPT, Llama, and Qwen) and DR strategies, highlighting their strengths, weaknesses, and the critical path for advancing enterprise deep research. Code is available at this https URL.
摘要：我们介绍了Drbench，这是一种评估AI代理的基准，以对企业设置中的复杂，开放式的深层研究任务进行评估。与以前专注于简单问题或仅网络查询的基准不同，Drbench评估了代理商对多步查询的代理人（例如，我们应该对产品路线图做出哪些更改，以确保符合此标准的符合此标准？”）需要从公共网络和私人公司知识基础中识别支持的范围。云文件系统，聊天对话和开放式网络是通过精心设计的合成管道来生成的，并评估了他们在回忆起相关洞察力的能力，维护事实准确性并产生相干性的，我们会在10个领域中发布15个深度研究。开放式和封闭式模型（例如GPT，Llama和Qwen）和DR策略中的代理商强调了他们的优势，劣势和促进企业深度研究的关键途径。

Title: PrimeX: A Dataset of Worldview, Opinion, and Explanation

Authors: Rik Koncel-Kedziorski, Brihi Joshi, Tim Paek
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00174
Pdf URL: https://arxiv.org/pdf/2510.00174
Copy Paste: [[2510.00174]] PrimeX: A Dataset of Worldview, Opinion, and Explanation(https://arxiv.org/abs/2510.00174)
Keywords: language model
Abstract: As the adoption of language models advances, so does the need to better represent individual users to the model. Are there aspects of an individual's belief system that a language model can utilize for improved alignment? Following prior research, we investigate this question in the domain of opinion prediction by developing PrimeX, a dataset of public opinion survey data from 858 US residents with two additional sources of belief information: written explanations from the respondents for why they hold specific opinions, and the Primal World Belief survey for assessing respondent worldview. We provide an extensive initial analysis of our data and show the value of belief explanations and worldview for personalizing language models. Our results demonstrate how the additional belief information in PrimeX can benefit both the NLP and psychological research communities, opening up avenues for further study.
摘要：随着语言模型的采用，更好地代表模型的单个用户的需求也是如此。语言模型可以用来改善对齐方式的个人信仰体系是否存在各个方面？在先前的研究之后，我们通过开发Primex的舆论预测领域进行了调查，这是来自858名美国居民的公众舆论调查数据数据集，其中有两个其他信仰信息来源：来自受访者的书面解释，以了解他们为何持有特定意见，以及评估受访者的原始世界信念调查。我们对数据进行了广泛的初步分析，并显示了信念解释的价值和个性化语言模型的世界观。我们的结果表明，PrimeX中的其他信念信息如何使NLP和心理学研究社区受益，从而为进一步的研究开辟了途径。

Title: Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It

Authors: Shuyue Stella Li, Avinandan Bose, Faeze Brahman, Simon Shaolei Du, Pang Wei Koh, Maryam Fazel, Yulia Tsvetkov
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00177
Pdf URL: https://arxiv.org/pdf/2510.00177
Copy Paste: [[2510.00177]] Personalized Reasoning: Just-In-Time Personalization and Why LLMs Fail At It(https://arxiv.org/abs/2510.00177)
Keywords: language model, llm
Abstract: Current large language model (LLM) development treats task-solving and preference alignment as separate challenges, optimizing first for objective correctness, then for alignment to aggregated human preferences. This paradigm fails in human-facing applications where solving a problem correctly is insufficient if the response mismatches the user's needs. This challenge intensifies in just-in-time scenarios where no prior user interaction history exists due to cold-start conditions or privacy constraints. LLMs need to identify what they don't know about user preferences, strategically elicit preference values through questioning, then adapt their reasoning processes and responses accordingly -- a complicated chain of cognitive processes which we term personalized reasoning. We introduce PREFDISCO, an evaluation methodology that transforms static benchmarks into interactive personalization tasks using psychologically-grounded personas with sparse preferences. Our framework creates scenarios where identical questions require different reasoning chains depending on user context, as optimal explanation approaches vary by individual expertise and preferences while maintaining factual accuracy. Evaluation of 21 frontier models across 10 tasks reveals 29.0% of naive personalization attempts produce worse preference alignment than generic responses, yet generic responses also fail to serve individual user needs effectively. These findings suggest personalized reasoning requires dedicated development rather than emerging naturally. PREFDISCO establishes personalized reasoning as a measurable research frontier and reveals fundamental limitations in current LLMs' interactive capabilities, providing a foundation for developing systems that can adapt to individual users in education, healthcare, and technical domains where personalization is critical.
摘要：当前的大型语言模型（LLM）开发将任务解决和偏好对齐视为单独的挑战，首先优化了客观正确性，然后将一致性汇总为汇总的人类偏好。这种范式在面向人的应用程序中失败，如果响应不匹配用户的需求，则正确解决问题是不够的。在即时场景中，由于寒冷启动条件或隐私限制，这一挑战会加剧。 LLM需要确定他们对用户偏好不了解的知识，通过询问从战略上引起偏好值，然后相应地调整其推理过程和回答 - 我们称之为个性化推理的复杂认知过程链。我们介绍了Prefdisco，这是一种评估方法，该方法将静态基准转换为互动的个性化任务，使用具有稀少偏好的心理界面角色。我们的框架创建了场景，其中相同的问题需要不同的推理链，具体取决于用户上下文，因为最佳解释方法因个人专业知识和偏好而有所不同，同时保持事实准确性。对10个任务的21个边境模型的评估显示，有29.0％的幼稚个性化尝试会产生比通用响应更差的偏好一致性，但通用响应也无法有效地满足个人用户需求。这些发现表明个性化的推理需要专门的发展，而不是自然出现。 Prefdisco将个性化的推理建立为可衡量的研究边界，并揭示了当前LLMS交互式功能的基本限制，为开发系统的基础提供了一个可以适应个性化非常重要的系统中个人用户的系统。

Title: BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses

Authors: Xin Xu, Xunzhi He, Churan Zhi, Ruizhe Chen, Julian McAuley, Zexue He
Subjects: cs.CL, cs.AI, cs.CY, cs.LG
Abstract URL: https://arxiv.org/abs/2510.00232
Pdf URL: https://arxiv.org/pdf/2510.00232
Copy Paste: [[2510.00232]] BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses(https://arxiv.org/abs/2510.00232)
Keywords: language model, llm, prompt
Abstract: Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance, leading to inconsistent comparisons among them. Moreover, their evaluations are mostly based on the comparison between LLMs' probabilities of biased and unbiased contexts, which ignores the gap between such evaluations and real-world use cases where users interact with LLMs by reading model responses and expect fair and safe outputs rather than LLMs' probabilities. To enable consistent evaluation across debiasing methods and bridge this gap, we introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques (covering four prompting-based and four training-based methods) on two test scenarios (multi-choice QA and open-ended multi-turn QA) by reorganizing existing datasets into a unified query-response setting. We further introduce a response-level metric, Bias-Free Score, to measure the extent to which LLM responses are fair, safe, and anti-stereotypical. Debiasing performances are systematically compared and analyzed across key dimensions: the prompting vs. training paradigm, model size, and generalization of different training strategies to unseen bias types. We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
摘要：关于大语模型（LLMS）的偏置缓解方法的现有研究使用不同的基准和指标来评估偏见性能，从而导致不一致的比较。此外，他们的评估主要基于LLMS偏见和无偏见的上下文的概率之间的比较，这些概率忽略了此类评估和现实世界中用户之间的差距，在这些情况下，用户通过读取模型响应并期望公平和安全的输出与LLMS而不是LLMS的概率相互作用。为了使跨词汇方法进行一致的评估并弥合这一差距，我们介绍了biasfreebench，这是一种经验基准，可以通过重新设置八个主流偏见缓解技术（涵盖两个基于提示的基于提示的基于提示和四个基于培训的方法）对两个测试场景（多choice QA和开放式QA）进行重新设置 - 现有的设置，并将其重新设置为ARINGY-INSET，并将其重新设置为合成。我们进一步引入了响应级度量，无偏差分数，以衡量LLM响应在多大程度上是公平，安全和反型型的。在关键维度上进行系统比较和分析辩护表演：提示与训练范式，模型大小以及不同培训策略的概括以使人看不见的偏见类型。我们将公开发布我们的基准，旨在为缓解偏见研究建立统一的测试床。

Title: TASER: Translation Assessment via Systematic Evaluation and Reasoning

Authors: Monishwaran Maheswaran, Marco Carini, Christian Federmann, Tony Diaz
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00255
Pdf URL: https://arxiv.org/pdf/2510.00255
Copy Paste: [[2510.00255]] TASER: Translation Assessment via Systematic Evaluation and Reasoning(https://arxiv.org/abs/2510.00255)
Keywords: llm, prompt
Abstract: We introduce TASER (Translation Assessment via Systematic Evaluation and Reasoning), a metric that uses Large Reasoning Models (LRMs) for automated translation quality assessment. TASER harnesses the explicit reasoning capabilities of LRMs to conduct systematic, step-by-step evaluation of translation quality. We evaluate TASER on the WMT24 Metrics Shared Task across both reference-based and reference-free scenarios, demonstrating state-of-the-art performance. In system-level evaluation, TASER achieves the highest soft pairwise accuracy in both reference-based and reference-free settings, outperforming all existing metrics. At the segment level, TASER maintains competitive performance with our reference-free variant ranking as the top-performing metric among all reference-free approaches. Our experiments reveal that structured prompting templates yield superior results with LRMs compared to the open-ended approaches that proved optimal for traditional LLMs. We evaluate o3, a large reasoning model from OpenAI, with varying reasoning efforts, providing insights into the relationship between reasoning depth and evaluation quality. The explicit reasoning process in LRMs offers interpretability and visibility, addressing a key limitation of existing automated metrics. Our results demonstrate that Large Reasoning Models show a measurable advancement in translation quality assessment, combining improved accuracy with transparent evaluation across diverse language pairs.
摘要：我们介绍TASER（通过系统评估和推理进行翻译评估），该指标使用大型推理模型（LRMS）进行自动翻译质量评估。泰瑟（Taser）利用LRMS的明确推理能力来进行系统的转换质量评估。我们在基于参考的情况和无参考方案中共享的WMT24指标共享任务上评估TASER，以证明最先进的性能。在系统级评估中，Taser在基于参考和无参考的设置中都达到了最高的软配对精度，表现优于所有现有指标。在细分市场级别上，泰瑟（Taser）保持了竞争性能，我们的无参考变体排名是所有无参考方法中表现最好的度量。我们的实验表明，与证明对传统LLM最佳的开放式方法相比，结构化的提示模板与LRM相比产生了较高的结果。我们评估了O3是一个来自OpenAI的大型推理模型，并进行了不同的推理工作，从而提供了对推理深度与评估质量之间关系的见解。 LRMS中的明确推理过程提供了解释性和可见性，并解决了现有自动化指标的关键限制。我们的结果表明，大型推理模型在翻译质量评估方面表现出可衡量的进步，将提高精度与跨不同语言对的透明评估相结合。

Title: Retrieval-Augmented Generation for Electrocardiogram-Language Models

Authors: Xiaoyu Song, William Han, Tony Chen, Chaojing Duan, Michael A. Rosenberg, Emerson Liu, Ding Zhao
Subjects: cs.CL, cs.AI, cs.MM
Abstract URL: https://arxiv.org/abs/2510.00261
Pdf URL: https://arxiv.org/pdf/2510.00261
Copy Paste: [[2510.00261]] Retrieval-Augmented Generation for Electrocardiogram-Language Models(https://arxiv.org/abs/2510.00261)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Interest in generative Electrocardiogram-Language Models (ELMs) is growing, as they can produce textual responses conditioned on ECG signals and textual queries. Unlike traditional classifiers that output label probabilities, ELMs are more versatile, supporting domain-specific tasks (e.g., waveform analysis, diagnosis, prognosis) as well as general tasks (e.g., open-ended questions, dialogue). Retrieval-Augmented Generation (RAG), widely used in Large Language Models (LLMs) to ground LLM outputs in retrieved knowledge, helps reduce hallucinations and improve natural language generation (NLG). However, despite its promise, no open-source implementation or systematic study of RAG pipeline design for ELMs currently exists. To address this gap, we present the first open-source RAG pipeline for ELMs, along with baselines and ablation studies for NLG. Experiments on three public datasets show that ELMs with RAG consistently improves performance over non-RAG baselines and highlights key ELM design considerations. Our code is available at: this https URL.
摘要：对生成心电图模型（ELM）的兴趣正在增长，因为它们可以产生以ECG信号和文本查询为条件的文本响应。与输出标签概率的传统分类器不同，ELMS更通用，支持特定领域的任务（例如，波形分析，诊断，预后）以及一般任务（例如，开放式问题，对话）。在大型语言模型（LLMS）中广泛使用的检索增强生成（RAG）在检索知识中以LLM输出为基础，有助于减少幻觉并改善自然语言产生（NLG）。但是，尽管有希望，但目前仍存在对ELMS的RAG管道设计的开源实施或系统研究。为了解决这一差距，我们提出了第一个用于榆树的开源抹布管道，以及基线和NLG的消融研究。三个公共数据集的实验表明，带有抹布的榆树一致地改善了非抹布基线的性能，并突出了关键的榆树设计注意事项。我们的代码可用：此HTTPS URL。

Title: Judging with Confidence: Calibrating Autoraters to Preference Distributions

Authors: Zhuohang Li, Xiaowei Li, Chengyu Huang, Guowang Li, Katayoon Goshvadi, Bo Dai, Dale Schuurmans, Paul Zhou, Hamid Palangi, Yiwen Song, Palash Goyal, Murat Kantarcioglu, Bradley A. Malin, Yuan Xue
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00263
Pdf URL: https://arxiv.org/pdf/2510.00263
Copy Paste: [[2510.00263]] Judging with Confidence: Calibrating Autoraters to Preference Distributions(https://arxiv.org/abs/2510.00263)
Keywords: language model, llm
Abstract: The alignment of large language models (LLMs) with human values increasingly relies on using other LLMs as automated judges, or ``autoraters''. However, their reliability is limited by a foundational issue: they are trained on discrete preference labels, forcing a single ground truth onto tasks that are often subjective, ambiguous, or nuanced. We argue that a reliable autorater must learn to model the full distribution of preferences defined by a target population. In this paper, we propose a general framework for calibrating probabilistic autoraters to any given preference distribution. We formalize the problem and present two learning methods tailored to different data conditions: 1) a direct supervised fine-tuning for dense, probabilistic labels, and 2) a reinforcement learning approach for sparse, binary labels. Our empirical results show that finetuning autoraters with a distribution-matching objective leads to verbalized probability predictions that are better aligned with the target preference distribution, with improved calibration and significantly lower positional bias, all while preserving performance on objective tasks.
摘要：大型语言模型（LLM）与人类价值观的一致性越来越依赖于使用其他LLM作为自动化法官或``自动化''。但是，它们的可靠性受到基本问题的限制：他们接受了离散偏好标签的培训，迫使单个基础真理进入通常是主观，模棱两可或细微差别的任务。我们认为，可靠的自动机必须学会对目标人群定义的偏好的完整分布进行建模。在本文中，我们提出了一个通用框架，用于将概率自动化器校准为任何给定的偏好分布。我们正式化了问题，并提出了针对不同数据条件的两种学习方法：1）直接监督密集的概率标签的微调和2）稀疏，二进制标签的增强学习方法。我们的经验结果表明，具有分布匹配的目标的鉴定自动化器会导致口头上的概率预测，这些预测与目标偏好分布更好地保持一致，并改善了校准和较低的位置偏见，同时均能在目标任务上保持绩效。

Title: Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction

Authors: Zhexiong Liu, Diane Litman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00268
Pdf URL: https://arxiv.org/pdf/2510.00268
Copy Paste: [[2510.00268]] Efficient Layer-wise LLM Fine-tuning for Revision Intention Prediction(https://arxiv.org/abs/2510.00268)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have shown extraordinary success across various text generation tasks; however, their potential for simple yet essential text classification remains underexplored, as LLM pre-training tends to emphasize generation over classification. While LLMs with instruction tuning can transform classification into a generation task, they often struggle to categorize nuanced texts. One such example is text revision, which involves nuanced edits between pairs of texts. Although simply fine-tuning LLMs for revision classification seems plausible, it requires a large amount of revision annotations, which are exceptionally expensive and scarce in the community. To address this issue, we introduce a plug-and-play layer-wise parameter-efficient fine-tuning (PEFT) framework, i.e., IR-Tuning, which fine-tunes a subset of important LLM layers that are dynamically selected based on their gradient norm distribution, while freezing those of redundant layers. Extensive experiments suggest that IR-Tuning surpasses several layer-wise PEFT baselines over diverse text revisions, while achieving fast convergence, low GPU memory consumption, and effectiveness on small revision corpora.
摘要：大型语言模型（LLMS）在各种文本生成任务中表现出非凡的成功；但是，由于LLM预训练倾向于强调产生而不是分类，因此它们进行简单但必不可少的文本分类的潜力仍然没有得到充实。尽管使用指令调整的LLM可以将分类转变为一代任务，但他们经常难以对细微的文本进行分类。一个这样的示例是文本修订，其中涉及成对文本之间的细微编辑。尽管简单地进行修订分类的微调LLM似乎是合理的，但它需要大量的修订注释，这些注释在社区中非常昂贵且稀缺。为了解决此问题，我们引入了一个插件的层面参数效率微调（PEFT）框架，即IR-Tuning，该框架微调了重要的LLM层的子集，这些子集是根据其梯度标准分布动态选择的，同时冻结了冗余层。广泛的实验表明，IR-Tuning在各种文本修订方面超过了几层PEFT基线，同时实现了快速收敛，低GPU的记忆消耗以及对小型修订公司的有效性。

Title: SafePassage: High-Fidelity Information Extraction with Black Box LLMs

Authors: Joe Barrow, Raj Patel, Misha Kharkovski, Ben Davies, Ryan Schmitt
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.00276
Pdf URL: https://arxiv.org/pdf/2510.00276
Copy Paste: [[2510.00276]] SafePassage: High-Fidelity Information Extraction with Black Box LLMs(https://arxiv.org/abs/2510.00276)
Keywords: language model, llm, hallucination
Abstract: Black box large language models (LLMs) make information extraction (IE) easy to configure, but hard to trust. Unlike traditional information extraction pipelines, the information "extracted" is not guaranteed to be grounded in the document. To prevent this, this paper introduces the notion of a "safe passage": context generated by the LLM that is both grounded in the document and consistent with the extracted information. This is operationalized via a three-step pipeline, SafePassage, which consists of: (1) an LLM extractor that generates structured entities and their contexts from a document, (2) a string-based global aligner, and (3) a scoring model. Results show that using these three parts in conjunction reduces hallucinations by up to 85% on information extraction tasks with minimal risk of flagging non-hallucinations. High agreement between the SafePassage pipeline and human judgments of extraction quality mean that the pipeline can be dually used to evaluate LLMs. Surprisingly, results also show that using a transformer encoder fine-tuned on a small number of task-specific examples can outperform an LLM scoring model at flagging unsafe passages. These annotations can be collected in as little as 1-2 hours.
摘要：黑匣子大语言模型（LLMS）使信息提取（IE）易于配置，但很难信任。与传统的信息提取管道不同，不能保证“提取”的信息在文档中扎根。为了防止这种情况，本文介绍了“安全段落”的概念：LLM产生的上下文，该上下文既基于文档中的基础，又与提取的信息一致。这是通过三步管道（Safepassage）进行操作的，该管道由：（1）从文档中生成结构化实体及其上下文的LLM提取器，（2）基于字符串的全局对准器以及（3）评分模型。结果表明，在信息提取任务上，使用这三个部分将幻觉降低了多达85％，其标记不进行的风险很小。 Safepassage管道和人类提取质量判断之间的高度一致性意味着该管道可双重地用于评估LLM。令人惊讶的是，结果还表明，在少数特定于任务的示例上使用变压器编码器进行微调可以超越LLM评分模型，以标记不安全的段落。这些注释可以在短短1-2小时内收集。

Title: o-MEGA: Optimized Methods for Explanation Generation and Analysis

Authors: Ľuboš Kriš, Jaroslav Kopčan, Qiwei Peng, Andrej Ridzik, Marcel Veselý, Martin Tamajka
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00288
Pdf URL: https://arxiv.org/pdf/2510.00288
Copy Paste: [[2510.00288]] o-MEGA: Optimized Methods for Explanation Generation and Analysis(https://arxiv.org/abs/2510.00288)
Keywords: language model
Abstract: The proliferation of transformer-based language models has revolutionized NLP domain while simultaneously introduced significant challenges regarding model transparency and trustworthiness. The complexity of achieving explainable systems in this domain is evidenced by the extensive array of explanation methods and evaluation metrics developed by researchers. To address the challenge of selecting optimal explainability approaches, we present \textbf{\texttt{o-mega}}, a hyperparameter optimization tool designed to automatically identify the most effective explainable AI methods and their configurations within the semantic matching domain. We evaluate o-mega on a post-claim matching pipeline using a curated dataset of social media posts paired with refuting claims. Our tool systematically explores different explainable methods and their hyperparameters, demonstrating improved transparency in automated fact-checking systems. As a result, such automated optimization of explanation methods can significantly enhance the interpretability of claim-matching models in critical applications such as misinformation detection, contributing to more trustworthy and transparent AI systems.
摘要：基于变压器的语言模型的扩散彻底改变了NLP领域，同时在模型透明度和可信度方面引入了重大挑战。研究人员开发的各种解释方法和评估指标都证明了在该领域中实现可解释系统的复杂性。为了解决选择最佳解释方法的挑战，我们提出\ textbf {\ texttt {o-mega}}，这是一种旨在自动识别语义匹配域内最有效的可解释的AI方法及其配置的超参数优化工具。我们使用策划的社交媒体帖子和反驳主张的策划数据集评估了O-Mega的O-Mega。我们的工具系统地探索了不同的可解释方法及其超参数，证明了自动化事实检查系统的透明度提高。结果，这种解释方法的自动优化可以显着增强索赔匹配模型在关键应用中的解释性，例如误导性检测，有助于更值得信赖和透明的AI系统。

Title: CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage

Authors: Bowen Wei, Yuan Shen Tay, Howard Liu, Jinhao Pan, Kun Luo, Ziwei Zhu, Chris Jordan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00311
Pdf URL: https://arxiv.org/pdf/2510.00311
Copy Paste: [[2510.00311]] CORTEX: Collaborative LLM Agents for High-Stakes Alert Triage(https://arxiv.org/abs/2510.00311)
Keywords: llm, agent
Abstract: Security Operations Centers (SOCs) are overwhelmed by tens of thousands of daily alerts, with only a small fraction corresponding to genuine attacks. This overload creates alert fatigue, leading to overlooked threats and analyst burnout. Classical detection pipelines are brittle and context-poor, while recent LLM-based approaches typically rely on a single model to interpret logs, retrieve context, and adjudicate alerts end-to-end -- an approach that struggles with noisy enterprise data and offers limited transparency. We propose CORTEX, a multi-agent LLM architecture for high-stakes alert triage in which specialized agents collaborate over real evidence: a behavior-analysis agent inspects activity sequences, evidence-gathering agents query external systems, and a reasoning agent synthesizes findings into an auditable decision. To support training and evaluation, we release a dataset of fine-grained SOC investigations from production environments, capturing step-by-step analyst actions and linked tool outputs. Across diverse enterprise scenarios, CORTEX substantially reduces false positives and improves investigation quality over state-of-the-art single-agent LLMs.
摘要：安全操作中心（SOC）被数万日常警报所淹没，只有一小部分与真正的攻击相对应。这种超负荷会产生警报疲劳，导致威胁被忽视和分析师倦怠。经典检测管道是脆弱和贫困的，而最近的基于LLM的方法通常依靠单个模型来解释日志，检索上下文和裁定警报端到端 - 这种方法与嘈杂的企业数据斗争并提供有限的透明度。我们提出了Cortex，这是一种用于高风险警报分类的多代理LLM体系结构，其中专门的代理商通过真实证据进行了协作：行为分析剂检查活动序列，循证收会剂查询外部系统，以及推理剂将发现合成发现的发现。为了支持培训和评估，我们从生产环境中发布了细粒度调查的数据集，逐步捕获分析师的动作和链接的工具输出。在各种企业方案中，皮层大大降低了假阳性，并提高了对最先进的单机代理LLM的调查质量。

Title: TokMem: Tokenized Procedural Memory for Large Language Models

Authors: Zijun Wu, Yongchang Hao, Lili Mou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00444
Pdf URL: https://arxiv.org/pdf/2510.00444
Copy Paste: [[2510.00444]] TokMem: Tokenized Procedural Memory for Large Language Models(https://arxiv.org/abs/2510.00444)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large language models rely heavily on prompts to specify tasks, recall knowledge and guide reasoning. However, this reliance is inefficient as prompts must be re-read at each step, scale poorly across tasks, and lack mechanisms for modular reuse. We introduce TokMem, a tokenized procedural memory that stores recurring procedures as compact, trainable embeddings. Each memory token encodes both an address to a procedure and a control signal that steers generation, enabling targeted behavior with constant-size overhead. To support continual adaptation, TokMem keeps the backbone model frozen, allowing new procedures to be added without interfering with existing ones. We evaluate TokMem on 1,000 tasks for atomic recall, and on function-calling tasks for compositional recall, where it consistently outperforms retrieval-augmented generation while avoiding repeated context overhead, and fine-tuning with far fewer parameters. These results establish TokMem as a scalable and modular alternative to prompt engineering and fine-tuning, offering an explicit procedural memory for LLMs.
摘要：大型语言模型在很大程度上依赖提示来指定任务，回忆知识和指导推理。但是，这种依赖效率降低，因为必须在每个步骤上重新阅读提示，在任务中缩放较差，并且缺乏模块化重复使用的机制。我们介绍了Tokmem，这是一种令牌化的程序记忆，将重复的程序存储为紧凑，可训练的嵌入。每个内存令牌既编码一个过程的地址，又要编码引导生成的控制信号，从而使目标行为具有恒定大小的开销。为了支持持续的适应，Tokmem保持了骨干模型的冷冻，可以添加新的过程而不会干扰现有过程。我们评估了1,000个原子召回任务的Tokmem，以及用于构图召回功能的函数呼叫任务，在该任务中，它始终优于检索效果的生成，同时避免了上下文上下文的上下文，并且较少的参数进行了微调。这些结果将Tokmem确立为促进工程和微调的可扩展和模块化替代方案，为LLMS提供了明确的程序记忆。

Title: LongCodeZip: Compress Long Context for Code Language Models

Authors: Yuling Shi, Yichun Qian, Hongyu Zhang, Beijun Shen, Xiaodong Gu
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2510.00446
Pdf URL: https://arxiv.org/pdf/2510.00446
Copy Paste: [[2510.00446]] LongCodeZip: Compress Long Context for Code Language Models(https://arxiv.org/abs/2510.00446)
Keywords: language model, llm, long context
Abstract: Code generation under long contexts is becoming increasingly critical as Large Language Models (LLMs) are required to reason over extensive information in the codebase. While recent advances enable code LLMs to process long inputs, high API costs and generation latency remain substantial bottlenecks. Existing context pruning techniques, such as LLMLingua, achieve promising results for general text but overlook code-specific structures and dependencies, leading to suboptimal performance in programming tasks. In this paper, we propose LongCodeZip, a novel plug-and-play code compression framework designed specifically for code LLMs. LongCodeZip employs a dual-stage strategy: (1) coarse-grained compression, which identifies and ranks function-level chunks using conditional perplexity with respect to the instruction, retaining only the most relevant functions; and (2) fine-grained compression, which segments retained functions into blocks based on perplexity and selects an optimal subset under an adaptive token budget to maximize relevance. Evaluations across multiple tasks, including code completion, summarization, and question answering, show that LongCodeZip consistently outperforms baseline methods, achieving up to a 5.6x compression ratio without degrading task performance. By effectively reducing context size while preserving essential information, LongCodeZip enables LLMs to better scale to real-world, large-scale code scenarios, advancing the efficiency and capability of code intelligence applications.
摘要：在长篇小说中，代码生成越来越重要，因为需要大型语言模型（LLM）来推理代码库中广泛的信息。尽管最近的进步使代码LLM能够处理长时间的输入，但高API成本和发电潜伏期仍然很大。现有的上下文修剪技术（例如LLMlingua）为一般文本而言，可以实现有希望的结果，但忽略了特定于代码的结构和依赖项，从而导致编程任务中的次优性能。在本文中，我们建议LongCodezip，这是一种专门为Code LLMS设计的新颖的插件代码压缩框架。 Longcodezip采用了双阶段策略：（1）粗粒的压缩，该压缩使用条件相对于指令识别和排名功能级别的块，仅保留最相关的功能；（2）细粒度的压缩，该片段将功能保留在基于困惑的块中，并在自适应令牌预算下选择一个最佳子集以最大化相关性。跨多个任务的评估，包括代码完成，摘要和问题答案，表明Longcodezip始终超过基线方法，达到5.6倍的压缩率，而不会降低任务性能。通过有效减少上下文大小的同时保留基本信息，LongCodezip使LLM可以更好地扩展到现实世界，大规模代码方案，从而提高代码智能应用程序的效率和能力。

Title: Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews

Authors: Koki Ryu, Hitomi Yanaka
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00449
Pdf URL: https://arxiv.org/pdf/2510.00449
Copy Paste: [[2510.00449]] Enhancing Rating Prediction with Off-the-Shelf LLMs Using In-Context User Reviews(https://arxiv.org/abs/2510.00449)
Keywords: language model, llm, prompt
Abstract: Personalizing the outputs of large language models (LLMs) to align with individual user preferences is an active research area. However, previous studies have mainly focused on classification or ranking tasks and have not considered Likert-scale rating prediction, a regression task that requires both language and mathematical reasoning to be solved effectively. This task has significant industrial applications, but the utilization of LLMs remains underexplored, particularly regarding the capabilities of off-the-shelf LLMs. This study investigates the performance of off-the-shelf LLMs on rating prediction, providing different in-context information. Through comprehensive experiments with eight models across three datasets, we demonstrate that user-written reviews significantly improve the rating prediction performance of LLMs. This result is comparable to traditional methods like matrix factorization, highlighting the potential of LLMs as a promising solution for the cold-start problem. We also find that the reviews for concrete items are more effective than general preference descriptions that are not based on any specific item. Furthermore, we discover that prompting LLMs to first generate a hypothetical review enhances the rating prediction performance. Our code is available at this https URL.
摘要：个性化大语模型（LLM）的输出以与单个用户偏好保持一致是一个主动的研究领域。但是，先前的研究主要集中在分类或排名任务上，并且不考虑李克特级评级预测，这是一种需要有效解决语言和数学推理的回归任务。这项任务具有重要的工业应用，但是LLMS的利用仍然没有得到充实，尤其是关于现成的LLM的功能。这项研究调查了现成的LLM在评级预测方面的性能，提供了不同的文字信息。通过对三个数据集的八个模型进行的全面实验，我们证明了用户编写的评论可显着提高LLM的评级预测性能。该结果与诸如矩阵分解之类的传统方法相媲美，突出了LLM作为冷启动问题的有前途解决方案的潜力。我们还发现，混凝土项目的评论比不基于任何特定项目的一般偏好描述更有效。此外，我们发现促使LLM首先产生假设审查会增强评级预测性能。我们的代码可在此HTTPS URL上找到。

Title: Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains

Authors: Yawen Xue, Masaya Tsunokake, Yuta Koreeda, Ekant Muljibhai Amin, Takashi Sumiyoshi, Yasuhiro Sogawa
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00482
Pdf URL: https://arxiv.org/pdf/2510.00482
Copy Paste: [[2510.00482]] Agent Fine-tuning through Distillation for Domain-specific LLMs in Microdomains(https://arxiv.org/abs/2510.00482)
Keywords: language model, llm, prompt, retrieval-augmented generation, agent
Abstract: Agentic large language models (LLMs) have become prominent for autonomously interacting with external environments and performing multi-step reasoning tasks. Most approaches leverage these capabilities via in-context learning with few-shot prompts, but this often results in lengthy inputs and higher computational costs. Agent fine-tuning offers an alternative by enabling LLMs to internalize procedural reasoning and domain-specific knowledge through training on relevant data and demonstration trajectories. While prior studies have focused on general domains, their effectiveness in specialized technical microdomains remains unclear. This paper explores agent fine-tuning for domain adaptation within Hitachi's JP1 middleware, a microdomain for specialized IT operations. We fine-tuned LLMs using JP1-specific datasets derived from domain manuals and distilled reasoning trajectories generated by LLMs themselves, enhancing decision making accuracy and search efficiency. During inference, we used an agentic prompt with retrieval-augmented generation and introduced a context-answer extractor to improve information relevance. On JP1 certification exam questions, our method achieved a 14% performance improvement over the base model, demonstrating the potential of agent fine-tuning for domain-specific reasoning in complex microdomains.
摘要：代理大型语言模型（LLMS）已在与外部环境自主互动并执行多步推理任务方面变得突出。大多数方法通过少数提示通过内在学习来利用这些功能，但这通常会导致冗长的投入和更高的计算成本。通过使LLMS可以通过对相关数据和演示轨迹培训进行培训，可以通过使LLMS内部化程序推理和特定于领域的知识来提供替代方案。尽管先前的研究集中在通用领域上，但它们在专门的技术微区域中的有效性尚不清楚。本文探讨了代理在日立的JP1中间件中进行域适应的微调，这是一种用于专门IT操作的微域。我们使用LLM本身生成的域手册和蒸馏推理轨迹得出的JP1特定数据集对LLM进行了微调，从而提高了决策的准确性和搜索效率。在推理期间，我们使用了一个带有检索的代理提示，并引入了上下文 - 答案提取器来提高信息相关性。在JP1认证考试问题上，我们的方法比基本模型实现了14％的性能提高，这证明了代理对复杂微区中特定于域特异性推理的微调潜力。

Title: Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations

Authors: Pengzhou Cheng, Lingzhong Dong, Zeng Wu, Zongru Wu, Xiangru Tang, Chengwei Qin, Zhuosheng Zhang, Gongshen Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00496
Pdf URL: https://arxiv.org/pdf/2510.00496
Copy Paste: [[2510.00496]] Agent-ScanKit: Unraveling Memory and Reasoning of Multimodal Agents via Sensitivity Perturbations(https://arxiv.org/abs/2510.00496)
Keywords: agent
Abstract: Although numerous strategies have recently been proposed to enhance the autonomous interaction capabilities of multimodal agents in graphical user interface (GUI), their reliability remains limited when faced with complex or out-of-domain tasks. This raises a fundamental question: Are existing multimodal agents reasoning spuriously? In this paper, we propose \textbf{Agent-ScanKit}, a systematic probing framework to unravel the memory and reasoning capabilities of multimodal agents under controlled perturbations. Specifically, we introduce three orthogonal probing paradigms: visual-guided, text-guided, and structure-guided, each designed to quantify the contributions of memorization and reasoning without requiring access to model internals. In five publicly available GUI benchmarks involving 18 multimodal agents, the results demonstrate that mechanical memorization often outweighs systematic reasoning. Most of the models function predominantly as retrievers of training-aligned knowledge, exhibiting limited generalization. Our findings underscore the necessity of robust reasoning modeling for multimodal agents in real-world scenarios, offering valuable insights toward the development of reliable multimodal agents.
摘要：尽管最近提出了许多策略来增强图形用户界面（GUI）中多模式代理的自主相互作用功能，但面对复杂或室外任务时，其可靠性仍然有限。这就提出了一个基本问题：现有的多模式代理是否有假性推理？在本文中，我们提出\ textbf {agent-scankit}，这是一个系统的探测框架，以揭示受控扰动下多模式代理的内存和推理能力。具体而言，我们介绍了三个正交探测范例：视觉引导，文本引导和结构引导，每种旨在量化记忆和推理的贡献，而无需访问模型内部设备。在涉及18种多模式代理的五个公开可用的GUI基准中，结果表明，机械记忆通常超过系统推理。大多数模型主要是作为训练一致知识的检索员，表现出有限的概括。我们的发现强调了在现实情况下对多模式代理进行强大推理建模的必要性，从而为可靠的多峰剂的开发提供了宝贵的见解。

Title: MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance

Authors: Xingjian Zhao, Zhe Xu, Luozhijie Jin, Yang Wang, Hanfu Chen, Yaozhou Jiang, Ke Chen, Ruixiao Li, Mingshu Chen, Ruiming Wang, Wenbo Zhang, Yiyang Zhang, Donghua Yu, Yang Gao, Xiaogui Yang, Yitian Gong, Yuanfan Xu, Qinyuan Cheng, Zhaoye Fei, Shimin Li, Yaqian Zhou, Xuanjing Huang, Xipeng Qiu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00499
Pdf URL: https://arxiv.org/pdf/2510.00499
Copy Paste: [[2510.00499]] MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance(https://arxiv.org/abs/2510.00499)
Keywords: language model, llm
Abstract: Spoken dialogue systems often rely on cascaded pipelines that transcribe, process, and resynthesize speech. While effective, this design discards paralinguistic cues and limits expressivity. Recent end-to-end methods reduce latency and better preserve these cues, yet still rely on text intermediates, creating a fundamental bottleneck. We present MOSS-Speech, a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance. Our approach combines a modality-based layer-splitting architecture with a frozen pre-training strategy, preserving the reasoning and knowledge of pretrained text LLMs while adding native speech capabilities. Experiments show that our model achieves state-of-the-art results in spoken question answering and delivers comparable speech-to-speech performance relative to existing text-guided systems, while still maintaining competitive text performance. By narrowing the gap between text-guided and direct speech generation, our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
摘要：口语对话系统通常依赖于cascccad的管道，这些管道转录，过程和重新合成语音。尽管有效，但这种设计却丢弃了副语言提示，并限制了表现力。最近的端到端方法减少了延迟并更好地保留这些提示，但仍依靠文本中间体，从而产生了基本的瓶颈。我们提出了莫斯语音，这是一种真实的语音到语音语言模型，它直接理解和生成语音而不依赖文本指导。我们的方法结合了基于模态的层拆分体系结构与冷冻的预训练策略，在添加本地语音能力的同时，保留了经过验证的文本LLM的推理和知识。实验表明，我们的模型实现最新的模型会导致口头问题回答，并相对于现有的文本指导系统提供可比的语音到语音性能，同时仍保持竞争性文本性能。通过缩小文本引导和直接语音生成之间的差距，我们的工作为表达和高效的端到端语音互动建立了一个新的范式。

Title: Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs

Authors: Yurun Chen, Xavier Hu, Yuhan Liu, Ziqi Wang, Zeyi Liao, Lin Chen, Feng Wei, Yuxi Qian, Bo Zheng, Keting Yin, Shengyu Zhang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00507
Pdf URL: https://arxiv.org/pdf/2510.00507
Copy Paste: [[2510.00507]] Graph2Eval: Automatic Multimodal Task Generation for Agents via Knowledge Graphs(https://arxiv.org/abs/2510.00507)
Keywords: llm, agent
Abstract: As multimodal LLM-driven agents continue to advance in autonomy and generalization, evaluation based on static datasets can no longer adequately assess their true capabilities in dynamic environments and diverse tasks. Existing LLM-based synthetic data methods are largely designed for LLM training and evaluation, and thus cannot be directly applied to agent tasks that require tool use and interactive capabilities. While recent studies have explored automatic agent task generation with LLMs, most efforts remain limited to text or image analysis, without systematically modeling multi-step interactions in web environments. To address these challenges, we propose Graph2Eval, a knowledge graph-based framework that automatically generates both multimodal document comprehension tasks and web interaction tasks, enabling comprehensive evaluation of agents' reasoning, collaboration, and interactive capabilities. In our approach, knowledge graphs constructed from multi-source external data serve as the task space, where we translate semantic relations into structured multimodal tasks using subgraph sampling, task templates, and meta-paths. A multi-stage filtering pipeline based on node reachability, LLM scoring, and similarity analysis is applied to guarantee the quality and executability of the generated tasks. Furthermore, Graph2Eval supports end-to-end evaluation of multiple agent types (Single-Agent, Multi-Agent, Web Agent) and measures reasoning, collaboration, and interaction capabilities. We instantiate the framework with Graph2Eval-Bench, a curated dataset of 1,319 tasks spanning document comprehension and web interaction scenarios. Experiments show that Graph2Eval efficiently generates tasks that differentiate agent and model performance, revealing gaps in reasoning, collaboration, and web interaction across different settings and offering a new perspective for agent evaluation.
摘要：随着多模式LLM驱动的代理在自主和概括方面继续提高，基于静态数据集的评估无法再充分评估其在动态环境和各种任务中的真正能力。现有的基于LLM的合成数据方法在很大程度上是为LLM培训和评估设计的，因此不能直接应用于需要工具使用和交互功能的代理任务。尽管最近的研究探索了使用LLM的自动代理任务生成，但大多数工作仍然限于文本或图像分析，而无需系统地对Web环境中的多步交互进行建模。为了应对这些挑战，我们提出了Graph2Eval，这是一种基于知识图的框架，该框架自动生成多模式文档理解任务和Web交互任务，从而可以全面评估代理的推理，协作和交互式功能。在我们的方法中，通过多源外部数据构建的知识图作为任务空间，在其中使用子图采样，任务模板和元路径将语义关系转换为结构化的多模式任务。基于节点可及性，LLM评分和相似性分析的多阶段过滤管道可用于保证生成的任务的质量和可执行性。此外，Graph2eval支持多种代理类型（单代理，多代理，Web代理）的端到端评估，并衡量推理，协作和交互功能。我们用Graph2eval-Bench实例化框架，该框架是一个策划的数据集，该数据集涵盖文档理解和Web交互情况。实验表明，Graph2Eval有效地生成了区分代理和模型性能的任务，揭示了跨不同设置的推理，协作和Web交互的差距，并为代理评估提供了新的视角。

Title: Copy-Paste to Mitigate Large Language Model Hallucinations

Authors: Yongchao Long, Xian Wu, Yingying Zhang, Xianbin Wen, Yuxi Zhou, Shenda Hong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00508
Pdf URL: https://arxiv.org/pdf/2510.00508
Copy Paste: [[2510.00508]] Copy-Paste to Mitigate Large Language Model Hallucinations(https://arxiv.org/abs/2510.00508)
Keywords: language model, llm, hallucination, prompt, retrieval-augmented generation
Abstract: While Retrieval-Augmented Generation (RAG) enables large language models (LLMs) to generate contextually grounded responses, contextual faithfulness remains challenging as LLMs may not consistently trust provided context, leading to hallucinations that undermine reliability. We observe an inverse correlation between response copying degree and context-unfaithful hallucinations on RAGTruth, suggesting that higher copying degrees reduce hallucinations by fostering genuine contextual belief. We propose CopyPasteLLM, obtained through two-stage high-copying response preference training. We design three prompting methods to enhance copying degree, demonstrating that high-copying responses achieve superior contextual faithfulness and hallucination control. These approaches enable a fully automated pipeline that transforms generated responses into high-copying preference data for training CopyPasteLLM. On FaithEval, ConFiQA and PubMedQA, CopyPasteLLM achieves best performance in both counterfactual and original contexts, remarkably with 12.2% to 24.5% accuracy improvements on FaithEval over the best baseline, while requiring only 365 training samples -- 1/50th of baseline data. To elucidate CopyPasteLLM's effectiveness, we propose the Context-Parameter Copying Capturing algorithm. Interestingly, this reveals that CopyPasteLLM recalibrates reliance on internal parametric knowledge rather than external knowledge during generation. All codes are available at this https URL
摘要：虽然检索功能的生成（RAG）使大型语言模型（LLMS）能够产生上下文扎根的响应，但上下文忠诚仍然具有挑战性，因为LLMS可能不会始终如一地信任提供上下文，从而导致破坏可靠性的幻觉。我们观察到响应复制程度与上下文对Ragtruth的幻觉之间存在反相关性，这表明更高的复制程度通过培养真正的上下文信念来减少幻觉。我们提出了通过两阶段的高复制响应偏好训练获得的复型。我们设计了三种提示方法来增强复制学位，表明高复印反应获得了卓越的背景忠诚和幻觉控制。这些方法可以使一条全自动的管道将生成的响应转化为训练复制赛的高复制偏好数据。在Faitheval，Confiqa和PubMedQA上，CopyPastellm在反事实和原始环境中都取得了最佳性能，在最佳基线上，信仰的准确性提高了12.2％至24.5％，而仅需要365个培训样本，而基线数据的1/50则需要365个培训样本。为了阐明CopyPastellm的有效性，我们提出了上下文参数复制捕获算法。有趣的是，这表明CopyPastellm重新校准了对内部参数知识而不是外部知识的依赖。所有代码都可以在此HTTPS URL上找到

Title: JoyAgent-JDGenie: Technical Report on the GAIA

Authors: Jiarun Liu, Shiyue Xu, Shangkun Liu, Yang Li, Wen Liu, Min Liu, Xiaoqing Zhou, Hanmin Wang, Shilin Jia, zhen Wang, Shaohua Tian, Hanhao Li, Junbo Zhang, Yongli Yu, Peng Cao, Haofen Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00510
Pdf URL: https://arxiv.org/pdf/2510.00510
Copy Paste: [[2510.00510]] JoyAgent-JDGenie: Technical Report on the GAIA(https://arxiv.org/abs/2510.00510)
Keywords: language model, agent
Abstract: Large Language Models are increasingly deployed as autonomous agents for complex real-world tasks, yet existing systems often focus on isolated improvements without a unifying design for robustness and adaptability. We propose a generalist agent architecture that integrates three core components: a collective multi-agent framework combining planning and execution agents with critic model voting, a hierarchical memory system spanning working, semantic, and procedural layers, and a refined tool suite for search, code execution, and multimodal parsing. Evaluated on a comprehensive benchmark, our framework consistently outperforms open-source baselines and approaches the performance of proprietary systems. These results demonstrate the importance of system-level integration and highlight a path toward scalable, resilient, and adaptive AI assistants capable of operating across diverse domains and tasks.
摘要：大型语言模型越来越多地将其作为复杂现实世界任务的自主代理部署，但是现有系统通常专注于隔离的改进，而无需统一的鲁棒性和适应性设计。我们提出了一种集成了三个核心组件的通才代理体系结构：将计划和执行代理与评论家模型投票相结合的集体多代理框架，一个跨越工作，语义和程序层的分层内存系统，以及用于搜索，代码执行和多模态解析的精制工具套件。在全面的基准测试中，我们的框架始终优于开源基线，并接近专有系统的性能。这些结果表明了系统级集成的重要性，并突出了能够跨不同领域和任务运行的可扩展，弹性和自适应AI助手的途径。

Title: Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

Authors: Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2510.00526
Pdf URL: https://arxiv.org/pdf/2510.00526
Copy Paste: [[2510.00526]] Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum(https://arxiv.org/abs/2510.00526)
Keywords: language model, llm
Abstract: Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. To this end, we study a general family of probability-based objectives and characterize their effectiveness under different conditions. Through comprehensive experiments and extensive ablation studies across 7 model backbones, 14 benchmarks, and 3 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. Our code is available at this https URL.
摘要：监督微调（SFT）是培训后大语言模型（LLMS）的标准方法，但通常显示出有限的概括。我们将此限制追溯到其默认训练目标：负对数可能性（NLL）。虽然NLL在从头开始训练时在经典上是最佳的，但训练后训练以不同的范式运行，并且可能违反其最佳假设，其中模型已经编码了与任务相关的先验和监督可能很长而嘈杂。为此，我们研究了一个基于概率的目标的一般家族，并在不同条件下表征其有效性。通过跨7个模型骨架，14个基准和3个领域的全面实验和广泛的消融研究，我们发现了一个临界维度，该维度控制着客观行为：模型能力连续性。在模型末端附近，先前倾斜的目标，这些目标是减小低概率令牌（例如$ -P $，$ -P^{10} $，阈值变体）一致均优于NLL；朝着模型的末端，NLL主导；在两者之间，没有一个目标占上风。我们的理论分析进一步阐明了目标在整个连续体中的交易方式，为调整目标建模能力提供了原则上的基础。我们的代码可在此HTTPS URL上找到。

Title: GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness

Authors: Kung-Hsiang Huang, Haoyi Qiu, Yutong Dai, Caiming Xiong, Chien-Sheng Wu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00536
Pdf URL: https://arxiv.org/pdf/2510.00536
Copy Paste: [[2510.00536]] GUI-KV: Efficient GUI Agents via KV Cache with Spatio-Temporal Awareness(https://arxiv.org/abs/2510.00536)
Keywords: language model, agent
Abstract: Graphical user interface (GUI) agents built on vision-language models have emerged as a promising approach to automate human-computer workflows. However, they also face the inefficiency challenge as they process long sequences of high-resolution screenshots and solving long-horizon tasks, making inference slow, costly and memory-bound. While key-value (KV) caching can mitigate this, storing the full cache is prohibitive for image-heavy contexts. Existing cache-compression methods are sub-optimal as they do not account for the spatial and temporal redundancy of GUIs. In this work, we first analyze attention patterns in GUI agent workloads and find that, unlike in natural images, attention sparsity is uniformly high across all transformer layers. This insight motivates a simple uniform budget allocation strategy, which we show empirically outperforms more complex layer-varying schemes. Building on this, we introduce GUI-KV, a plug-and-play KV cache compression method for GUI agents that requires no retraining. GUI-KV combines two novel techniques: (i) spatial saliency guidance, which augments attention scores with the L2 norm of hidden states to better preserve semantically important visual tokens, and (ii) temporal redundancy scoring, which projects previous frames' keys onto the current frame's key subspace to preferentially prune redundant history. Across standard GUI agent benchmarks and models, GUI-KV outperforms competitive KV compression baselines, closely matching full-cache accuracy at modest budgets. Notably, in a 5-screenshot setting on the AgentNetBench benchmark, GUI-KV reduces decoding FLOPs by 38.9% while increasing step accuracy by 4.1% over the full-cache baseline. These results demonstrate that exploiting GUI-specific redundancies enables efficient and reliable agent performance.
摘要：建立在视觉语言模型上的图形用户界面（GUI）代理已成为自动化人机工作流程的一种有前途的方法。但是，当他们处理高分辨率屏幕截图并解决长胜压任务的长序列时，他们也面临着效率低下的挑战，从而使推理缓慢，昂贵和记忆力降低。虽然键值（KV）缓存可以减轻这种情况，但存储完整的缓存对于较重的图像上下文而言却是过敏的。现有的缓存压缩方法是最佳的，因为它们不考虑GUI的空间和时间冗余。在这项工作中，我们首先分析了GUI代理工作负载中的注意力模式，并发现与自然图像不同，所有变压器层的注意力稀疏性均匀高。这种洞察力激发了一种简单的统一预算分配策略，我们表明，凭经验表现优于更复杂的层变体方案。在此基础上，我们介绍了GUI-KV，这是一种无需再培训的GUI代理的插件KV缓存压缩方法。 GUI-KV结合了两种新颖的技术：（i）空间显着性指导，它通过隐藏状态的L2标准提高注意力评分，以更好地保留语义上重要的视觉令牌，（ii）时间冗余评分，该速度冗余评分将以前的帧键投射到当前框架的钥匙子空间上，以优先实用填充冗余历史。在标准的GUI代理基准和模型中，GUI-KV的表现优于竞争性的KV压缩基线，在适度的预算下与完整的精度非常匹配。值得注意的是，在AgentNetBench基准上的5屏像设置中，GUI-KV将解码器的拖放量减少了38.9％，而在全库基线上将步骤精度提高了4.1％。这些结果表明，利用GUI特定的冗余可以实现有效且可靠的代理性能。

Title: Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation

Authors: Yubo Xie, Chenkai Wang, Zongyang Ma, Fahui Miao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00567
Pdf URL: https://arxiv.org/pdf/2510.00567
Copy Paste: [[2510.00567]] Are Large Language Models Chronically Online Surfers? A Dataset for Chinese Internet Meme Explanation(https://arxiv.org/abs/2510.00567)
Keywords: language model, llm
Abstract: Large language models (LLMs) are trained on vast amounts of text from the Internet, but do they truly understand the viral content that rapidly spreads online -- commonly known as memes? In this paper, we introduce CHIME, a dataset for CHinese Internet Meme Explanation. The dataset comprises popular phrase-based memes from the Chinese Internet, annotated with detailed information on their meaning, origin, example sentences, types, etc. To evaluate whether LLMs understand these memes, we designed two tasks. In the first task, we assessed the models' ability to explain a given meme, identify its origin, and generate appropriate example sentences. The results show that while LLMs can explain the meanings of some memes, their performance declines significantly for culturally and linguistically nuanced meme types. Additionally, they consistently struggle to provide accurate origins for the memes. In the second task, we created a set of multiple-choice questions (MCQs) requiring LLMs to select the most appropriate meme to fill in a blank within a contextual sentence. While the evaluated models were able to provide correct answers, their performance remains noticeably below human levels. We have made CHIME public and hope it will facilitate future research on computational meme understanding.
摘要：大型语言模型（LLM）接受了互联网上大量文本的培训，但是他们是否真正了解迅速在网上传播的病毒内容（通常称为模因？在本文中，我们介绍了Chime，这是一个用于中国互联网模因说明的数据集。数据集包括来自中国互联网的流行短语模因，并注释了有关其含义，来源，示例句子，类型等的详细信息。评估LLMS是否理解这些模因，我们设计了两个任务。在第一个任务中，我们评估了模型解释给定模因，确定其来源并生成适当的示例句子的能力。结果表明，尽管LLM可以解释某些模因的含义，但在文化和语言上细微的模因类型中，它们的表现大大下降。此外，他们一直在努力为模因提供准确的起源。在第二个任务中，我们创建了一组多项选择问题（MCQ），要求LLMS选择最合适的模因以填充上下文句子中的空白。虽然评估的模型能够提供正确的答案，但它们的性能仍明显低于人类水平。我们已经公开了铃声，并希望它将促进对计算模因理解的未来研究。

Title: ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards

Authors: Shiyu Li, Yang Tang, Yifan Wang, Peiming Li, Xi Chen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00568
Pdf URL: https://arxiv.org/pdf/2510.00568
Copy Paste: [[2510.00568]] ReSeek: A Self-Correcting Framework for Search Agents with Instructive Rewards(https://arxiv.org/abs/2510.00568)
Keywords: language model, llm, agent
Abstract: Search agents powered by Large Language Models (LLMs) have demonstrated significant potential in tackling knowledge-intensive tasks. Reinforcement learning (RL) has emerged as a powerful paradigm for training these agents to perform complex, multi-step reasoning. However, prior RL-based methods often rely on sparse or rule-based rewards, which can lead agents to commit to suboptimal or erroneous reasoning paths without the ability to recover. To address these limitations, we propose ReSeek, a novel self-correcting framework for training search agents. Our framework introduces a self-correction mechanism that empowers the agent to dynamically identify and recover from erroneous search paths during an episode. By invoking a special JUDGE action, the agent can judge the information and re-plan its search strategy. To guide this process, we design a dense, instructive process reward function, which decomposes into a correctness reward for retrieving factual information and a utility reward for finding information genuinely useful for the query. Furthermore, to mitigate the risk of data contamination in existing datasets, we introduce FictionalHot, a new and challenging benchmark with recently curated questions requiring complex reasoning. Being intuitively reasonable and practically simple, extensive experiments show that agents trained with ReSeek significantly outperform SOTA baselines in task success rate and path faithfulness.
摘要：由大语言模型（LLM）提供动力的搜索代理在解决知识密集型任务方面具有巨大的潜力。增强学习（RL）已成为训练这些代理以执行复杂的多步推理的强大范式。但是，先前基于RL的方法通常依赖于稀疏或基于规则的奖励，这可能会导致代理人在没有恢复能力的情况下承诺次优或错误的推理路径。为了解决这些局限性，我们提出了Reseek，这是一个新型的自我校正框架，用于培训搜索剂。我们的框架引入了一种自我纠正机制，该机制使代理能力在情节中动态识别和从错误的搜索路径中恢复。通过调用特殊的法官行动，代理可以判断信息并重新计划其搜索策略。为了指导此过程，我们设计了一个密集的，具有启发性的过程奖励功能，该功能将其分解为正确的奖励，以检索事实信息和效用奖励，以找到对查询真正有用的信息。此外，为了减轻现有数据集中数据污染的风险，我们介绍了虚构途径，这是一种新的且具有挑战性的基准，最近精心策划的问题需要复杂的推理。具有直觉合理且实际上简单，广泛的实验表明，经过Reseek培训的代理商在任务成功率和忠诚度上的表现明显超过了SOTA基准。

Title: CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs

Authors: Li Li, Ziyi Wang, Yongliang Wu, Jianfei Cai, Xu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00579
Pdf URL: https://arxiv.org/pdf/2510.00579
Copy Paste: [[2510.00579]] CoT Vectors: Transferring and Probing the Reasoning Mechanisms of LLMs(https://arxiv.org/abs/2510.00579)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Chain-of-Thought (CoT) prompting has emerged as a powerful approach to enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing implementations, such as in-context learning and fine-tuning, remain costly and inefficient. To improve CoT reasoning at a lower cost, and inspired by the task vector paradigm, we introduce CoT Vectors, compact representations that encode task-general, multi-step reasoning knowledge. Through experiments with Extracted CoT Vectors, we observe pronounced layer-wise instability, manifesting as a U-shaped performance curve that reflects a systematic three-stage reasoning process in LLMs. To address this limitation, we propose Learnable CoT Vectors, optimized under a teacher-student framework to provide more stable and robust guidance. Extensive evaluations across diverse benchmarks and models demonstrate that CoT Vectors not only outperform existing baselines but also achieve performance comparable to parameter-efficient fine-tuning methods, while requiring fewer trainable parameters. Moreover, by treating CoT Vectors as a probe, we uncover how their effectiveness varies due to latent space structure, information density, acquisition mechanisms, and pre-training differences, offering new insights into the functional organization of multi-step reasoning in LLMs. The source code will be released.
摘要：经过深思熟虑（COT）提示已成为增强大语模型（LLMS）推理能力的强大方法。但是，现有的实现（例如在上下文学习和微调）仍然昂贵且效率低下。为了以较低的成本提高COT推理，并受到任务向量范式的启发，我们介绍了COT矢量，编码任务将军，多步推理知识的紧凑型表示。通过提取的COT载体的实验，我们观察到明显的层不稳定性，表现为U形性能曲线，反映了LLMS中系统的三阶段推理过程。为了解决这一限制，我们提出了可学习的COT矢量，在教师学生框架下进行了优化，以提供更稳定和强大的指导。跨不同基准和模型的广泛评估表明，COT向量不仅要优于现有基准，而且还可以实现与参数有效的微调方法相当的性能，同时需要更少的可训练参数。此外，通过将COT载体视为探针，我们发现它们的有效性如何由于潜在的空间结构，信息密度，获取机制和预训练差异而变化，从而为LLMS中多阶段推理的功能组织提供了新的见解。源代码将发布。

Title: MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation

Authors: Jinlan Fu, Shenzhen Huangfu, Hao Fei, Yichong Huang, Xiaoyu Shen, Xipeng Qiu, See-Kiong Ng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00647
Pdf URL: https://arxiv.org/pdf/2510.00647
Copy Paste: [[2510.00647]] MCM-DPO: Multifaceted Cross-Modal Direct Preference Optimization for Alt-text Generation(https://arxiv.org/abs/2510.00647)
Keywords: language model, llm
Abstract: The alt-text generation task produces concise, context-relevant descriptions of images, enabling blind and low-vision users to access online images. Despite the capabilities of large vision-language models, alt-text generation performance remains limited due to noisy user annotations, inconsistent standards, and MLLMs' insensitivity to contextual information. Previous efforts to fine-tune MLLMs using supervised fine-tuning (SFT) have struggled, as SFT relies on accurate target annotations, which are often flawed in user-generated alt-text. To address this, we propose Multi-faceted Cross-modal Direct Preference Optimization (MCM-DPO), which improves alt-text generation by learning to identify better options in preference pairs without requiring precise annotations. MCM-DPO optimizes preferences across single, paired, and multi-preference dimensions, covering textual, visual, and cross-modal factors. In light of the scarcity of high-quality annotated and preference-labeled datasets for alt-text, we constructed two large-scale, high-quality datasets named TAlt and PAlt, sourced from Twitter and Pinterest. These datasets include 202k annotated alt-text samples and 18k preference pairs that cover diverse preference dimensions, aiming to support further research in this domain. Experimental results show that our proposed MCM-DPO method consistently outperforms both DPO and SFT, establishing a new state of the art in alt-text generation. We release the code and data here: this https URL
摘要：Alt-Text生成任务可产生图像的简洁，上下文相关的描述，使盲人和低视频用户能够访问在线图像。尽管大型视觉模型具有功能，但由于嘈杂的用户注释，不一致的标准和MLLMS对上下文信息的不敏感性，Alt-Text发电性能仍然有限。以前的努力使用有监督的微调（SFT）微调MLLM，因为SFT依赖于准确的目标注释，这些目标通常在用户生成的Alt-Text中存在缺陷。为了解决这个问题，我们提出了多面的跨模式直接偏好优化（MCM-DPO），该优化通过学习在不需要精确注释的情况下学习识别更好的选项来改善Alt-Text生成。 MCM-DPO优化了跨单个，配对和多首选项维度的偏好，涵盖了文本，视觉和跨模式因素。鉴于Alt-Text的高质量注释和偏好标记的数据集的稀缺性，我们构建了两个名为Talt and Palt的大规模，高质量的数据集，这些数据集来自Twitter和Pinterest。这些数据集包括202K注释的Alt-Text样本和涵盖各种偏好维度的18K偏好对，旨在支持该领域的进一步研究。实验结果表明，我们提出的MCM-DPO方法始终胜过DPO和SFT，在Alt-Text生成中建立了新的最新状态。我们在此处发布代码和数据：此HTTPS URL

Title: Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation

Authors: François Ledoyen, Gaël Dias, Jeremie Pantin, Alexis Lechervy, Fabrice Maurel, Youssef Chahir
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00662
Pdf URL: https://arxiv.org/pdf/2510.00662
Copy Paste: [[2510.00662]] Facilitating Cognitive Accessibility with LLMs: A Multi-Task Approach to Easy-to-Read Text Generation(https://arxiv.org/abs/2510.00662)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Simplifying complex texts is essential for ensuring equitable access to information, especially for individuals with cognitive impairments. The Easy-to-Read (ETR) initiative offers a framework for making content accessible to the neurodivergent population, but the manual creation of such texts remains time-consuming and resource-intensive. In this work, we investigate the potential of large language models (LLMs) to automate the generation of ETR content. To address the scarcity of aligned corpora and the specificity of ETR constraints, we propose a multi-task learning (MTL) approach that trains models jointly on text summarization, text simplification, and ETR generation. We explore two different strategies: multi-task retrieval-augmented generation (RAG) for in-context learning, and MTL-LoRA for parameter-efficient fine-tuning. Our experiments with Mistral-7B and LLaMA-3-8B, based on ETR-fr, a new high-quality dataset, demonstrate the benefits of multi-task setups over single-task baselines across all configurations. Moreover, results show that the RAG-based strategy enables generalization in out-of-domain settings, while MTL-LoRA outperforms all learning strategies within in-domain configurations.
摘要：简化复杂文本对于确保公平获取信息至关重要，尤其是对于认知障碍的人。易于阅读的（ETR）倡议提供了一个框架，可以使神经差异的人群访问内容，但是此类文本的手动创建仍然耗时且资源密集。在这项工作中，我们研究了大语言模型（LLM）的潜力自动化ETR含量的生成。为了解决一致的语料库的稀缺性和ETR约束的特异性，我们提出了一种多任务学习（MTL）方法，该方法在文本摘要，文本简化和ETR生成上共同培训模型。我们探讨了两种不同的策略：用于内部文化学习的多任务检索生成（RAG），以及用于参数有效的微调的MTL-Lora。我们对基于ETR-FR的Mistral-7b和Llama-3-8b进行的实验，这是一个新的高质量数据集，它在所有配置中都证明了多任务设置比单任任务基准的好处。此外，结果表明，基于抹布的策略可以在室外设置中进行概括，而MTL-Lora在内域配置内的所有学习策略都优于所有学习策略。

Title: Inclusive Easy-to-Read Generation for Individuals with Cognitive Impairments

Authors: François Ledoyen, Gaël Dias, Alexis Lechervy, Jeremie Pantin, Fabrice Maurel, Youssef Chahir, Elisa Gouzonnat, Mélanie Berthelot, Stanislas Moravac, Armony Altinier, Amy Khairalla
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00691
Pdf URL: https://arxiv.org/pdf/2510.00691
Copy Paste: [[2510.00691]] Inclusive Easy-to-Read Generation for Individuals with Cognitive Impairments(https://arxiv.org/abs/2510.00691)
Keywords: language model, llm
Abstract: Ensuring accessibility for individuals with cognitive impairments is essential for autonomy, self-determination, and full citizenship. However, manual Easy-to-Read (ETR) text adaptations are slow, costly, and difficult to scale, limiting access to crucial information in healthcare, education, and civic life. AI-driven ETR generation offers a scalable solution but faces key challenges, including dataset scarcity, domain adaptation, and balancing lightweight learning of Large Language Models (LLMs). In this paper, we introduce ETR-fr, the first dataset for ETR text generation fully compliant with European ETR guidelines. We implement parameter-efficient fine-tuning on PLMs and LLMs to establish generative baselines. To ensure high-quality and accessible outputs, we introduce an evaluation framework based on automatic metrics supplemented by human assessments. The latter is conducted using a 36-question evaluation form that is aligned with the guidelines. Overall results show that PLMs perform comparably to LLMs and adapt effectively to out-of-domain texts.
摘要：确保具有认知障碍的人的可及性对于自治，自决和完全公民身份至关重要。但是，手动易于阅读（ETR）的文本适应性缓慢，昂贵且难以扩展，从而限制了医疗保健，教育和公民生活中关键信息的访问。 AI驱动的ETR生成提供了可扩展的解决方案，但面临着关键的挑战，包括数据集稀缺，域适应和平衡大型语言模型（LLMS）的轻量级学习。在本文中，我们介绍了ETR-FR，这是ETR文本生成的第一个完全符合欧洲ETR指南的数据集。我们在PLM和LLMS上实现参数有效的微调来建立生成基线。为了确保高质量和可访问的产出，我们基于根据人类评估补充的自动指标引入了评估框架。后者是使用与准则一致的36个问题评估表进行的。总体结果表明，PLM与LLMS的性能相当，并有效地适应了室外文本。

Title: ALARB: An Arabic Legal Argument Reasoning Benchmark

Authors: Harethah Abu Shairah, Somayah AlHarbi, Abdulaziz AlHussein, Sameer Alsabea, Omar Shaqaqi, Hebah AlShamlan, Omar Knio, George Turkiyyah
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.00694
Pdf URL: https://arxiv.org/pdf/2510.00694
Copy Paste: [[2510.00694]] ALARB: An Arabic Legal Argument Reasoning Benchmark(https://arxiv.org/abs/2510.00694)
Keywords: language model, gpt, llm
Abstract: We introduce ALARB, a dataset and suite of tasks designed to evaluate the reasoning capabilities of large language models (LLMs) within the Arabic legal domain. While existing Arabic benchmarks cover some knowledge-intensive tasks such as retrieval and understanding, substantial datasets focusing specifically on multistep reasoning for Arabic LLMs, especially in open-ended contexts, are lacking. The dataset comprises over 13K commercial court cases from Saudi Arabia, with each case including the facts presented, the reasoning of the court, the verdict, as well as the cited clauses extracted from the regulatory documents. We define a set of challenging tasks leveraging this dataset and reflecting the complexity of real-world legal reasoning, including verdict prediction, completion of reasoning chains in multistep legal arguments, and identification of relevant regulations based on case facts. We benchmark a representative selection of current open and closed Arabic LLMs on these tasks and demonstrate the dataset's utility for instruction tuning. Notably, we show that instruction-tuning a modest 12B parameter model using ALARB significantly enhances its performance in verdict prediction and Arabic verdict generation, reaching a level comparable to that of GPT-4o.
摘要：我们介绍了Alarb，这是一个旨在评估阿拉伯法律领域中大语言模型（LLM）的推理能力的数据集和套件。尽管现有的阿拉伯基准测试涵盖了一些知识密集的任务，例如检索和理解，但缺乏专门针对阿拉伯语LLM的多步骤推理的大量数据集，尤其是在开放式上下文中。该数据集包括来自沙特阿拉伯的13,000多个商业法院案件，每个案件包括提出的事实，法院的推理，裁决以及从监管文件中提取的所述条款。我们定义了一组利用该数据集的挑战性任务，并反映了现实世界中法律推理的复杂性，包括裁决预测，在多步法中完成推理链的完成以及基于案例事实的相关法规的识别。我们在这些任务上基准选择了当前开放和封闭的阿拉伯语LLM的代表性选择，并演示了数据集的指导调整实用程序。值得注意的是，我们表明，使用ALARB来调整适中的12B参数模型的指令显着提高了其在判决预测和阿拉伯判决生成中的性能，达到了与GPT-4O相当的水平。

Title: Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese

Authors: Jenny Kunz, Iben Nyholm Debess, Annika Simonsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00810
Pdf URL: https://arxiv.org/pdf/2510.00810
Copy Paste: [[2510.00810]] Family Matters: Language Transfer and Merging for Adapting Small LLMs to Faroese(https://arxiv.org/abs/2510.00810)
Keywords: llm
Abstract: We investigate how to adapt small, efficient LLMs to Faroese, a low-resource North Germanic language. Starting from English models, we continue pre-training on related Scandinavian languages, either individually or combined via merging, before fine-tuning on Faroese. We compare full fine-tuning with parameter-efficient tuning using LoRA, evaluating their impact on both linguistic accuracy and text comprehension. Due to the lack of existing Faroese evaluation data, we construct two new minimal-pair benchmarks from adapted and newly collected datasets and complement them with human evaluations by Faroese linguists. Our results demonstrate that transfer from related languages is crucial, though the optimal source language depends on the task: Icelandic enhances linguistic accuracy, whereas Danish boosts comprehension. Similarly, the choice between full fine-tuning and LoRA is task-dependent: LoRA improves linguistic acceptability and slightly increases human evaluation scores on the base model, while full fine-tuning yields stronger comprehension performance and better preserves model capabilities during downstream fine-tuning.
摘要：我们研究了如何将小型，高效的LLM适应低资源的北日耳曼语Faroese。从英语模型开始，我们继续对相关的斯堪的纳维亚语言进行预培训，无论是单独或通过合并而结合的，然后在Faroese上进行微调。我们将完整的微调与使用Lora进行参数有效调整进行了比较，从而评估了它们对语言精度和文本理解的影响。由于缺乏现有的FAROESE评估数据，我们通过适应和新收集的数据集构建了两个新的最小对基准测试，并通过Faroese语言学家进行了人类评估。我们的结果表明，从相关语言转移至关重要，尽管最佳源语言取决于任务：冰岛增强了语言精度，而丹麦提高了理解。同样，完整的微调和洛拉之间的选择是任务依赖性的：洛拉提高了语言的可接受性，并稍微提高了基本模型上的人类评估评分，而完整的微调可以提高理解力，并更好地保留下游微调期间模型能力。

Title: Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation

Authors: Yanming Sun, Runzhe Zhan, Chi Seng Cheang, Han Wu, Xuebo Liu, Yuyao Niu, Fengying Ye, Kaixin Lan, Lidia S. Chao, Derek F. Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00829
Pdf URL: https://arxiv.org/pdf/2510.00829
Copy Paste: [[2510.00829]] Exposing the Cracks: Vulnerabilities of Retrieval-Augmented LLM-based Machine Translation(https://arxiv.org/abs/2510.00829)
Keywords: llm
Abstract: \textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world deployment. To address this gap, we propose a noise synthesis framework and new metrics to evaluate the robustness of REAL-MT systematically. Using this framework, we instantiate REAL-MT with Qwen-series models, including standard LLMs and large reasoning models (LRMs) with enhanced reasoning, and evaluate their performance on idiomatic translation across high-, medium-, and low-resource language pairs under synthesized noise. Our results show that low-resource language pairs, which rely more heavily on retrieved context, degrade more severely under noise than high-resource ones and often produce nonsensical translations. Although LRMs possess enhanced reasoning capabilities, they show no improvement in error correction and are even more susceptible to noise, tending to rationalize incorrect contexts. We find that this stems from an attention shift away from the source idiom to noisy content, while confidence increases despite declining accuracy, indicating poor calibration. To mitigate these issues, we investigate training-free and fine-tuning strategies, which improve robustness at the cost of performance in clean contexts, revealing a fundamental trade-off. Our findings highlight the limitations of current approaches, underscoring the need for self-verifying integration mechanisms.
摘要：\textbf{RE}trieval-\textbf{A}ugmented \textbf{L}LM-based \textbf{M}achine \textbf{T}ranslation (REAL-MT) shows promise for knowledge-intensive tasks like idiomatic translation, but its reliability under noisy retrieval contexts remains poorly understood despite this being a common challenge in real-world部署。为了解决这一差距，我们提出了一个噪声综合框架和新的指标，以系统地评估RealMT的鲁棒性。使用此框架，我们使用QWEN系列模型实例化了Reare-MT，包括标准LLM和具有增强推理的大型推理模型（LRMS），并评估其在合成噪声下的高，中，中，中，中，中等和低资源语言对的惯用性翻译上的性能。我们的结果表明，低资源语言对更依赖于检索到的上下文，比高资源的语言更严重地降解，并且通常会产生荒谬的翻译。尽管LRM具有增强的推理能力，但它们在误差校正方面却没有改善，甚至更容易受到噪声的影响，从而使不正确的环境合理化。我们发现，这源于注意力从源习惯转移到嘈杂的含量，而置信度提高了，表明校准较差。为了减轻这些问题，我们调查了无培训和微调策略，这些策略以清洁环境中的绩效为代价改善了鲁棒性，从而揭示了基本的权衡。我们的发现突出了当前方法的局限性，强调了自我验证的整合机制的需求。

Title: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs

Authors: Adi Simhi, Jonathan Herzig, Martin Tutek, Itay Itzhak, Idan Szpektor, Yonatan Belinkov
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00857
Pdf URL: https://arxiv.org/pdf/2510.00857
Copy Paste: [[2510.00857]] ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs(https://arxiv.org/abs/2510.00857)
Keywords: language model, llm, agent
Abstract: As large language models (LLMs) evolve from conversational assistants into autonomous agents, evaluating the safety of their actions becomes critical. Prior safety benchmarks have primarily focused on preventing generation of harmful content, such as toxic text. However, they overlook the challenge of agents taking harmful actions when the most effective path to an operational goal conflicts with human safety. To address this gap, we introduce ManagerBench, a benchmark that evaluates LLM decision-making in realistic, human-validated managerial scenarios. Each scenario forces a choice between a pragmatic but harmful action that achieves an operational goal, and a safe action that leads to worse operational performance. A parallel control set, where potential harm is directed only at inanimate objects, measures a model's pragmatism and identifies its tendency to be overly safe. Our findings indicate that the frontier LLMs perform poorly when navigating this safety-pragmatism trade-off. Many consistently choose harmful options to advance their operational goals, while others avoid harm only to become overly safe and ineffective. Critically, we find this misalignment does not stem from an inability to perceive harm, as models' harm assessments align with human judgments, but from flawed prioritization. ManagerBench is a challenging benchmark for a core component of agentic behavior: making safe choices when operational goals and alignment values incentivize conflicting actions. Benchmark & code available at this https URL.
摘要：随着大型语言模型（LLM）从对话助手演变为自主代理人，评估其行为的安全变得至关重要。先前的安全基准主要集中在防止产生有害含量（例如有毒文本）上。但是，他们忽略了代理人采取有害行动的挑战，当时运营目标与人类安全的最有效途径发生冲突。为了解决这一差距，我们介绍了ManagerBench，这是一个评估LLM现实，人为验证的管理方案的决策的基准。每种情况都迫使实现操作目标的务实但有害行动之间的选择，而安全的行动会导致较差的运营绩效。平行控制集，其中潜在的伤害仅针对无生命的物体，可以测量模型的实用主义，并确定其过度安全的趋势。我们的发现表明，在这种安全性权衡权衡处置时，边境LLM的表现较差。许多人始终选择有害选择来促进其运营目标，而其他人则避免了伤害，以使其变得过于安全和无效。至关重要的是，我们发现这种错位并不是由于模型的危害评估与人类判断的一致，而是由于优先级排序而无法感知伤害。 ManagerBench是代理行为的核心组成部分的具有挑战性的基准：在操作目标和一致性值激励冲突的行动时，做出安全选择。基准和代码可在此HTTPS URL上找到。

Title: Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs

Authors: Ziliang Wang, Kang An, Xuhui Zheng, Faqiang Qian, Weikun Zhang, Cijun Ouyang, Jialu Cai, Yuhang Wang, Yichao Wu
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2510.00861
Pdf URL: https://arxiv.org/pdf/2510.00861
Copy Paste: [[2510.00861]] Erase to Improve: Erasable Reinforcement Learning for Search-Augmented LLMs(https://arxiv.org/abs/2510.00861)
Keywords: language model, llm
Abstract: While search-augmented large language models (LLMs) exhibit impressive capabilities, their reliability in complex multi-hop reasoning remains limited. This limitation arises from three fundamental challenges: decomposition errors, where tasks are incorrectly broken down; retrieval missing, where key evidence fails to be retrieved; and reasoning errors, where flawed logic propagates through the reasoning chain. A single failure in any of these stages can derail the final answer. We propose Erasable Reinforcement Learning (ERL), a novel framework that transforms fragile reasoning into a robust process. ERL explicitly identifies faulty steps, erases them, and regenerates reasoning in place, preventing defective logic from propagating through the reasoning chain. This targeted correction mechanism turns brittle reasoning into a more resilient process. Models trained with ERL, termed ESearch, achieve substantial improvements on HotpotQA, MuSiQue, 2Wiki, and Bamboogle, with the 3B model achieving +8.48% EM and +11.56% F1, and the 7B model achieving +5.38% EM and +7.22% F1 over previous state-of-the-art(SOTA) results. These findings suggest that erasable reinforcement learning provides a powerful paradigm shift for robust multi-step reasoning in LLMs.
摘要：尽管搜索型的大型语言模型（LLMS）具有令人印象深刻的功能，但它们在复杂的多跳推理中的可靠性仍然有限。这种限制源于三个基本挑战：分解错误，其中任务被错误分解；检索失踪，主要证据无法检索；以及推理错误，其中有缺陷的逻辑通过推理链传播。这些阶段中的任何一个失败都会使最终答案脱轨。我们提出了可擦除的增强学习（ERL），这是一个新颖的框架，将脆弱的推理转化为强大的过程。 ERL明确地识别出故障的步骤，擦除它们并再生推理，以防止有缺陷的逻辑通过推理链传播。该靶向校正机制将脆性推理变成更具弹性的过程。经过ERL，称为研究的型号，对HotPotQA，Musique，2Wiki和Bamboogle进行了实质性改进，3B型号可实现 +8.48％EM和 +11.56％F1，以及7B模型可实现的 +5.38％EM和 +7.22％的EM和 +7.22％的F1比先前的State-of-the-the-The-The-Art（Sota）。这些发现表明，可擦容的强化学习为LLM中强大的多步推理提供了强大的范式转变。

Title: HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation

Authors: Loris Bergeron, Ioana Buhnila, Jérôme François, Radu State
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00880
Pdf URL: https://arxiv.org/pdf/2510.00880
Copy Paste: [[2510.00880]] HalluGuard: Evidence-Grounded Small Reasoning Models to Mitigate Hallucinations in Retrieval-Augmented Generation(https://arxiv.org/abs/2510.00880)
Keywords: language model, gpt, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) excel in many NLP tasks but remain prone to hallucinations, limiting trust in real-world applications. We present HalluGuard, a 4B-parameter Small Reasoning Model (SRM) for mitigating hallucinations in Retrieval-Augmented Generation (RAG). HalluGuard classifies document-claim pairs as grounded or hallucinated and produces evidence-grounded justifications for transparency. Our approach combines (i) a domain-agnostic synthetic dataset derived from FineWeb and refined through multi-stage curation and data reformation, (ii) synthetic grounded and hallucinated claims, and (iii) preference-based fine-tuning with Odds Ratio Preference Optimization to distill large-model reasoning into a smaller backbone. On the RAGTruth subset of the LLM-AggreFact benchmark, HalluGuard achieves 84.0% balanced accuracy (BAcc), rivaling specialized models, MiniCheck (7B; 84.0%) and Granite Guardian 3.3 (8B; 82.2%) while using roughly half their parameters. Over the full benchmark it reaches 75.7% BAcc, matching larger general-purpose LLMs such as GPT-4o (75.9%). We will release HalluGuard and datasets under Apache 2.0 upon acceptance.
摘要：大型语言模型（LLMS）在许多NLP任务中都表现出色，但仍然容易出现幻觉，从而限制了对现实世界应用的信任。我们提出了Halluguard，这是一种4B参数的小推理模型（SRM），用于缓解检索效果（RAG）中的幻觉。 Halluguard将文档索赔对归类为接地或幻觉，并产生透明度的证据理由。我们的方法结合了（i）源自fineWeb的域 - 不足的合成数据集，并通过多阶段的策展和数据改革来完善，（ii）合成基础和幻觉的索赔，以及（iii）基于偏好的微调与优势优化，以优化较小的较小的型摩德式靠背，以蒸馏出较小的较小的骨架。在LLM-aggrefact基准的Ragtruth子集上，Halluguard可实现84.0％的平衡精度（BACC），竞争专业模型，Minicheck（7b; 84.0％）和Granite Guardian 3.3（8b; 82.2％），同时使用大约一半的参数。在完整的基准测试中，它达到75.7％的BACC，与较大的通用LLM相匹配，例如GPT-4O（75.9％）。接受后，我们将在Apache 2.0下发布Halluguard和数据集。

Title: Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration

Authors: Zhen Yin, Shenghua Wang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00890
Pdf URL: https://arxiv.org/pdf/2510.00890
Copy Paste: [[2510.00890]] Span-level Detection of AI-generated Scientific Text via Contrastive Learning and Structural Calibration(https://arxiv.org/abs/2510.00890)
Keywords: language model, gpt, llm
Abstract: The rapid adoption of large language models (LLMs) in scientific writing raises serious concerns regarding authorship integrity and the reliability of scholarly publications. Existing detection approaches mainly rely on document-level classification or surface-level statistical cues; however, they neglect fine-grained span localization, exhibit weak calibration, and often fail to generalize across disciplines and generators. To address these limitations, we present Sci-SpanDet, a structure-aware framework for detecting AI-generated scholarly texts. The proposed method combines section-conditioned stylistic modeling with multi-level contrastive learning to capture nuanced human-AI differences while mitigating topic dependence, thereby enhancing cross-domain robustness. In addition, it integrates BIO-CRF sequence labeling with pointer-based boundary decoding and confidence calibration to enable precise span-level detection and reliable probability estimates. Extensive experiments on a newly constructed cross-disciplinary dataset of 100,000 annotated samples generated by multiple LLM families (GPT, Qwen, DeepSeek, LLaMA) demonstrate that Sci-SpanDet achieves state-of-the-art performance, with F1(AI) of 80.17, AUROC of 92.63, and Span-F1 of 74.36. Furthermore, it shows strong resilience under adversarial rewriting and maintains balanced accuracy across IMRaD sections and diverse disciplines, substantially surpassing existing baselines. To ensure reproducibility and to foster further research on AI-generated text detection in scholarly documents, the curated dataset and source code will be publicly released upon publication.
摘要：大型语言模型（LLM）在科学写作中的迅速采用引起了人们对作者完整性和学术出版物的可靠性的严重关注。现有的检测方法主要依赖文档级分类或表面级统计提示；但是，他们忽略了细粒度的定位，表现出较弱的校准，并且通常无法跨越学科和发电机。为了解决这些局限性，我们提出了Sci-Spandet，这是一个用于检测AI生成的学术文本的结构感知框架。所提出的方法将截面条件的文体建模与多层次的对比度学习结合在一起，以捕获细微的人类AI差异，同时减轻主题依赖性，从而增强跨域的鲁棒性。此外，它将Bio-CRF序列标记与基于指针的边界解码和置信度校准集成，以实现精确的跨度检测和可靠的概率估计。对新建造的跨学科数据集进行了广泛的实验，该数据集由多个LLM家族（GPT，Qwen，DeepSeek，Llama）生成的100,000个注释样本，这表明SCI-SPANDET可实现最先进的性能，F1（AI）的F1（AI）为80.17，AUROC为92.63和Span-fass-f1 of 74.36。此外，它在对抗性重写下显示出很强的弹性，并保持了整个Imrad部分和各种学科的平衡准确性，从而超过了现有的基线。为了确保可重复性并促进对学术文档中AI生成的文本检测的进一步研究，将在出版后公开发布策划的数据集和源代码。

Title: Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving

Authors: Shunfeng Zheng, Yudi Zhang, Meng Fang, Zihan Zhang, Zhitan Wu, Mykola Pechenizkiy, Ling Chen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.00919
Pdf URL: https://arxiv.org/pdf/2510.00919
Copy Paste: [[2510.00919]] Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving(https://arxiv.org/abs/2510.00919)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks, but their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored. Inspired by the way students prepare for competitions by reviewing past problems, we investigate the potential of RAG to enhance physics reasoning in foundation models. We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics, enabling systematic study of retrieval-based reasoning. PhoPile includes diagrams, graphs, and equations, capturing the inherently multimodal nature of physics problem solving. Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers. Our results demonstrate that integrating retrieval with physics corpora can improve model performance, while also highlighting challenges that motivate further research in retrieval-augmented physics reasoning.
摘要：带有基础模型的检索授课生成（RAG）在各种任务中取得了强大的表现，但是它们的专家级推理能力，例如解决奥林匹克级物理问题问题，在很大程度上没有探索。受学生通过审查过去问题准备比赛的方式的启发，我们研究了抹布在基础模型中增强物理推理的潜力。我们介绍了Phopile，这是一种专为奥林匹克级物理学设计的高质量多模式数据集，从而实现了基于检索的推理的系统研究。 Phopile包括图表，图形和方程式，捕获物理问题解决的固有多模式性质。我们使用Phopile，基于抹布的基础模型，涵盖了带有多个猎犬的大型语言模型（LLMS）和大型多模式模型（LMMS）。我们的结果表明，将检索与物理语料库集成可以提高模型性能，同时也强调了挑战，这些挑战激发了在检索效果的物理推理中进一步研究。

Title: Making, not Taking, the Best of N

Authors: Ammar Khairi, Daniel D'souza, Marzieh Fadaee, Julia Kreutzer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00931
Pdf URL: https://arxiv.org/pdf/2510.00931
Copy Paste: [[2510.00931]] Making, not Taking, the Best of N(https://arxiv.org/abs/2510.00931)
Keywords: llm
Abstract: Obtaining high-quality generations in modern LLMs has largely been framed as a selection problem: identifying a single winning generation from a diverse pool of N samples, the Best-of-N (BoN). Yet, this approach is inherently zero-sum, discarding diverse and potentially useful information from the pool. Instead, we explore a collaborative setup, where all candidates can potentially contribute to the final winning generation. To this end, we propose Fusion-of-N (FusioN): a method that uses a general LLM judge to synthesize the most informative elements of each sample into a single final answer. We compare FusioN to BoN in two settings, (i) test-time scaling, where we sample and aggregate from a single model at test-time (ii) synthetic data generation, where we fuse samples from a pool of diverse teachers to improve a student model. We extensively benchmark both setups across 11 languages, 3 diverse tasks and varying model scales. Across the bench, FusioN consistently outperforms BoN showing versatility and robustness both in test-time scaling and in downstream gains from synthetic data generation. We also perform extensive analysis on FusioN, where it shows surprising strengths and robustness under challenging settings. These results show that we should shift how we think about evaluating and utilizing LLM generations from a monolithic measure of quality, to embracing their polylithic nature. This shift allows us to integrate diverse strengths, unlock latent potential, and achieve improvements that were previously inaccessible through selection alone.
摘要：在现代LLM中获得高质量的几代人已被构成选择问题：从不同的N样本池（Bon）中识别出单一的获胜一代（BON）。然而，这种方法本质上是零和零和，从池中丢弃了多样化和潜在的有用信息。取而代之的是，我们探索了协作设置，所有候选人都可以为最终的获胜一代做出贡献。为此，我们提出了N（Fusion）的融合：一种使用一般LLM法官将每个样本中最有用的元素合成一个最终答案的方法。我们将融合与BON在两个设置中进行比较，（i）测试时间缩放，在测试时间（II）合成数据生成中，我们从单个模型中进行采样和聚集，在该模型中，我们将融合来自不同教师的样本以改善学生模型。我们在11种语言，3种不同的任务和不同模型量表上进行了广泛的基准测试。在长凳上，融合在测试时间缩放和合成数据生成的下游增长中始终胜过BON的BON。我们还对融合进行了广泛的分析，在挑战性的环境下，它显示出令人惊讶的优势和鲁棒性。这些结果表明，我们应该将如何评估和利用LLM世代的思考从整体质量衡量，再到拥抱其各岩石性质。这种转变使我们能够整合各种优势，解锁潜在的潜力，并实现以前仅通过选择就无法访问的改进。

Title: Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks

Authors: Eileen Pan, Anna Seo Gyeong Choi, Maartje ter Hoeve, Skyler Seto, Allison Koenecke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.00962
Pdf URL: https://arxiv.org/pdf/2510.00962
Copy Paste: [[2510.00962]] Analyzing Dialectical Biases in LLMs for Knowledge and Reasoning Benchmarks(https://arxiv.org/abs/2510.00962)
Keywords: language model, llm
Abstract: Large language models (LLMs) are ubiquitous in modern day natural language processing. However, previous work has shown degraded LLM performance for under-represented English dialects. We analyze the effects of typifying "standard" American English language questions as non-"standard" dialectal variants on multiple choice question answering tasks and find up to a 20% reduction in accuracy. Additionally, we investigate the grammatical basis of under-performance in non-"standard" English questions. We find that individual grammatical rules have varied effects on performance, but some are more consequential than others: three specific grammar rules (existential "it", zero copula, and y'all) can explain the majority of performance degradation observed in multiple dialects. We call for future work to investigate bias mitigation methods focused on individual, high-impact grammatical structures.
摘要：大型语言模型（LLM）在现代自然语言处理中无处不在。但是，以前的工作显示了代表性不足的英语方言的LLM性能退化。我们分析了“标准”美国英语问题为非“标准”方言变体对多项选择问题回答任务的影响，并发现准确性降低了20％。此外，我们研究了非“标准”英语问题中表现不佳的语法基础。我们发现，单个语法规则对性能的影响有所不同，但是有些比其他规则更重要：三个特定的语法规则（存在的“ IT”，零copula和y'All）可以解释在多种方言中观察到的大多数性能退化。我们呼吁将来的工作研究偏向缓解方法，这些方法侧重于个人，高影响力的语法结构。

Title: Syntax-Guided Diffusion Language Models with User-Integrated Personalization

Authors: Ruqian Zhang, Yijiao Zhang, Juan Shen, Zhongyi Zhu, Annie Qu
Subjects: cs.CL, stat.ME
Abstract URL: https://arxiv.org/abs/2510.01028
Pdf URL: https://arxiv.org/pdf/2510.01028
Copy Paste: [[2510.01028]] Syntax-Guided Diffusion Language Models with User-Integrated Personalization(https://arxiv.org/abs/2510.01028)
Keywords: language model
Abstract: Large language models have made revolutionary progress in generating human-like text, yet their outputs often tend to be generic, exhibiting insufficient structural diversity, which limits personalized expression. Recent advances in diffusion models have opened new opportunities for improving language generation beyond the limitations of autoregressive paradigms. In this work, we propose a syntax-guided diffusion language model that integrates structural supervision and personalized conditioning to enhance text quality, diversity, and controllability. We introduce a cascaded framework that generates syntactic guidance before conditional text generation, and further generalize it to a novel noncascaded architecture for better alignment between structure and content. By incorporating syntactic information in the generating process, the proposed model better captures the lexical and structural characteristics of stylistic sentence construction. To enable fine-grained personalization, we develop a shared representation mechanism that facilitates information integration across users, supporting both faithful stylistic generation and generalizable zero-shot inference. Extensive experiments on multiple tasks demonstrate the superiority of our approach in fluency, diversity, and stylistic fidelity. Further qualitative analyses highlight its interpretability and flexibility in learning personalized patterns.
摘要：大型语言模型在产生类似人类的文本方面取得了革命性的进步，但是它们的输出通常往往是通用的，结构多样性不足，这限制了个性化的表达。扩散模型的最新进展为改善语言发电的新机会超出了自回归范式的局限性。在这项工作中，我们提出了一种语法引导的扩散语言模型，该模型集成了结构监督和个性化条件，以增强文本质量，多样性和可控性。我们引入了一个级联的框架，该框架在有条件的文本生成之前生成句法指导，并将其进一步推广到一种新型的非壁式体系结构，以更好地对齐结构和内容之间。通过将句法信息纳入生成过程中，提出的模型可以更好地捕获风格句子结构的词汇和结构特征。为了实现精细的个性化，我们开发了一种共享的表示机制，该机制促进了跨用户的信息集成，从而支持忠实的风格生成和可概括的零照片推理。关于多个任务的广泛实验证明了我们在流利，多样性和风格忠诚度方面的优势。进一步的定性分析强调了其在学习个性化模式方面的解释性和灵活性。

Title: Interpreting Language Models Through Concept Descriptions: A Survey

Authors: Nils Feldhus, Laura Kopf
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01048
Pdf URL: https://arxiv.org/pdf/2510.01048
Copy Paste: [[2510.01048]] Interpreting Language Models Through Concept Descriptions: A Survey(https://arxiv.org/abs/2510.01048)
Keywords: language model, llm
Abstract: Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.
摘要：了解神经网络的决策过程是机械解释性的核心目标。在大型语言模型（LLMS）的背景下，这涉及发现基本机制，并识别单个模型组件（例如神经元和注意力头）的作用，以及模型抽象（例如，稀疏自动辅助编码器（SAES）提取的学习的稀疏特征）。快速增长的工作系列通过使用强大的生成器模型为这些组件生成开放式唱片，自然语言概念描述来应对这一挑战。在本文中，我们提供了有关模型组件和抽象的概念描述的新兴领域的首次调查。我们绘制了生成这些描述的关键方法，自动化和人类指标的不断发展的景观以及对这些研究的基础的数据集。我们的合成表明对更严格的因果评估的需求不断增长。通过概述艺术的状态并确定关键挑战，这项调查为将来的研究提供了一个路线图，以使模型更加透明。

Title: Hybrid Dialogue State Tracking for Persian Chatbots: A Language Model-Based Approach

Authors: Samin Mahdipour Aghabagher, Saeedeh Momtazi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01052
Pdf URL: https://arxiv.org/pdf/2510.01052
Copy Paste: [[2510.01052]] Hybrid Dialogue State Tracking for Persian Chatbots: A Language Model-Based Approach(https://arxiv.org/abs/2510.01052)
Keywords: language model, gpt, chat, agent
Abstract: Dialogue State Tracking (DST) is an essential element of conversational AI with the objective of deeply understanding the conversation context and leading it toward answering user requests. Due to high demands for open-domain and multi-turn chatbots, the traditional rule-based DST is not efficient enough, since it cannot provide the required adaptability and coherence for human-like experiences in complex conversations. This study proposes a hybrid DST model that utilizes rule-based methods along with language models, including BERT for slot filling and intent detection, XGBoost for intent validation, GPT for DST, and online agents for real-time answer generation. This model is uniquely designed to be evaluated on a comprehensive Persian multi-turn dialogue dataset and demonstrated significantly improved accuracy and coherence over existing methods in Persian-based chatbots. The results demonstrate how effectively a hybrid approach may improve DST capabilities, paving the way for conversational AI systems that are more customized, adaptable, and human-like.
摘要：对话状态跟踪（DST）是对话式AI的基本要素，目的是深入了解对话上下文并将其带入回答用户请求。由于对开放域和多转弯聊天机器人的需求很高，因此传统的基于规则的DST不够有效，因为它无法为复杂的对话中的类似人类的经验提供所需的适应性和连贯性。这项研究提出了一种混合DST模型，该模型利用基于规则的方法以及语言模型，包括用于插槽填充和意图检测的BERT，用于意图验证的XGBoost，用于DST的XGBoost以及用于实时答案的在线代理。该模型的设计独特，旨在在全面的波斯多转向对话数据集上进行评估，并证明了基于波斯语的聊天机器人中现有方法的准确性和连贯性得到了显着提高。结果表明，混合方法可以如何提高DST功能，为更自定义，适应性和类似人类的对话AI系统铺平道路。

Title: mR3: Multilingual Rubric-Agnostic Reward Reasoning Models

Authors: David Anugraha, Shou-Yi Hung, Zilu Tang, Annie En-Shiun Lee, Derry Tanti Wijaya, Genta Indra Winata
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01146
Pdf URL: https://arxiv.org/pdf/2510.01146
Copy Paste: [[2510.01146]] mR3: Multilingual Rubric-Agnostic Reward Reasoning Models(https://arxiv.org/abs/2510.01146)
Keywords: language model, gpt, llm
Abstract: Evaluation using Large Language Model (LLM) judges has been widely adopted in English and shown to be effective for automatic evaluation. However, their performance does not generalize well to non-English settings, and it remains unclear what constitutes effective multilingual training for such judges. In this paper, we introduce mR3, a massively multilingual, rubric-agnostic reward reasoning model trained on 72 languages, achieving the broadest language coverage in reward modeling to date. We present a comprehensive study of data and curriculum selection for training to identify effective strategies and data sources for building high-quality reward models, including the integration of target-language reasoning datasets. Our approach attains state-of-the-art performance on multilingual reward model benchmarks, surpassing much larger models (i.e., GPT-OSS-120B) while being up to 9x smaller, and its effectiveness is further confirmed through extensive ablation studies. Our models, data, and code are available as open source at this https URL.
摘要：使用大语言模型（LLM）法官的评估已被英语广泛采用，并证明对自动评估有效。但是，它们的性能并不能很好地推广到非英语环境，目前尚不清楚哪些法官有效的多语言培训尚不清楚。在本文中，我们介绍了MR3，这是一种大量的多语言，无形的奖励推理模型，该模型训练了72种语言，在迄今为止获得了奖励建模中最广泛的语言覆盖范围。我们对培训的数据和课程选择进行了全面的研究，以确定有效的策略和数据源，以构建高质量奖励模型，包括集成目标语言推理数据集。我们的方法在多语言奖励模型基准上达到了最先进的性能，超过了更大的模型（即GPT-Oss-1220b），同时又少了9倍，并且通过广泛的消融研究进一步证实了其有效性。我们的模型，数据和代码可在此HTTPS URL上作为开源。

Title: Pay-Per-Search Models are Abstention Models

Authors: Mustafa Omer Gul, Claire Cardie, Tanya Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01152
Pdf URL: https://arxiv.org/pdf/2510.01152
Copy Paste: [[2510.01152]] Pay-Per-Search Models are Abstention Models(https://arxiv.org/abs/2510.01152)
Keywords: llm
Abstract: LLMs cannot reliably recognize their parametric knowledge boundaries and often hallucinate answers to outside-of-boundary questions. In contrast, humans recognize their limitations and can either seek external help for such questions or abstain. In this paper, we introduce MASH (Modeling Abstention via Selective Help-seeking), a training framework that readily extracts abstentions from LLMs. Our key idea is that any external help-seeking by an LLM, i.e. search tool use, can serve as a proxy for abstention if the external help (search) is appropriately penalized while simultaneously rewarding answer accuracy. MASH operationalizes this idea using reinforcement learning with a pay-per-search reward. We run experiments on three knowledge-intensive QA datasets. Our results show that MASH substantially improves upon the selective help-seeking performance of prior efficient search approaches; on multi-hop datasets, MASH improves answer accuracy by 7.6%. Furthermore, MASH demonstrates strong off-the-shelf abstention -- it can distinguish between unanswerable/answerable questions and selectively generate responses for answerable questions -- showcasing behavior analogous to specialized abstention approaches. We emphasize that contrary to prior abstention methods, MASH does not require pre-determining knowledge boundaries to construct training data. Instead, MASH's abstentions are a by-product of training for the auxiliary selective help-seeking task. Overall, we show that MASH training effectively aligns search tool use with parametric knowledge, which can be successfully leveraged for making abstention decisions.
摘要：LLM无法可靠地认识到他们的参数知识边界，并且经常对边界问题的答案幻觉。相比之下，人类认识到自己的局限性，可以为此问题寻求外部帮助或避免。在本文中，我们介绍了MASH（通过选择性寻求帮助进行建模弃权），该培训框架很容易从LLM中提取弃权。我们的关键想法是，如果外部帮助（搜索）受到适当的惩罚，同时奖励答案的准确性，那么LLM的任何外部帮助，即搜索工具的使用，都可以作为弃权的代理。 Mash使用加强学习以付费进行付费奖励来实现这一想法。我们在三个知识密集的质量检查数据集上进行实验。我们的结果表明，MASH对先前有效搜索方法的选择性寻求帮助的性能有所改善。在多跳数据集上，MASH将答案准确性提高了7.6％。此外，MASH表现出强烈的避免现成的弃权 - 它可以区分无法回答/可回答的问题，并有选择地为可回答的问题产生回答 - 展示类似于专业弃权方法的行为。我们强调了与先前的弃权方法相反，MASH不需要预先确定的知识边界来构建培训数据。相反，Mash的弃权是辅助选择性寻求帮助任务的培训的副产品。总体而言，我们表明，MASH培训有效地使搜索工具的使用与参数知识保持一致，可以成功利用这些知识来制定弃权决策。

Title: Backdoor Attacks Against Speech Language Models

Authors: Alexandrine Fortier, Thomas Thebaud, Jesús Villalba, Najim Dehak, Patrick Cardinal
Subjects: cs.CL, cs.CR, cs.SD
Abstract URL: https://arxiv.org/abs/2510.01157
Pdf URL: https://arxiv.org/pdf/2510.01157
Copy Paste: [[2510.01157]] Backdoor Attacks Against Speech Language Models(https://arxiv.org/abs/2510.01157)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) and their multimodal extensions are becoming increasingly popular. One common approach to enable multimodality is to cascade domain-specific encoders with an LLM, making the resulting model inherit vulnerabilities from all of its components. In this work, we present the first systematic study of audio backdoor attacks against speech language models. We demonstrate its effectiveness across four speech encoders and three datasets, covering four tasks: automatic speech recognition (ASR), speech emotion recognition, and gender and age prediction. The attack consistently achieves high success rates, ranging from 90.76% to 99.41%. To better understand how backdoors propagate, we conduct a component-wise analysis to identify the most vulnerable stages of the pipeline. Finally, we propose a fine-tuning-based defense that mitigates the threat of poisoned pretrained encoders.
摘要：大型语言模型（LLM）及其多模式扩展越来越流行。启用多模式的一种常见方法是使用LLM级联域的特定编码器，从而使所得模型从其所有组件中继承了漏洞。在这项工作中，我们介绍了对语音语言模型的音频后门攻击的首次系统研究。我们证明了它在四个语音编码和三个数据集中的有效性，涵盖了四个任务：自动语音识别（ASR），语音情感识别以及性别和年龄预测。这次攻击始终达到高成功率，范围从90.76％到99.41％。为了更好地了解后门如何传播，我们进行了组件分析，以确定管道最脆弱的阶段。最后，我们提出了一种基于微调的防御，以减轻有毒的经过预处理的编码者的威胁。

Title: Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare

Authors: Zhengliang Shi, Ruotian Ma, Jen-tse Huang, Xinbei Ma, Xingyu Chen, Mengru Wang, Qu Yang, Yue Wang, Fanghua Ye, Ziyang Chen, Shanyi Wang, Cixing Li, Wenxuan Wang, Zhaopeng Tu, Xiaolong Li, Zhaochun Ren, Linus
Subjects: cs.CL, cs.AI, cs.CY, cs.HC
Abstract URL: https://arxiv.org/abs/2510.01164
Pdf URL: https://arxiv.org/pdf/2510.01164
Copy Paste: [[2510.01164]] Social Welfare Function Leaderboard: When LLM Agents Allocate Social Welfare(https://arxiv.org/abs/2510.01164)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment where an LLM acts as a sovereign allocator, distributing tasks to a heterogeneous community of recipients. The benchmark is designed to create a persistent trade-off between maximizing collective efficiency (measured by Return on Investment) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs and present the first leaderboard for social welfare allocation. Our findings reveal three key insights: (i) A model's general conversational ability, as measured by popular leaderboards, is a poor predictor of its allocation skill. (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing group productivity at the expense of severe inequality. (iii) Allocation strategies are highly vulnerable, easily perturbed by output-length constraints and social-influence framing. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and targeted alignment for AI governance.
摘要：大型语言模型（LLM）越来越多地委托影响人类福利的高风险决策。但是，在分配稀缺社会资源时，指导这些模型的原则和价值观在很大程度上尚未受到审查。为了解决这个问题，我们介绍了社会福利功能（SWF）基准，这是一个动态的模拟环境，LLM充当主权分配器，将任务分配给异构的接收者社区。该基准旨在在最大化集体效率（通过投资回报率衡量）和确保分配公平性（通过Gini系数衡量）之间创建持续权衡的权衡。我们评估了20个最先进的LLM，并介绍了第一个进行社会福利分配的排行榜。我们的发现揭示了三个关键见解：（i）通过受欢迎的排行榜衡量的模型的一般对话能力是对其分配技能的糟糕预测指标。（ii）大多数LLM都表现出强大的默认功利方向，以严重不平等为代价将群体生产率优先考虑。（iii）分配策略非常脆弱，很容易受到输出长度的约束和社会影响框架的干扰。这些结果凸显了将当前LLMs作为社会决策者部署的风险，并强调了对AI治理的专业基准和目标对齐的需求。

Title: GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning

Authors: Oussama Gabouj, Kamel Charaf, Ivan Zakazov, Nicolas Baldwin, Robert West
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2510.01165
Pdf URL: https://arxiv.org/pdf/2510.01165
Copy Paste: [[2510.01165]] GRAD: Generative Retrieval-Aligned Demonstration Sampler for Efficient Few-Shot Reasoning(https://arxiv.org/abs/2510.01165)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: Large Language Models (LLMs) achieve strong performance across diverse tasks, but their effectiveness often depends on the quality of the provided context. Retrieval-Augmented Generation (RAG) enriches prompts with external information, but its reliance on static databases constrains adaptability and can result in irrelevant demonstrations. In this work, we propose a Generative Retrieval-Aligned Demonstrator (GRAD), a dynamic demonstration-based approach where an LLM model is trained to generate input-specific concise demonstrations. By tailoring demonstrations to each input, our method offers better contextual support than traditional RAG approaches. We demonstrate the superiority of GRAD under budget constraints, where we limit both the number of tokens used per demonstration and the number of tokens used for the final output. Trained solely on a math dataset, GRAD consistently outperforms strong baselines on Qwen2.5-14B across mathematical reasoning and advanced STEM questions, highlighting GRAD's robust generalization to out-of-distribution (OOD) domains such as physics, chemistry, and computer science. Furthermore, we show that demonstrations generated by trained smaller models can effectively guide larger target models, reducing training costs while maintaining competitive accuracy. Overall, this work introduces a scalable demonstration generator model presenting the first step toward a dynamic few-shot learning paradigm in resource-constrained settings. We release the code used for the project.
摘要：大型语言模型（LLMS）在各种任务中实现了强劲的性能，但是它们的有效性通常取决于提供的上下文的质量。检索增强的生成（RAG）富含外部信息提示，但其对静态数据库的依赖会限制适应性，并可能导致无关紧要的演示。在这项工作中，我们提出了一个生成检索的示范器（GRAD），这是一种基于动态演示的方法，其中训练了LLM模型以生成特定于输入的简洁演示。通过为每个输入调整演示，我们的方法提供了比传统的抹布方法更好的上下文支持。我们证明了毕业生在预算限制下的优势，其中我们限制了每个演示使用的令牌数量和用于最终输出的代币数量。毕业生仅在数学数据集中受过培训，在数学推理和高级STEM问题上始终超过QWEN2.5-14B的强大基线，从而突出了Grad对诸如物理，化学和计算机科学等毕业（OOD）领域的强大概括（OOD）域。此外，我们表明，训练有素的较小模型产生的示威活动可以有效地指导较大的目标模型，从而降低培训成本，同时保持竞争精度。总体而言，这项工作引入了可扩展的演示生成器模型，该模型介绍了在资源约束设置中朝着动态的少量学习范式迈出的第一步。我们发布用于项目的代码。

Title: Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity

Authors: Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R. Tomz, Christopher D. Manning, Weiyan Shi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2510.01171
Pdf URL: https://arxiv.org/pdf/2510.01171
Copy Paste: [[2510.01171]] Verbalized Sampling: How to Mitigate Mode Collapse and Unlock LLM Diversity(https://arxiv.org/abs/2510.01171)
Keywords: llm, prompt
Abstract: Post-training alignment often reduces LLM diversity, leading to a phenomenon known as mode collapse. Unlike prior work that attributes this effect to algorithmic limitations, we identify a fundamental, pervasive data-level driver: typicality bias in preference data, whereby annotators systematically favor familiar text as a result of well-established findings in cognitive psychology. We formalize this bias theoretically, verify it on preference datasets empirically, and show that it plays a central role in mode collapse. Motivated by this analysis, we introduce Verbalized Sampling, a simple, training-free prompting strategy to circumvent mode collapse. VS prompts the model to verbalize a probability distribution over a set of responses (e.g., ``Generate 5 jokes about coffee and their corresponding probabilities''). Comprehensive experiments show that VS significantly improves performance across creative writing (poems, stories, jokes), dialogue simulation, open-ended QA, and synthetic data generation, without sacrificing factual accuracy and safety. For instance, in creative writing, VS increases diversity by 1.6-2.1x over direct prompting. We further observe an emergent trend that more capable models benefit more from VS. In sum, our work provides a new data-centric perspective on mode collapse and a practical inference-time remedy that helps unlock pre-trained generative diversity.
摘要：训练后的比对通常会降低LLM多样性，从而导致一种称为模式崩溃的现象。与先前将这种影响归因于算法限制的工作不同，我们确定了一个基本的，普遍的数据级驱动因素：偏好数据中的典型性偏见，在这些数据中，注释者在认知心理学中有良好的发现是系统地偏爱熟悉的文本。我们从理论上对这种偏见进行形式化，通过经验对其进行验证，并表明它在模式崩溃中起着核心作用。在此分析的激励下，我们引入了口头抽样，这是一种简单，无训练的提示策略，以避免模式崩溃。 VS提示该模型对一组响应的概率分布进行了口头表达（例如，``生成5个关于咖啡及其相应概率的笑话''）。综合实验表明，VS可显着提高创意写作（诗歌，故事，笑话），对话模拟，开放式质量质量质量检查和合成数据生成，而无需牺牲事实准确性和安全性。例如，在创意写作中，与直接提示相比，VS的多样性增加了1.6-2.1倍。我们进一步观察到了一种新兴趋势，即更有能力的模型从VS中受益更多。总而言之，我们的工作为模式崩溃和实用的推理时间疗法提供了新的以数据为中心的观点，有助于解锁预训练的生成多样性。

Title: Energy-Regularized Sequential Model Editing on Hyperspheres

Authors: Qingyuan Liu, Jia-Chen Gu, Yunzhi Yao, Hong Wang, Nanyun Peng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2510.01172
Pdf URL: https://arxiv.org/pdf/2510.01172
Copy Paste: [[2510.01172]] Energy-Regularized Sequential Model Editing on Hyperspheres(https://arxiv.org/abs/2510.01172)
Keywords: language model, llm
Abstract: Large language models (LLMs) require constant updates to remain aligned with evolving real-world knowledge. Model editing offers a lightweight alternative to retraining, but sequential editing often destabilizes representations and induces catastrophic forgetting. In this work, we seek to better understand and mitigate performance degradation caused by sequential editing. We hypothesize that hyperspherical uniformity, a property that maintains uniform distribution of neuron weights on a hypersphere, helps the model remain stable, retain prior knowledge, while still accommodate new updates. We use Hyperspherical Energy (HE) to quantify neuron uniformity during editing, and examine its correlation with editing performance. Empirical studies across widely used editing methods reveals a strong correlation between HE dynamics and editing performance, with editing failures consistently coinciding with high HE fluctuations. We further theoretically prove that HE dynamics impose a lower bound on the degradation of pretrained knowledge, highlighting why HE stability is crucial for knowledge retention. Motivated by these insights, we propose SPHERE (Sparse Projection for Hyperspherical Energy-Regularized Editing), an HE-driven regularization strategy that stabilizes neuron weight distributions, ultimately preserving prior knowledge while enabling reliable sequential updates. Specifically, SPHERE identifies a sparse space complementary to the principal hyperspherical directions of the pretrained weight matrices and projects new knowledge onto it, attenuating perturbations on the principal directions. Extensive experiments on LLaMA3 (8B) and Qwen2.5 (7B) show that SPHERE outperforms the best baseline in editing capability by an average of 16.41%, while most faithfully preserving general model performance, thereby offering a principled path toward reliable large-scale knowledge editing.
摘要：大型语言模型（LLM）需要恒定的更新，以保持与不断发展的现实世界知识保持一致。模型编辑为重新训练提供了轻巧的替代方法，但是顺序编辑通常会破坏表示的稳定，并引起灾难性的遗忘。在这项工作中，我们寻求更好地理解和减轻由顺序编辑引起的性能降解。我们假设超球体均匀性是一种在超晶体上保持神经元权重均匀分布的属性，有助于模型保持稳定，保持先验知识，同时仍然可以容纳新的更新。我们使用超球能（HE）在编辑过程中量化神经元均匀性，并检查其与编辑性能的相关性。跨广泛使用的编辑方法的实证研究揭示了HE动力学与编辑性能之间的密切相关性，并且编辑失败始终与HE的高度波动相吻合。从理论上讲，我们进一步证明了他的动态对验证的知识的退化施加了下限，这强调了为什么他的稳定性对于保留知识至关重要。在这些见解的推动下，我们提出了Sphere（稀疏的超球能量调节化编辑），这是一种稳定神经元重量分布的HE驱动正则化策略，最终保留了先验知识，同时启用可靠的顺序更新。具体而言，Sphere识别出较少的空间与预告片的重量矩阵的主要超球指示，并将新知识投射到其上，从而减弱了主要方向的扰动。在Llama3（8b）和Qwen2.5（7b）上进行的广泛实验表明，Sphere在编辑能力方面的表现平均高于16.41％，而最忠实地保留一般模型性能，从而为可靠的大规模知识编辑提供了原则上的道路。