2025-05-13

Title: ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents

Authors: Elias Lumer, Anmol Gulati, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, James A. Burke
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06416
Pdf URL: https://arxiv.org/pdf/2505.06416
Copy Paste: [[2505.06416]] ScaleMCP: Dynamic and Auto-Synchronizing Model Context Protocol Tools for LLM Agents(https://arxiv.org/abs/2505.06416)
Keywords: language model, llm, agent
Abstract: Recent advancements in Large Language Models (LLMs) and the introduction of the Model Context Protocol (MCP) have significantly expanded LLM agents' capability to interact dynamically with external tools and APIs. However, existing tool selection frameworks do not integrate MCP servers, instead relying heavily on error-prone manual updates to monolithic local tool repositories, leading to duplication, inconsistencies, and inefficiencies. Additionally, current approaches abstract tool selection before the LLM agent is invoked, limiting its autonomy and hindering dynamic re-querying capabilities during multi-turn interactions. To address these issues, we introduce ScaleMCP, a novel tool selection approach that dynamically equips LLM agents with a MCP tool retriever, giving agents the autonomy to add tools into their memory, as well as an auto-synchronizing tool storage system pipeline through CRUD (create, read, update, delete) operations with MCP servers as the single source of truth. We also propose a novel embedding strategy, Tool Document Weighted Average (TDWA), designed to selectively emphasize critical components of tool documents (e.g. tool name or synthetic questions) during the embedding process. Comprehensive evaluations conducted on a created dataset of 5,000 financial metric MCP servers, across 10 LLM models, 5 embedding models, and 5 retriever types, demonstrate substantial improvements in tool retrieval and agent invocation performance, emphasizing ScaleMCP's effectiveness in scalable, dynamic tool selection and invocation.
摘要：大型语言模型（LLM）和引入模型上下文协议（MCP）的最新进展已大大扩展了LLM代理与外部工具和API动态交互的能力。但是，现有的工具选择框架不会集成MCP服务器，而是在很大程度上依赖于易用错误的手动更新到单片本地工具存储库中，从而导致重复，不一致和效率低下。此外，当前的方法是在调用LLM代理之前的抽象工具选择，从而限制了其在多转交互期间的自主性和阻碍动态的重新传播功能。为了解决这些问题，我们介绍了一种新型的工具选择方法ScaleMCP，该方法将LLM代理机构与MCP工具回合器配置，使代理商可以自治，以将工具添加到其内存中，以及通过CRUD（创建，读取，更新，DELETE，DELETE，DELETE，DELETE，DELETE，DELETE）作为真实源来源的自动同步工具存储系统管道。我们还提出了一种新颖的嵌入策略，即工具文档加权平均值（TDWA），旨在在嵌入过程中选择性地强调工具文档的关键组件（例如工具名称或合成问题）。在创建的5,000个财务公制MCP服务器上进行的全面评估，10个LLM型号，5种嵌入式模型和5种猎犬类型，证明了工具检索和代理调用性能的实质性改善，强调ScaleMCP在可扩展，动态，动态工具选择和调用中的有效性。

Title: Is your multimodal large language model a good science tutor?

Authors: Ming Liu, Liwen Wang, Wensheng Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06418
Pdf URL: https://arxiv.org/pdf/2505.06418
Copy Paste: [[2505.06418]] Is your multimodal large language model a good science tutor?(https://arxiv.org/abs/2505.06418)
Keywords: language model, llm
Abstract: Multimodal large language models (MLLMs) demonstrate impressive performance on scientific reasoning tasks (e.g., ScienceQA). However, most existing benchmarks focus narrowly on the accuracy of the final answer while ignoring other metrics. In particular, when applying MLLMs to educational contexts, the goal is not only correctness but also the ability to teach. In this paper, we propose a framework that evaluates MLLMs as science tutors using a comprehensive educational rubric and a simulated student model that judges the teaching performance of the tutors. Given a list of candidate MLLM science tutors, we use rubric-based student judgments to produce a range of tutor performance scores, identifying both strong and weak tutors. Using the training section of the ScienceQA dataset, we then construct a data set of pairwise comparisons between the outputs of strong and weak tutors. This enables us to apply multiple preference optimization methods to fine-tune an underperforming tutor model (Qwen2-VL-2B) into more effective ones. Our results also show that strong problem-solving skills do not guarantee high-quality tutoring and that performance optimization-guided refinements can yield more educationally aligned tutor models. This approach opens avenues for building MLLMs that serve not only as problem solvers, but as genuinely helpful educational assistants.
摘要：多模式大语模型（MLLM）在科学推理任务（例如ScienceQA）上表现出令人印象深刻的表现。但是，大多数现有的基准测试标准狭窄地关注最终答案的准确性，同时忽略其他指标。特别是，在将MLLM应用于教育环境时，目标不仅是正确性，而且是教学能力。在本文中，我们提出了一个框架，该框架使用全面的教育专栏和模拟学生模型评估MLLM作为科学教师，以判断导师的教学表现。鉴于候选人MLLM科学辅导员列表，我们使用基于专栏的学生判断来产生一系列导师的表现分数，从而确定了强者和弱导师。然后，使用ScienceQA数据集的训练部分，然后在强导师和弱导师的输出之间构建一个成对比较的数据集。这使我们能够将多种优先优化方法应用于更有效的导师模型（QWEN2-VL-2B）中，将表现不佳的导师模型（QWEN2-VL-2B）微调。我们的结果还表明，强大的解决问题的技能不能保证高质量的辅导，并且绩效优化引导的改进可以产生更教育性的老师模型。这种方法为建立不仅作为解决问题的人而充当真正有用的教育助理的MLLM开辟了途径。

Title: xGen-small Technical Report

Authors: Erik Nijkamp, Bo Pang, Egor Pakhomov, Akash Gokul, Jin Qu, Silvio Savarese, Yingbo Zhou, Caiming Xiong
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06496
Pdf URL: https://arxiv.org/pdf/2505.06496
Copy Paste: [[2505.06496]] xGen-small Technical Report(https://arxiv.org/abs/2505.06496)
Keywords: long context
Abstract: We introduce xGen-small, a family of 4B and 9B Transformer decoder models optimized for long-context applications. Our vertically integrated pipeline unites domain-balanced, frequency-aware data curation; multi-stage pre-training with quality annealing and length extension to 128k tokens; and targeted post-training via supervised fine-tuning, preference learning, and online reinforcement learning. xGen-small delivers strong performance across various tasks, especially in math and coding domains, while excelling at long context benchmarks.
摘要：我们介绍了XGEN-SMALL，这是一个为长篇文化应用程序优化的4B和9B变压器解码器模型的家族。我们垂直集成的管道将域平衡，频率吸引的数据策展；多阶段预训练，具有优质退火和长度扩展到128K令牌；并通过受监督的微调，偏好学习和在线增强学习进行针对培训。 XGEN-SMALL在各种任务中，尤其是在数学和编码领域中提供出色的性能，同时在长篇小说基准中表现出色。

Title: REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback

Authors: Aniruddha Roy, Pretam Ray, Abhilash Nandy, Somak Aditya, Pawan Goyal
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06548
Pdf URL: https://arxiv.org/pdf/2505.06548
Copy Paste: [[2505.06548]] REFINE-AF: A Task-Agnostic Framework to Align Language Models via Self-Generated Instructions using Reinforcement Learning from Automated Feedback(https://arxiv.org/abs/2505.06548)
Keywords: language model, gpt, llm
Abstract: Instruction-based Large Language Models (LLMs) have proven effective in numerous few-shot or zero-shot Natural Language Processing (NLP) tasks. However, creating human-annotated instruction data is time-consuming, expensive, and often limited in quantity and task diversity. Previous research endeavors have attempted to address this challenge by proposing frameworks capable of generating instructions in a semi-automated and task-agnostic manner directly from the model itself. Many of these efforts have relied on large API-only parameter-based models such as GPT-3.5 (175B), which are expensive, and subject to limits on a number of queries. This paper explores the performance of three open-source small LLMs such as LLaMA 2-7B, LLama 2-13B, and Mistral 7B, using a semi-automated framework, thereby reducing human intervention, effort, and cost required to generate an instruction dataset for fine-tuning LLMs. Furthermore, we demonstrate that incorporating a Reinforcement Learning (RL) based training algorithm into this LLMs-based framework leads to further enhancements. Our evaluation of the dataset reveals that these RL-based frameworks achieve a substantial improvements in 63-66% of the tasks compared to previous approaches.
摘要：基于指导的大语言模型（LLM）已被证明在许多少数或零弹性的自然语言处理（NLP）任务中有效。但是，创建人类注销的教学数据是耗时的，昂贵的，并且通常限制数量和任务多样性。以前的研究努力试图通过提出能够直接从模型本身以半自动和任务无关的方式生成指令的框架来应对这一挑战。这些努力中有许多依赖于仅基于参数的大型模型，例如GPT-3.5（175b），它们昂贵，并且受到许多查询的限制。本文探讨了三个开源小型LLM的性能，例如Llama 2-7b，Llama 2-13b和Mistral 7b，使用半自动化框架，从而减少人力干预，努力和成本来生成指令数据集以进行微调LLMS。此外，我们证明，将基于强化学习（RL）培训算法纳入基于LLMS的框架会导致进一步增强。我们对数据集的评估表明，与以前的方法相比，这些基于RL的框架在63-66％的任务方面取得了重大改进。

Title: MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG

Authors: Woosang Lim, Zekun Li, Gyuwan Kim, Sungyoung Ji, HyeonJung Kim, Kyuri Choi, Jin Hyuk Lim, Kyungpyo Park, William Yang Wang
Subjects: cs.CL, cs.AI, cs.IR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.06569
Pdf URL: https://arxiv.org/pdf/2505.06569
Copy Paste: [[2505.06569]] MacRAG: Compress, Slice, and Scale-up for Multi-Scale Adaptive Context RAG(https://arxiv.org/abs/2505.06569)
Keywords: language model, gpt, llm, long context, retrieval-augmented generation
Abstract: Long-context (LC) Large Language Models (LLMs) combined with Retrieval-Augmented Generation (RAG) hold strong potential for complex multi-hop and large-document tasks. However, existing RAG systems often suffer from imprecise retrieval, incomplete context coverage under constrained context windows, and fragmented information caused by suboptimal context construction. We introduce Multi-scale Adaptive Context RAG (MacRAG), a hierarchical retrieval framework that compresses and partitions documents into coarse-to-fine granularities, then adaptively merges relevant contexts through chunk- and document-level expansions in real time. By starting from the finest-level retrieval and progressively incorporating higher-level and broader context, MacRAG constructs effective query-specific long contexts, optimizing both precision and coverage. Evaluations on the challenging LongBench expansions of HotpotQA, 2WikiMultihopQA, and Musique confirm that MacRAG consistently surpasses baseline RAG pipelines on single- and multi-step generation with Llama-3.1-8B, Gemini-1.5-pro, and GPT-4o. Our results establish MacRAG as an efficient, scalable solution for real-world long-context, multi-hop reasoning. Our code is available at this https URL.
摘要：长篇文化（LC）大语言模型（LLMS）与检索功能增强的生成（RAG）相结合，对复杂的多跳和大型文档任务具有强大的潜力。但是，现有的抹布系统通常会在受约束的上下文窗口下遭受不精确的检索，不完整的上下文覆盖范围，以及由次优上下文构建引起的零散信息。我们介绍了多尺度自适应上下文抹布（MACRAG），这是一个层次检索框架，将文档压缩和分区将文档压缩为粗到细的粒度，然后通过实时的块和文档级别的扩展自适应地合并相关上下文。通过从最优质的检索开始，并逐步融合了更高级别和更广泛的背景，Macrag构建了有效的特定查询特定长篇小说，从而优化了精度和覆盖范围。对HotPotQA，2Wikimultihopqa和Musique的挑战性长板扩展的评估证实，Macrag始终超过Llama-3.1-8B，Gemini-1.5-Pro和GPT-4O的单步生成和多步生成的基线RAG管道。我们的结果将MACRAG作为真实世界长篇下CENT，多跳的推理的有效，可扩展的解决方案。我们的代码可在此HTTPS URL上找到。

Title: Evaluating LLM-Generated Q&A Test: a Student-Centered Study

Authors: Anna Wróblewska, Bartosz Grabek, Jakub Świstak, Daniel Dan
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.06591
Pdf URL: https://arxiv.org/pdf/2505.06591
Copy Paste: [[2505.06591]] Evaluating LLM-Generated Q&A Test: a Student-Centered Study(https://arxiv.org/abs/2505.06591)
Keywords: gpt, llm, chat
Abstract: This research prepares an automatic pipeline for generating reliable question-answer (Q&A) tests using AI chatbots. We automatically generated a GPT-4o-mini-based Q&A test for a Natural Language Processing course and evaluated its psychometric and perceived-quality metrics with students and experts. A mixed-format IRT analysis showed that the generated items exhibit strong discrimination and appropriate difficulty, while student and expert star ratings reflect high overall quality. A uniform DIF check identified two items for review. These findings demonstrate that LLM-generated assessments can match human-authored tests in psychometric performance and user satisfaction, illustrating a scalable approach to AI-assisted assessment development.
摘要：这项研究为使用AI Chatbots生成可靠的问答（Q＆A）测试的自动管道准备了一条自动管道。我们自动为自然语言处理课程生成了基于GPT-4O-MINI的Q＆A测试，并评估了其与学生和专家的心理测量和感知质量指标。混合形式的IRT分析表明，生成的项目表现出强烈的歧视和适当的难度，而学生和专家明星评级反映了总体质量高。统一的DIF检查确定了两个项目以进行审查。这些发现表明，LLM生成的评估可以与人为实现的心理测量表现和用户满意度相匹配，这说明了AI辅助评估开发的可扩展方法。

Title: Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Authors: Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.06594
Pdf URL: https://arxiv.org/pdf/2505.06594
Copy Paste: [[2505.06594]] Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation(https://arxiv.org/abs/2505.06594)
Keywords: language model
Abstract: Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset, demonstrating superiority against state-of-the-art VLMs such as Gemini 1.5 by generating summaries containing 20% more relevant visual information while requiring 75% less of the video as input.
摘要：当总结复杂的多模式输入（例如整个电视节目剧集）时，视觉语言模型（VLMS）通常很难平衡视觉和文本信息。在本文中，我们提出了一种零拍的视频对文本摘要方法，该方法构建了自己的情节剧本表示，有效地将关键视频矩，对话和角色信息整合到统一文档中。与以前的方法不同，我们同时生成剧本，并以零射门的名字命名，仅使用音频，视频和成绩单作为输入。此外，我们强调说，现有的摘要指标可能无法评估摘要中的多模式内容。为了解决这个问题，我们引入了MFACTSUM，这是一种多模式度量，可评估有关视力和文本方式的摘要。使用MFACTSUM，我们在SummScreen3D数据集上评估了我们的剧本摘要，通过生成包含20％相关的视觉信息的摘要，同时需要少75％的视频AS输入，证明了与最先进的VLM（例如Gemini 1.5）的优势。

Title: Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation

Authors: Abbas Bertina, Shahab Beirami, Hossein Biniazian, Elham Esmaeilnia, Soheil Shahi, Mahdi Pirnia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06599
Pdf URL: https://arxiv.org/pdf/2505.06599
Copy Paste: [[2505.06599]] Bridging the Gap: An Intermediate Language for Enhanced and Cost-Effective Grapheme-to-Phoneme Conversion with Homographs with Multiple Pronunciations Disambiguation(https://arxiv.org/abs/2505.06599)
Keywords: language model, llm, prompt
Abstract: Grapheme-to-phoneme (G2P) conversion for Persian presents unique challenges due to its complex phonological features, particularly homographs and Ezafe, which exist in formal and informal language contexts. This paper introduces an intermediate language specifically designed for Persian language processing that addresses these challenges through a multi-faceted approach. Our methodology combines two key components: Large Language Model (LLM) prompting techniques and a specialized sequence-to-sequence machine transliteration architecture. We developed and implemented a systematic approach for constructing a comprehensive lexical database for homographs with multiple pronunciations disambiguation often termed polyphones, utilizing formal concept analysis for semantic differentiation. We train our model using two distinct datasets: the LLM-generated dataset for formal and informal Persian and the B-Plus podcasts for informal language variants. The experimental results demonstrate superior performance compared to existing state-of-the-art approaches, particularly in handling the complexities of Persian phoneme conversion. Our model significantly improves Phoneme Error Rate (PER) metrics, establishing a new benchmark for Persian G2P conversion accuracy. This work contributes to the growing research in low-resource language processing and provides a robust solution for Persian text-to-speech systems and demonstrating its applicability beyond Persian. Specifically, the approach can extend to languages with rich homographic phenomena such as Chinese and Arabic
摘要：波斯语的谱系转换（G2P）转换引起了独特的挑战，其复杂的语音特征，尤其是同谱和EZAFE，它们存在于正式和非正式的语言环境中。本文介绍了一种专门为波斯语言处理设计的中间语言，该语言通过多方面的方法来解决这些挑战。我们的方法结合了两个关键组成部分：大语言模型（LLM）提示技术和专门的序列到序列机器音译体系结构。我们开发并实施了一种系统的方法，用于为具有多个发音歧义的均匀词法构建全面的词汇数据库，通常使用正式的概念分析来进行语义差异。我们使用两个不同的数据集训练我们的模型：正式和非正式波斯语的LLM生成的数据集以及非正式语言变体的B-Plus播客。实验结果表明，与现有的最新方法相比，尤其是在处理波斯音素转化的复杂性时，其性能卓越。我们的模型显着提高了音素错误率（PER）指标，从而建立了波斯G2P转换精度的新基准。这项工作有助于低资源语言处理的不断增长的研究，并为波斯文本到语音系统提供了强有力的解决方案，并证明了其超越波斯语的适用性。具体而言，该方法可以扩展到具有丰富同源现象的语言，例如中文和阿拉伯语

Title: Boosting Neural Language Inference via Cascaded Interactive Reasoning

Authors: Min Li, Chun Yuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06607
Pdf URL: https://arxiv.org/pdf/2505.06607
Copy Paste: [[2505.06607]] Boosting Neural Language Inference via Cascaded Interactive Reasoning(https://arxiv.org/abs/2505.06607)
Keywords: language model
Abstract: Natural Language Inference (NLI) focuses on ascertaining the logical relationship (entailment, contradiction, or neutral) between a given premise and hypothesis. This task presents significant challenges due to inherent linguistic features such as diverse phrasing, semantic complexity, and contextual nuances. While Pre-trained Language Models (PLMs) built upon the Transformer architecture have yielded substantial advancements in NLI, prevailing methods predominantly utilize representations from the terminal layer. This reliance on final-layer outputs may overlook valuable information encoded in intermediate layers, potentially limiting the capacity to model intricate semantic interactions effectively. Addressing this gap, we introduce the Cascaded Interactive Reasoning Network (CIRN), a novel architecture designed for deeper semantic comprehension in NLI. CIRN implements a hierarchical feature extraction strategy across multiple network depths, operating within an interactive space where cross-sentence information is continuously integrated. This mechanism aims to mimic a process of progressive reasoning, transitioning from surface-level feature matching to uncovering more profound logical and semantic connections between the premise and hypothesis. By systematically mining latent semantic relationships at various representational levels, CIRN facilitates a more thorough understanding of the input pair. Comprehensive evaluations conducted on several standard NLI benchmark datasets reveal consistent performance gains achieved by CIRN over competitive baseline approaches, demonstrating the efficacy of leveraging multi-level interactive features for complex relational reasoning.
摘要：自然语言推论（NLI）的重点是确定给定前提和假设之间的逻辑关系（需要，矛盾或中性）。由于固有的语言特征，例如各种措辞，语义复杂性和上下文细微差别，这项任务提出了重大挑战。尽管构建在变压器体系结构上的预训练的语言模型（PLM）在NLI中取得了重大进步，但普遍的方法主要利用终端层中的表示形式。对最终输出的这种依赖可能会忽略中间层编码的有价值的信息，从而有效地限制了有效建模复杂的语义相互作用的能力。在解决这一差距的情况下，我们介绍了级联的交互式推理网络（CIRN），这是一种新颖的体系结构，旨在在NLI中进行更深入的语义理解。 CIRN在多个网络深度之间实现了分层特征提取策略，该策略在连续集成的交互式空间内运行。该机制旨在模仿渐进推理的过程，从表面级特征匹配到揭示前提和假设之间更深刻的逻辑和语义联系。通过系统地挖掘各种表示级别的潜在语义关系，CIRN促进了对输入对的更彻底的了解。在几个标准的NLI基准数据集上进行的全面评估揭示了CIRN对竞争性基线方法所取得的一致性能提高，这证明了利用多级交互功能来实现复杂关系推理的功效。

Title: Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models

Authors: Isaac Gerber
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.06633
Pdf URL: https://arxiv.org/pdf/2505.06633
Copy Paste: [[2505.06633]] Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models(https://arxiv.org/abs/2505.06633)
Keywords: language model
Abstract: Decoder-only transformer networks have become incredibly popular for language modeling tasks. State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text. Each transformer block typically consists of a multi-head attention (MHA) mechanism and a two-layer fully connected feedforward network (FFN). In this paper, we examine the importance of the FFN during the model pre-training process through a series of experiments, confirming that the FFN is important to model performance. Furthermore, we show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
摘要：仅解码器的变压器网络在语言建模任务中变得非常流行。最先进的模型可以拥有一百多个变压器块，其中包含数十亿个可训练的参数，并接受了数万亿个文本令牌的培训。每个变压器块通常由多头注意（MHA）机制和两层完全连接的前馈网络（FFN）组成。在本文中，我们通过一系列实验研究了FFN在模型预训练过程中的重要性，证实FFN对于模型性能很重要。此外，我们表明，使用Transformer Block配置使用具有较少此类块的三层FFN的模型胜过标准的两层配置，从而提供了较低的训练损失，而在更少的时间内，总参数较少。

Title: TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models

Authors: Junyi Peng, Takanori Ashihara, Marc Delcroix, Tsubasa Ochiai, Oldrich Plchot, Shoko Araki, Jan Černocký
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.06660
Pdf URL: https://arxiv.org/pdf/2505.06660
Copy Paste: [[2505.06660]] TS-SUPERB: A Target Speech Processing Benchmark for Speech Self-Supervised Learning Models(https://arxiv.org/abs/2505.06660)
Keywords: llm
Abstract: Self-supervised learning (SSL) models have significantly advanced speech processing tasks, and several benchmarks have been proposed to validate their effectiveness. However, previous benchmarks have primarily focused on single-speaker scenarios, with less exploration of target-speaker tasks in noisy, multi-talker conditions -- a more challenging yet practical case. In this paper, we introduce the Target-Speaker Speech Processing Universal Performance Benchmark (TS-SUPERB), which includes four widely recognized target-speaker processing tasks that require identifying the target speaker and extracting information from the speech mixture. In our benchmark, the speaker embedding extracted from enrollment speech is used as a clue to condition downstream models. The benchmark result reveals the importance of evaluating SSL models in target speaker scenarios, demonstrating that performance cannot be easily inferred from related single-speaker tasks. Moreover, by using a unified SSL-based target speech encoder, consisting of a speaker encoder and an extractor module, we also investigate joint optimization across TS tasks to leverage mutual information and demonstrate its effectiveness.
摘要：自我监督学习（SSL）模型具有显着高级的语音处理任务，并提出了几种基准来验证其有效性。但是，以前的基准主要集中在单扬声器方案上，在嘈杂，多样性的条件下对目标扬声器任务的探索较少，这是一个更具挑战性但更实用的案例。在本文中，我们介绍了目标扬声器语音处理通用性能基准（TS-Superb），其中包括四个公认的宣传扬声器处理任务，这些任务需要识别目标扬声器并从语音混合物中提取信息。在我们的基准中，从入学语音中提取的扬声器用作条件下游模型的线索。基准结果揭示了在目标扬声器方案中评估SSL模型的重要性，这表明无法轻易从相关的单扬声器任务中推断出性能。此外，通过使用由说话者编码器和提取器模块组成的基于统一的基于SSL的目标语音编码器，我们还研究了跨TS任务的关节优化，以利用共同信息并证明其有效性。

Title: From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback

Authors: Zongqi Wang, Tianle Gu, Chen Gong, Xin Tian, Siqi Bao, Yujiu Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06698
Pdf URL: https://arxiv.org/pdf/2505.06698
Copy Paste: [[2505.06698]] From Rankings to Insights: Evaluation Should Shift Focus from Leaderboard to Feedback(https://arxiv.org/abs/2505.06698)
Keywords: language model, llm
Abstract: Automatic evaluation benchmarks such as MT-Bench, Arena-Hard, and Auto-Arena are seeing growing adoption for the evaluation of Large Language Models (LLMs). Existing research has primarily focused on approximating human-based model rankings using limited data and LLM-as-a-Judge. However, the fundamental premise of these studies, which attempts to replicate human rankings, is flawed. Specifically, these benchmarks typically offer only overall scores, limiting their utility to leaderboard rankings, rather than providing feedback that can guide model optimization and support model profiling. Therefore, we advocate for an evaluation paradigm shift from approximating human-based model rankings to providing feedback with analytical value. To this end, we introduce Feedbacker, an evaluation framework that provides comprehensive and fine-grained results, thereby enabling thorough identification of a model's specific strengths and weaknesses. Such feedback not only supports the targeted optimization of the model but also enhances the understanding of its behavior. Feedbacker comprises three key components: an extensible tree-based query taxonomy builder, an automated query synthesis scheme, and a suite of visualization and analysis tools. Furthermore, we propose a novel LLM-as-a-Judge method: PC2 (Pre-Comparison-derived Criteria) pointwise evaluation. This method derives evaluation criteria by pre-comparing the differences between several auxiliary responses, achieving the accuracy of pairwise evaluation while maintaining the time complexity of pointwise evaluation. Finally, leveraging the evaluation results of 17 mainstream LLMs, we demonstrate the usage of Feedbacker and highlight its effectiveness and potential. Our homepage project is available at this https URL.
摘要：自动评估基准（例如MT Bench，Arena-Hard和Auto-Arena）正在看到对大语言模型（LLMS）评估的采用越来越多。现有的研究主要集中于使用有限的数据和LLM-AS-A-Gudge近似基于人类的模型排名。但是，这些研究试图复制人类排名的基本前提是有缺陷的。具体而言，这些基准通常仅提供总体分数，将其实用性限制在排行榜排名中，而不是提供可以指导模型优化和支持模型分析的反馈。因此，我们主张评估范式从近似基于人类的模型排名转变为提供分析价值的反馈。为此，我们介绍了反馈者，这是一个评估框架，可提供全面且细粒度的结果，从而彻底识别模型的特定优点和劣势。这种反馈不仅支持模型的有针对性优化，而且还增强了对其行为的理解。反馈器包括三个关键组成部分：基于树木的查询分类构建器，自动查询合成方案以及一套可视化和分析工具。此外，我们提出了一种新型的LLM-AS-A-A-Audge方法：PC2（预先衍生的衍生标准）指标。该方法通过预先张贴几个辅助响应之间的差异来得出评估标准，从而在保持角度评估的时间复杂性的同时，达到了成对评估的准确性。最后，利用17个主流LLM的评估结果，我们证明了反馈者的用法并强调了其有效性和潜力。我们的主页项目可在此HTTPS URL上找到。

Title: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Authors: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.06708
Pdf URL: https://arxiv.org/pdf/2505.06708
Copy Paste: [[2505.06708]] Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free(https://arxiv.org/abs/2505.06708)
Keywords: language model
Abstract: Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related $\href{this https URL}{codes}$ and $\href{this https URL}{models}$ to facilitate future research.
摘要：从LSTM和公路网络等早期模型到最近的状态空间模型，线性注意力以及软磁心的关注，门控机制已被广泛使用。然而，现有文献很少检查门控的特定影响。在这项工作中，我们进行了全面的实验，以系统地研究门控的SoftMax注意变体。具体而言，我们对30种型号的15B混合物（MOE）型号和1.7B密集的模型进行了全面比较，该模型在3.5万亿代币数据集中进行了训练。我们的核心发现是，在缩放点 - 产物注意（SDPA）一致地提高了性能之后，简单的修改 - 适用于头部特异性的Sigmoid Gate。这种修改还可以增强训练稳定性，耐受较大的学习率并提高缩放特性。通过比较各种门控位置和计算变体，我们将这种有效性归因于两个关键因素：（1）在SoftMax注意的低级别映射时引入非线性，以及（2）应用查询依赖性稀疏门控分数以调节SDPA输出。值得注意的是，我们发现这种稀疏的门控机制减轻了“注意下沉”并提高了长篇文化外推性能，并且我们还发布了相关的$ \ href {this Https url} {codes} $} $和$ \ href {this Https url} {Models} {型号} $，以促进未来的研究。

Title: Utilizing LLMs to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK

Authors: Damian Curran, Brian Chapman, Mike Conway
Subjects: cs.CL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.06782
Pdf URL: https://arxiv.org/pdf/2505.06782
Copy Paste: [[2505.06782]] Utilizing LLMs to Investigate the Disputed Role of Evidence in Electronic Cigarette Health Policy Formation in Australia and the UK(https://arxiv.org/abs/2505.06782)
Keywords: language model, gpt, llm
Abstract: Australia and the UK have developed contrasting approaches to the regulation of electronic cigarettes, with - broadly speaking - Australia adopting a relatively restrictive approach and the UK adopting a more permissive approach. Notably, these divergent policies were developed from the same broad evidence base. In this paper, to investigate differences in how the two jurisdictions manage and present evidence, we developed and evaluated a Large Language Model-based sentence classifier to perform automated analyses of electronic cigarette-related policy documents drawn from official Australian and UK legislative processes (109 documents in total). Specifically, we utilized GPT-4 to automatically classify sentences based on whether they contained claims that e-cigarettes were broadly helpful or harmful for public health. Our LLM-based classifier achieved an F-score of 0.9. Further, when applying the classifier to our entire sentence-level corpus, we found that Australian legislative documents show a much higher proportion of harmful statements, and a lower proportion of helpful statements compared to the expected values, with the opposite holding for the UK. In conclusion, this work utilized an LLM-based approach to provide evidence to support the contention that - drawing on the same evidence base - Australian ENDS-related policy documents emphasize the harms associated with ENDS products and UK policy documents emphasize the benefits. Further, our approach provides a starting point for using LLM-based methods to investigate the complex relationship between evidence and health policy formation.
摘要：澳大利亚和英国已经开发了对电子烟的调节方法，从广义上讲，澳大利亚采取了相对限制性的方法，英国采用了更宽松的方法。值得注意的是，这些不同的政策是从相同的广泛证据基础制定的。在本文中，为了调查两个司法管辖区如何管理和呈现证据的差异，我们开发并评估了一个基于语言模型的大型句子分类器，以对电子卷烟相关的政策文件进行自动分析，该政策文件是根据澳大利亚正式和英国立法程序（总共109个文件）进行的。具体来说，我们利用GPT-4根据句子自动对句子进行分类，因为它们是否包含了电子烟有助于或有害于公共卫生的主张。我们的基于LLM的分类器的F评分为0.9。此外，当将分类器应用于我们的整个句子级语料库时，我们发现澳大利亚立法文件显示出有害陈述的比例要高得多，与预期价值相比，有用的陈述比例较低，而英国的持有却相反。总之，这项工作采用了一种基于LLM的方法来提供证据来支持以下论点：借助相同的证据基础 - 澳大利亚目的相关的政策文件强调与目的产品和英国政策文件相关的危害强调了收益。此外，我们的方法为使用基于LLM的方法来研究证据与健康政策形成之间的复杂关系提供了一个起点。

Title: IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method

Authors: Mihyeon Kim, Juhyoung Park, Youngbin Kim
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06889
Pdf URL: https://arxiv.org/pdf/2505.06889
Copy Paste: [[2505.06889]] IM-BERT: Enhancing Robustness of BERT through the Implicit Euler Method(https://arxiv.org/abs/2505.06889)
Keywords: language model
Abstract: Pre-trained Language Models (PLMs) have achieved remarkable performance on diverse NLP tasks through pre-training and fine-tuning. However, fine-tuning the model with a large number of parameters on limited downstream datasets often leads to vulnerability to adversarial attacks, causing overfitting of the model on standard datasets. To address these issues, we propose IM-BERT from the perspective of a dynamic system by conceptualizing a layer of BERT as a solution of Ordinary Differential Equations (ODEs). Under the situation of initial value perturbation, we analyze the numerical stability of two main numerical ODE solvers: the explicit and implicit Euler approaches. Based on these analyses, we introduce a numerically robust IM-connection incorporating BERT's layers. This strategy enhances the robustness of PLMs against adversarial attacks, even in low-resource scenarios, without introducing additional parameters or adversarial training strategies. Experimental results on the adversarial GLUE (AdvGLUE) dataset validate the robustness of IM-BERT under various conditions. Compared to the original BERT, IM-BERT exhibits a performance improvement of approximately 8.3\%p on the AdvGLUE dataset. Furthermore, in low-resource scenarios, IM-BERT outperforms BERT by achieving 5.9\%p higher accuracy.
摘要：预训练的语言模型（PLM）通过预训练和微调在不同的NLP任务上取得了出色的性能。但是，在有限的下游数据集上使用大量参数对模型进行微调通常会导致易受对抗性攻击的脆弱性，从而导致该模型过于拟合标准数据集。为了解决这些问题，我们从动态系统的角度提出了Im-Bert，通过将BERT层概念化为普通微分方程（ODES）的解决方案。在初始值扰动的情况下，我们分析了两个主要数值求解器的数值稳定性：显式和隐式Euler方法。基于这些分析，我们引入了一个结合Bert层的数值强大的IM连接。即使在低资源场景中，这种策略也提高了PLM对对抗攻击的鲁棒性，而无需引入其他参数或对抗性训练策略。对抗性胶（Advglue）数据集的实验结果验证了Im-Bert在各种条件下的鲁棒性。与原始的BERT相比，Im-Bert在Advglue数据集上表现出大约8.3 \％P的性能提高。此外，在低资源的情况下，Im-Bert通过实现5.9 \％p的精度提高了BERT的表现。

Title: EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation

Authors: Xinyi Mou, Chen Qian, Wei Liu, Xuanjing Huang, Zhongyu Wei
Subjects: cs.CL, cs.CY
Abstract URL: https://arxiv.org/abs/2505.06904
Pdf URL: https://arxiv.org/pdf/2505.06904
Copy Paste: [[2505.06904]] EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation(https://arxiv.org/abs/2505.06904)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have demonstrated an impressive ability to role-play humans and replicate complex social dynamics. While large-scale social simulations are gaining increasing attention, they still face significant challenges, particularly regarding high time and computation costs. Existing solutions, such as distributed mechanisms or hybrid agent-based model (ABM) integrations, either fail to address inference costs or compromise accuracy and generalizability. To this end, we propose EcoLANG: Efficient and Effective Agent Communication Language Induction for Social Simulation. EcoLANG operates in two stages: (1) language evolution, where we filter synonymous words and optimize sentence-level rules through natural selection, and (2) language utilization, where agents in social simulations communicate using the evolved language. Experimental results demonstrate that EcoLANG reduces token consumption by over 20%, enhancing efficiency without sacrificing simulation accuracy.
摘要：大型语言模型（LLM）表现出令人印象深刻的角色扮演人类和复杂的社会动态的能力。尽管大规模的社会模拟引起了人们的关注，但它们仍然面临重大挑战，尤其是在高时间和计算成本方面。现有的解决方案（例如分布式机制或基于混合代理的模型（ABM）集成）无法解决推理成本或妥协准确性和概括性。为此，我们提出了Ecolang：社交模拟的高效和有效的代理传播语言诱导。 Ecolang分为两个阶段：（1）语言演变，我们在其中过滤同义词并通过自然选择优化句子级规则，以及（2）语言利用，在社交模拟中使用演变的语言进行通信。实验结果表明，Ecolang将令牌消耗降低了20％以上，在不牺牲模拟精度的情况下提高了效率。

Title: The Distracting Effect: Understanding Irrelevant Passages in RAG

Authors: Chen Amiraz, Florin Cuconasu, Simone Filice, Zohar Karnin
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.06914
Pdf URL: https://arxiv.org/pdf/2505.06914
Copy Paste: [[2505.06914]] The Distracting Effect: Understanding Irrelevant Passages in RAG(https://arxiv.org/abs/2505.06914)
Keywords: llm, retrieval augmented generation
Abstract: A well-known issue with Retrieval Augmented Generation (RAG) is that retrieved passages that are irrelevant to the query sometimes distract the answer-generating LLM, causing it to provide an incorrect response. In this paper, we shed light on this core issue and formulate the distracting effect of a passage w.r.t. a query (and an LLM). We provide a quantifiable measure of the distracting effect of a passage and demonstrate its robustness across LLMs. Our research introduces novel methods for identifying and using hard distracting passages to improve RAG systems. By fine-tuning LLMs with these carefully selected distracting passages, we achieve up to a 7.5% increase in answering accuracy compared to counterparts fine-tuned on conventional RAG datasets. Our contribution is two-fold: first, we move beyond the simple binary classification of irrelevant passages as either completely unrelated vs. distracting, and second, we develop and analyze multiple methods for finding hard distracting passages. To our knowledge, no other research has provided such a comprehensive framework for identifying and utilizing hard distracting passages.
摘要：检索增强发电（RAG）的一个众所周知的问题是，检索与查询无关的段落有时会分散答案生成的LLM，从而导致其提供不正确的响应。在本文中，我们阐明了这个核心问题，并制定了W.R.T.的分心效果。查询（和LLM）。我们可以量化通过段落的分散效果，并在LLM中证明其稳健性。我们的研究介绍了用于识别和使用硬分散段落以改善抹布系统的新方法。通过对这些精心选择的分散注意力的llm进行微调LLM，与传统的抹布数据集进行了微调相比，我们的答案准确性提高了7.5％。我们的贡献是两个方面的：首先，我们超越了无关的段落的简单二元分类，因为完全无关与分心，其次，我们开发和分析了多种方法来寻找硬分散注意力的段落。据我们所知，没有其他研究为识别和利用艰苦的分心段落提供了如此全面的框架。

Title: Convert Language Model into a Value-based Strategic Planner

Authors: Xiaoyu Wang, Yue Zhao, Qingqing Gu, Zhonglin Jiang, Xiaokai Chen, Yong Chen, Luo Ji
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.06987
Pdf URL: https://arxiv.org/pdf/2505.06987
Copy Paste: [[2505.06987]] Convert Language Model into a Value-based Strategic Planner(https://arxiv.org/abs/2505.06987)
Keywords: language model, llm
Abstract: Emotional support conversation (ESC) aims to alleviate the emotional distress of individuals through effective conversations. Although large language models (LLMs) have obtained remarkable progress on ESC, most of these studies might not define the diagram from the state model perspective, therefore providing a suboptimal solution for long-term satisfaction. To address such an issue, we leverage the Q-learning on LLMs, and propose a framework called straQ*. Our framework allows a plug-and-play LLM to bootstrap the planning during ESC, determine the optimal strategy based on long-term returns, and finally guide the LLM to response. Substantial experiments on ESC datasets suggest that straQ* outperforms many baselines, including direct inference, self-refine, chain of thought, finetuning, and finite state machines.
摘要：情感支持对话（ESC）旨在通过有效的对话来减轻个人的情绪困扰。尽管大型语言模型（LLM）在ESC上取得了显着的进展，但这些研究中的大多数可能无法从状态模型的角度定义图表，因此为长期满意度提供了次优的解决方案。为了解决此类问题，我们利用LLMS上的Q学习，并提出一个称为Straq*的框架。我们的框架允许插件的LLM在ESC期间进行计划，确定基于长期回报的最佳策略，最后指导LLM进行响应。在ESC数据集上进行的大量实验表明，Straq*的表现优于许多基准，包括直接推断，自我refine，思想链，填充和有限状态机器。

Title: HAMLET: Healthcare-focused Adaptive Multilingual Learning Embedding-based Topic Modeling

Authors: Hajar Sakai, Sarah S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07157
Pdf URL: https://arxiv.org/pdf/2505.07157
Copy Paste: [[2505.07157]] HAMLET: Healthcare-focused Adaptive Multilingual Learning Embedding-based Topic Modeling(https://arxiv.org/abs/2505.07157)
Keywords: language model, llm
Abstract: Traditional topic models often struggle with contextual nuances and fail to adequately handle polysemy and rare words. This limitation typically results in topics that lack coherence and quality. Large Language Models (LLMs) can mitigate this issue by generating an initial set of topics. However, these raw topics frequently lack refinement and representativeness, which leads to redundancy without lexical similarity and reduced interpretability. This paper introduces HAMLET, a graph-driven architecture for cross-lingual healthcare topic modeling that uses LLMs. The proposed approach leverages neural-enhanced semantic fusion to refine the embeddings of topics generated by the LLM. Instead of relying solely on statistical co-occurrence or human interpretation to extract topics from a document corpus, this method introduces a topic embedding refinement that uses Bidirectional Encoder Representations from Transformers (BERT) and Graph Neural Networks (GNN). After topic generation, a hybrid technique that involves BERT and Sentence-BERT (SBERT) is employed for embedding. The topic representations are further refined using a GNN, which establishes connections between documents, topics, words, similar topics, and similar words. A novel method is introduced to compute similarities. Consequently, the topic embeddings are refined, and the top k topics are extracted. Experiments were conducted using two healthcare datasets, one in English and one in French, from which six sets were derived. The results demonstrate the effectiveness of HAMLET.
摘要：传统的主题模型通常会在上下文的细微差别上挣扎，并且无法充分处理多义和稀有词。这种限制通常会导致缺乏连贯性和质量的主题。大型语言模型（LLMS）可以通过生成初始主题来减轻此问题。但是，这些原始主题经常缺乏改进和代表性，从而导致冗余而没有词汇相似性和降低的解释性。本文介绍了Hamlet，这是一种用于使用LLM的跨语性医疗主题建模的图形驱动架构。提出的方法利用神经增强的语义融合来完善LLM产生的主题的嵌入。该方法不仅依靠统计共同出现或人类解释来从文档语料库中提取主题，而是引入了一个主题嵌入细化的主题，该主题使用了来自变形金刚（BERT）和图形神经网络（GNN）的双向编码器表示。主题一代后，采用了一种涉及Bert和Senten-Bert（Sbert）的混合技术来嵌入。使用GNN进一步完善了主题表示，该GNN在文档，主题，单词，类似主题和类似单词之间建立了联系。引入了一种新颖的方法来计算相似性。因此，将主题嵌入进行完善，并提取顶部K主题。使用两个医疗保健数据集进行了实验，一个数据集用英语和法语进行，从中得出了六组。结果证明了小村庄的有效性。

Title: Towards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue

Authors: Jannatun Naim, Jie Cao, Fareen Tasneem, Jennifer Jacobs, Brent Milne, James Martin, Tamara Sumner
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.07161
Pdf URL: https://arxiv.org/pdf/2505.07161
Copy Paste: [[2505.07161]] Towards Actionable Pedagogical Feedback: A Multi-Perspective Analysis of Mathematics Teaching and Tutoring Dialogue(https://arxiv.org/abs/2505.07161)
Keywords: agent
Abstract: Effective feedback is essential for refining instructional practices in mathematics education, and researchers often turn to advanced natural language processing (NLP) models to analyze classroom dialogues from multiple perspectives. However, utterance-level discourse analysis encounters two primary challenges: (1) multifunctionality, where a single utterance may serve multiple purposes that a single tag cannot capture, and (2) the exclusion of many utterances from domain-specific discourse move classifications, leading to their omission in feedback. To address these challenges, we proposed a multi-perspective discourse analysis that integrates domain-specific talk moves with dialogue act (using the flattened multi-functional SWBD-MASL schema with 43 tags) and discourse relation (applying Segmented Discourse Representation Theory with 16 relations). Our top-down analysis framework enables a comprehensive understanding of utterances that contain talk moves, as well as utterances that do not contain talk moves. This is applied to two mathematics education datasets: TalkMoves (teaching) and SAGA22 (tutoring). Through distributional unigram analysis, sequential talk move analysis, and multi-view deep dive, we discovered meaningful discourse patterns, and revealed the vital role of utterances without talk moves, demonstrating that these utterances, far from being mere fillers, serve crucial functions in guiding, acknowledging, and structuring classroom discourse. These insights underscore the importance of incorporating discourse relations and dialogue acts into AI-assisted education systems to enhance feedback and create more responsive learning environments. Our framework may prove helpful for providing human educator feedback, but also aiding in the development of AI agents that can effectively emulate the roles of both educators and students.
摘要：有效的反馈对于精炼数学教育的教学实践至关重要，研究人员通常会转向先进的自然语言处理（NLP）模型，从多个角度来分析课堂对话。但是，话语级的话语分析遇到了两个主要挑战：（1）多功能性，单个话语可能有多种目的，单个标签无法捕获，并且（2）将许多话语排除在特定于领域的话语移动分类中，从而导致它们在反馈中的省略。为了应对这些挑战，我们提出了一个多人的话语分析，该分析将特定于领域的谈话动作与对话行为（使用扁平的多功能SWBD-MASL模式使用43个标签）和话语关系（将分段的话语表示理论与16个关系应用于16个关系）。我们自上而下的分析框架可以全面了解包含谈话动作的话语以及不包含谈话动作的话语。这适用于两个数学教育数据集：TalkMoves（教学）和SAGA22（辅导）。通过分布式的Umigram分析，顺序的谈话动作分析和多视图深水潜水，我们发现了有意义的话语模式，并揭示了话语不进行谈话的至关重要的作用，表明这些话语远非仅仅是填充物，而是在指导，承认，承认和构造课堂话语方面起着至关重要的功能。这些见解强调了将话语关系和对话纳入AI辅助教育系统的重要性，以增强反馈并创造更敏感的学习环境。我们的框架可能证明对提供人类教育者的反馈很有帮助，但也有助于可以有效地模仿教育者和学生角色的AI代理的发展。

Title: KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification

Authors: Hajar Sakai, Sarah S. Lam
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07162
Pdf URL: https://arxiv.org/pdf/2505.07162
Copy Paste: [[2505.07162]] KDH-MLTC: Knowledge Distillation for Healthcare Multi-Label Text Classification(https://arxiv.org/abs/2505.07162)
Keywords: language model, llm
Abstract: The increasing volume of healthcare textual data requires computationally efficient, yet highly accurate classification approaches able to handle the nuanced and complex nature of medical terminology. This research presents Knowledge Distillation for Healthcare Multi-Label Text Classification (KDH-MLTC), a framework leveraging model compression and Large Language Models (LLMs). The proposed approach addresses conventional healthcare Multi-Label Text Classification (MLTC) challenges by integrating knowledge distillation and sequential fine-tuning, subsequently optimized through Particle Swarm Optimization (PSO) for hyperparameter tuning. KDH-MLTC transfers knowledge from a more complex teacher LLM (i.e., BERT) to a lighter student LLM (i.e., DistilBERT) through sequential training adapted to MLTC that preserves the teacher's learned information while significantly reducing computational requirements. As a result, the classification is enabled to be conducted locally, making it suitable for healthcare textual data characterized by sensitivity and, therefore, ensuring HIPAA compliance. The experiments conducted on three medical literature datasets of different sizes, sampled from the Hallmark of Cancer (HoC) dataset, demonstrate that KDH-MLTC achieves superior performance compared to existing approaches, particularly for the largest dataset, reaching an F1 score of 82.70%. Additionally, statistical validation and an ablation study are carried out, proving the robustness of KDH-MLTC. Furthermore, the PSO-based hyperparameter optimization process allowed the identification of optimal configurations. The proposed approach contributes to healthcare text classification research, balancing efficiency requirements in resource-constrained healthcare settings with satisfactory accuracy demands.
摘要：医疗保健文本数据的数量不断增加，需要计算有效的，但高度准确的分类方法，能够处理医学术语的细微差别和复杂性质。这项研究介绍了医疗保健多标签文本分类（KDH-MLTC）的知识蒸馏，该框架利用了模型压缩和大语言模型（LLMS）。所提出的方法通过整合知识蒸馏和顺序微调来解决常规的医疗多标签文本分类（MLTC）挑战，随后通过粒子群优化（PSO）进行了高参数调谐而进行了优化。 KDH-MLTC将知识从更复杂的教师LLM（即BERT）转移到更轻的学生LLM（即Distilbert），通过对MLTC进行的顺序培训，该培训可保留教师的学习信息，同时显着降低了计算要求。结果，该分类能够在本地进行，使其适用于具有敏感性的特征，因此可以确保HIPAA合规性。从癌症标志（HOC）数据集采样的三个医学文献数据集上进行的实验表明，与现有方法相比，KDH-MLTC的性能优于较高的性能，尤其是对于最大的数据集，F1分数达到82.70％。另外，进行了统计验证和消融研究，证明了KDH-MLTC的鲁棒性。此外，基于PSO的高参数优化过程允许识别最佳配置。拟议的方法有助于医疗保健文本分类研究，平衡资源受限医疗保健设置的效率要求，并具有令人满意的准确性要求。

Title: Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs

Authors: Yifan Wei, Xiaoyan Yu, Tengfei Pan, Angsheng Li, Li Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07184
Pdf URL: https://arxiv.org/pdf/2505.07184
Copy Paste: [[2505.07184]] Structural Entropy Guided Agent for Detecting and Repairing Knowledge Deficiencies in LLMs(https://arxiv.org/abs/2505.07184)
Keywords: language model, llm, agent
Abstract: Large language models (LLMs) have achieved unprecedented performance by leveraging vast pretraining corpora, yet their performance remains suboptimal in knowledge-intensive domains such as medicine and scientific research, where high factual precision is required. While synthetic data provides a promising avenue for augmenting domain knowledge, existing methods frequently generate redundant samples that do not align with the model's true knowledge gaps. To overcome this limitation, we propose a novel Structural Entropy-guided Knowledge Navigator (SENATOR) framework that addresses the intrinsic knowledge deficiencies of LLMs. Our approach employs the Structure Entropy (SE) metric to quantify uncertainty along knowledge graph paths and leverages Monte Carlo Tree Search (MCTS) to selectively explore regions where the model lacks domain-specific knowledge. Guided by these insights, the framework generates targeted synthetic data for supervised fine-tuning, enabling continuous self-improvement. Experimental results on LLaMA-3 and Qwen2 across multiple domain-specific benchmarks show that SENATOR effectively detects and repairs knowledge deficiencies, achieving notable performance improvements. The code and data for our methods and experiments are available at this https URL.
摘要：大型语言模型（LLMS）通过利用大量预刻录的语料库实现了前所未有的表现，但是他们的表现仍然在知识密集的领域（例如医学和科学研究）中占优势，在那里需要高度的事实精确。虽然合成数据为增强域知识提供了有希望的途径，但现有方法经常生成冗余样本，这些样本与模型的真实知识差距不符。为了克服这一限制，我们提出了一个新型的结构性熵引导的知识导航器（参议员）框架，该框架解决了LLMS的内在知识缺陷。我们的方法采用结构熵（SE）度量来量化知识图路径的不确定性，并利用蒙特卡洛树搜索（MCT）选择性地探索模型缺乏特定领域知识的区域。在这些见解的指导下，该框架生成了有针对性的合成数据，以进行微调，从而实现持续的自我完善。在多个领域特异性基准的Llama-3和Qwen2上对Llama-3和Qwen2的实验结果表明，参议员有效地检测和修复知识缺陷，从而实现了显着的绩效提高。我们的方法和实验的代码和数据可在此HTTPS URL上找到。

Title: On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud

Authors: Hyouin Liu, Zhikuan Zhang
Subjects: cs.CL, cs.SD, eess.AS
Abstract URL: https://arxiv.org/abs/2505.07202
Pdf URL: https://arxiv.org/pdf/2505.07202
Copy Paste: [[2505.07202]] On the Cost and Benefits of Training Context with Utterance or Full Conversation Training: A Comparative Stud(https://arxiv.org/abs/2505.07202)
Keywords: hallucination
Abstract: Modern TTS systems designed for conversations achieve high-quality utterances but often remain inaccessible publicly. Are existing open-source architectures inadequate, or are current training techniques insufficient? This paper investigates prominent models and their underlying behaviors regarding conversational context. Using 20 GPU-hours on an NVIDIA H100, we empirically examine two approaches: context-based utterance-level training versus full conversation training. Results demonstrate that context-based utterance training achieves superior MOS scores (4.3/5.0 vs 3.7/5.0) and reduces training time by 37%, while full conversation approaches suffer from speaker similarity hallucination issues. These findings provide practical guidelines for conversational TTS development, favoring utterance-level training with contextual conditioning for both resource efficiency and output quality.
摘要：专为对话设计的现代TTS系统实现了高质量的话语，但通常在公开上仍然无法访问。现有的开源体系结构不足，还是当前的培训技术不足？本文研究了关于会话环境的著名模型及其基本行为。在NVIDIA H100上使用20个GPU小时，我们经验研究了两种方法：基于上下文的话语级训练与完整的对话培训。结果表明，基于上下文的话语训练可实现出色的MOS分数（4.3/5.0 vs 3.7/5.0），并将训练时间减少37％，而完整的对话方法则遭受了说话者相似性幻觉问题的困扰。这些发现为对话TTS开发提供了实用的指南，有利于语音级培训，并具有上下文调节的资源效率和产出质量。

Title: Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030

Authors: Mouxiao Bian, Rongzhao Zhang, Chao Ding, Xinwei Peng, Jie Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07205
Pdf URL: https://arxiv.org/pdf/2505.07205
Copy Paste: [[2505.07205]] Benchmarking Ethical and Safety Risks of Healthcare LLMs in China-Toward Systemic Governance under Healthy China 2030(https://arxiv.org/abs/2505.07205)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are poised to transform healthcare under China's Healthy China 2030 initiative, yet they introduce new ethical and patient-safety challenges. We present a novel 12,000-item Q&A benchmark covering 11 ethics and 9 safety dimensions in medical contexts, to quantitatively evaluate these risks. Using this dataset, we assess state-of-the-art Chinese medical LLMs (e.g., Qwen 2.5-32B, DeepSeek), revealing moderate baseline performance (accuracy 42.7% for Qwen 2.5-32B) and significant improvements after fine-tuning on our data (up to 50.8% accuracy). Results show notable gaps in LLM decision-making on ethics and safety scenarios, reflecting insufficient institutional oversight. We then identify systemic governance shortfalls-including the lack of fine-grained ethical audit protocols, slow adaptation by hospital IRBs, and insufficient evaluation tools-that currently hinder safe LLM deployment. Finally, we propose a practical governance framework for healthcare institutions (embedding LLM auditing teams, enacting data ethics guidelines, and implementing safety simulation pipelines) to proactively manage LLM risks. Our study highlights the urgent need for robust LLM governance in Chinese healthcare, aligning AI innovation with patient safety and ethical standards.
摘要：大型语言模型（LLMS）有望在中国健康的中国2030倡议下改变医疗保健，但他们引入了新的道德和患者安全挑战。我们提出了一个新颖的12,000个项目问答基准测试，涵盖了医疗环境中11个道德和9个安全方面，以定量评估这些风险。使用此数据集，我们评估了最先进的中国医学LLM（例如QWEN 2.5-32B，DEEPSEEK），揭示了中等基线的性能（QWEN 2.5-32B的准确性为42.7％），并在对数据进行微调后进行了显着改进（最高准确性50.8％）。结果表明，LLM关于道德和安全方案的决策差距显着，反映了机构监督不足。然后，我们确定系统治理短缺，包括缺乏精细的道德审计方案，医院IRB的缓慢适应以及评估工具不足 - 目前阻碍了安全的LLM部署。最后，我们为医疗机构（嵌入LLM审核团队，制定数据道德准则和实施安全模拟管道）提供一个实用治理框架，以主动管理LLM风险。我们的研究强调了中国医疗保健中强大的LLM治理的迫切需求，将AI创新与患者的安全和道德标准保持一致。

Title: DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation

Authors: Jiashuo Sun, Xianrui Zhong, Sizhe Zhou, Jiawei Han
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07233
Pdf URL: https://arxiv.org/pdf/2505.07233
Copy Paste: [[2505.07233]] DynamicRAG: Leveraging Outputs of Large Language Model as Feedback for Dynamic Reranking in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.07233)
Keywords: language model, llm, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) systems combine large language models (LLMs) with external knowledge retrieval, making them highly effective for knowledge-intensive tasks. A crucial but often under-explored component of these systems is the reranker, which refines retrieved documents to enhance generation quality and explainability. The challenge of selecting the optimal number of documents (k) remains unsolved: too few may omit critical information, while too many introduce noise and inefficiencies. Although recent studies have explored LLM-based rerankers, they primarily leverage internal model knowledge and overlook the rich supervisory signals that LLMs can provide, such as using response quality as feedback for optimizing reranking decisions. In this paper, we propose DynamicRAG, a novel RAG framework where the reranker dynamically adjusts both the order and number of retrieved documents based on the query. We model the reranker as an agent optimized through reinforcement learning (RL), using rewards derived from LLM output quality. Across seven knowledge-intensive datasets, DynamicRAG demonstrates superior performance, achieving state-of-the-art results. The model, data and code are available at this https URL
摘要：检索增强的生成（RAG）系统将大型语言模型（LLM）与外部知识检索相结合，使其在知识密集型任务中非常有效。这些系统的一个至关重要但通常不足的组成部分是Reranker，它完善了检索的文档以提高发电质量和解释性。选择最佳文档数量（k）的挑战尚未解决：很少有人省略关键信息，而太多的噪声和效率低下。尽管最近的研究探索了基于LLM的重读者，但它们主要利用内部模型知识并忽略LLM可以提供的丰富的监督信号，例如使用响应质量作为反馈来优化重读的决策。在本文中，我们提出了DynamiCrag，这是一种新颖的RAG框架，其中Reranker根据查询动态调整了检索文档的顺序和数量。我们使用从LLM输出质量获得的奖励，将Reranker建模为通过增强学习（RL）优化的代理。在七个知识密集的数据集中，DynamiCrag表现出卓越的性能，从而实现了最新的结果。该模型，数据和代码可在此HTTPS URL上找到

Title: SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models

Authors: Peichao Lai, Kexuan Zhang, Yi Lin, Linyihan Zhang, Feiyang Ye, Jinhao Yan, Yanwei Xu, Conghui He, Yilei Wang, Wentao Zhang, Bin Cui
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07247
Pdf URL: https://arxiv.org/pdf/2505.07247
Copy Paste: [[2505.07247]] SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models(https://arxiv.org/abs/2505.07247)
Keywords: language model, llm, prompt
Abstract: Subjective Answer Grading (SAG) plays a crucial role in education, standardized testing, and automated assessment systems, particularly for evaluating short-form responses in Short Answer Scoring (SAS). However, existing approaches often produce coarse-grained scores and lack detailed reasoning. Although large language models (LLMs) have demonstrated potential as zero-shot evaluators, they remain susceptible to bias, inconsistencies with human judgment, and limited transparency in scoring decisions. To overcome these limitations, we introduce SAS-Bench, a benchmark specifically designed for LLM-based SAS tasks. SAS-Bench provides fine-grained, step-wise scoring, expert-annotated error categories, and a diverse range of question types derived from real-world subject-specific exams. This benchmark facilitates detailed evaluation of model reasoning processes and explainability. We also release an open-source dataset containing 1,030 questions and 4,109 student responses, each annotated by domain experts. Furthermore, we conduct comprehensive experiments with various LLMs, identifying major challenges in scoring science-related questions and highlighting the effectiveness of few-shot prompting in improving scoring accuracy. Our work offers valuable insights into the development of more robust, fair, and educationally meaningful LLM-based evaluation systems.
摘要：主观答案分级（SAG）在教育，标准化的测试和自动化评估系统中起着至关重要的作用，特别是在评估简短答案评分（SAS）中的短形式反应（SAS）中。但是，现有的方法通常会产生粗粒分数，并且缺乏详细的推理。尽管大型语言模型（LLM）表现出作为零射门评估者的潜力，但它们仍然容易受到偏见的影响，与人类判断力的不一致以及得分决策的透明度有限。为了克服这些限制，我们引入了SAS Bench，这是专门为基于LLM的SAS任务设计的基准。 SAS基础提供了细粒度，逐步评分，专家注释的错误类别，以及从现实世界中特定主题的考试中得出的各种问题类型。该基准有助于对模型推理过程和解释性的详细评估。我们还发布了一个开源数据集，其中包含1,030个问题和4,109个学生回答，每个都由域专家注释。此外，我们对各种LLM进行了全面的实验，确定了与科学相关问题得分的主要挑战，并突出了很少射击促使提高评分准确性的有效性。我们的工作为发展更健壮，公平和教育意义的基于LLM的评估系统的发展提供了宝贵的见解。

Title: No Query, No Access

Authors: Wenqiang Wang, Siyuan Liang, Yangshijie Zhang, Xiaojun Jia, Hao Lin, Xiaochun Cao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07258
Pdf URL: https://arxiv.org/pdf/2505.07258
Copy Paste: [[2505.07258]] No Query, No Access(https://arxiv.org/abs/2505.07258)
Keywords: language model, gpt, llm
Abstract: Textual adversarial attacks mislead NLP models, including Large Language Models (LLMs), by subtly modifying text. While effective, existing attacks often require knowledge of the victim model, extensive queries, or access to training data, limiting real-world feasibility. To overcome these constraints, we introduce the \textbf{Victim Data-based Adversarial Attack (VDBA)}, which operates using only victim texts. To prevent access to the victim model, we create a shadow dataset with publicly available pre-trained models and clustering methods as a foundation for developing substitute models. To address the low attack success rate (ASR) due to insufficient information feedback, we propose the hierarchical substitution model design, generating substitute models to mitigate the failure of a single substitute model at the decision boundary. Concurrently, we use diverse adversarial example generation, employing various attack methods to generate and select the adversarial example with better similarity and attack effectiveness. Experiments on the Emotion and SST5 datasets show that VDBA outperforms state-of-the-art methods, achieving an ASR improvement of 52.08\% while significantly reducing attack queries to 0. More importantly, we discover that VDBA poses a significant threat to LLMs such as Qwen2 and the GPT family, and achieves the highest ASR of 45.99% even without access to the API, confirming that advanced NLP models still face serious security risks. Our codes can be found at this https URL
摘要：文本对手攻击通过巧妙地修改文本来误导NLP模型，包括大语言模型（LLMS）。尽管有效，但现有的攻击通常需要了解受害者模型，广泛的查询或访问培训数据，从而限制了现实世界的可行性。为了克服这些约束，我们介绍了\ textbf {基于受害数据的对抗性攻击（VDBA）}，该}仅使用受害者文本进行操作。为了防止访问受害者模型，我们创建了一个带有公开训练的模型和聚类方法的影子数据集，作为开发替代模型的基础。为了解决由于信息反馈不足而导致的低攻击成功率（ASR），我们提出了层次替代模型设计，生成替代模型以减轻决策边界上单个替代模型的失败。同时，我们使用各种对抗性示例生成，采用各种攻击方法来生成和选择具有更好相似性和攻击效果的对抗示例。情感和SST5数据集的实验表明，VDBA的表现优于最先进的方法，实现了52.08 \％的ASR改善，同时将攻击的查询显着降低到0。更重要的是，更重要的是，VDBA对Qwen2和GPT Family等LLMS构成了重要的威胁，例如QWEN2和GPT FARMAS，并且APS a api a api conver a ap a api a ap a api a ap a api a api a api a ap ap a ap a as a a api a api a api a as ap ap ap ap ap ap ap ap ap ap ap ap ap ap ap a的最高45.99。999。 NLP模型仍然面临严重的安全风险。我们的代码可以在此HTTPS URL上找到

Title: On the Robustness of Reward Models for Language Model Alignment

Authors: Jiwoo Hong, Noah Lee, Eunki Kim, Guijin Son, Woojin Chung, Aman Gupta, Shao Tang, James Thorne
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07271
Pdf URL: https://arxiv.org/pdf/2505.07271
Copy Paste: [[2505.07271]] On the Robustness of Reward Models for Language Model Alignment(https://arxiv.org/abs/2505.07271)
Keywords: language model
Abstract: The Bradley-Terry (BT) model is widely practiced in reward modeling for reinforcement learning with human feedback (RLHF). Despite its effectiveness, reward models (RMs) trained with BT model loss are prone to over-optimization, losing generalizability to unseen input distributions. In this paper, we study the cause of over-optimization in RM training and its downstream effects on the RLHF procedure, accentuating the importance of distributional robustness of RMs in unseen data. First, we show that the excessive dispersion of hidden state norms is the main source of over-optimization. Then, we propose batch-wise sum-to-zero regularization (BSR) to enforce zero-centered reward sum per batch, constraining the rewards with extreme magnitudes. We assess the impact of BSR in improving robustness in RMs through four scenarios of over-optimization, where BSR consistently manifests better robustness. Subsequently, we compare the plain BT model and BSR on RLHF training and empirically show that robust RMs better align the policy to the gold preference model. Finally, we apply BSR to high-quality data and models, which surpasses state-of-the-art RMs in the 8B scale by adding more than 5% in complex preference prediction tasks. By conducting RLOO training with 8B RMs, AlpacaEval 2.0 reduces generation length by 40% while adding a 7% increase in win rate, further highlighting that robustness in RMs induces robustness in RLHF training. We release the code, data, and models: this https URL.
摘要：Bradley-Terry（BT）模型在通过人类反馈（RLHF）的增强学习中广泛实践。尽管具有有效性，但接受过BT模型损失训练的奖励模型（RMS）容易过度优化，从而失去了看不见的输入分布的通用性。在本文中，我们研究了RM培训过度优化的原因及其对RLHF程序的下游影响，从而突出了RMS在看不见的数据中的分布鲁棒性的重要性。首先，我们表明，隐藏状态规范的过度分散是过度优化的主要来源。然后，我们提出批处理总和到零正则化（BSR），以每批实施以零为中心的奖励总和，从而以极端的幅度来限制奖励。我们通过四种过度优化的情况来评估BSR在改善RMS鲁棒性方面的影响，其中BSR始终表现出更好的鲁棒性。随后，我们将普通的BT模型和BSR在RLHF培训中进行了比较，并从经验上表明，ROST的RMS可以更好地使该策略与黄金偏好模型保持一致。最后，我们将BSR应用于高质量的数据和模型，在复杂的首选项预测任务中添加超过5％的人，它超过了8B尺度上的最新RMS。通过使用8B RMS进行RLOO训练，Alpacaeval 2.0可将发电长度降低40％，同时增加了7％的获胜率，进一步强调了RMS的鲁棒性在RLHF培训中促进了鲁棒性。我们发布代码，数据和模型：此HTTPS URL。

Title: Semantic Retention and Extreme Compression in LLMs: Can We Have Both?

Authors: Stanislas Laborde, Martin Cousseau, Antoun Yaacoub, Lionel Prevost
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07289
Pdf URL: https://arxiv.org/pdf/2505.07289
Copy Paste: [[2505.07289]] Semantic Retention and Extreme Compression in LLMs: Can We Have Both?(https://arxiv.org/abs/2505.07289)
Keywords: language model, llm
Abstract: The exponential growth in Large Language Model (LLM) deployment has intensified the need for efficient model compression techniques to reduce computational and memory costs. While pruning and quantization have shown promise, their combined potential remains largely unexplored. In this paper, we examine joint compression and how strategically combining pruning and quantization could yield superior performance-to-compression ratios compared to single-method approaches. Recognizing the challenges in accurately assessing LLM performance, we address key limitations of previous evaluation frameworks and introduce the Semantic Retention Compression Rate (SrCr), a novel metric that quantifies the trade-off between model compression and semantic preservation, facilitating the optimization of pruning-quantization configurations. Experiments demonstrate that our recommended combination achieves, on average, a 20% performance increase compared to an equivalent quantization-only model at the same theoretical compression rate.
摘要：大语言模型（LLM）部署的指数增长增强了对有效模型压缩技术的需求，以降低计算和记忆成本。虽然修剪和量化已经显示出希望，但它们的综合潜力仍然在很大程度上没有探索。在本文中，我们研究了联合压缩以及与单方法方法相比，在战略上结合修剪和量化如何产生较高的性能与压缩比。认识到准确评估LLM性能的挑战，我们解决了先前评估框架的关键局限性，并引入语义保留压缩率（SRCR），这是一种量化模型压缩和语义保护之间的折衷的新型指标，促进了介质量化配置的优化。实验表明，与在相同的理论压缩率下的同等量化模型相比，我们建议的组合平均达到20％的性能提高。

Title: AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection

Authors: Kai Hua, Steven Wu, Ge Zhang, Ke Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07293
Pdf URL: https://arxiv.org/pdf/2505.07293
Copy Paste: [[2505.07293]] AttentionInfluence: Adopting Attention Head Influence for Weak-to-Strong Pretraining Data Selection(https://arxiv.org/abs/2505.07293)
Keywords: language model, llm
Abstract: Recently, there has been growing interest in collecting reasoning-intensive pretraining data to improve LLMs' complex reasoning ability. Prior approaches typically rely on supervised classifiers to identify such data, which requires labeling by humans or LLMs, often introducing domain-specific biases. Due to the attention heads being crucial to in-context reasoning, we propose AttentionInfluence, a simple yet effective, training-free method without supervision signal. Our approach enables a small pretrained language model to act as a strong data selector through a simple attention head masking operation. Specifically, we identify retrieval heads and compute the loss difference when masking these heads. We apply AttentionInfluence to a 1.3B-parameter dense model to conduct data selection on the SmolLM corpus of 241B tokens, and mix the SmolLM corpus with the selected subset comprising 73B tokens to pretrain a 7B-parameter dense model using 1T training tokens and WSD learning rate scheduling. Our experimental results demonstrate substantial improvements, ranging from 1.4pp to 3.5pp, across several knowledge-intensive and reasoning-heavy benchmarks (i.e., MMLU, MMLU-Pro, AGIEval-en, GSM8K, and HumanEval). This demonstrates an effective weak-to-strong scaling property, with small models improving the final performance of larger models-offering a promising and scalable path for reasoning-centric data selection.
摘要：最近，人们越来越有兴趣收集推理密集型预处理数据以提高LLMS的复杂推理能力。先前的方法通常依靠有监督的分类器来识别需要人类或LLMS标记的此类数据，通常会引入特定于域的偏见。由于注意力头针对中文推理至关重要，因此我们提出了注意力影响，这是一种简单但有效的无训练方法，没有监督信号。我们的方法使一个小的审核语言模型通过简单的注意力掩盖操作充当强大的数据选择器。具体来说，我们确定检索头并计算掩盖这些头时的损失差异。我们将注意力促进性应用于1.3B参数密集的模型，以对241B代币的Smollm语料库进行数据选择，并将Smollm语料库与所选子集混合，该子集包括73B令牌，以使用1T训练代币和WSD学习率计划使用7B参数密集模型。我们的实验结果表明，在几个知识密集的和推理的基准（即MMLU，MMLU-PRO，Agieval-EN，GSM8K和HumaneVal）中，从1.4pp到3.5pp的实质性改进。这证明了有效的弱到较大的缩放属性，小型模型改善了较大模型的最终性能，从而为以推理为中心的数据选择有前途且可扩展的路径。

Title: Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study

Authors: Baixuan Xu, Chunyang Li, Weiqi Wang, Wei Fan, Tianshi Zheng, Haochen Shi, Tao Fan, Yangqiu Song, Qiang Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07313
Pdf URL: https://arxiv.org/pdf/2505.07313
Copy Paste: [[2505.07313]] Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study(https://arxiv.org/abs/2505.07313)
Keywords: llm, agent
Abstract: Designing effective collaboration structure for multi-agent LLM systems to enhance collective reasoning is crucial yet remains under-explored. In this paper, we systematically investigate how collaborative reasoning performance is affected by three key design dimensions: (1) Expertise-Domain Alignment, (2) Collaboration Paradigm (structured workflow vs. diversity-driven integration), and (3) System Scale. Our findings reveal that expertise alignment benefits are highly domain-contingent, proving most effective for contextual reasoning tasks. Furthermore, collaboration focused on integrating diverse knowledge consistently outperforms rigid task decomposition. Finally, we empirically explore the impact of scaling the multi-agent system with expertise specialization and study the computational trade off, highlighting the need for more efficient communication protocol design. This work provides concrete guidelines for configuring specialized multi-agent system and identifies critical architectural trade-offs and bottlenecks for scalable multi-agent reasoning. The code will be made available upon acceptance.
摘要：为多代理LLM系统设计有效的协作结构来增强集体推理至关重要，但仍未探索。在本文中，我们系统地研究了协作推理绩效如何受三个关键设计维度的影响：（1）专业知识域的一致性，（2）协作范式（结构化工作流与多样性驱动的集成）以及（3）系统规模。我们的发现表明，专业知识的一致性优势是高度域名，证明对上下文推理任务最有效。此外，合作的重点是整合多样化的知识，一致性地胜过严格的任务分解。最后，我们从经验上探讨了通过专业知识专业化扩展多代理系统的影响，并研究了计算权的权衡，从而强调了对更有效的通信协议设计的需求。这项工作提供了配置专业多代理系统的具体指南，并确定了关键的建筑折衷和瓶颈，以进行可扩展的多代理推理。该代码将在接受后提供。

Title: QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines

Authors: Ohjoon Kwon, Changsu Lee, Jihye Back, Lim Sun Suk, Inho Kang, Donghyeon Jeon
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.07345
Pdf URL: https://arxiv.org/pdf/2505.07345
Copy Paste: [[2505.07345]] QUPID: Quantified Understanding for Enhanced Performance, Insights, and Decisions in Korean Search Engines(https://arxiv.org/abs/2505.07345)
Keywords: language model, llm
Abstract: Large language models (LLMs) have been widely used for relevance assessment in information retrieval. However, our study demonstrates that combining two distinct small language models (SLMs) with different architectures can outperform LLMs in this task. Our approach -- QUPID -- integrates a generative SLM with an embedding-based SLM, achieving higher relevance judgment accuracy while reducing computational costs compared to state-of-the-art LLM solutions. This computational efficiency makes QUPID highly scalable for real-world search systems processing millions of queries daily. In experiments across diverse document types, our method demonstrated consistent performance improvements (Cohen's Kappa of 0.646 versus 0.387 for leading LLMs) while offering 60x faster inference times. Furthermore, when integrated into production search pipelines, QUPID improved nDCG@5 scores by 1.9%. These findings underscore how architectural diversity in model combinations can significantly enhance both search relevance and operational efficiency in information retrieval systems.
摘要：大型语言模型（LLM）已被广泛用于信息检索的相关性评估。但是，我们的研究表明，将两个不同的小语言模型（SLM）与不同的体系结构相结合可以在此任务中胜过LLM。我们的方法 - QUPID-将生成性SLM与基于嵌入的SLM集成，与最先进的LLM解决方案相比，在降低计算成本的同时，达到了更高的相关性判断精度。这种计算效率使得对现实世界搜索系统的高度可扩展性每天处理数百万查询。在各种文档类型的实验中，我们的方法表现出一致的性能提高（Cohen的Kappa为0.646，而领先的LLM的Kappa为0.387），同时提供了更快的推理时间60倍。此外，当集成到生产搜索管道中时，QUPID将NDCG@5分数提高了1.9％。这些发现强调了模型组合中的架构多样性如何显着提高信息检索系统中的搜索相关性和操作效率。

Title: Computational Fact-Checking of Online Discourse: Scoring scientific accuracy in climate change related news articles

Authors: Tim Wittenborg, Constantin Sebastian Tremel, Markus Stocker, Sören Auer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07409
Pdf URL: https://arxiv.org/pdf/2505.07409
Copy Paste: [[2505.07409]] Computational Fact-Checking of Online Discourse: Scoring scientific accuracy in climate change related news articles(https://arxiv.org/abs/2505.07409)
Keywords: llm
Abstract: Democratic societies need reliable information. Misinformation in popular media such as news articles or videos threatens to impair civic discourse. Citizens are, unfortunately, not equipped to verify this content flood consumed daily at increasing rates. This work aims to semi-automatically quantify scientific accuracy of online media. By semantifying media of unknown veracity, their statements can be compared against equally processed trusted sources. We implemented a workflow using LLM-based statement extraction and knowledge graph analysis. Our neurosymbolic system was able to evidently streamline state-of-the-art veracity quantification. Evaluated via expert interviews and a user survey, the tool provides a beneficial veracity indication. This indicator, however, is unable to annotate public media at the required granularity and scale. Further work towards a FAIR (Findable, Accessible, Interoperable, Reusable) ground truth and complementary metrics are required to scientifically support civic discourse.
摘要：民主社会需要可靠的信息。新闻文章或视频等流行媒体中的错误信息可能会损害公民话语。不幸的是，公民没有能力验证每天以越来越多的速度消耗的洪水。这项工作旨在半自动地量化在线媒体的科学准确性。通过将未知真实性的媒体进行性介质，可以将其陈述与同样处理的受信任来源进行比较。我们使用基于LLM的语句提取和知识图分析实施了工作流程。我们的神经肯定系统显然可以简化最新的真实性量化。该工具通过专家访谈和用户调查进行了评估，提供了有益的真实性指示。但是，该指标无法按照所需的粒度和规模注释公共媒体。需要进一步的工作（可发现，可访问，可互操作，可重复使用的）地面真理和补充指标才能科学地支持公民话语。

Title: ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution

Authors: Xu Huang, Weiwen Liu, Xingshan Zeng, Yuefeng Huang, Xinlong Hao, Yuxian Wang, Yirong Zeng, Chuhan Wu, Yasheng Wang, Ruiming Tang, Defu Lian
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07512
Pdf URL: https://arxiv.org/pdf/2505.07512
Copy Paste: [[2505.07512]] ToolACE-DEV: Self-Improving Tool Learning via Decomposition and EVolution(https://arxiv.org/abs/2505.07512)
Keywords: language model, llm
Abstract: The tool-using capability of large language models (LLMs) enables them to access up-to-date external information and handle complex tasks. Current approaches to enhancing this capability primarily rely on distilling advanced models by data synthesis. However, this method incurs significant costs associated with advanced model usage and often results in data compatibility issues, led by the high discrepancy in the knowledge scope between the advanced model and the target model. To address these challenges, we propose ToolACE-DEV, a self-improving framework for tool learning. First, we decompose the tool-learning objective into sub-tasks that enhance basic tool-making and tool-using abilities. Then, we introduce a self-evolving paradigm that allows lightweight models to self-improve, reducing reliance on advanced LLMs. Extensive experiments validate the effectiveness of our approach across models of varying scales and architectures.
摘要：大型语言模型（LLMS）的工具使用功能使他们能够访问最新的外部信息并处理复杂的任务。当前增强此能力的方法主要依赖于数据合成的高级模型。但是，这种方法会产生与高级模型使用相关的巨大成本，并且通常会导致数据兼容性问题，这是由于高级模型和目标模型之间知识范围的高差异所致。为了应对这些挑战，我们提出了Toolace-dev，这是一个用于工具学习的自我改善框架。首先，我们将工具学习目标分解为子任务，以增强基本工具制造和使用工具的能力。然后，我们引入了一个自我发展的范式，该范式允许轻量级模型自我突破，从而降低了对高级LLM的依赖。广泛的实验验证了我们跨不同尺度和体系结构模型的方法的有效性。

Title: SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion

Authors: Lei Wang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07528
Pdf URL: https://arxiv.org/pdf/2505.07528
Copy Paste: [[2505.07528]] SEReDeEP: Hallucination Detection in Retrieval-Augmented Models via Semantic Entropy and Context-Parameter Fusion(https://arxiv.org/abs/2505.07528)
Keywords: hallucination, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) models frequently encounter hallucination phenomena when integrating external information with internal parametric knowledge. Empirical studies demonstrate that the disequilibrium between external contextual information and internal parametric knowledge constitutes a primary factor in hallucination generation. Existing hallucination detection methodologies predominantly emphasize either the external or internal mechanism in isolation, thereby overlooking their synergistic effects. The recently proposed ReDeEP framework decouples these dual mechanisms, identifying two critical contributors to hallucinations: excessive reliance on parametric knowledge encoded in feed-forward networks (FFN) and insufficient utilization of external information by attention mechanisms (particularly copy heads). ReDeEP quantitatively assesses these factors to detect hallucinations and dynamically modulates the contributions of FFNs and copy heads to attenuate their occurrence. Nevertheless, ReDeEP and numerous other hallucination detection approaches have been employed at logit-level uncertainty estimation or language-level self-consistency evaluation, inadequately address the semantic dimensions of model responses, resulting in inconsistent hallucination assessments in RAG implementations. Building upon ReDeEP's foundation, this paper introduces SEReDeEP, which enhances computational processes through semantic entropy captured via trained linear probes, thereby achieving hallucination assessments that more accurately reflect ground truth evaluations.
摘要：将外部信息与内部参数知识整合在一起时，检索增强的生成（RAG）模型经常会遇到幻觉现象。经验研究表明，外部上下文信息与内部参数知识之间的不平衡构成了幻觉产生的主要因素。现有的幻觉检测方法主要强调孤立的外部机制或内部机制，从而忽略了它们的协同作用。最近提出的Redeep框架解除了这些双重机制，确定了两个关键因素造成幻觉的因素：过度依赖馈送前卫网络（FFN）中编码的参数知识以及通过注意机制（尤其是复制头）对外部信息的利用不足。 Redeep定量评估这些因素以检测幻觉并动态调节FFN和复制头的贡献以减轻其发生。然而，在logit级的不确定性估计或语言级别的自愿性评估中，还采用了Redeep和许多其他幻觉检测方法，不足以解决模型响应的语义维度，从而导致RAG实施中的幻觉评估不一致。本文以Redeep的基础为基础，介绍了Seredeep，通过训练有素的线性探针捕获的语义熵增强了计算过程，从而实现了更准确地反映地面真相评估的幻觉评估。

Title: A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models

Authors: Junjie Ye, Caishuang Huang, Zhuohan Chen, Wenjie Fu, Chenyuan Yang, Leyi Yang, Yilong Wu, Peng Wang, Meng Zhou, Xiaolong Yang, Tao Gui, Qi Zhang, Zhongchao Shi, Jianping Fan, Xuanjing Huang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07591
Pdf URL: https://arxiv.org/pdf/2505.07591
Copy Paste: [[2505.07591]] A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models(https://arxiv.org/abs/2505.07591)
Keywords: language model, llm, prompt
Abstract: Instruction following evaluates large language models (LLMs) on their ability to generate outputs that adhere to user-defined constraints. However, existing benchmarks often rely on templated constraint prompts, which lack the diversity of real-world usage and limit fine-grained performance assessment. To fill this gap, we propose a multi-dimensional constraint framework encompassing three constraint patterns, four constraint categories, and four difficulty levels. Building on this framework, we develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting, yielding 1,200 code-verifiable instruction-following test samples. We evaluate 19 LLMs across seven model families and uncover substantial variation in performance across constraint forms. For instance, average performance drops from 77.67% at Level I to 32.96% at Level IV. Furthermore, we demonstrate the utility of our approach by using it to generate data for reinforcement learning, achieving substantial gains in instruction following without degrading general performance. In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters, which enhance constraint recognition and adherence. Code and data are available in this https URL.
摘要：以下指令评估了大型语言模型（LLMS）的生成遵守用户定义约束的输出的能力。但是，现有的基准通常依赖于模板的约束提示，这些提示缺乏现实世界使用的多样性并限制了细粒度的绩效评估。为了填补这一空白，我们提出了一个多维约束框架，其中包括三个约束模式，四个约束类别和四个难度级别。在此框架的基础上，我们开发了一个自动指令生成管道，该管道可以执行约束扩展，冲突检测和指令重写，从而产生1,200个代码验证的指令遵循测试样本。我们评估了七个模型家族中的19个LLM，并发现了跨约束形式的性能差异。例如，在IV级时，平均绩效从I级的77.67％降至32.96％。此外，我们通过使用该方法来生成增强学习的数据，在不降低一般绩效的情况下实现大幅提高，从而证明了我们的方法的实用性。深入分析表明，这些收益主要源于模型注意模块参数的修改，从而增强了约束识别和依从性。代码和数据可在此HTTPS URL中找到。

Title: Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent

Authors: Ziyang Huang, Xiaowei Yuan, Yiming Ju, Jun Zhao, Kang Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07596
Pdf URL: https://arxiv.org/pdf/2505.07596
Copy Paste: [[2505.07596]] Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent(https://arxiv.org/abs/2505.07596)
Keywords: language model, llm, hallucination, retrieval-augmented generation, agent
Abstract: Retrieval-augmented generation (RAG) is a common strategy to reduce hallucinations in Large Language Models (LLMs). While reinforcement learning (RL) can enable LLMs to act as search agents by activating retrieval capabilities, existing ones often underutilize their internal knowledge. This can lead to redundant retrievals, potential harmful knowledge conflicts, and increased inference latency. To address these limitations, an efficient and adaptive search agent capable of discerning optimal retrieval timing and synergistically integrating parametric (internal) and retrieved (external) knowledge is in urgent need. This paper introduces the Reinforced Internal-External Knowledge Synergistic Reasoning Agent (IKEA), which could indentify its own knowledge boundary and prioritize the utilization of internal knowledge, resorting to external search only when internal knowledge is deemed insufficient. This is achieved using a novel knowledge-boundary aware reward function and a knowledge-boundary aware training dataset. These are designed for internal-external knowledge synergy oriented RL, incentivizing the model to deliver accurate answers, minimize unnecessary retrievals, and encourage appropriate external searches when its own knowledge is lacking. Evaluations across multiple knowledge reasoning tasks demonstrate that IKEA significantly outperforms baseline methods, reduces retrieval frequency significantly, and exhibits robust generalization capabilities.
摘要：检索增强的一代（RAG）是减少大语言模型（LLM）幻觉的常见策略。尽管增强学习（RL）可以通过激活检索能力来使LLM充当搜索剂，但现有的能力通常不足以使其内部知识不足。这可以导致冗余的检索，潜在的有害知识冲突和增加的推理潜伏期。为了解决这些限制，一种有效的自适应搜索代理，能够辨别最佳检索时间并协同整合参数（内部）和检索（外部）知识。本文介绍了增强的内部知识协同推理代理（IKEA），该知识可以抑制其自身的知识边界，并优先考虑内部知识的利用，只有在内部知识不足的情况下才能诉诸外部搜索。这是使用新颖的知识意识奖励功能和知识范围的意识培训数据集实现的。这些设计用于内部知识协同为导向的RL，激励模型以提供准确的答案，最大程度地减少不必要的检索，并在缺乏自己的知识时鼓励适当的外部搜索。跨多个知识推理任务的评估表明，宜家显着胜过基线方法，大大降低了检索频率，并具有强大的概括能力。

Title: Characterizing the Investigative Methods of Fictional Detectives with Large Language Models

Authors: Edirlei Soares de Lima, Marco A. Casanova, Bruno Feijó, Antonio L. Furtado
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07601
Pdf URL: https://arxiv.org/pdf/2505.07601
Copy Paste: [[2505.07601]] Characterizing the Investigative Methods of Fictional Detectives with Large Language Models(https://arxiv.org/abs/2505.07601)
Keywords: language model, llm
Abstract: Detective fiction, a genre defined by its complex narrative structures and character-driven storytelling, presents unique challenges for computational narratology, a research field focused on integrating literary theory into automated narrative generation. While traditional literary studies have offered deep insights into the methods and archetypes of fictional detectives, these analyses often focus on a limited number of characters and lack the scalability needed for the extraction of unique traits that can be used to guide narrative generation methods. In this paper, we present an AI-driven approach for systematically characterizing the investigative methods of fictional detectives. Our multi-phase workflow explores the capabilities of 15 Large Language Models (LLMs) to extract, synthesize, and validate distinctive investigative traits of fictional detectives. This approach was tested on a diverse set of seven iconic detectives - Hercule Poirot, Sherlock Holmes, William Murdoch, Columbo, Father Brown, Miss Marple, and Auguste Dupin - capturing the distinctive investigative styles that define each character. The identified traits were validated against existing literary analyses and further tested in a reverse identification phase, achieving an overall accuracy of 91.43%, demonstrating the method's effectiveness in capturing the distinctive investigative approaches of each detective. This work contributes to the broader field of computational narratology by providing a scalable framework for character analysis, with potential applications in AI-driven interactive storytelling and automated narrative generation.
摘要：侦探小说是其复杂的叙事结构和角色驱动的讲故事所定义的一种类型，对计算叙事学提出了独特的挑战，该研究领域着重于将文学理论纳入自动叙事生成。尽管传统的文学研究为虚构侦探的方法和原型提供了深入的见解，但这些分析通常集中在有限的角色上，并且缺乏提取可用于指导叙事生成方法的独特性状所需的可扩展性。在本文中，我们提出了一种AI驱动的方法，用于系统地表征虚构侦探的调查方法。我们的多相工作流探讨了15种大语言模型（LLMS）的功能，以提取，合成和验证虚构侦探的独特调查性状。对七个标志性侦探的多样化进行了测试 - 赫尔库勒·波洛特，夏洛克·福尔摩斯，威廉·默多克，哥伦布，布朗神父，马普尔小姐和奥古斯特·杜宾 - 捕获了定义每个角色的独特调查风格。对现有的文学分析进行了验证，并在反向识别阶段进行了进一步测试，达到了91.43％的总体准确性，这证明了该方法在捕获每个侦探的独特调查方法方面的有效性。这项工作通过为角色分析提供了可扩展的框架，为更广泛的计算叙事学领域做出了贡献，并在AI驱动的交互式讲故事和自动化的叙述中使用了潜在的应用。

Title: MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining

Authors: Xiaomi LLM-Core Team: Bingquan Xia, Bowen Shen, Cici, Dawei Zhu, Di Zhang, Gang Wang, Hailin Zhang, Huaqiu Liu, Jiebao Xiao, Jinhao Dong, Liang Zhao, Peidian Li, Peng Wang, Shihua Yu, Shimao Chen, Weikun Wang, Wenhan Ma, Xiangwei Deng, Yi Huang, Yifan Song, Zihan Jiang, Bowen Ye, Can Cai, Chenhong He, Dong Zhang, Duo Zhang, Guoan Wang, Hao Tian, Haochen Zhao, Heng Qu, Hongshen Xu, Jun Shi, Kainan Bao, QingKai Fang, Kang Zhou, Kangyang Zhou, Lei Li, Menghang Zhu, Nuo Chen, Qiantong Wang, Shaohui Liu, Shicheng Li, Shuhao Gu, Shuhuai Ren, Shuo Liu, Sirui Deng, Weiji Zhuang, Weiwei Lv, Wenyu Yang, Xin Zhang, Xing Yong, Xing Zhang, Xingchen Song, Xinzhe Xu, Xu Wang, Yihan Yan, Yu Tu, Yuanyuan Tian, Yudong Wang, Yue Yu, Zhenru Lin, Zhichao Song, Zihao Yue
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07608
Pdf URL: https://arxiv.org/pdf/2505.07608
Copy Paste: [[2505.07608]] MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining(https://arxiv.org/abs/2505.07608)
Keywords: language model
Abstract: We present MiMo-7B, a large language model born for reasoning tasks, with optimization across both pre-training and post-training stages. During pre-training, we enhance the data preprocessing pipeline and employ a three-stage data mixing strategy to strengthen the base model's reasoning potential. MiMo-7B-Base is pre-trained on 25 trillion tokens, with additional Multi-Token Prediction objective for enhanced performance and accelerated inference speed. During post-training, we curate a dataset of 130K verifiable mathematics and programming problems for reinforcement learning, integrating a test-difficulty-driven code-reward scheme to alleviate sparse-reward issues and employing strategic data resampling to stabilize training. Extensive evaluations show that MiMo-7B-Base possesses exceptional reasoning potential, outperforming even much larger 32B models. The final RL-tuned model, MiMo-7B-RL, achieves superior performance on mathematics, code and general reasoning tasks, surpassing the performance of OpenAI o1-mini. The model checkpoints are available at this https URL.
摘要：我们提出了MIMO-7B，这是一种用于推理任务的大型语言模型，在训练前和训练后阶段都进行了优化。在预训练期间，我们会增强数据预处理管道，并采用三阶段数据混合策略来增强基本模型的推理潜力。 MIMO-7B基碱在25万亿代币上进行了预训练，并具有增强性能和加速推理速度的其他多句预测目标。在培训后，我们策划了130K可验证的数学和编程问题的数据集，以加强学习，集成了测试 - 缺陷驱动的代码奖励方案，以减轻稀疏回报的问题并采用战略数据重新采样以稳定培训。广泛的评估表明，MIMO-7B基础具有出色的推理潜力，甚至超过了更大的32B模型。最终的RL调整模型MIMO-7B-RL在数学，代码和一般推理任务上取得了卓越的性能，超过了Openai O1-Mini的性能。该模型检查点可在此HTTPS URL上找到。

Title: Concept-Level Explainability for Auditing & Steering LLM Responses

Authors: Kenza Amara, Rita Sevastjanova, Mennatallah El-Assady
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.07610
Pdf URL: https://arxiv.org/pdf/2505.07610
Copy Paste: [[2505.07610]] Concept-Level Explainability for Auditing & Steering LLM Responses(https://arxiv.org/abs/2505.07610)
Keywords: language model, llm, prompt
Abstract: As large language models (LLMs) become widely deployed, concerns about their safety and alignment grow. An approach to steer LLM behavior, such as mitigating biases or defending against jailbreaks, is to identify which parts of a prompt influence specific aspects of the model's output. Token-level attribution methods offer a promising solution, but still struggle in text generation, explaining the presence of each token in the output separately, rather than the underlying semantics of the entire LLM response. We introduce ConceptX, a model-agnostic, concept-level explainability method that identifies the concepts, i.e., semantically rich tokens in the prompt, and assigns them importance based on the outputs' semantic similarity. Unlike current token-level methods, ConceptX also offers to preserve context integrity through in-place token replacements and supports flexible explanation goals, e.g., gender bias. ConceptX enables both auditing, by uncovering sources of bias, and steering, by modifying prompts to shift the sentiment or reduce the harmfulness of LLM responses, without requiring retraining. Across three LLMs, ConceptX outperforms token-level methods like TokenSHAP in both faithfulness and human alignment. Steering tasks boost sentiment shift by 0.252 versus 0.131 for random edits and lower attack success rates from 0.463 to 0.242, outperforming attribution and paraphrasing baselines. While prompt engineering and self-explaining methods sometimes yield safer responses, ConceptX offers a transparent and faithful alternative for improving LLM safety and alignment, demonstrating the practical value of attribution-based explainability in guiding LLM behavior.
摘要：随着大型语言模型（LLMS）被广泛部署，对其安全性和一致性的担忧增长。转向LLM行为的一种方法，例如减轻偏见或防御越狱，是确定迅速的哪些部分影响模型输出的特定方面。令牌级别的归因方法提供了一个有希望的解决方案，但仍在文本生成中挣扎，从而分别解释了输出中每个令牌的存在，而不是整个LLM响应的基本语义。我们介绍了ConceptX，这是一种模型，概念级别的解释性方法，它在提示中识别概念，即语义上丰富的代币，并根据输出的语义相似性而分配了它们的重要性。与当前的令牌级别不同，ConceptX还提供了通过就地代币替换来保护上下文完整性，并支持灵活的解释目标，例如性别偏见。 Conceptx通过修改提示转移情感或减少LLM响应的有害性，而无需重新训练，可以通过发现偏见来源和转向来实现审计。在三个LLM中，ConceptX优于令牌级别的方法，例如Tokenshap在忠诚和人类的一致性方面。转向任务的情感转移增加了0.252，而随机编辑的转移和0.131的变化，攻击成功率从0.463降低到0.242，表现优于归因和释义基准。虽然迅速的工程和自我解释的方法有时会产生更安全的反应，但Conceptx为改善LLM安全性和对齐方式提供了一种透明而忠实的替代方法，证明了基于归因的解释性在指导LLM行为中的实践价值。

Title: Chronocept: Instilling a Sense of Time in Machines

Authors: Krish Goel, Sanskar Pandey, KS Mahadevan, Harsh Kumar, Vishesh Khadaria
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07637
Pdf URL: https://arxiv.org/pdf/2505.07637
Copy Paste: [[2505.07637]] Chronocept: Instilling a Sense of Time in Machines(https://arxiv.org/abs/2505.07637)
Keywords: retrieval-augmented generation, agent
Abstract: Human cognition is deeply intertwined with a sense of time, known as Chronoception. This sense allows us to judge how long facts remain valid and when knowledge becomes outdated. Despite progress in vision, language, and motor control, AI still struggles to reason about temporal validity. We introduce Chronocept, the first benchmark to model temporal validity as a continuous probability distribution over time. Using skew-normal curves fitted along semantically decomposed temporal axes, Chronocept captures nuanced patterns of emergence, decay, and peak relevance. It includes two datasets: Benchmark I (atomic facts) and Benchmark II (multi-sentence passages). Annotations show strong inter-annotator agreement (84% and 89%). Our baselines predict curve parameters - location, scale, and skewness - enabling interpretable, generalizable learning and outperforming classification-based approaches. Chronocept fills a foundational gap in AI's temporal reasoning, supporting applications in knowledge grounding, fact-checking, retrieval-augmented generation (RAG), and proactive agents. Code and data are publicly available.
摘要：人类的认知与时间感（称为计时感）深深交织在一起。这种意义使我们能够判断事实保持有效的时间以及知识过时的时间。尽管视觉，语言和电动机控制取得了进展，但AI仍在努力推理时间有效性。我们介绍了Chronocept，这是将时间效度模型为随着时间的连续概率分布的第一个基准。使用沿语义分解的颞轴拟合的偏斜正常曲线，计时捕获了细微的出现，衰减和峰值相关性的模式。它包括两个数据集：基准I（原子事实）和基准II（多句段落）。注释显示强有力的通道互合同（84％和89％）。我们的基线预测曲线参数 - 位置，规模和偏度 - 启用可解释的，可推广的学习和优于基于分类的方法。 Chronocept填补了AI的时间推理中的基本差距，支持知识接地，事实检查，检索效果的一代（RAG）和主动代理的应用。代码和数据公开可用。

Title: JobHop: A Large-Scale Dataset of Career Trajectories

Authors: Iman Johary, Raphael Romero, Alexandru C. Mara, Tijl De Bie
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07653
Pdf URL: https://arxiv.org/pdf/2505.07653
Copy Paste: [[2505.07653]] JobHop: A Large-Scale Dataset of Career Trajectories(https://arxiv.org/abs/2505.07653)
Keywords: language model, llm
Abstract: Understanding labor market dynamics is essential for policymakers, employers, and job seekers. However, comprehensive datasets that capture real-world career trajectories are scarce. In this paper, we introduce JobHop, a large-scale public dataset derived from anonymized resumes provided by VDAB, the public employment service in Flanders, Belgium. Utilizing Large Language Models (LLMs), we process unstructured resume data to extract structured career information, which is then mapped to standardized ESCO occupation codes using a multi-label classification model. This results in a rich dataset of over 2.3 million work experiences, extracted from and grouped into more than 391,000 user resumes and mapped to standardized ESCO occupation codes, offering valuable insights into real-world occupational transitions. This dataset enables diverse applications, such as analyzing labor market mobility, job stability, and the effects of career breaks on occupational transitions. It also supports career path prediction and other data-driven decision-making processes. To illustrate its potential, we explore key dataset characteristics, including job distributions, career breaks, and job transitions, demonstrating its value for advancing labor market research.
摘要：了解劳动力市场动态对于决策者，雇主和求职者至关重要。但是，捕获现实世界职业轨迹的综合数据集稀缺。在本文中，我们介绍了Jobhop，这是一个大规模的公共数据集，该数据集源自比利时Flanders的公共就业服务VDAB提供的匿名简历。利用大型语言模型（LLMS），我们处理非结构化简历数据以提取结构化的职业信息，然后使用多标签分类模型将其映射到标准化的ESCO职业代码。这导致了超过230万个工作经验的丰富数据集，从并从391,000多个用户简历中归类为标准化的ESCO职业代码，并为现实世界中的职业过渡提供了宝贵的见解。该数据集可实现各种应用程序，例如分析劳动力市场流动性，工作稳定以及职业中断对职业过渡的影响。它还支持职业路径预测和其他数据驱动的决策过程。为了说明其潜力，我们探讨了关键的数据集特征，包括工作分布，职业中断和工作过渡，证明了其在促进劳动力市场研究的价值。

Title: Benchmarking Retrieval-Augmented Generation for Chemistry

Authors: Xianrui Zhong, Bowen Jin, Siru Ouyang, Yanzhen Shen, Qiao Jin, Yin Fang, Zhiyong Lu, Jiawei Han
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.07671
Pdf URL: https://arxiv.org/pdf/2505.07671
Copy Paste: [[2505.07671]] Benchmarking Retrieval-Augmented Generation for Chemistry(https://arxiv.org/abs/2505.07671)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) has emerged as a powerful framework for enhancing large language models (LLMs) with external knowledge, particularly in scientific domains that demand specialized and dynamic information. Despite its promise, the application of RAG in the chemistry domain remains underexplored, primarily due to the lack of high-quality, domain-specific corpora and well-curated evaluation benchmarks. In this work, we introduce ChemRAG-Bench, a comprehensive benchmark designed to systematically assess the effectiveness of RAG across a diverse set of chemistry-related tasks. The accompanying chemistry corpus integrates heterogeneous knowledge sources, including scientific literature, the PubChem database, PubMed abstracts, textbooks, and Wikipedia entries. In addition, we present ChemRAG-Toolkit, a modular and extensible RAG toolkit that supports five retrieval algorithms and eight LLMs. Using ChemRAG-Toolkit, we demonstrate that RAG yields a substantial performance gain -- achieving an average relative improvement of 17.4% over direct inference methods. We further conduct in-depth analyses on retriever architectures, corpus selection, and the number of retrieved passages, culminating in practical recommendations to guide future research and deployment of RAG systems in the chemistry domain. The code and data is available at this https URL.
摘要：检索增强的一代（RAG）已成为具有外部知识的大型语言模型（LLM）的强大框架，尤其是在需要专业和动态信息的科学领域。尽管有希望，但在化学领域中的应用仍未得到充满反感，这主要是由于缺乏高质量，特定领域的语料库和经过良好策划的评估基准。在这项工作中，我们介绍了Chemrag Bench，这是一种综合基准，旨在系统地评估RAG在各种与化学相关的任务中的有效性。随附的化学语料库整合了异质知识来源，包括科学文献，Pubchem数据库，PubMed摘要，教科书和Wikipedia条目。此外，我们提出了ChemRag-Toolkit，这是一种模块化和可扩展的破布工具包，该工具包支持五种检索算法和八种LLM。使用Chemrag-toolkit，我们证明了RAG可以产生可观的性能增长 - 在直接推理方法中，平均相对提高17.4％。我们进一步对猎犬架构，语料库选择以及检索段落的数量进行了深入分析，最终在实用建议中指导化学领域中的破布系统的未来研究和部署。代码和数据可在此HTTPS URL上找到。

Title: OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit

Authors: Arun S. Maiya
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07672
Pdf URL: https://arxiv.org/pdf/2505.07672
Copy Paste: [[2505.07672]] OnPrem.LLM: A Privacy-Conscious Document Intelligence Toolkit(https://arxiv.org/abs/2505.07672)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: We present this http URL, a Python-based toolkit for applying large language models (LLMs) to sensitive, non-public data in offline or restricted environments. The system is designed for privacy-preserving use cases and provides prebuilt pipelines for document processing and storage, retrieval-augmented generation (RAG), information extraction, summarization, classification, and prompt/output processing with minimal configuration. this http URL supports multiple LLM backends -- including this http URL, Ollama, vLLM, and Hugging Face Transformers -- with quantized model support, GPU acceleration, and seamless backend switching. Although designed for fully local execution, this http URL also supports integration with a wide range of cloud LLM providers when permitted, enabling hybrid deployments that balance performance with data control. A no-code web interface extends accessibility to non-technical users.
摘要：我们提供此HTTP URL，这是一种基于Python的工具包，用于在离线或受限环境中应用大型语言模型（LLM）对敏感的非公共数据。该系统设计用于保护隐私的用例，并提供用于文档处理和存储，检索功能的生成（RAG），信息提取，摘要，分类和提示/输出处理的管道。此HTTP URL支持多个LLM后端 - 包括此HTTP URL，Ollama，VLLM和拥抱面型变形金刚 - 具有量化的模型支持，GPU加速度和无缝的后端切换。尽管为完全本地执行而设计，但此HTTP URL在允许的情况下还支持与广泛的云LLM提供商的集成，从而实现了与数据控制平衡性能的混合部署。无代码Web界面将可访问性扩展到非技术用户。

Title: Codifying Character Logic in Role-Playing

Authors: Letian Peng, Jingbo Shang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07705
Pdf URL: https://arxiv.org/pdf/2505.07705
Copy Paste: [[2505.07705]] Codifying Character Logic in Role-Playing(https://arxiv.org/abs/2505.07705)
Keywords: llm, prompt, agent
Abstract: This paper introduces Codified Profiles for role-playing, a novel approach that represents character logic as structured, executable functions for behavioral decision-making. Each profile defines a set of functions parse_by_scene(scene) that outputs a list of logic-grounded assertions triggered_statements, using both explicit control structures (e.g., if-then-else) and condition checks like check_condition(scene, question), where each question is a semantically meaningful prompt about the scene (e.g., "Is the character in danger?") discriminated by the role-playing LLM as true, false, or unknown. This explicit representation offers three key advantages over traditional prompt-based profiles, which append character descriptions directly into text prompts: (1) Persistence, by enforcing complete and consistent execution of character logic, rather than relying on the model's implicit reasoning; (2) Updatability, through systematic inspection and revision of behavioral logic, which is difficult to track or debug in prompt-only approaches; (3) Controllable Randomness, by supporting stochastic behavior directly within the logic, enabling fine-grained variability that prompting alone struggles to achieve. To validate these advantages, we introduce a new benchmark constructed from 83 characters and 5,141 scenes curated from Fandom, using NLI-based scoring to compare character responses against ground-truth actions. Our experiments demonstrate the significant benefits of codified profiles in improving persistence, updatability, and behavioral diversity. Notably, by offloading a significant portion of reasoning to preprocessing, codified profiles enable even 1B-parameter models to perform high-quality role-playing, providing a scalable and efficient foundation for local deployment of role-play agents.
摘要：本文介绍了用于角色扮演的编纂配置文件，这是一种新颖的方法，代表角色逻辑为结构化的，可执行的函数，用于行为决策。 Each profile defines a set of functions parse_by_scene(scene) that outputs a list of logic-grounded assertions triggered_statements, using both explicit control structures (e.g., if-then-else) and condition checks like check_condition(scene, question), where each question is a semantically meaningful prompt about the scene (e.g., "Is the character in danger?") discriminated by the role-playing LLM为真，错误或未知。此明确表示比传统的基于及时的配置文件提供了三个关键优势，这些优势将字符描述直接附加到文本提示中：（1）持久性，通过执行完整且一致地执行字符逻辑，而不是依靠模型的隐式推理；（2）通过系统的检查和修订行为逻辑的可更新性，这很难在仅及时的方法中跟踪或调试；（3）可控制的随机性，通过直接在逻辑中支持随机行为，从而实现了促使单独努力实现的细粒度可变性。为了验证这些优势，我们使用基于NLI的评分将角色响应与地面真实的动作进行比较，从粉丝策划的83个字符和5,141场场景构建了一个新的基准。我们的实验证明了编纂概况在改善持久性，更新性和行为多样性方面的重大好处。值得注意的是，通过卸载大量推理以进行预处理，编纂的配置文件使甚至可以1B参数模型执行高质量的角色扮演，从而为局部部署角色扮演代理提供了可扩展有效的基础。

Title: Spoken Language Understanding on Unseen Tasks With In-Context Learning

Authors: Neeraj Agrawal, Sriram Ganapathy
Subjects: cs.CL, cs.LG, eess.AS
Abstract URL: https://arxiv.org/abs/2505.07731
Pdf URL: https://arxiv.org/pdf/2505.07731
Copy Paste: [[2505.07731]] Spoken Language Understanding on Unseen Tasks With In-Context Learning(https://arxiv.org/abs/2505.07731)
Keywords: language model, llm
Abstract: Spoken language understanding (SLU) tasks involve diverse skills that probe the information extraction, classification and/or generation capabilities of models. In this setting, task-specific training data may not always be available. While traditional task-specific SLU models are unable to cater to such requirements, the speech-text large language models (LLMs) offer a promising alternative with emergent abilities. However, out of-the-box, our evaluations indicate that the zero/few-shot performance of prominent open-source speech-text LLMs on SLU tasks are not up to the mark. In this paper, we introduce a novel approach to robust task-agnostic fine-tuning using randomized class labels. With this proposed fine-tuning, we illustrate that the performance of the speech-text LLMs on an unseen task is significantly improved over standard approaches. Critically, the proposed approach avoids the requirement of task-specific data annotations for enabling new tasks in speech-text LLMs.
摘要：口语理解（SLU）任务涉及多种技能，可探讨模型的信息提取，分类和/或发电能力。在这种情况下，特定于任务的培训数据可能并不总是可用。尽管传统特定于任务的SLU模型无法满足此类要求，但语音文本大语模型（LLMS）提供了一种有希望的替代品，具有紧急的能力。但是，我们的评估表明，在SLU任务上，突出的开源语音 - 文本LLM的零/少量性能尚未达到标记。在本文中，我们介绍了一种新颖的方法，可以使用随机类标签来鲁棒性地不合时宜的微调。通过此提出的微调，我们说明语音文本LLM在看不见的任务上的性能与标准方法相比得到了显着改善。至关重要的是，所提出的方法避免了对特定于任务的数据注释的要求，以实现语音文本LLMS中的新任务。

Title: Must Read: A Systematic Survey of Computational Persuasion

Authors: Nimet Beyza Bozdag, Shuhaib Mehri, Xiaocheng Yang, Hyeonjeong Ha, Zirui Cheng, Esin Durmus, Jiaxuan You, Heng Ji, Gokhan Tur, Dilek Hakkani-Tür
Subjects: cs.CL, cs.AI, cs.CY
Abstract URL: https://arxiv.org/abs/2505.07775
Pdf URL: https://arxiv.org/pdf/2505.07775
Copy Paste: [[2505.07775]] Must Read: A Systematic Survey of Computational Persuasion(https://arxiv.org/abs/2505.07775)
Keywords: language model
Abstract: Persuasion is a fundamental aspect of communication, influencing decision-making across diverse contexts, from everyday conversations to high-stakes scenarios such as politics, marketing, and law. The rise of conversational AI systems has significantly expanded the scope of persuasion, introducing both opportunities and risks. AI-driven persuasion can be leveraged for beneficial applications, but also poses threats through manipulation and unethical influence. Moreover, AI systems are not only persuaders, but also susceptible to persuasion, making them vulnerable to adversarial attacks and bias reinforcement. Despite rapid advancements in AI-generated persuasive content, our understanding of what makes persuasion effective remains limited due to its inherently subjective and context-dependent nature. In this survey, we provide a comprehensive overview of computational persuasion, structured around three key perspectives: (1) AI as a Persuader, which explores AI-generated persuasive content and its applications; (2) AI as a Persuadee, which examines AI's susceptibility to influence and manipulation; and (3) AI as a Persuasion Judge, which analyzes AI's role in evaluating persuasive strategies, detecting manipulation, and ensuring ethical persuasion. We introduce a taxonomy for computational persuasion research and discuss key challenges, including evaluating persuasiveness, mitigating manipulative persuasion, and developing responsible AI-driven persuasive systems. Our survey outlines future research directions to enhance the safety, fairness, and effectiveness of AI-powered persuasion while addressing the risks posed by increasingly capable language models.
摘要：说服力是沟通的基本方面，影响了各种环境的决策，从日常对话到政治，营销和法律等高风险情景。会话人工智能系统的兴起大大扩大了说服力的范围，引入了机会和风险。 AI驱动的说服力可以用于有益的应用，但也可以通过操纵和不道德的影响构成威胁。此外，人工智能系统不仅是说服者，而且容易说服，使它们容易受到对抗性攻击和偏见的强化。尽管AI生成的有说服力的内容的快速发展，但由于其固有的主观和上下文依赖性的性质，我们对使说服力有效的理解仍然有限。在这项调查中，我们提供了计算说服力的全面概述，该概述围绕三种关键观点：（1）AI作为说服者，探讨了AI生成的说服力及其应用；（2）AI作为说服力，研究了AI对影响和操纵的敏感性；（3）AI作为说服法官，该法官分析了AI在评估说服力策略，检测操纵和确保道德说服力方面的作用。我们介绍了计算说服研究的分类法，并讨论了关键挑战，包括评估说服力，缓解操纵性说服力以及发展负责任的AI驱动的说服力系统。我们的调查概述了未来的研究指示，以提高AI驱动说服力的安全性，公平性和有效性，同时解决越来越强大的语言模型带来的风险。

Title: Domain Regeneration: How well do LLMs match syntactic properties of text domains?

Authors: Da Ju, Hagen Blix, Adina Williams
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.07784
Pdf URL: https://arxiv.org/pdf/2505.07784
Copy Paste: [[2505.07784]] Domain Regeneration: How well do LLMs match syntactic properties of text domains?(https://arxiv.org/abs/2505.07784)
Keywords: language model, llm, prompt
Abstract: Recent improvement in large language model performance have, in all likelihood, been accompanied by improvement in how well they can approximate the distribution of their training data. In this work, we explore the following question: which properties of text domains do LLMs faithfully approximate, and how well do they do so? Applying observational approaches familiar from corpus linguistics, we prompt a commonly used, opensource LLM to regenerate text from two domains of permissively licensed English text which are often contained in LLM training data -- Wikipedia and news text. This regeneration paradigm allows us to investigate whether LLMs can faithfully match the original human text domains in a fairly semantically-controlled setting. We investigate varying levels of syntactic abstraction, from more simple properties like sentence length, and article readability, to more complex and higher order properties such as dependency tag distribution, parse depth, and parse complexity. We find that the majority of the regenerated distributions show a shifted mean, a lower standard deviation, and a reduction of the long tail, as compared to the human originals.
摘要：大型语言模型表现的最新改善很有可能伴随着他们能够近似培训数据的分布的进步。在这项工作中，我们探讨了以下问题：LLM忠实地近似文本域的属性，以及它们的效果如何？应用语料库语言学中熟悉的观察方法，我们促使一个常用的OpenSource LLM从允许许可的英语文本的两个领域中再生文本，这些文本通常包含在LLM培训数据中 - Wikipedia和News Text。这种再生范式使我们能够调查LLM是否可以在相当语义上控制的设置中忠实地匹配原始人类文本域。我们研究了不同水平的句法抽象，从句子长度和文章可读性等更简单的属性到更复杂和更高阶的属性，例如依赖性标签分布，分析深度和解析复杂性。我们发现，与人类原始人相比，大多数再生分布都显示出移动的平均值，较低的标准偏差和减少长尾巴。

Title: Learning Dynamics in Continual Pre-Training for Large Language Models

Authors: Xingjin Wang, Howe Tissue, Lu Wang, Linjing Li, Daniel Dajun Zeng
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.07796
Pdf URL: https://arxiv.org/pdf/2505.07796
Copy Paste: [[2505.07796]] Learning Dynamics in Continual Pre-Training for Large Language Models(https://arxiv.org/abs/2505.07796)
Keywords: language model
Abstract: Continual Pre-Training (CPT) has become a popular and effective method to apply strong foundation models to specific downstream tasks. In this work, we explore the learning dynamics throughout the CPT process for large language models. We specifically focus on how general and downstream domain performance evolves at each training step, with domain performance measured via validation losses. We have observed that the CPT loss curve fundamentally characterizes the transition from one curve to another hidden curve, and could be described by decoupling the effects of distribution shift and learning rate annealing. We derive a CPT scaling law that combines the two factors, enabling the prediction of loss at any (continual) training steps and across learning rate schedules (LRS) in CPT. Our formulation presents a comprehensive understanding of several critical factors in CPT, including loss potential, peak learning rate, training steps, replay ratio, etc. Moreover, our approach can be adapted to customize training hyper-parameters to different CPT goals such as balancing general and domain-specific performance. Extensive experiments demonstrate that our scaling law holds across various CPT datasets and training hyper-parameters.
摘要：持续的预训练（CPT）已成为一种流行而有效的方法，将强大的基础模型应用于特定的下游任务。在这项工作中，我们探讨了大型语言模型的整个CPT过程中的学习动力。我们特别关注一般和下游领域的性能在每个训练步骤中如何演变，并通过验证损失衡量域性能。我们已经观察到，CPT损耗曲线从根本上表征了从一条曲线到另一条隐藏曲线的过渡，并且可以通过将分布移位和学习速率退火的影响描述。我们得出了结合两个因素的CPT缩放定律，从而在CPT中的任何（持续）培训步骤以及跨学习率计划（LRS）的损失预测。我们的配方对CPT中的几个关键因素有了全面的了解，包括损失潜力，峰值学习率，训练步骤，重播比率等。此外，我们的方法可以适应于对不同CPT目标进行定制的训练超参数，例如平衡一般和领域特定的绩效。广泛的实验表明，我们的缩放定律在各种CPT数据集和培训超参数中持有。