2025-05-23

Title: BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law

Authors: Juvenal Domingos Júnior, Augusto Faria, E. Seiti de Oliveira, Erick de Brito, Matheus Teotonio, Andre Assumpção, Diedre Carmo, Roberto Lotufo, Jayr Pereira
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15916
Pdf URL: https://arxiv.org/pdf/2505.15916
Copy Paste: [[2505.15916]] BR-TaxQA-R: A Dataset for Question Answering with References for Brazilian Personal Income Tax Law, including case law(https://arxiv.org/abs/2505.15916)
Keywords: gpt, chat, retrieval-augmented generation
Abstract: This paper presents BR-TaxQA-R, a novel dataset designed to support question answering with references in the context of Brazilian personal income tax law. The dataset contains 715 questions from the 2024 official Q\&A document published by Brazil's Internal Revenue Service, enriched with statutory norms and administrative rulings from the Conselho Administrativo de Recursos Fiscais (CARF). We implement a Retrieval-Augmented Generation (RAG) pipeline using OpenAI embeddings for searching and GPT-4o-mini for answer generation. We compare different text segmentation strategies and benchmark our system against commercial tools such as ChatGPT and this http URL using RAGAS-based metrics. Results show that our custom RAG pipeline outperforms commercial systems in Response Relevancy, indicating stronger alignment with user queries, while commercial models achieve higher scores in Factual Correctness and fluency. These findings highlight a trade-off between legally grounded generation and linguistic fluency. Crucially, we argue that human expert evaluation remains essential to ensure the legal validity of AI-generated answers in high-stakes domains such as taxation. BR-TaxQA-R is publicly available at this https URL.
摘要：本文介绍了BR-TAXQA-R，这是一种新颖的数据集，旨在在巴西个人所得税法的背景下以参考为支持答案。该数据集包含来自2024年官方Q \＆的715个问题，该文件由巴西国税局发表，并充满了法定规范和Conselho Idministrativo de Recursos fiscais（CARF）的法定规范和行政裁决。我们使用OpenAI嵌入式进行搜索和GPT-4O-Mini进行回答生成，我们实现了检索功能的生成（RAG）管道。我们使用基于Ragas的指标比较了不同的文本细分策略，并将我们的系统与商业工具（例如Chatgpt和此HTTP URL）进行基准测试。结果表明，我们的自定义RAG管道以响应相关性优于商业系统，表明与用户查询更加一致，而商业模型的事实正确性和流利度则更高。这些发现突出了合法扎根的一代和语言流利性之间的权衡。至关重要的是，我们认为人类专家评估对于确保在高风险领域（例如税收）中AI产生答案的法律有效性至关重要。 BR-TAXQA-R在此HTTPS URL上公开可用。

Title: Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Authors: Aliakbar Nafar, Kristen Brent Venable, Zijun Cui, Parisa Kordjamshidi
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.15918
Pdf URL: https://arxiv.org/pdf/2505.15918
Copy Paste: [[2505.15918]] Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization(https://arxiv.org/abs/2505.15918)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.
摘要：大型语言模型（LLM）表现出可能作为事实知识基础的潜力。但是，它们产生有关现实事件的概率知识的能力仍在研究中。本文使用LLMS固有的概率知识进行了研究，以得出有关事件及其通过贝叶斯网络（BN）捕获的有关事件及其相互关系的概率估计。在此上下文中使用LLMS可以进行BNS的参数化，从而在特定域内实现概率建模。从医疗保健到融资的80个公开可用的贝叶斯网络的实验表明，与基准相比，对事件的条件概率进行查询，可提供有意义的结果，包括随机和统一分布以及基于下一代生成概率的方法。我们探讨了这些LLM衍生的分布如何用作优化从最小数据提取的分布的专家先验，从而大大减少了系统偏见。总体而言，这项工作引入了一种有希望的策略，用于通过将从LLM中提取的概率知识与少量现实世界数据相结合，从而自动构建贝叶斯网络。此外，我们评估了几种提示策略，以从LLM中引起概率知识，并建立了评估LLM绩效提取概率知识的第一个全面基线。

Title: Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition

Authors: Dong Won Lee, Hae Won Park, Cynthia Breazeal, Louis-Philippe Morency
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15922
Pdf URL: https://arxiv.org/pdf/2505.15922
Copy Paste: [[2505.15922]] Aligning Dialogue Agents with Global Feedback via Large Language Model Reward Decomposition(https://arxiv.org/abs/2505.15922)
Keywords: language model, llm, prompt, agent
Abstract: We propose a large language model based reward decomposition framework for aligning dialogue agents using only a single session-level feedback signal. We leverage the reasoning capabilities of a frozen, pretrained large language model (LLM) to infer fine-grained local implicit rewards by decomposing global, session-level feedback. Our first text-only variant prompts the LLM to perform reward decomposition using only the dialogue transcript. The second multimodal variant incorporates additional behavioral cues, such as pitch, gaze, and facial affect, expressed as natural language descriptions. These inferred turn-level rewards are distilled into a lightweight reward model, which we utilize for RL-based fine-tuning for dialogue generation. We evaluate both text-only and multimodal variants against state-of-the-art reward decomposition methods and demonstrate notable improvements in human evaluations of conversation quality, suggesting that LLMs are strong reward decomposers that obviate the need for manual reward shaping and granular human feedback.
摘要：我们建议仅使用单个会话级反馈信号对话代理对对话代理的对准代理进行大型奖励分解框架。我们通过分解全球，会话级的反馈来推断出冻结的大语言模型（LLM）的推理能力，以推断出细粒度的本地隐性奖励。我们的第一个仅文本变体促使LLM仅使用对话记录执行奖励分解。第二个多模式变体包含其他行为线索，例如音调，凝视和面部影响，以自然语言描述表示。这些推断的转交级奖励被蒸馏成一个轻巧的奖励模型，我们将其用于基于RL的微调进行对话生成。我们评估了针对最新的奖励分解方法的纯文本和多模式变体，并在人类对对话质量的评估中表现出显着改善，这表明LLMS是强有力的奖励分解器，可以消除对手动奖励塑造和颗粒状人类反馈的需求。

Title: Citation Parsing and Analysis with Language Models

Authors: Parth Sarin, Juan Pablo Alperin
Subjects: cs.CL, cs.DL, cs.SI
Abstract URL: https://arxiv.org/abs/2505.15948
Pdf URL: https://arxiv.org/pdf/2505.15948
Copy Paste: [[2505.15948]] Citation Parsing and Analysis with Language Models(https://arxiv.org/abs/2505.15948)
Keywords: language model
Abstract: A key type of resource needed to address global inequalities in knowledge production and dissemination is a tool that can support journals in understanding how knowledge circulates. The absence of such a tool has resulted in comparatively less information about networks of knowledge sharing in the Global South. In turn, this gap authorizes the exclusion of researchers and scholars from the South in indexing services, reinforcing colonial arrangements that de-center and minoritize those scholars. In order to support citation network tracking on a global scale, we investigate the capacity of open-weight language models to mark up manuscript citations in an indexable format. We assembled a dataset of matched plaintext and annotated citations from preprints and published research papers. Then, we evaluated a number of open-weight language models on the annotation task. We find that, even out of the box, today's language models achieve high levels of accuracy on identifying the constituent components of each citation, outperforming state-of-the-art methods. Moreover, the smallest model we evaluated, Qwen3-0.6B, can parse all fields with high accuracy in $2^5$ passes, suggesting that post-training is likely to be effective in producing small, robust citation parsing models. Such a tool could greatly improve the fidelity of citation networks and thus meaningfully improve research indexing and discovery, as well as further metascientific research.
摘要：解决知识生产和传播全球不平等所需的一种关键资源类型是一种工具，可以支持期刊了解知识的流通方式。缺乏这样的工具导致有关全球南方知识共享网络的信息相对较少。反过来，这一差距授权将研究人员和学者从南方排除在索引服务中，从而加强了殖民安排，使这些学者取消中心和少数。为了支持全球规模的引用网络跟踪，我们研究了开放权重语言模型以可索引格式标记手稿引用的能力。我们组装了一个匹配的明文数据集，并从预印本和发表的研究论文中引用了带注释的引用。然后，我们在注释任务上评估了许多开放式语言模型。我们发现，即使开箱即用，当今的语言模型在识别每种引用的组成部分方面具有很高的准确性，超过了最先进的方法。此外，我们评估的最小模型是Qwen3-0.6b，可以在$ 2^5 $通过时以高精度来解析所有领域，这表明培训后训练可能有效地产生较小的，可靠的引文解析模型。这样的工具可以大大提高引文网络的忠诚度，从而有意义地改善了研究索引和发现，并进一步提高了迁移率研究。

Title: Training Step-Level Reasoning Verifiers with Formal Verification Tools

Authors: Ryo Kamoi, Yusen Zhang, Nan Zhang, Sarkar Snigdha Sarathi Das, Rui Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15960
Pdf URL: https://arxiv.org/pdf/2505.15960
Copy Paste: [[2505.15960]] Training Step-Level Reasoning Verifiers with Formal Verification Tools(https://arxiv.org/abs/2505.15960)
Keywords: language model, llm
Abstract: Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by Large Language Models (LLMs), are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at this https URL.
摘要：流程奖励模型（PRM）对大型语言模型（LLMS）产生的推理提供了分步反馈，它正在受到越来越多的关注。但是，仍然存在两个关键的研究差距：收集准确的训练级别错误标签通常需要昂贵的人类注释，而现有的PRM仅限于数学推理问题。为了应对这些差距，本文旨在应对自动数据集创建的挑战以及PRM对各种推理任务的概括。为了实现这一目标，我们提出了FOVER，这是一种训练PRM的方法，该方法会自动通过正式验证工具注释的级数错误标签，例如用于正式逻辑的Z3和iSabelle for Theorem证明，为符号任务提供了自动验证。使用这种方法，我们合成了一个培训数据集，其中具有LLM响应上的错误标签，用于正式逻辑和无人注释的定理证明任务。尽管此数据合成仅适用于与正式验证兼容的任务，但我们观察到，在我们的数据集中训练的基于LLM的PRMS展示了交叉任务的概括，从而改善了各种推理任务的验证。具体而言，与原始LLM相比，经过原始LLM的训练的PRM显着超过了基线PRM，并且与在由人类或更强模型注释的标签上培训的最先进的PRM相比，取得了竞争或优越的成绩，通过在ProcessBench上进行的步骤验证，在ProcessBench上进行了衡量，并在12个推理基础上进行了12个理解的Benchmarks benchmarks，包括数学，Aime，Anli，Anli，Anli，mmlu和Mmlu，并在包括12个理解的基础上进行验证。在此HTTPS URL上提供了数据集，模型和代码。

Title: Pre-training Large Memory Language Models with Internal and External Knowledge

Authors: Linxi Zhao, Sofian Zalouk, Christian K. Belardi, Justin Lovelace, Jin Peng Zhou, Kilian Q. Weinberger, Yoav Artzi, Jennifer J. Sun
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.15962
Pdf URL: https://arxiv.org/pdf/2505.15962
Copy Paste: [[2505.15962]] Pre-training Large Memory Language Models with Internal and External Knowledge(https://arxiv.org/abs/2505.15962)
Keywords: language model, llm
Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.
摘要：神经语言模型是黑盒 - 语言模式和事实知识都分布在数十亿不透明的参数之间。这种纠缠的编码使得难以可靠地检查，验证或更新特定的事实。我们提出了一种新的语言模型，大型记忆语言模型（LMLM），并具有预训练的食谱，该食谱将事实知识存储在内部权重和外部数据库中。我们的方法从策略上掩盖了外部从训练损失中检索的事实价值，从而教导模型执行目标查找，而不是依靠模型权重中的记忆。我们的实验表明，LMLM与标准基准的明显更大的知识密度LLM相比，实现了竞争性能，同时提供了明确，可编辑和可验证的知识库的优势。这项工作代表了语言模型与事实知识互动和管理的基本转变。

Title: Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku

Authors: Anirudh Maiya, Razan Alghamdi, Maria Leonor Pacheco, Ashutosh Trivedi, Fabio Somenzi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.15993
Pdf URL: https://arxiv.org/pdf/2505.15993
Copy Paste: [[2505.15993]] Explaining Puzzle Solutions in Natural Language: An Exploratory Study on 6x6 Sudoku(https://arxiv.org/abs/2505.15993)
Keywords: language model, llm
Abstract: The success of Large Language Models (LLMs) in human-AI collaborative decision-making hinges on their ability to provide trustworthy, gradual, and tailored explanations. Solving complex puzzles, such as Sudoku, offers a canonical example of this collaboration, where clear and customized explanations often hold greater importance than the final solution. In this study, we evaluate the performance of five LLMs in solving and explaining \sixsix{} Sudoku puzzles. While one LLM demonstrates limited success in solving puzzles, none can explain the solution process in a manner that reflects strategic reasoning or intuitive problem-solving. These findings underscore significant challenges that must be addressed before LLMs can become effective partners in human-AI collaborative decision-making.
摘要：大语模型（LLM）在人类协作决策中的成功取决于其提供值得信赖，渐进和量身定制的解释的能力。解决复杂的难题（例如Sudoku）提供了这项合作的规范示例，在这种情况下，清晰和自定义的解释通常比最终解决方案更重要。在这项研究中，我们评估了五个LLM在解决和解释\ sixsix {} sudoku难题时的性能。虽然一个LLM在解决难题方面表现出有限的成功，但没有人可以以反映战略推理或直觉解决问题的方式来解释解决方案过程。这些发现强调了在LLM可以成为人类协作决策的有效伙伴之前，必须解决的重大挑战。

Title: Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model

Authors: Mehrdad ghassabi, Pedram Rostami, Hamidreza Baradaran Kashani, Amirhossein Poursina, Zahra Kazemi, Milad Tavakoli
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16000
Pdf URL: https://arxiv.org/pdf/2505.16000
Copy Paste: [[2505.16000]] Leveraging Online Data to Enhance Medical Knowledge in a Small Persian Language Model(https://arxiv.org/abs/2505.16000)
Keywords: language model
Abstract: The rapid advancement of language models has demonstrated the potential of artificial intelligence in the healthcare industry. However, small language models struggle with specialized domains in low-resource languages like Persian. While numerous medical-domain websites exist in Persian, no curated dataset or corpus has been available making ours the first of its kind. This study explores the enhancement of medical knowledge in a small language model by leveraging accessible online data, including a crawled corpus from medical magazines and a dataset of real doctor-patient QA pairs. We fine-tuned a baseline model using our curated data to improve its medical knowledge. Benchmark evaluations demonstrate that the fine-tuned model achieves improved accuracy in medical question answering and provides better responses compared to its baseline. This work highlights the potential of leveraging open-access online data to enrich small language models in medical fields, providing a novel solution for Persian medical AI applications suitable for resource-constrained environments.
摘要：语言模型的快速发展已经证明了医疗保健行业人工智能的潜力。但是，小型语言模型在波斯语中的低资源语言中与专门的领域斗争。尽管波斯语中存在许多医疗域网站，但没有策划的数据集或语料库可以使我们的第一个。这项研究通过利用可访问的在线数据（包括医疗杂志的爬行语料库和真正的医生QA对数据集）来探讨小语言模型中医学知识的增强。我们使用策划数据来微调基线模型以改善其医学知识。基准评估表明，微调模型在医疗问题答案中的准确性提高了，并且与基线相比提供了更好的回答。这项工作突出了利用开放访问在线数据以丰富医学领域中的小语言模型的潜力，为波斯医学AI应用程序提供了适合资源受限环境的新颖解决方案。

Title: Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions

Authors: Sasha Boguraev, Christopher Potts, Kyle Mahowald
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16002
Pdf URL: https://arxiv.org/pdf/2505.16002
Copy Paste: [[2505.16002]] Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions(https://arxiv.org/abs/2505.16002)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors -- relating to frequency, filler type, and surrounding context -- that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.
摘要：大型语言模型（LLM）已成为寻求发展语法理论的语言学家的有力证据来源。在本文中，我们认为，应用于LLM的因果可解释性方法可以通过帮助我们表征LLMS学会使用的抽象机制来大大提高此类证据的价值。我们的经验重点是一组英语填充依赖性构造（例如，问题，相对从句）。语言理论在很大程度上同意这些结构具有许多属性。使用基于分布式互换干预措施的实验，我们表明LLM会在这些结构的类似抽象分析中收敛。这些分析还揭示了以前被忽视的因素 - 与频率，填充类型和周围环境有关，可以激发标准语言理论的变化。总体而言，这些结果表明，LLM的机理内部分析可以推动语言理论的前进。

Title: SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Authors: Roland Daynauth, Christopher Clarke, Krisztian Flautner, Lingjia Tang, Jason Mars
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16003
Pdf URL: https://arxiv.org/pdf/2505.16003
Copy Paste: [[2505.16003]] SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models(https://arxiv.org/abs/2505.16003)
Keywords: language model, gpt, llm
Abstract: The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.
摘要：LLM-AS-A-Gudge范式提供了一种可扩展的，无参考的方法来评估语言模型。尽管已经提出了几种校准技术来使这些评估者与人类判断力更好地对齐，但先前的研究主要集中于狭窄的结构良好的基准。结果，尚不清楚这种校准是否将其推广到现实世界中的开放式任务。在这项工作中，我们表明SOTA校准的评估者在这些环境中常常失败，与人类判断较弱甚至负相关。为了解决这个问题，我们提出了Slmeval，这是一种基于少量人类偏好数据的熵最大化的新型有效校准方法。通过估计模型质量的潜在分布并相应地重新释放评估器，Slmeval与两个现实世界中生产用例和公共基准的人类评估达到了密切的相关性。例如，在一项此类任务上，Slmeval达到了Spearman的相关性0.57与人类判断的相关性，而G-Eval产生了负相关。此外，与基于GPT-4的校准评估器（例如G-eval）相比，SLMEVAL将评估成本降低了5-30倍。

Title: Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains

Authors: Yash Saxena, Anpur Padia, Mandar S Chaudhary, Kalpa Gunaratna, Srinivasan Parthasarathy, Manas Gaur
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16014
Pdf URL: https://arxiv.org/pdf/2505.16014
Copy Paste: [[2505.16014]] Ranking Free RAG: Replacing Re-ranking with Selection in RAG for Sensitive Domains(https://arxiv.org/abs/2505.16014)
Keywords: llm, retrieval-augmented generation
Abstract: Traditional Retrieval-Augmented Generation (RAG) pipelines rely on similarity-based retrieval and re-ranking, which depend on heuristics such as top-k, and lack explainability, interpretability, and robustness against adversarial content. To address this gap, we propose a novel method METEORA that replaces re-ranking in RAG with a rationale-driven selection approach. METEORA operates in two stages. First, a general-purpose LLM is preference-tuned to generate rationales conditioned on the input query using direct preference optimization. These rationales guide the evidence chunk selection engine, which selects relevant chunks in three stages: pairing individual rationales with corresponding retrieved chunks for local relevance, global selection with elbow detection for adaptive cutoff, and context expansion via neighboring chunks. This process eliminates the need for top-k heuristics. The rationales are also used for consistency check using a Verifier LLM to detect and filter poisoned or misleading content for safe generation. The framework provides explainable and interpretable evidence flow by using rationales consistently across both selection and verification. Our evaluation across six datasets spanning legal, financial, and academic research domains shows that METEORA improves generation accuracy by 33.34% while using approximately 50% fewer chunks than state-of-the-art re-ranking methods. In adversarial settings, METEORA significantly improves the F1 score from 0.10 to 0.44 over the state-of-the-art perplexity-based defense baseline, demonstrating strong resilience to poisoning attacks. Code available at: this https URL
摘要：传统的检索型发电（RAG）管道依赖于基于相似性的检索和重新排列，这些检索依赖于启发式方法，例如TOP-K，缺乏针对对抗性内容的解释性，可解释性和鲁棒性。为了解决这一差距，我们提出了一种新型的Meteora，该方法将用理由驱动的选择方法代替RAG重新排行。 Meteora分为两个阶段。首先，使用直接偏好优化在输入查询的条件下生成基本原理的通用LLM。这些理由指导证据块选择引擎，该引擎在三个阶段中选择相关块：将单个理由与相应检索的局部相关性，全球选择与肘部检测进行自适应截止以及通过相邻块扩展的环境扩展。这个过程消除了对TOP-K启发式方法的需求。这些理由还用于使用验证者llm来检测和过滤的中毒或误导性内容，以确保安全生成。该框架通过在选择和验证中始终使用理由来提供可解释的证据流。我们在跨越法律，财务和学术研究领域的六个数据集中进行的评估表明，Meteora的生成准确性提高了33.34％，而使用比最先进的重新排列方法的块少约50％。在对抗环境中，在基于最新的困惑的防御基线中，Meteora将F1得分从0.10提高到0.44，表明对中毒攻击的强大韧性。代码可用：此HTTPS URL

Title: NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Authors: Wei Liu, Siya Qi, Xinyu Wang, Chen Qian, Yali Du, Yulan He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16022
Pdf URL: https://arxiv.org/pdf/2505.16022
Copy Paste: [[2505.16022]] NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning(https://arxiv.org/abs/2505.16022)
Keywords: language model
Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.
摘要：诸如DeepSeek R1-Zero之类的最新进展突出了激励培训的有效性，激励训练的有效性，一种强化学习范式，仅根据语言模型输出的最终答案部分计算奖励，从而鼓励产生中间推理步骤。但是，这些方法从根本上依赖于外部验证符，这将其适用性限制在数学和编码等域之类的域，在此验证符中很容易获得。尽管奖励模型可以用作验证符，但它们需要高质量的注释数据，并且训练昂贵。在这项工作中，我们提出了Nover，Nover，No-No-Verifier强化学习，这是一个普遍的增强学习框架，仅需要标准监督的微调数据，而无需外部验证器。 Nover可以在各种文本到文本任务上进行激励培训，并优于从大型推理模型（例如DeepSeek R1 671B）提取的相同尺寸的模型，提高了7.7％。此外，NOVER的灵活性可实现优化大型语言模型的新可能性，例如激励培训。

Title: Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild

Authors: Sheshera Mysore, Debarati Das, Hancheng Cao, Bahareh Sarrafzadeh
Subjects: cs.CL, cs.HC
Abstract URL: https://arxiv.org/abs/2505.16023
Pdf URL: https://arxiv.org/pdf/2505.16023
Copy Paste: [[2505.16023]] Prototypical Human-AI Collaboration Behaviors from LLM-Assisted Writing in the Wild(https://arxiv.org/abs/2505.16023)
Keywords: language model, llm, prompt, chat
Abstract: As large language models (LLMs) are used in complex writing workflows, users engage in multi-turn interactions to steer generations to better fit their needs. Rather than passively accepting output, users actively refine, explore, and co-construct text. We conduct a large-scale analysis of this collaborative behavior for users engaged in writing tasks in the wild with two popular AI assistants, Bing Copilot and WildChat. Our analysis goes beyond simple task classification or satisfaction estimation common in prior work and instead characterizes how users interact with LLMs through the course of a session. We identify prototypical behaviors in how users interact with LLMs in prompts following their original request. We refer to these as Prototypical Human-AI Collaboration Behaviors (PATHs) and find that a small group of PATHs explain a majority of the variation seen in user-LLM interaction. These PATHs span users revising intents, exploring texts, posing questions, adjusting style or injecting new content. Next, we find statistically significant correlations between specific writing intents and PATHs, revealing how users' intents shape their collaboration behaviors. We conclude by discussing the implications of our findings on LLM alignment.
摘要：由于大型语言模型（LLM）用于复杂的写作工作流程，因此用户从事与转向世代的多转交互，以更好地满足他们的需求。用户不是被动地接受输出，而是积极完善，探索和共同构建文本。我们对这种协作行为进行了大规模分析，以与两名流行的AI助手Bing Copilot和Wildchat在野外从事写作任务的用户进行大规模分析。我们的分析超出了先前工作中常见的简单任务分类或满意度估计，而是表征了用户在会话过程中与LLMS互动的方式。我们确定了用户在其原始请求下在提示中与LLM互动的原型行为。我们将其称为典型的人类协作行为（路径），发现一小部分路径解释了用户-LLM相互作用中大部分变化。这些路径涵盖了用户修改意图，探索文本，提出问题，调整样式或注入新内容。接下来，我们发现特定的写作意图和路径之间具有统计学意义的相关性，从而揭示了用户的意图如何塑造其协作行为。我们通过讨论我们的发现对LLM一致性的含义来总结。

Title: OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models

Authors: Burak Erinç Çetin, Yıldırım Özen, Elif Naz Demiryılmaz, Kaan Engür, Cagri Toraman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16036
Pdf URL: https://arxiv.org/pdf/2505.16036
Copy Paste: [[2505.16036]] OpenEthics: A Comprehensive Ethical Evaluation of Open-Source Generative Large Language Models(https://arxiv.org/abs/2505.16036)
Keywords: language model, llm
Abstract: Generative large language models present significant potential but also raise critical ethical concerns. Most studies focus on narrow ethical dimensions, and also limited diversity of languages and models. To address these gaps, we conduct a broad ethical evaluation of 29 recent open-source large language models using a novel data collection including four ethical aspects: Robustness, reliability, safety, and fairness. We analyze model behavior in both a commonly used language, English, and a low-resource language, Turkish. Our aim is to provide a comprehensive ethical assessment and guide safer model development by filling existing gaps in evaluation breadth, language coverage, and model diversity. Our experimental results, based on LLM-as-a-Judge, reveal that optimization efforts for many open-source models appear to have prioritized safety and fairness, and demonstrated good robustness while reliability remains a concern. We demonstrate that ethical evaluation can be effectively conducted independently of the language used. In addition, models with larger parameter counts tend to exhibit better ethical performance, with Gemma and Qwen models demonstrating the most ethical behavior among those evaluated.
摘要：生成的大语言模型具有巨大的潜力，但也引起了关键的道德问题。大多数研究都集中在狭窄的道德维度上，以及语言和模型的多样性。为了解决这些差距，我们使用新的数据收集包括四个道德方面：鲁棒性，可靠性，安全性和公平性，对29个最近的开源大语模型进行了广泛的道德评估。我们分析了常用语言，英语和低资源语言的模型行为，土耳其语。我们的目的是通过填补评估广度，语言覆盖范围和模型多样性的现有空白来提供全面的道德评估，并指导更安全的模型开发。我们的实验结果基于LLM-AS-A-Gudge，表明许多开源模型的优化工作似乎优先考虑安全性和公平性，并且表现出良好的鲁棒性，而可靠性仍然是一个令人担忧的问题。我们证明，道德评估可以独立于使用的语言进行有效进行。此外，具有较大参数计数的模型倾向于表现出更好的道德表现，Gemma和Qwen模型表明了评估者中最具道德的行为。

Title: Internal and External Impacts of Natural Language Processing Papers

Authors: Yu Zhang
Subjects: cs.CL, cs.DL
Abstract URL: https://arxiv.org/abs/2505.16061
Pdf URL: https://arxiv.org/pdf/2505.16061
Copy Paste: [[2505.16061]] Internal and External Impacts of Natural Language Processing Papers(https://arxiv.org/abs/2505.16061)
Keywords: language model
Abstract: We investigate the impacts of NLP research published in top-tier conferences (i.e., ACL, EMNLP, and NAACL) from 1979 to 2024. By analyzing citations from research articles and external sources such as patents, media, and policy documents, we examine how different NLP topics are consumed both within the academic community and by the broader public. Our findings reveal that language modeling has the widest internal and external influence, while linguistic foundations have lower impacts. We also observe that internal and external impacts generally align, but topics like ethics, bias, and fairness show significant attention in policy documents with much fewer academic citations. Additionally, external domains exhibit distinct preferences, with patents focusing on practical NLP applications and media and policy documents engaging more with the societal implications of NLP models.
摘要：我们研究了从1979年至2024年发表在顶级会议（即ACL，EMNLP和NAACL）发表的NLP研究的影响。通过分析研究文章和外部来源的引用，例如专利，媒体和政策文件，例如在学术界和广大公众中如何在学术界和公众内部消费不同的NLP主题。我们的发现表明，语言建模具有最广泛的内部和外部影响，而语言基础的影响较低。我们还观察到内部和外部影响通常是一致的，但是诸如道德，偏见和公平之类的主题在学术引用少得多的政策文件中表现出很大的关注。此外，外部领域表现出独特的偏好，专利重点介绍了NLP应用程序以及媒体和政策文件更多地涉及NLP模型的社会影响。

Title: Small Language Models in the Real World: Insights from Industrial Text Classification

Authors: Lujun Li, Lama Sleem, Niccolo' Gentile, Geoffrey Nichil, Radu State
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16078
Pdf URL: https://arxiv.org/pdf/2505.16078
Copy Paste: [[2505.16078]] Small Language Models in the Real World: Insights from Industrial Text Classification(https://arxiv.org/abs/2505.16078)
Keywords: language model, gpt, prompt, chat
Abstract: With the emergence of ChatGPT, Transformer models have significantly advanced text classification and related tasks. Decoder-only models such as Llama exhibit strong performance and flexibility, yet they suffer from inefficiency on inference due to token-by-token generation, and their effectiveness in text classification tasks heavily depends on prompt quality. Moreover, their substantial GPU resource requirements often limit widespread adoption. Thus, the question of whether smaller language models are capable of effectively handling text classification tasks emerges as a topic of significant interest. However, the selection of appropriate models and methodologies remains largely underexplored. In this paper, we conduct a comprehensive evaluation of prompt engineering and supervised fine-tuning methods for transformer-based text classification. Specifically, we focus on practical industrial scenarios, including email classification, legal document categorization, and the classification of extremely long academic texts. We examine the strengths and limitations of smaller models, with particular attention to both their performance and their efficiency in Video Random-Access Memory (VRAM) utilization, thereby providing valuable insights for the local deployment and application of compact models in industrial settings.
摘要：随着CHATGPT的出现，变压器模型具有明显高级的文本分类和相关任务。诸如Llama之类的仅解码器模型表现出强大的性能和灵活性，但由于逐个代币的产生而引起的推理效率低下，并且它们在文本分类任务中的有效性在很大程度上取决于迅速的质量。此外，他们的大量GPU资源需求通常会限制广泛采用。因此，较小语言模型是否能够有效处理文本分类任务的问题是引起人们兴趣的一个主题。但是，选择合适的模型和方法论仍然很大程度上尚未得到充实。在本文中，我们对基于变压器的文本分类的及时工程和监督的微调方法进行了全面评估。具体而言，我们专注于实用的工业场景，包括电子邮件分类，法律文件分类以及非常长的学术文本的分类。我们研究了较小模型的优势和局限性，特别关注了它们的性能和视频随机记忆（VRAM）利用率的效率，从而为在工业环境中紧凑的模型的本地部署和应用提供了宝贵的见解。

Title: BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators

Authors: KMA Solaiman
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16081
Pdf URL: https://arxiv.org/pdf/2505.16081
Copy Paste: [[2505.16081]] BiasLab: Toward Explainable Political Bias Detection with Dual-Axis Annotations and Rationale Indicators(https://arxiv.org/abs/2505.16081)
Keywords: gpt
Abstract: We present BiasLab, a dataset of 300 political news articles annotated for perceived ideological bias. These articles were selected from a curated 900-document pool covering diverse political events and source biases. Each article is labeled by crowdworkers along two independent scales, assessing sentiment toward the Democratic and Republican parties, and enriched with rationale indicators. The annotation pipeline incorporates targeted worker qualification and was refined through pilot-phase analysis. We quantify inter-annotator agreement, analyze misalignment with source-level outlet bias, and organize the resulting labels into interpretable subsets. Additionally, we simulate annotation using schema-constrained GPT-4o, enabling direct comparison to human labels and revealing mirrored asymmetries, especially in misclassifying subtly right-leaning content. We define two modeling tasks: perception drift prediction and rationale type classification, and report baseline performance to illustrate the challenge of explainable bias detection. BiasLab's rich rationale annotations provide actionable interpretations that facilitate explainable modeling of political bias, supporting the development of transparent, socially aware NLP systems. We release the dataset, annotation schema, and modeling code to encourage research on human-in-the-loop interpretability and the evaluation of explanation effectiveness in real-world settings.
摘要：我们介绍Biaslab，这是一个针对感知意识形态偏见的300条政治新闻文章的数据集。这些文章是从精心策划的900个文档池中选择的，涵盖了各种政治活动和来源偏见。每篇文章都由众劳工沿两个独立规模标记，评估了对民主党和共和党的情绪，并充满了理由指标。注释管道包含有针对性的工人资格，并通过试验性分析进行了完善。我们量化了通道间的一致性，分析源级别偏置偏差的未对准，并将所得标签组织成可解释的子集。此外，我们使用架构约束的GPT-4O模拟注释，使人可以直接比较人类标签并揭示镜像不对称性，尤其是在错误地分类巧妙的右倾角含量时。我们定义了两个建模任务：感知漂移预测和基本原理类型分类，并报告基线性能，以说明可解释的偏差检测的挑战。 Biaslab丰富的理由注释提供了可行的解释，可促进政治偏见的可解释建模，支持透明，具有社会意识的NLP系统的发展。我们发布数据集，注释模式和建模代码，以鼓励研究人类的解释性研究以及对现实环境中解释有效性的评估。

Title: Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Authors: Gagan Bhatia, Maxime Peyrard, Wei Zhao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16088
Pdf URL: https://arxiv.org/pdf/2505.16088
Copy Paste: [[2505.16088]] Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning(https://arxiv.org/abs/2505.16088)
Keywords: language model, llm
Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day).
摘要：现代的BPE令牌通常将日历分为毫无意义的片段，例如20250312 $ \ rightarrow $ 202、503、12，使令牌计数充气并掩盖了稳健的时间推理所需的固有结构。在这项工作中，我们（1）引入了一个简单但可解释的，称为日期片段化的比例，该量子衡量了代币器保留多位数的日期组件的忠实; （2）释放DateAugbench，由6500个示例组成的套件，涵盖了三个时间推理任务：基于上下文的日期解决，格式不变性难题以及跨历史，现代和未来制度的日期算术；（3）通过层次的探测和因果注意分析，发现了一种新兴的日期 - 侵蚀机制，从而大型语言模型将时间推理的月，日和年成分的片段拼凑在一起。我们的实验表明，过度的碎片化与历史和未来派日期（如历史和未来派日期）的准确性下降至最高10点。此外，我们发现模型越大，即将完成的恢复日期片段的紧急日期抽象越快。最后，我们观察到LLMS遵循的推理路径组装日期片段，通常与人类的解释不同（年$ \ rightarrow $ $ $ \ rightarrow $ day）。

Title: Continually Self-Improving Language Models for Bariatric Surgery Question--Answering

Authors: Yash Kumar Atri, Thomas H Shin, Thomas Hartvigsen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16102
Pdf URL: https://arxiv.org/pdf/2505.16102
Copy Paste: [[2505.16102]] Continually Self-Improving Language Models for Bariatric Surgery Question--Answering(https://arxiv.org/abs/2505.16102)
Keywords: language model, llm, retrieval-augmented generation
Abstract: While bariatric and metabolic surgery (MBS) is considered the gold standard treatment for severe and morbid obesity, its therapeutic efficacy hinges upon active and longitudinal engagement with multidisciplinary providers, including surgeons, dietitians/nutritionists, psychologists, and endocrinologists. This engagement spans the entire patient journey, from preoperative preparation to long-term postoperative management. However, this process is often hindered by numerous healthcare disparities, such as logistical and access barriers, which impair easy patient access to timely, evidence-based, clinician-endorsed information. To address these gaps, we introduce bRAGgen, a novel adaptive retrieval-augmented generation (RAG)-based model that autonomously integrates real-time medical evidence when response confidence dips below dynamic thresholds. This self-updating architecture ensures that responses remain current and accurate, reducing the risk of misinformation. Additionally, we present bRAGq, a curated dataset of 1,302 bariatric surgery--related questions, validated by an expert bariatric surgeon. bRAGq constitutes the first large-scale, domain-specific benchmark for comprehensive MBS care. In a two-phase evaluation, bRAGgen is benchmarked against state-of-the-art models using both large language model (LLM)--based metrics and expert surgeon review. Across all evaluation dimensions, bRAGgen demonstrates substantially superior performance in generating clinically accurate and relevant responses.
摘要：虽然减肥和代谢手术（MBS）被认为是严重和病态肥胖症的金标准治疗方法，但其治疗效率取决于与多学科提供者的积极和纵向互动，包括外科医生，营养学家/营养学家，心理学家，心理学家和内分泌学家。从术前准备到长期术后管理，这种参与跨越了整个患者的旅程。但是，这一过程通常受到许多医疗保健差异的阻碍，例如后勤和访问障碍，这些差异很容易损害患者访问及时，基于循证的，临床医生认可的信息。为了解决这些差距，我们介绍了Braggen，这是一种新型的自适应检索生成（RAG）模型，当响应置信度下降到动态阈值以下时，该模型会自主整合实时医疗证据。这种自我更新的体系结构可确保响应保持最新和准确，从而降低了错误信息的风险。此外，我们提出了Bragq，这是一个由1,302个减肥手术的策划数据集 - 与专业的减肥外科医生一起验证。 BRAGQ构成了全面MBS护理的第一个大规模，特定领域的基准。在两阶段的评估中，使用大型语言模型（LLM） - 基于指标和专家外科医生的评论，对Braggen进行了针对最先进模型的基准测试。在所有评估维度中，Braggen在产生临床准确且相关的反应方面表现出基本上的表现。

Title: Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models

Authors: Yue Li, Xin Yi, Dongsheng Shi, Gerard de Melo, Xiaoling Wang, Linlin Wang
Subjects: cs.CL, cs.CV, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16104
Pdf URL: https://arxiv.org/pdf/2505.16104
Copy Paste: [[2505.16104]] Hierarchical Safety Realignment: Lightweight Restoration of Safety in Pruned Large Vision-Language Models(https://arxiv.org/abs/2505.16104)
Keywords: language model
Abstract: With the increasing size of Large Vision-Language Models (LVLMs), network pruning techniques aimed at compressing models for deployment in resource-constrained environments have garnered significant attention. However, we observe that pruning often leads to a degradation in safety performance. To address this issue, we present a novel and lightweight approach, termed Hierarchical Safety Realignment (HSR). HSR operates by first quantifying the contribution of each attention head to safety, identifying the most critical ones, and then selectively restoring neurons directly within these attention heads that play a pivotal role in maintaining safety. This process hierarchically realigns the safety of pruned LVLMs, progressing from the attention head level to the neuron level. We validate HSR across various models and pruning strategies, consistently achieving notable improvements in safety performance. To our knowledge, this is the first work explicitly focused on restoring safety in LVLMs post-pruning.
摘要：随着大型视觉模型（LVLM）规模的增加，旨在压缩资源受限环境中部署模型的网络修剪技术引起了极大的关注。但是，我们观察到修剪通常会导致安全性能下降。为了解决这个问题，我们提出了一种新颖而轻的方法，称为层次安全重新调整（HSR）。 HSR通过首先量化每个注意力头对安全性的贡献，确定最关键的人的贡献，然后在这些注意力头中直接选择性地恢复神经元，这些神经元在维持安全方面起着关键作用。该过程从结构上进行了层次的重新调整，使修剪的LVLM的安全性从注意力头层升至神经元水平。我们在各种模型和修剪策略中验证了HSR，从而始终取得了明显的安全性能改善。据我们所知，这是第一项明确专注于恢复LVLMS后延期安全性的工作。

Title: MPL: Multiple Programming Languages with Large Language Models for Information Extraction

Authors: Bo Li, Gexiang Fang, Wei Ye, Zhenghua Xu, Jinglei Zhang, Hao Cheng, Shikun Zhang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16107
Pdf URL: https://arxiv.org/pdf/2505.16107
Copy Paste: [[2505.16107]] MPL: Multiple Programming Languages with Large Language Models for Information Extraction(https://arxiv.org/abs/2505.16107)
Keywords: language model, prompt
Abstract: Recent research in information extraction (IE) focuses on utilizing code-style inputs to enhance structured output generation. The intuition behind this is that the programming languages (PLs) inherently exhibit greater structural organization than natural languages (NLs). This structural advantage makes PLs particularly suited for IE tasks. Nevertheless, existing research primarily focuses on Python for code-style simulation, overlooking the potential of other widely-used PLs (e.g., C++ and Java) during the supervised fine-tuning (SFT) phase. In this research, we propose \textbf{M}ultiple \textbf{P}rogramming \textbf{L}anguages with large language models for information extraction (abbreviated as \textbf{MPL}), a novel framework that explores the potential of incorporating different PLs in the SFT phase. Additionally, we introduce \texttt{function-prompt} with virtual running to simulate code-style inputs more effectively and efficiently. Experimental results on a wide range of datasets demonstrate the effectiveness of MPL. Furthermore, we conduct extensive experiments to provide a comprehensive analysis. We have released our code for future research.
摘要：信息提取（IE）的最新研究重点是利用代码风格的输入来增强结构化的产出产生。这背后的直觉是，编程语言（PLS）固有地表现出比自然语言（NLS）更大的结构组织。这种结构性优势使PLS特别适合IE任务。然而，现有的研究主要集中于用于代码风格的模拟的Python，忽视了受监督的微调（SFT）阶段中其他广泛使用的PL（例如C ++和Java）的潜力。在这项研究中，我们提出了\ textbf {m} untiple \ textbf {p} rogramming \ textbf {l}具有大型语言模型的信息提取（作为\ textbf {mpl {mpl}），这是一个新颖的框架，探索了将不同的pl纳入SFT阶段的潜力。此外，我们在虚拟运行中介绍\ texttt {function-prompt}，以更有效，有效地模拟代码风格的输入。广泛的数据集的实验结果证明了MPL的有效性。此外，我们进行了广泛的实验以提供全面的分析。我们发布了未来研究的代码。

Title: Semiotic Reconstruction of Destination Expectation Constructs An LLM-Driven Computational Paradigm for Social Media Tourism Analytics

Authors: Haotian Lan, Yao Gao, Yujun Cheng, Wei Yuan, Kun Wang
Subjects: cs.CL, stat.AP
Abstract URL: https://arxiv.org/abs/2505.16118
Pdf URL: https://arxiv.org/pdf/2505.16118
Copy Paste: [[2505.16118]] Semiotic Reconstruction of Destination Expectation Constructs An LLM-Driven Computational Paradigm for Social Media Tourism Analytics(https://arxiv.org/abs/2505.16118)
Keywords: llm
Abstract: Social media's rise establishes user-generated content (UGC) as pivotal for travel decisions, yet analytical methods lack scalability. This study introduces a dual-method LLM framework: unsupervised expectation extraction from UGC paired with survey-informed supervised fine-tuning. Findings reveal leisure/social expectations drive engagement more than foundational natural/emotional factors. By establishing LLMs as precision tools for expectation quantification, we advance tourism analytics methodology and propose targeted strategies for experience personalization and social travel promotion. The framework's adaptability extends to consumer behavior research, demonstrating computational social science's transformative potential in marketing optimization.
摘要：社交媒体的兴起确立了用户生成的内容（UGC）作为旅行决策的关键，但是分析方法缺乏可扩展性。这项研究介绍了双方法LLM框架：UGC的无监督期望提取与调查发现的监督微调。调查结果表明，休闲/社会期望比基本的自然/情感因素更具参与度。通过建立LLM作为预期量化的精确工具，我们提高了旅游分析方法，并提出了有针对性的策略，以实现个性化和社交旅行促进。该框架的适应性扩展到消费者行为研究，证明了计算社会科学在营销优化方面的变革潜力。

Title: KoBALT: Korean Benchmark For Advanced Linguistic Tasks

Authors: Hyopil Shin, Sangah Lee, Dongjun Jang, Wooseok Song, Jaeyoon Kim, Chaeyoung Oh, Hyemi Jo, Youngchae Ahn, Sihyun Oh, Hyohyeong Chang, Sunkyoung Kim, Jinsik Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16125
Pdf URL: https://arxiv.org/pdf/2505.16125
Copy Paste: [[2505.16125]] KoBALT: Korean Benchmark For Advanced Linguistic Tasks(https://arxiv.org/abs/2505.16125)
Keywords: language model, llm
Abstract: We introduce KoBALT (Korean Benchmark for Advanced Linguistic Tasks), a comprehensive linguistically-motivated benchmark comprising 700 multiple-choice questions spanning 24 phenomena across five linguistic domains: syntax, semantics, pragmatics, phonetics/phonology, and morphology. KoBALT is designed to advance the evaluation of large language models (LLMs) in Korean, a morphologically rich language, by addressing the limitations of conventional benchmarks that often lack linguistic depth and typological grounding. It introduces a suite of expert-curated, linguistically motivated questions with minimal n-gram overlap with standard Korean corpora, substantially mitigating the risk of data contamination and allowing a more robust assessment of true language understanding. Our evaluation of 20 contemporary LLMs reveals significant performance disparities, with the highest-performing model achieving 61\% general accuracy but showing substantial variation across linguistic domains - from stronger performance in semantics (66\%) to considerable weaknesses in phonology (31\%) and morphology (36\%). Through human preference evaluation with 95 annotators, we demonstrate a strong correlation between KoBALT scores and human judgments, validating our benchmark's effectiveness as a discriminative measure of Korean language understanding. KoBALT addresses critical gaps in linguistic evaluation for typologically diverse languages and provides a robust framework for assessing genuine linguistic competence in Korean language models.
摘要：我们介绍了Kobalt（用于先进语言任务的韩国基准），这是一个全面的语言动机基准，其中包括700个多种选择性问题，涵盖了跨越五个语言领域的24个现象：语法，语义，语义，说明性，语言，语音学/语音学和形态学。 Kobalt旨在通过解决通常缺乏语言深度和类型学基础的传统基准的局限性来提高韩语的大型语言模型（LLM）的评估。它引入了一套专家策划的，语言动机的问题，而N-gram与标准的韩国语料库重叠，从而大大减轻了数据污染的风险，并允许对真实语言的理解进行更强有力的评估。我们对20种当代LLM的评估揭示了显着的性能差异，表现最高的模型达到了61 \％的一般准确性，但在语言领域之间显示出很大的差异 - 从语义（66 \％）的更强表现到语音学（31 \％）和形态学（36 \％）的较大弱点。通过与95个注释者的人类偏好评估，我们证明了科巴尔特分数与人类判断之间的密切相关性，从而证实了我们的基准有效性，以此作为对韩国语言理解的歧视性衡量。科巴尔特解决了类型多样性语言的语言评估中的关键差距，并为评估韩国语言模型的真正语言能力提供了强大的框架。

Title: Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning

Authors: Yue Zhou, Barbara Di Eugenio
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16128
Pdf URL: https://arxiv.org/pdf/2505.16128
Copy Paste: [[2505.16128]] Veracity Bias and Beyond: Uncovering LLMs' Hidden Beliefs in Problem-Solving Reasoning(https://arxiv.org/abs/2505.16128)
Keywords: llm
Abstract: Despite LLMs' explicit alignment against demographic stereotypes, they have been shown to exhibit biases under various social contexts. In this work, we find that LLMs exhibit concerning biases in how they associate solution veracity with demographics. Through experiments across five human value-aligned LLMs on mathematics, coding, commonsense, and writing problems, we reveal two forms of such veracity biases: Attribution Bias, where models disproportionately attribute correct solutions to certain demographic groups, and Evaluation Bias, where models' assessment of identical solutions varies based on perceived demographic authorship. Our results show pervasive biases: LLMs consistently attribute fewer correct solutions and more incorrect ones to African-American groups in math and coding, while Asian authorships are least preferred in writing evaluation. In additional studies, we show LLMs automatically assign racially stereotypical colors to demographic groups in visualization code, suggesting these biases are deeply embedded in models' reasoning processes. Our findings indicate that demographic bias extends beyond surface-level stereotypes and social context provocations, raising concerns about LLMs' deployment in educational and evaluation settings.
摘要：尽管LLMS明确地对准人口刻板印象，但它们已被证明在各种社会背景下都表现出偏见。在这项工作中，我们发现LLM关于它们如何将解决方案与人口统计相关联的偏见表现出。通过对数学，编码，常识和写作问题的五个人类价值一致的LLM进行的实验，我们揭示了这种真实性偏见的两种形式：归因偏见，其中模型不成比例地将某些人口统计学组的正确解决方案归因于某些人口统计学组和评估偏见，其中模型对基于人口统计学的人口统计学的评估变化了模型的评估。我们的结果表明，普遍存在的偏见：LLM始终将更少的正确解决方案归因于数学和编码中的非裔美国人群体，而亚洲作者在书写评估中最不受欢迎。在其他研究中，我们表明LLMS在可视化代码中自动为人口统计组分配了种族刻板印象的颜色，这表明这些偏见已深深地嵌入模型的推理过程中。我们的发现表明，人口偏见超出了表面级别的刻板印象和社会环境的挑衅，引起了人们对LLMS在教育和评估环境中部署的关注。

Title: LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods

Authors: Hyang Cui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16129
Pdf URL: https://arxiv.org/pdf/2505.16129
Copy Paste: [[2505.16129]] LLMs Are Not Scorers: Rethinking MT Evaluation with Generation-Based Methods(https://arxiv.org/abs/2505.16129)
Keywords: language model, llm, prompt
Abstract: Recent studies have applied large language models (LLMs) to machine translation quality estimation (MTQE) by prompting models to assign numeric scores. Nonetheless, these direct scoring methods tend to show low segment-level correlation with human judgments. In this paper, we propose a generation-based evaluation paradigm that leverages decoder-only LLMs to produce high-quality references, followed by semantic similarity scoring using sentence embeddings. We conduct the most extensive evaluation to date in MTQE, covering 8 LLMs and 8 language pairs. Empirical results show that our method outperforms both intra-LLM direct scoring baselines and external non-LLM reference-free metrics from MTME. These findings demonstrate the strength of generation-based evaluation and support a shift toward hybrid approaches that combine fluent generation with accurate semantic assessment.
摘要：最近的研究通过促使模型分配数字得分，将大型语言模型（LLMS）应用于机器翻译质量估计（MTQE）。尽管如此，这些直接评分方法倾向于表现出与人类判断的较低阶段级别的相关性。在本文中，我们提出了一个基于一代的评估范例，该范式利用仅解码器LLMS产生高质量的参考，然后使用句子嵌入进行语义相似性评分。我们在MTQE中进行了最广泛的评估，涵盖了8个LLM和8个语言对。经验结果表明，我们的方法的表现优于内部直接评分基准和MTME的外部非LLL参考指标。这些发现证明了基于世代的评估的强度，并支持向混合方法转变，这些方法将流利的生成与准确的语义评估相结合。

Title: Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models

Authors: Menschikov Mikhail, Alexander Kharitonov, Maiia Kotyga, Vadim Porvatov, Anna Zhukovskaya, David Kagramanyan, Egor Shvetsov, Evgeny Burnaev
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16134
Pdf URL: https://arxiv.org/pdf/2505.16134
Copy Paste: [[2505.16134]] Position of Uncertainty: A Cross-Linguistic Study of Positional Bias in Large Language Models(https://arxiv.org/abs/2505.16134)
Keywords: language model, llm, prompt
Abstract: Large language models exhibit positional bias -- systematic neglect of information at specific context positions -- yet its interplay with linguistic diversity remains poorly understood. We present a cross-linguistic study across five typologically distinct languages (English, Russian, German, Hindi, Vietnamese), examining how positional bias interacts with model uncertainty, syntax, and prompting. Key findings: (1) Positional bias is model-driven, with language-specific variations -- Qwen2.5-7B favors late positions, challenging assumptions of early-token bias; (2) Explicit positional guidance (e.g., correct context is at position X) reduces accuracy across languages, undermining prompt-engineering practices; (3) Aligning context with positional bias increases entropy, yet minimal entropy does not predict accuracy. (4) We further uncover that LLMs differently impose dominant word order in free-word-order languages like Hindi.
摘要：大型语言模型表现出位置偏见 - 在特定上下文位置上有系统地忽略信息 - 但其与语言多样性的相互作用仍然很众所周知。我们提出了一项跨语言研究，涉及五种类型上不同的语言（英语，俄语，德语，印地语，越南语），研究了位置偏见如何与模型不确定性，语法和提示相互作用。主要发现：（1）位置偏见是模型驱动的，具有特定于语言的变化-QWEN2.5-7B有利于较晚的位置，对早期偏见的挑战；（2）明确的位置指导（例如，正确的上下文在位置x）降低了跨语言的准确性，从而破坏了及时的工程实践；（3）将上下文与位置偏差对准会增加熵，但最小的熵无法预测准确性。（4）我们进一步发现，LLM在自由词的语言中施加了不同的单词顺序，例如印地语。

Title: Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning

Authors: Shicheng Xu, Liang Pang, Yunchang Zhu, Jia Gu, Zihao Wei, Jingcheng Deng, Feiyang Pan, Huawei Shen, Xueqi Cheng
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16142
Pdf URL: https://arxiv.org/pdf/2505.16142
Copy Paste: [[2505.16142]] Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning(https://arxiv.org/abs/2505.16142)
Keywords: language model, llm
Abstract: Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of smaller Large Language Models (LLMs). However, the reasoning paths generated by teacher models often reflect only surface-level traces of their underlying authentic reasoning. Insights from cognitive neuroscience suggest that authentic reasoning involves a complex interweaving between meta-reasoning (which selects appropriate sub-problems from multiple candidates) and solving (which addresses the sub-problem). This implies authentic reasoning has an implicit multi-branch structure. Supervised fine-tuning collapses this rich structure into a flat sequence of token prediction in the teacher's reasoning path, preventing effective distillation of this structure to students. To address this limitation, we propose RLKD, a reinforcement learning (RL)-based distillation framework guided by a novel Generative Structure Reward Model (GSRM). Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning. RLKD combines this reward with RL, enabling student LLMs to internalize the teacher's implicit multi-branch reasoning structure rather than merely mimicking fixed output paths. Experiments show RLKD surpasses standard SFT-RL pipelines even when trained on 0.1% of data under an RL-only regime, unlocking greater student reasoning potential than SFT-based distillation.
摘要：通过监督的微调（SFT）将推理路径从教师到学生模型，为提高较小的大语言模型（LLM）的推理能力提供了捷径。但是，教师模型产生的推理路径通常仅反映其基本真实推理的表面层痕迹。认知神经科学的见解表明，真实的推理涉及在元问题（从多个候选者中选择适当的子问题）之间的复杂交织在一起，并解决了解决（解决子问题）。这意味着真实的推理具有隐式的多分支结构。监督的微调使这种丰富的结构崩溃了，使其在教师的推理路径中统一的象征性预测序列，以防止对学生有效蒸馏这种结构。为了解决这一限制，我们提出了RLKD，这是一种基于新颖的生成结构奖励模型（GSRM）指导的基于增强的蒸馏框架。我们的GSRM将推理路径转换为多个元解析步骤和计算奖励，以衡量学生和教师推理之间的结构一致性。 RLKD将此奖励与RL结合在一起，使学生LLM可以内部化教师的隐式多分支推理结构，而不仅仅是模仿固定的输出路径。实验表明，即使在仅RL唯一的训练下对0.1％的数据进行了训练，RLKD也超过了标准的SFT-RL管道，与基于SFT的蒸馏相比，释放了更大的学生推理潜力。

Title: EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios

Authors: Bin Xu, Yu Bai, Huashan Sun, Yiguan Lin, Siming Liu, Xinyue Liang, Yaolin Li, Yang Gao, Heyan Huang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16160
Pdf URL: https://arxiv.org/pdf/2505.16160
Copy Paste: [[2505.16160]] EduBench: A Comprehensive Benchmarking Dataset for Evaluating Large Language Models in Diverse Educational Scenarios(https://arxiv.org/abs/2505.16160)
Keywords: language model
Abstract: As large language models continue to advance, their application in educational contexts remains underexplored and under-optimized. In this paper, we address this gap by introducing the first diverse benchmark tailored for educational scenarios, incorporating synthetic data containing 9 major scenarios and over 4,000 distinct educational contexts. To enable comprehensive assessment, we propose a set of multi-dimensional evaluation metrics that cover 12 critical aspects relevant to both teachers and students. We further apply human annotation to ensure the effectiveness of the model-generated evaluation responses. Additionally, we succeed to train a relatively small-scale model on our constructed dataset and demonstrate that it can achieve performance comparable to state-of-the-art large models (e.g., Deepseek V3, Qwen Max) on the test set. Overall, this work provides a practical foundation for the development and evaluation of education-oriented language models. Code and data are released at this https URL.
摘要：随着大型语言模型的继续发展，它们在教育环境中的应用仍未得到充分利用和不优化。在本文中，我们通过引入针对教育方案量身定制的第一个不同的基准来解决这一差距，其中包含包含9个主要场景和4,000多种不同教育环境的合成数据。为了实现全面的评估，我们提出了一组多维评估指标，涵盖了与教师和学生相关的12个关键方面。我们进一步应用人类注释以确保模型生成的评估响应的有效性。此外，我们成功地在构造的数据集上训练一个相对较小的尺度模型，并证明它可以实现与测试集上的最新大型模型相当的性能（例如，DeepSeek V3，Qwen Max）。总体而言，这项工作为开发和评估面向教育的语言模型提供了实用的基础。代码和数据在此HTTPS URL上发布。

Title: KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization

Authors: Mingbo Song, Heming Xia, Jun Zhang, Chak Tou Leong, Qiancheng Xu, Wenjie Li, Sujian Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16162
Pdf URL: https://arxiv.org/pdf/2505.16162
Copy Paste: [[2505.16162]] KNN-SSD: Enabling Dynamic Self-Speculative Decoding via Nearest Neighbor Layer Set Optimization(https://arxiv.org/abs/2505.16162)
Keywords: language model, llm
Abstract: Speculative Decoding (SD) has emerged as a widely used paradigm to accelerate the inference of large language models (LLMs) without compromising generation quality. It works by efficiently drafting multiple tokens using a compact model and then verifying them in parallel using the target LLM. Notably, Self-Speculative Decoding proposes skipping certain layers to construct the draft model, which eliminates the need for additional parameters or training. Despite its strengths, we observe in this work that drafting with layer skipping exhibits significant sensitivity to domain shifts, leading to a substantial drop in acceleration performance. To enhance the domain generalizability of this paradigm, we introduce KNN-SSD, an algorithm that leverages K-Nearest Neighbor (KNN) search to match different skipped layers with various domain inputs. We evaluated our algorithm in various models and multiple tasks, observing that its application leads to 1.3x-1.6x speedup in LLM inference.
摘要：投机解码（SD）已成为一种广泛使用的范式，以加速大型语言模型（LLMS）的推断，而不会损害发电质量。它通过使用紧凑的模型有效起草多个令牌，然后使用目标LLM并行验证它们，从而起作用。值得注意的是，自定义解码建议跳过某些层来构建草稿模型，从而消除了对其他参数或培训的需求。尽管具有优势，但我们在这项工作中观察到，层跳过的起草对域移动表现出显着的敏感性，从而导致加速度性能大幅下降。为了增强此范式的域推广性，我们引入了KNN-SSD，这是一种利用K-Nearest邻居（KNN）搜索的算法，将不同的跳过图层与各种域输入匹配。我们在各种模型和多个任务中评估了我们的算法，观察到其应用程序在LLM推理中导致1.3倍1.6倍的速度。

Title: Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task

Authors: Mengyang Qiu, Zoe Brisebois, Siena Sun
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16164
Pdf URL: https://arxiv.org/pdf/2505.16164
Copy Paste: [[2505.16164]] Can LLMs Simulate Human Behavioral Variability? A Case Study in the Phonemic Fluency Task(https://arxiv.org/abs/2505.16164)
Keywords: language model, llm, prompt
Abstract: Large language models (LLMs) are increasingly explored as substitutes for human participants in cognitive tasks, but their ability to simulate human behavioral variability remains unclear. This study examines whether LLMs can approximate individual differences in the phonemic fluency task, where participants generate words beginning with a target letter. We evaluated 34 model configurations, varying prompt specificity, sampling temperature, and model type, and compared outputs to responses from 106 human participants. While some configurations, especially Claude 3.7 Sonnet, matched human averages and lexical preferences, none reproduced the scope of human variability. LLM outputs were consistently less diverse and structurally rigid, and LLM ensembles failed to increase diversity. Network analyses further revealed fundamental differences in retrieval structure between humans and models. These results highlight key limitations in using LLMs to simulate human cognition and behavior.
摘要：大型语言模型（LLM）越来越多地探讨了人类参与者在认知任务中的替代品，但它们模拟人类行为可变性的能力尚不清楚。这项研究检查了LLM是否可以近似于语音流利任务中的个体差异，参与者从目标字母开始产生单词。我们评估了34种模型配置，变化的及时特异性，采样温度和模型类型，并将其与106名人参与者的响应进行了比较。尽管某些配置，尤其是克劳德3.7十四行诗，匹配人类平均值和词汇偏好，但没有一个复制人类变异性的范围。 LLM输出始终较少且结构上的刚性较差，而LLM合奏未能增加多样性。网络分析进一步揭示了人与模型之间的检索结构的基本差异。这些结果突出了使用LLM模拟人类认知和行为的关键局限性。

Title: When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction

Authors: Yuqing Yang, Robin Jia
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16170
Pdf URL: https://arxiv.org/pdf/2505.16170
Copy Paste: [[2505.16170]] When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction(https://arxiv.org/abs/2505.16170)
Keywords: language model, llm
Abstract: Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on this https URL.
摘要：大型语言模型（LLM）可以承认他们应该更了解的错误吗？在这项工作中，我们将确认先前生成的答案中错误的行为定义为“撤回”，并旨在了解LLMS何时以及为什么选择缩回。我们首先构建了特定于模型的数据集，以评估模型是否会撤回与其自身参数知识相矛盾的错误答案。虽然LLM能够撤回，但它们只能很少进行。我们证明，撤回与先前确定的模型内部信念的指标紧密相关：模型无法撤回他们“相信”实际上正确的错误答案。转向实验进一步表明，内部信念在因果关系中影响模型缩回。特别是，当模型不相信其答案时，这不仅鼓励模型试图验证答案，而且还会在自我验证期间改变注意力行为。最后，我们证明了简单的监督微调可以通过帮助模型学习更准确的内部信念来显着改善缩回性能。代码和数据集可在此HTTPS URL上找到。

Title: Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss

Authors: Abhay Kumara Sri Krishna Nandiraju, Gondy Leroy, David Kauchak, Arif Ahmed
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16172
Pdf URL: https://arxiv.org/pdf/2505.16172
Copy Paste: [[2505.16172]] Automated Feedback Loops to Protect Text Simplification with Generative AI from Information Loss(https://arxiv.org/abs/2505.16172)
Keywords: gpt
Abstract: Understanding health information is essential in achieving and maintaining a healthy life. We focus on simplifying health information for better understanding. With the availability of generative AI, the simplification process has become efficient and of reasonable quality, however, the algorithms remove information that may be crucial for comprehension. In this study, we compare generative AI to detect missing information in simplified text, evaluate its importance, and fix the text with the missing information. We collected 50 health information texts and simplified them using gpt-4-0613. We compare five approaches to identify missing elements and regenerate the text by inserting the missing elements. These five approaches involve adding missing entities and missing words in various ways: 1) adding all the missing entities, 2) adding all missing words, 3) adding the top-3 entities ranked by gpt-4-0613, and 4, 5) serving as controls for comparison, adding randomly chosen entities. We use cosine similarity and ROUGE scores to evaluate the semantic similarity and content overlap between the original, simplified, and reconstructed simplified text. We do this for both summaries and full text. Overall, we find that adding missing entities improves the text. Adding all the missing entities resulted in better text regeneration, which was better than adding the top-ranked entities or words, or random words. Current tools can identify these entities, but are not valuable in ranking them.
摘要：了解健康信息对于实现和维持健康的生活至关重要。我们专注于简化健康信息以更好地理解。随着生成AI的可用性，简化过程已变得有效且质量合理，但是，算法消除了可能对理解至关重要的信息。在这项研究中，我们比较生成AI以检测简化文本中缺少的信息，评估其重要性，并使用缺少信息来修复文本。我们收集了50个健康信息文本，并使用GPT-4-0613简化了它们。我们比较了五种方法来识别缺失元素并通过插入缺失元素来重新生成文本。这五种方法涉及以各种方式添加缺失的实体和丢失的单词：1）添加所有缺失的实体，2）添加所有缺失的单词，3）添加由GPT-4-4-0613排名的TOP-3 ENTITIT，以及4，5）用作比较的控制，以进行比较，添加随机选择的实体。我们使用余弦相似性和胭脂分数来评估原始，简化和重建的简化文本之间的语义相似性和内容重叠。我们要为此进行摘要和全文。总体而言，我们发现添加缺失实体可以改善文本。添加所有缺失的实体会导致更好的文本再生，这比添加排名最高的实体或单词或随机词更好。当前的工具可以识别这些实体，但对于对它们进行排名并不有价值。

Title: Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge

Authors: Ying Zhang, Benjamin Heinzerling, Dongyuan Li, Ryoma Ishigaki, Yuta Hitomi, Kentaro Inui
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16178
Pdf URL: https://arxiv.org/pdf/2505.16178
Copy Paste: [[2505.16178]] Understanding Fact Recall in Language Models: Why Two-Stage Training Encourages Memorization but Mixed Training Teaches Knowledge(https://arxiv.org/abs/2505.16178)
Keywords: language model
Abstract: Fact recall, the ability of language models (LMs) to retrieve specific factual knowledge, remains a challenging task despite their impressive general capabilities. Common training strategies often struggle to promote robust recall behavior with two-stage training, which first trains a model with fact-storing examples (e.g., factual statements) and then with fact-recalling examples (question-answer pairs), tending to encourage rote memorization rather than generalizable fact retrieval. In contrast, mixed training, which jointly uses both types of examples, has been empirically shown to improve the ability to recall facts, but the underlying mechanisms are still poorly understood. In this work, we investigate how these training strategies affect how model parameters are shaped during training and how these differences relate to their ability to recall facts. We introduce cross-task gradient trace to identify shared parameters, those strongly influenced by both fact-storing and fact-recalling examples. Our analysis on synthetic fact recall datasets with the Llama-3.2B and Pythia-2.8B models reveals that mixed training encouraging a larger and more centralized set of shared parameters. These findings suggest that the emergence of parameters may play a key role in enabling LMs to generalize factual knowledge across task formulations.
摘要：事实回想起，尽管具有令人印象深刻的一般能力，但语言模型（LMS）检索特定事实知识的能力仍然是一项具有挑战性的任务。常见的培训策略通常努力通过两阶段培训来促进稳健的回忆行为，该培训首先通过事实存储的例子（例如，事实陈述）训练模型，然后以事实重新调查的例子（问答式）培训，倾向于鼓励死记硬背的记忆，而不是可概括的事实检索。相比之下，共同使用两种示例的混合训练已被经验证明可以提高回忆事实的能力，但是基本机制仍然很少了解。在这项工作中，我们研究了这些培训策略如何影响模型参数在培训期间的形状以及这些差异与回忆事实的能力之间的关系。我们介绍了交叉任务梯度跟踪以识别共享参数，这些参数受到事实存储和事实恢复示例的强烈影响。我们对使用Llama-3.2b和Pythia-2.8b模型的合成事实回忆数据集的分析表明，混合训练可以鼓励更大，更集中的共享参数集。这些发现表明，参数的出现在使LMS能够跨越任务配方概括事实知识方面可能起关键作用。

Title: SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models

Authors: Zirui He, Mingyu Jin, Bo Shen, Ali Payani, Yongfeng Zhang, Mengnan Du
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16188
Pdf URL: https://arxiv.org/pdf/2505.16188
Copy Paste: [[2505.16188]] SAE-SSV: Supervised Steering in Sparse Representation Spaces for Reliable Control of Language Models(https://arxiv.org/abs/2505.16188)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but controlling their behavior reliably remains challenging, especially in open-ended generation settings. This paper introduces a novel supervised steering approach that operates in sparse, interpretable representation spaces. We employ sparse autoencoders (SAEs)to obtain sparse latent representations that aim to disentangle semantic attributes from model activations. Then we train linear classifiers to identify a small subspace of task-relevant dimensions in latent representations. Finally, we learn supervised steering vectors constrained to this subspace, optimized to align with target behaviors. Experiments across sentiment, truthfulness, and politics polarity steering tasks with multiple LLMs demonstrate that our supervised steering vectors achieve higher success rates with minimal degradation in generation quality compared to existing methods. Further analysis reveals that a notably small subspace is sufficient for effective steering, enabling more targeted and interpretable interventions.
摘要：大型语言模型（LLM）在自然语言理解和产生中表现出了令人印象深刻的能力，但是控制其行为仍然具有挑战性，尤其是在开放式的一代环境中。本文介绍了一种新颖的监督转向方法，该方法在稀疏，可解释的表示空间中运行。我们采用稀疏的自动编码器（SAE）来获得稀疏的潜在表示，旨在将语义属性从模型激活中解散。然后，我们训练线性分类器，以识别潜在表示中与任务相关的维度的一个小子空间。最后，我们学习受到此子空间的监督转向向量，优化以与目标行为保持一致。与现有方法相比，与现有方法相比，具有多个LLM的情感，真实性和政治极性转向任务的实验表明，我们所监督的转向向量获得了更高的成功率，而发电质量的降低最小。进一步的分析表明，一个尤其小的子空间足以有效转向，实现了更具针对性和可解释的干预措施。

Title: An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability

Authors: Daiqing Wu, Dongbao Yang, Sicheng Zhao, Can Ma, Yu Zhou
Subjects: cs.CL, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16193
Pdf URL: https://arxiv.org/pdf/2505.16193
Copy Paste: [[2505.16193]] An Empirical Study on Configuring In-Context Learning Demonstrations for Unleashing MLLMs' Sentimental Perception Capability(https://arxiv.org/abs/2505.16193)
Keywords: language model, llm
Abstract: The advancements in Multimodal Large Language Models (MLLMs) have enabled various multimodal tasks to be addressed under a zero-shot paradigm. This paradigm sidesteps the cost of model fine-tuning, emerging as a dominant trend in practical application. Nevertheless, Multimodal Sentiment Analysis (MSA), a pivotal challenge in the quest for general artificial intelligence, fails to accommodate this convenience. The zero-shot paradigm exhibits undesirable performance on MSA, casting doubt on whether MLLMs can perceive sentiments as competent as supervised models. By extending the zero-shot paradigm to In-Context Learning (ICL) and conducting an in-depth study on configuring demonstrations, we validate that MLLMs indeed possess such capability. Specifically, three key factors that cover demonstrations' retrieval, presentation, and distribution are comprehensively investigated and optimized. A sentimental predictive bias inherent in MLLMs is also discovered and later effectively counteracted. By complementing each other, the devised strategies for three factors result in average accuracy improvements of 15.9% on six MSA datasets against the zero-shot paradigm and 11.2% against the random ICL baseline.
摘要：多模式大语言模型（MLLM）的进步已经使各种多模式任务在零射击范式下解决。这种范式避开了模型微调的成本，这是实际应用中的主要趋势。然而，多模式情感分析（MSA）是寻求通用人工智能的关键挑战，无法适应这种便利。零射击范式在MSA上表现出不良的表现，对MLLM是否可以将情感视为有监督模型的胜任感。通过将零射击范式扩展到文化学习（ICL）并进行配置演示的深入研究，我们验证了MLLM的确具有这种能力。具体而言，全面研究和优化了涵盖演示的检索，演示和分布的三个关键因素。还发现了MLLM固有的感性预测偏差，然后有效地应对。通过相互补充，针对三个因素设计的策略导致六个MSA数据集的平均准确性提高15.9％，而零弹药范式对随机ICL基线的平均准确性提高，为11.2％。

Title: Large Language Models based ASR Error Correction for Child Conversations

Authors: Anfeng Xu, Tiantian Feng, So Hyun Kim, Somer Bishop, Catherine Lord, Shrikanth Narayanan
Subjects: cs.CL, eess.AS
Abstract URL: https://arxiv.org/abs/2505.16212
Pdf URL: https://arxiv.org/pdf/2505.16212
Copy Paste: [[2505.16212]] Large Language Models based ASR Error Correction for Child Conversations(https://arxiv.org/abs/2505.16212)
Keywords: language model, llm
Abstract: Automatic Speech Recognition (ASR) has recently shown remarkable progress, but accurately transcribing children's speech remains a significant challenge. Recent developments in Large Language Models (LLMs) have shown promise in improving ASR transcriptions. However, their applications in child speech including conversational scenarios are underexplored. In this study, we explore the use of LLMs in correcting ASR errors for conversational child speech. We demonstrate the promises and challenges of LLMs through experiments on two children's conversational speech datasets with both zero-shot and fine-tuned ASR outputs. We find that while LLMs are helpful in correcting zero-shot ASR outputs and fine-tuned CTC-based ASR outputs, it remains challenging for LLMs to improve ASR performance when incorporating contextual information or when using fine-tuned autoregressive ASR (e.g., Whisper) outputs.
摘要：自动语音识别（ASR）最近显示出了显着的进步，但是准确转录儿童的言语仍然是一个重大挑战。大型语言模型（LLM）的最新发展显示了改善ASR抄录的希望。但是，他们在儿童演讲中的应用包括对话情景，并没有充满反感。在这项研究中，我们探讨了LLM在纠正ASR错误对会话儿童语音中的使用。我们通过对两个儿童的对话语音数据集进行了零射击和微调的ASR输出的实验来证明LLMS的承诺和挑战。我们发现，尽管LLM有助于纠正零射击ASR输出和基于CTC的微调ASR输出，但对于合并上下文信息或使用微型自动回调的ASR（例如，Whisper）输出时，LLMS在合并上下文信息时提高ASR性能仍然具有挑战性。

Title: Memorization or Reasoning? Exploring the Idiom Understanding of LLMs

Authors: Jisu Kim, Youngwoo Shin, Uiji Hwang, Jihun Choi, Richeng Xuan, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16216
Pdf URL: https://arxiv.org/pdf/2505.16216
Copy Paste: [[2505.16216]] Memorization or Reasoning? Exploring the Idiom Understanding of LLMs(https://arxiv.org/abs/2505.16216)
Keywords: language model, llm
Abstract: Idioms have long posed a challenge due to their unique linguistic properties, which set them apart from other common expressions. While recent studies have leveraged large language models (LLMs) to handle idioms across various tasks, e.g., idiom-containing sentence generation and idiomatic machine translation, little is known about the underlying mechanisms of idiom processing in LLMs, particularly in multilingual settings. To this end, we introduce MIDAS, a new large-scale dataset of idioms in six languages, each paired with its corresponding meaning. Leveraging this resource, we conduct a comprehensive evaluation of LLMs' idiom processing ability, identifying key factors that influence their performance. Our findings suggest that LLMs rely not only on memorization, but also adopt a hybrid approach that integrates contextual cues and reasoning, especially when processing compositional idioms. This implies that idiom understanding in LLMs emerges from an interplay between internal knowledge retrieval and reasoning-based inference.
摘要：由于其独特的语言特性，成语长期以来提出了挑战，这使它们与其他共同表达式区分开来。尽管最近的研究利用了大型语言模型（LLM）来处理各种任务中的成语，例如，含字句的句子的产生和惯用机器翻译，但对LLMS中习惯处理的基本机制知之甚少，尤其是在多语言环境中。为此，我们介绍了MIDAS，这是一种新的大型习语数据集，其中每种语言都与其相应的含义配对。利用此资源，我们对LLMS的成语处理能力进行了全面评估，确定了影响其性能的关键因素。我们的发现表明，LLM不仅依赖于记忆，而且还采用一种混合方法来整合上下文提示和推理，尤其是在处理构图成语时。这意味着LLMS中的成语理解是由于内部知识检索和基于推理的推论之间的相互作用而出现的。

Title: Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation

Authors: Jiwon Moon, Yerin Hwang, Dongryeol Lee, Taegwan Kang, Yongil Kim, Kyomin Jung
Subjects: cs.CL, cs.SE
Abstract URL: https://arxiv.org/abs/2505.16222
Pdf URL: https://arxiv.org/pdf/2505.16222
Copy Paste: [[2505.16222]] Don't Judge Code by Its Cover: Exploring Biases in LLM Judges for Code Evaluation(https://arxiv.org/abs/2505.16222)
Keywords: language model, llm, prompt
Abstract: With the growing use of large language models(LLMs) as evaluators, their application has expanded to code evaluation tasks, where they assess the correctness of generated code without relying on reference implementations. While this offers scalability and flexibility, it also raises a critical, unresolved question: Can LLM judges fairly and robustly evaluate semantically equivalent code with superficial variations? Functionally correct code often exhibits variations-such as differences in variable names, comments, or formatting-that should not influence its correctness. Yet, whether LLM judges can reliably handle these variations remains unclear. We present the first comprehensive study of this issue, defining six types of potential bias in code evaluation and revealing their systematic impact on LLM judges. Across five programming languages and multiple LLMs, we empirically demonstrate that all tested LLM judges are susceptible to both positive and negative biases, resulting in inflated or unfairly low scores. Moreover, we observe that LLM judges remain vulnerable to these biases even when prompted to generate test cases before scoring, highlighting the need for more robust code evaluation methods.
摘要：随着大型语言模型（LLM）作为评估者的日益增长的使用，其应用程序已扩展到代码评估任务，在那里他们在不依赖参考实现的情况下评估了生成的代码的正确性。尽管这提供了可伸缩性和灵活性，但它也提出了一个关键的，尚未解决的问题：LLM法官是否可以公平，牢固地评估具有表面变化的语义上等效的代码？在功能上正确的代码通常表现出变化，例如可变名称，注释或格式化的差异 - 不应影响其正确性。但是，LLM法官是否可以可靠地处理这些变化尚不清楚。我们介绍了该问题的首次全面研究，定义了代码评估中的六种潜在偏见，并揭示了其对LLM法官的系统影响。在五种编程语言和多种LLM中，我们从经验上证明，所有经过测试的LLM法官都容易受到正偏见和负面偏见的影响，从而导致分数膨胀或不公平。此外，我们观察到，即使提示在得分之前提示生成测试案例，LLM法官仍然容易受到这些偏见的影响，这突出了需要更强大的代码评估方法。

Title: Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Authors: Bohao Wu, Qingyun Wang, Yue Guo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16227
Pdf URL: https://arxiv.org/pdf/2505.16227
Copy Paste: [[2505.16227]] Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning(https://arxiv.org/abs/2505.16227)
Keywords: language model, gpt, prompt
Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.
摘要：个性化的术语检测和解释对于使具有多种纪律背景的读者可以访问技术文档至关重要。但是，针对单个用户的量身定制模型通常需要由于用户特异性的固定而导致的大量注释工作和计算资源。为了解决这个问题，我们提出了一项对个性化行话检测的系统研究，重点介绍了对现实世界部署既有效又可扩展的方法。我们探讨了两种个性化策略：（1）使用低级别适应（LORA）在开源模型上轻巧的微调，以及（2）个性化提示，该提示在推理时定制了模型行为而无需保留。为了反映现实的约束，我们还研究了将有限的注释数据与无监督的用户背景信号相结合的混合方法。我们个性化的洛拉（Lora）模型在F1分中优于GPT-4的21.4％，超过了表现最好的Oracle基线8.3％。值得注意的是，我们的方法仅使用10％的带注释的培训数据来实现可比的性能，证明了其对资源受限设置的实用性。我们的研究提供了第一项工作，以系统地探索使用开放式语言模型对行话检测有效，低资源个性化的个性化，从而为通往可扩展的，用户自适应的NLP系统提供了实用的途径。

Title: MuseRAG: Idea Originality Scoring At Scale

Authors: Ali Sarosh Bangash, Krish Veera, Ishfat Abrar Islam, Raiyan Abdul Baten
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16232
Pdf URL: https://arxiv.org/pdf/2505.16232
Copy Paste: [[2505.16232]] MuseRAG: Idea Originality Scoring At Scale(https://arxiv.org/abs/2505.16232)
Keywords: language model, llm, prompt, retrieval-augmented generation
Abstract: An objective, face-valid way to assess the originality of creative ideas is to measure how rare each idea is within a population -- an approach long used in creativity research but difficult to automate at scale. Tabulating response frequencies via manual bucketing of idea rephrasings is labor-intensive, error-prone, and brittle under large corpora. We introduce a fully automated, psychometrically validated pipeline for frequency-based originality scoring. Our method, MuseRAG, combines large language models (LLMs) with an externally orchestrated retrieval-augmented generation (RAG) framework. Given a new idea, the system retrieves semantically similar prior idea buckets and zero-shot prompts the LLM to judge whether the new idea belongs to an existing bucket or forms a new one. The resulting buckets enable computation of frequency-based originality metrics. Across five datasets (N=1143, n_ideas=16294), MuseRAG matches human annotators in idea clustering structure and resolution (AMI = 0.59) and in participant-level scoring (r = 0.89) -- while exhibiting strong convergent and external validity. Our work enables intent-sensitive, human-aligned originality scoring at scale to aid creativity research.
摘要：评估创意创意的一种客观，面部流行的方式是衡量每个人在人群中的罕见 - 一种长期用于创造力研究的方法，但很难自动化自动化。在大型语料库下，通过手动构思的手动限制响应频率是劳动密集型，容易出错和脆弱的。我们引入了一条完全自动化的，心理验证的管道，用于基于频率的原创性评分。我们的方法Muserag将大型语言模型（LLMS）与外部精心策划的检索生成（RAG）框架相结合。考虑到一个新想法，系统检索语义上相似的先前想法存储桶和零射击促使LLM判断新想法是否属于现有存储桶或形成新的想法。最终的存储桶能够计算基于频率的原创性指标。在五个数据集（n = 1143，n_ideas = 16294）中，Muserag在思想聚类结构和分辨率（AMI = 0.59）和参与者级别的评分（r = 0.89）中与人类注释匹配，同时表现出较强的收敛性和外部有效性。我们的工作使意识敏感的人类一致的独创性在大规模上得分，以帮助创造力研究。

Title: LIFEBench: Evaluating Length Instruction Following in Large Language Models

Authors: Wei Zhang, Zhenhong Zhou, Junfeng Fang, Rongwu Xu, Kun Wang, Yuanhe Zhang, Rui Wang, Ge Zhang, Xinfeng Li, Li Sun, Lingjuan Lyu, Yang Liu, Sen Su
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16234
Pdf URL: https://arxiv.org/pdf/2505.16234
Copy Paste: [[2505.16234]] LIFEBench: Evaluating Length Instruction Following in Large Language Models(https://arxiv.org/abs/2505.16234)
Keywords: language model, llm, long context
Abstract: While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.
摘要：尽管大型语言模型（LLMS）可以在长上下文输入中解决博士学位级的推理问题，但它们仍然在看似简单的任务中挣扎：遵循明确的长度指令 - e.g。，写一本10,000字的小说。此外，模型通常会产生太短的输出，过早终止甚至拒绝请求。现有基准主要侧重于评估世代质量，但经常忽略几代人是否达到长度限制。为此，我们在评估基准（LifeBench）之后介绍了长度说明，以全面评估LLMS遵循各种任务和广泛指定长度的长度说明的能力。 LifeBench由英语和中文的4个任务类别的10,800个实例组成，涵盖了16到8192个单词的长度约束。我们评估了26个广泛使用的LLM，并发现大多数模型合理地遵循短长度的说明，但急剧下降了一定阈值。令人惊讶的是，实际上，几乎所有模型在实践中都无法达到供应商所宣称的最大输出长度，这进一步证实了我们的评估最多可扩展32K单词。即使是长篇小说LLM，尽管它们扩展了输入输出窗口，但违反直觉无法改善以下长度指导。值得注意的是，推理LLM的表现甚至超过了专业的长文本生成模型，从而实现了最新的长度。总体而言，LifeBench发现了当前LLMS的长度说明的基本限制，为未来的进步提供了关键的见解。

Title: Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation

Authors: Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, Enhong Chen, Tong Xu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16237
Pdf URL: https://arxiv.org/pdf/2505.16237
Copy Paste: [[2505.16237]] Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation(https://arxiv.org/abs/2505.16237)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large language models (LLMs) have demonstrated remarkable capabilities, but still struggle with issues like hallucinations and outdated information. Retrieval-augmented generation (RAG) addresses these issues by grounding LLM outputs in external knowledge with an Information Retrieval (IR) system. Building on this foundation, graph-based RAG systems go a step further by retrieving subgraphs, which preserve the relationships between knowledge entities and provide more comprehensive context. However, graph RAG faces two challenges: (1) Retrieving relevant information introduces irrelevant nodes (especially in dense graph databases, where retrieval usually extends to adjacent nodes), and leads to overly lengthy inputs that hinder efficiency; (2) The representation gap between graph and language during generation with LLMs limits the ability to fully leverage graph structures for enhanced understanding. To address these limitations, we propose Align-GRAG, a novel reasoning-guided dual alignment framework in post-retrieval phrase. It first formulates a subgraph by retrieving nodes and edges. Then an Aligner is proposed to jointly optimizes a graph encoder with LLM-summarized reasoning. It achieves dual alignment of graph node and representation by leveraging KL divergence loss and contrastive loss, facilitating efficient pruning of irrelevant knowledge and establishing a unified semantic space. The Generator integrates the aligned graph data with LLM to produce coherent and accurate answers. Experiments on GraphQA benchmark across three tasks (including common sense reasoning, scene graph understanding, and knowledge graph reasoning) validate the effectiveness of our method. The code will be available upon accepted.
摘要：大型语言模型（LLM）表现出了非凡的功能，但仍在幻觉和过时的信息等问题上挣扎。检索增强的生成（RAG）通过使用信息检索（IR）系统将LLM输出接地，从而解决了这些问题。基于该基础的基础，基于图的抹布系统通过检索子图，从而更进一步，从而保留知识实体之间的关系并提供更全面的环境。但是，图形抹布面临两个挑战：（1）检索相关信息引入了无关的节点（尤其是在密集的图形数据库中，检索通常扩展到相邻节点），并导致过度漫长的输入，从而阻碍效率；（2）LLMS生成期间图形和语言之间的表示差距限制了充分利用图形结构以增强理解的能力。为了解决这些局限性，我们提出了Align-Grag，这是一种新颖的推理引导后的双对齐框架，中的反应后短语。它首先通过检索节点和边缘来制定子图。然后，提出了一个对准器，以共同优化使用LLM-Summarized推理的图形编码器。它通过利用KL差异损失和对比度损失来实现图节点的双重对准，并促进了无关紧要的知识的有效修剪并建立统一的语义空间。发电机将对齐的图形数据与LLM集成在一起，以产生相干和准确的答案。跨三个任务（包括常识推理，场景图理解和知识图推理）的GraphQA基准测试实验验证了我们方法的有效性。该代码将在接受后可用。

Title: Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers

Authors: Viet-Anh Nguyen, Shiqian Zhao, Gia Dao, Runyi Hu, Yi Xie, Luu Anh Tuan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16241
Pdf URL: https://arxiv.org/pdf/2505.16241
Copy Paste: [[2505.16241]] Three Minds, One Legend: Jailbreak Large Reasoning Model with Adaptive Stacked Ciphers(https://arxiv.org/abs/2505.16241)
Keywords: language model, gpt, llm
Abstract: Recently, Large Reasoning Models (LRMs) have demonstrated superior logical capabilities compared to traditional Large Language Models (LLMs), gaining significant attention. Despite their impressive performance, the potential for stronger reasoning abilities to introduce more severe security vulnerabilities remains largely underexplored. Existing jailbreak methods often struggle to balance effectiveness with robustness against adaptive safety mechanisms. In this work, we propose SEAL, a novel jailbreak attack that targets LRMs through an adaptive encryption pipeline designed to override their reasoning processes and evade potential adaptive alignment. Specifically, SEAL introduces a stacked encryption approach that combines multiple ciphers to overwhelm the models reasoning capabilities, effectively bypassing built-in safety mechanisms. To further prevent LRMs from developing countermeasures, we incorporate two dynamic strategies - random and adaptive - that adjust the cipher length, order, and combination. Extensive experiments on real-world reasoning models, including DeepSeek-R1, Claude Sonnet, and OpenAI GPT-o4, validate the effectiveness of our approach. Notably, SEAL achieves an attack success rate of 80.8% on GPT o4-mini, outperforming state-of-the-art baselines by a significant margin of 27.2%. Warning: This paper contains examples of inappropriate, offensive, and harmful content.
摘要：最近，与传统的大型语言模型（LLM）相比，大型推理模型（LRMS）表现出了出色的逻辑能力，并引起了人们的重大关注。尽管其表现令人印象深刻，但具有更强推理能力引入更严重安全漏洞的潜力仍然很大程度上没有被解散。现有的越狱方法通常难以在适应性安全机制上平衡效率与鲁棒性。在这项工作中，我们提出了SEAL，这是一种新颖的越狱攻击，该袭击通过适应性加密管道来针对LRM，旨在覆盖其推理过程并逃避潜在的适应性对准。具体而言，SEAR引入了一种堆叠的加密方法，该方法结合了多个密码，以压倒模型推理功能，有效地绕开了内置的安全机制。为了进一步防止LRM开发对策，我们结合了两种动态策略 - 随机和自适应 - 调整了密码长度，顺序和组合。在包括DeepSeek-R1，Claude Sonnet和Openai GPT-O4在内的实际推理模型上进行了广泛的实验，验证了我们方法的有效性。值得注意的是，SEAL的攻击成功率在GPT O4-Mini上达到了80.8％，超过最先进的基线的攻击成功率显着27.2％。警告：本文包含不适当，令人反感和有害内容的示例。

Title: Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models

Authors: Vijeta Deshpande, Debasmita Ghose, John D. Patterson, Roger Beaty, Anna Rumshisky
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16245
Pdf URL: https://arxiv.org/pdf/2505.16245
Copy Paste: [[2505.16245]] Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models(https://arxiv.org/abs/2505.16245)
Keywords: language model
Abstract: Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective "diversity teachers" for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.
摘要：各种语言模型的响应对于创造性的产生，开放式任务和自我完善培训至关重要。我们表明，常见的多样性指标，甚至用于偏好优化的奖励模型，系统地偏向于较短的产出，从而限制了表现力。为了解决这个问题，我们介绍了多种多样的（不同的NS），这是一个长度控制的自学习框架，可改善响应多样性，同时保持均等。通过生成和过滤的偏好数据，平衡多样性，质量和长度，不同的NS仅使用3,000对偏好对就可以有效培训。应用于Llama-3.1-8B和Olmo-2家族，不同的NS大大提高了词汇和语义多样性。我们在四个创意一代任务上的响应质量略有减少或提高质量方面表现出一致的改善：不同的关联，角色生成，替代用途和创意写作。令人惊讶的是，使用Olmo-2模型家族（7B和13B）进行的实验表明，诸如Olmo-2-7B之类的较小模型可以作为较大模型的有效“多样性教师”。通过明确解决长度偏差，我们的方法有效地将模型推向了更多样化和表达的产出。

Title: Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models

Authors: Hwiyeong Lee, Uiji Hwang, Hyelim Lim, Taeuk Kim
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16252
Pdf URL: https://arxiv.org/pdf/2505.16252
Copy Paste: [[2505.16252]] Does Localization Inform Unlearning? A Rigorous Examination of Local Parameter Attribution for Knowledge Unlearning in Language Models(https://arxiv.org/abs/2505.16252)
Keywords: language model, prompt
Abstract: Large language models often retain unintended content, prompting growing interest in knowledge unlearning. Recent approaches emphasize localized unlearning, which restricts parameter updates to specific regions in an effort to remove target knowledge while preserving unrelated general knowledge. However, their effectiveness remains uncertain due to the lack of robust and thorough evaluation of the trade-off between the competing goals of unlearning. In this paper, we begin by revisiting existing localized unlearning approaches. We then conduct controlled experiments to rigorously evaluate whether local parameter updates causally contribute to unlearning. Our findings reveal that the set of parameters that must be modified for effective unlearning is not strictly determined, challenging the core assumption of localized unlearning that parameter locality is inherently indicative of effective knowledge removal.
摘要：大型语言模型通常会保留意外的内容，从而促使人们对知识学习的兴趣日益增加。最近的方法强调了本地化的学习，这将参数更新限制为特定区域，以消除目标知识，同时保留无关的常识。但是，由于缺乏对竞争目标之间的权衡，由于缺乏强大而彻底的评估，它们的有效性仍然不确定。在本文中，我们首先重新审视现有的本地化学习方法。然后，我们进行受控的实验，以严格评估本地参数更新是否有因果关系有助于学习。我们的发现表明，必须修改以进行有效学习的一组参数并非严格确定，这挑战了本地化学习的核心假设，即参数位置本质上是有效删除知识的。

Title: IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection

Authors: Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
Subjects: cs.CL, cs.AI, cs.CV
Abstract URL: https://arxiv.org/abs/2505.16258
Pdf URL: https://arxiv.org/pdf/2505.16258
Copy Paste: [[2505.16258]] IRONIC: Coherence-Aware Reasoning Chains for Multi-Modal Sarcasm Detection(https://arxiv.org/abs/2505.16258)
Keywords: chain-of-thought
Abstract: Interpreting figurative language such as sarcasm across multi-modal inputs presents unique challenges, often requiring task-specific fine-tuning and extensive reasoning steps. However, current Chain-of-Thought approaches do not efficiently leverage the same cognitive processes that enable humans to identify sarcasm. We present IRONIC, an in-context learning framework that leverages Multi-modal Coherence Relations to analyze referential, analogical and pragmatic image-text linkages. Our experiments show that IRONIC achieves state-of-the-art performance on zero-shot Multi-modal Sarcasm Detection across different baselines. This demonstrates the need for incorporating linguistic and cognitive insights into the design of multi-modal reasoning strategies. Our code is available at: this https URL
摘要：在多模式输入中解释诸如讽刺之类的比喻语言提出了独特的挑战，通常需要特定于任务的微调和广泛的推理步骤。但是，当前的思想链方法并不能有效利用使人能够识别讽刺的相同认知过程。我们提出了一个具有讽刺意味的是，这是一种在上下文学习框架中，它利用多模式的连贯关系来分析参考，类比和务实的图像文本链接。我们的实验表明，具有讽刺意味的是，在不同基线的零射击多模式讽刺检测上取得了最新的性能。这表明有必要将语言和认知见解纳入多模式推理策略的设计中。我们的代码可用：此HTTPS URL

Title: Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Authors: Jiaru Zou, Yikun Ban, Zihao Li, Yunzhe Qi, Ruizhong Qiu, Ling Yang, Jingrui He
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16270
Pdf URL: https://arxiv.org/pdf/2505.16270
Copy Paste: [[2505.16270]] Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning(https://arxiv.org/abs/2505.16270)
Keywords: language model, llm
Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.
摘要：大型语言模型通常通过对特定领域的数据进行监督的微调来适应下游任务。尽管标准的微调集中在最大程度地减少生成损失以优化模型参数，但我们通过保留和利用模型自己的学习信号来更深入地迈出一步，这类似于人类学习者如何反思过去的错误以改善未来的表现。我们首先介绍了错误日志的概念，以系统地跟踪模型的学习行为，并在整个微调过程中重复出现错误。将原始的基于变压器的模型视为飞行员，我们相应地设计了一个副副模型，以通过逻辑纠正来完善飞行员的推理性能。我们命名了整个飞行员 - 局部框架Transformer Copilot，它引入了（i）一种新颖的副本模型设计，（ii）一个联合培训范式，副驾驶员不断从飞行员的不断发展的错误日志中学习，以及（iii）融合的推理范式，在其中铜铜的副本将飞行员的逻辑增强了一代。我们在新的学习框架上提供理论和经验分析。跨越常识性，算术和建议任务的12个基准测试的实验表明，变压器副标士一致地提高了性能高达34.5％，同时将边际计算开销引入了试点模型，并表现出强大的可伸缩性和可传递性和可传递性。

Title: Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility

Authors: Sheng-Fu Wang, Laurent Prevot, Jou-an Chi, Ri-Sheng Huang, Shu-Kai Hsieh
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16277
Pdf URL: https://arxiv.org/pdf/2505.16277
Copy Paste: [[2505.16277]] Spontaneous Speech Variables for Evaluating LLMs Cognitive Plausibility(https://arxiv.org/abs/2505.16277)
Keywords: language model, llm
Abstract: The achievements of Large Language Models in Natural Language Processing, especially for high-resource languages, call for a better understanding of their characteristics from a cognitive perspective. Researchers have attempted to evaluate artificial models by testing their ability to predict behavioral (e.g., eye-tracking fixations) and physiological (e.g., brain responses) variables during language processing (e.g., reading/listening). In this paper, we propose using spontaneous speech corpora to derive production variables (speech reductions, prosodic prominences) and applying them in a similar fashion. More precisely, we extract. We then test models trained with a standard procedure on different pretraining datasets (written, spoken, and mixed genres) for their ability to predict these two variables. Our results show that, after some fine-tuning, the models can predict these production variables well above baselines. We also observe that spoken genre training data provides more accurate predictions than written genres. These results contribute to the broader effort of using high-quality speech corpora as benchmarks for LLMs.
摘要：自然语言处理中大型语言模型的成就，尤其是对于高资源语言，呼吁从认知角度更好地理解其特征。研究人员试图通过测试预测行为（例如，眼睛跟踪固定）和生理（例如，大脑反应）变量（例如，阅读/听力）的生理变量（例如，阅读/听力）来评估人工模型。在本文中，我们建议使用自发的语音语料库来得出生产变量（降低语音，韵律突出），并以类似的方式应用它们。更确切地说，我们提取。然后，我们测试了在不同的预科数据集（书面，口语和混合流派）上训练有标准程序的模型，以预测这两个变量。我们的结果表明，经过一些微调，模型可以预测这些生产变量远高于基准。我们还观察到，与书面类型相比，口语类型培训数据提供了更准确的预测。这些结果有助于将高质量的语音语料库作为LLMS的基准进行更广泛的努力。

Title: HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation

Authors: Shijie Zhang, Renhao Li, Songsheng Wang, Philipp Koehn, Min Yang, Derek F. Wong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16281
Pdf URL: https://arxiv.org/pdf/2505.16281
Copy Paste: [[2505.16281]] HiMATE: A Hierarchical Multi-Agent Framework for Machine Translation Evaluation(https://arxiv.org/abs/2505.16281)
Keywords: language model, llm, hallucination, agent
Abstract: The advancement of Large Language Models (LLMs) enables flexible and interpretable automatic evaluations. In the field of machine translation evaluation, utilizing LLMs with translation error annotations based on Multidimensional Quality Metrics (MQM) yields more human-aligned judgments. However, current LLM-based evaluation methods still face challenges in accurately identifying error spans and assessing their severity. In this paper, we propose HiMATE, a Hierarchical Multi-Agent Framework for Machine Translation Evaluation. We argue that existing approaches inadequately exploit the fine-grained structural and semantic information within the MQM hierarchy. To address this, we develop a hierarchical multi-agent system grounded in the MQM error typology, enabling granular evaluation of subtype errors. Two key strategies are incorporated to further mitigate systemic hallucinations within the framework: the utilization of the model's self-reflection capability and the facilitation of agent discussion involving asymmetric information. Empirically, HiMATE outperforms competitive baselines across different datasets in conducting human-aligned evaluations. Further analyses underscore its significant advantage in error span detection and severity assessment, achieving an average F1-score improvement of 89% over the best-performing baseline. We make our code and data publicly available at this https URL.
摘要：大型语言模型（LLMS）的进步可以灵活且可解释的自动评估。在机器翻译评估领域，利用基于多维质量指标（MQM）的翻译误差注释的LLM会产生更多的人类分配判断。但是，当前基于LLM的评估方法仍然面临着准确识别错误跨度并评估其严重程度的挑战。在本文中，我们提出了Himate，这是用于机器翻译评估的层次多代理框架。我们认为，现有的方法不足以利用MQM层次结构内的细粒结构和语义信息。为了解决这个问题，我们开发了一个基于MQM误差类型的层次多代理系统，从而实现了亚型错误的粒状评估。纳入了两种关键策略，以进一步减轻框架内的系统性幻觉：模型的自我反射能力的利用以及涉及不对称信息的代理讨论的促进。从经验上讲，Himate在进行人类一致的评估方面优于不同数据集的竞争基准。进一步的分析强调了其在误差跨检测和严重性评估中的显着优势，比表现最佳的基线的平均F1分数提高了89％。我们在此HTTPS URL上公开提供代码和数据。

Title: Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA

Authors: Rishabh Maheshwary, Masoud Hashemi, Khyati Mahajan, Shiva Krishna Reddy Malay, Sai Rajeswar, Sathwik Tejaswi Madhusudhan, Spandana Gella, Vikas Yadav
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16293
Pdf URL: https://arxiv.org/pdf/2505.16293
Copy Paste: [[2505.16293]] Augmenting LLM Reasoning with Dynamic Notes Writing for Complex QA(https://arxiv.org/abs/2505.16293)
Keywords: language model, llm
Abstract: Iterative RAG for multi-hop question answering faces challenges with lengthy contexts and the buildup of irrelevant information. This hinders a model's capacity to process and reason over retrieved content and limits performance. While recent methods focus on compressing retrieved information, they are either restricted to single-round RAG, require finetuning or lack scalability in iterative RAG. To address these challenges, we propose Notes Writing, a method that generates concise and relevant notes from retrieved documents at each step, thereby reducing noise and retaining only essential information. This indirectly increases the effective context length of Large Language Models (LLMs), enabling them to reason and plan more effectively while processing larger volumes of input text. Notes Writing is framework agnostic and can be integrated with different iterative RAG methods. We demonstrate its effectiveness with three iterative RAG methods, across two models and four evaluation datasets. Notes writing yields an average improvement of 15.6 percentage points overall, with minimal increase in output tokens.
摘要：多跳问题回答的迭代抹布面临着冗长的环境和无关信息的挑战。这阻碍了模型处理和推理的能力，而不是检索到的内容和限制性能。尽管最近的方法着重于压缩检索到的信息，但它们要么仅限于单一抹布，需要填充或缺乏迭代抹布的可扩展性。为了应对这些挑战，我们提出了笔记写作，该方法在每个步骤中从检索到的文档中生成简洁且相关的笔记，从而减少噪声并仅保留基本信息。这间接增加了大语言模型（LLM）的有效上下文长度，使他们能够在处理大量输入文本的同时更有效地进行推理和计划。笔记写作是框架不可知论的，可以与不同的迭代抹布方法集成。我们在两个模型和四个评估数据集中使用三种迭代抹布方法证明了它的有效性。笔记写作的平均提高总数为15.6个百分点，而产出令牌的增加最小。

Title: ToDi: Token-wise Distillation via Fine-Grained Divergence Control

Authors: Seongryong Jung, Suwan Yoon, DongGeon Kim, Hwanhee Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16297
Pdf URL: https://arxiv.org/pdf/2505.16297
Copy Paste: [[2505.16297]] ToDi: Token-wise Distillation via Fine-Grained Divergence Control(https://arxiv.org/abs/2505.16297)
Keywords: language model, llm
Abstract: Large language models (LLMs) offer impressive performance but are impractical for resource-constrained deployment due to high latency and energy consumption. Knowledge distillation (KD) addresses this by transferring knowledge from a large teacher to a smaller student model. However, conventional KD, notably approaches like Forward KL (FKL) and Reverse KL (RKL), apply uniform divergence loss across the entire vocabulary, neglecting token-level prediction discrepancies. By investigating these representative divergences via gradient analysis, we reveal that FKL boosts underestimated tokens, while RKL suppresses overestimated ones, showing their complementary roles. Based on this observation, we propose Token-wise Distillation (ToDi), a novel method that adaptively combines FKL and RKL per token using a sigmoid-based weighting function derived from the teacher-student probability log-ratio. ToDi dynamically emphasizes the appropriate divergence for each token, enabling precise distribution alignment. We demonstrate that ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies across instruction-following benchmarks. Extensive ablation studies and efficiency analysis further validate ToDi's effectiveness and practicality.
摘要：大型语言模型（LLM）提供了令人印象深刻的性能，但由于高潜伏期和能源消耗而对资源受限的部署不切实际。知识蒸馏（KD）通过将知识从大师转移到较小的学生模型来解决这一问题。但是，传统的KD，特别是诸如前kl（FKL）和反向KL（RKL）之类的方法，在整个词汇量中施加统一的差异损失，忽略令牌级别的预测差异。通过通过梯度分析研究这些代表性差异，我们揭示了FKL的提升被低估了令牌，而RKL抑制了被高估的令牌，显示了它们的互补作用。基于此观察结果，我们提出了令牌蒸馏（TODI），这是一种新颖的方法，它使用从教师学生概率日志中得出的基于Sigmoid的加权函数适应性地结合了fkl和rkl。 Todi动态强调了每个令牌的适当差异，从而实现了精确的分布对准。我们证明，Todi始终在跨指令遵循的基准中使用统一或更少的颗粒状策略来胜过最近的蒸馏基线。广泛的消融研究和效率分析进一步验证了托迪的有效性和实用性。

Title: INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling

Authors: Haochen Shi, Tianshi Zheng, Weiqi Wang, Baixuan Xu, Chunyang Li, Chunkit Chan, Tao Fan, Yangqiu Song, Qiang Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16303
Pdf URL: https://arxiv.org/pdf/2505.16303
Copy Paste: [[2505.16303]] INFERENCEDYNAMICS: Efficient Routing Across LLMs through Structured Capability and Knowledge Profiling(https://arxiv.org/abs/2505.16303)
Keywords: language model, llm
Abstract: Large Language Model (LLM) routing is a pivotal technique for navigating a diverse landscape of LLMs, aiming to select the best-performing LLMs tailored to the domains of user queries, while managing computational resources. However, current routing approaches often face limitations in scalability when dealing with a large pool of specialized LLMs, or in their adaptability to extending model scope and evolving capability domains. To overcome those challenges, we propose InferenceDynamics, a flexible and scalable multi-dimensional routing framework by modeling the capability and knowledge of models. We operate it on our comprehensive dataset RouteMix, and demonstrate its effectiveness and generalizability in group-level routing using modern benchmarks including MMLU-Pro, GPQA, BigGenBench, and LiveBench, showcasing its ability to identify and leverage top-performing models for given tasks, leading to superior outcomes with efficient resource utilization. The broader adoption of Inference Dynamics can empower users to harness the full specialized potential of the LLM ecosystem, and our code will be made publicly available to encourage further research.
摘要：大型语言模型（LLM）路由是一种关键技术，用于导航LLM的多样化景观，旨在选择针对用户查询域的最佳表现LLM，同时管理计算资源。但是，当前的路由方法在处理大量专业LLM或扩展模型范围和不断发展的能力域的适应性时，通常会面临可伸缩性的限制。为了克服这些挑战，我们提出了推论动力学，这是一个灵活且可扩展的多维路由框架，通过对模型的能力和知识进行建模。我们在全面的数据集RouteMix上进行操作，并使用现代基准（包括MMLU-Pro，GPQA，BigGenBench和LiveBench）在群体级路由中证明其有效性和概括性，并展示了其识别和利用具有优势资源的卓越成果的能力，从而展示了其能力识别和利用表现出色的模型。推理动态的广泛采用可以使用户能够利用LLM生态系统的全部专业潜力，并将公开使用我们的代码以鼓励进一步的研究。

Title: PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

Authors: Chenzhuo Zhao, Ziqian Liu, Xingda Wang, Junting Lu, Chaoyi Ruan
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16307
Pdf URL: https://arxiv.org/pdf/2505.16307
Copy Paste: [[2505.16307]] PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models(https://arxiv.org/abs/2505.16307)
Keywords: language model, llm, prompt
Abstract: Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.
摘要：迅速优化提供了一种实用且广泛的替代方案，可用于改善大型语言模型（LLM）性能。但是，现有方法通常依赖于昂贵的产出，自我征收能力或人为宣传的偏好，这限制了它们的可扩展性，尤其是对于较小或非实体调整的模型。我们介绍了PMPO（概率度量提示优化），这是一个统一的框架，可完善使用令牌级的跨层损失作为直接，轻量级评估信号的提示。 PMPO通过掩盖和衡量其对损失的影响来识别低质量的提示段，然后通过将损失降到最小化的正面和负面示例，从而改写并选择改进的变体。与先前的方法不同，它在优化过程中不需要输出采样或人类评估，仅依靠向前的通行证和对数可能。 PMPO通过基于损失的评估策略来支持受监督和基于偏好的任务。实验表明，PMPO始终胜过模型尺寸和任务的先前方法：它达到了BBH的最高平均精度，在GSM8K和Aqua-Rat上的表现强大，并提高了Alpacaeval 2.0胜率，超过19分。这些结果突出了PMPO的有效性，效率和广泛的适用性。

Title: SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

Authors: Wenqing Wu, Chengzhi Zhang, Tong Bao, Yi Zhao
Subjects: cs.CL, cs.AI, cs.DL
Abstract URL: https://arxiv.org/abs/2505.16330
Pdf URL: https://arxiv.org/pdf/2505.16330
Copy Paste: [[2505.16330]] SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers(https://arxiv.org/abs/2505.16330)
Keywords: language model, llm
Abstract: Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at this https URL.
摘要：新颖性是学术论文的核心组成部分，关于新颖性评估有多种观点。现有的方法通常集中在单词或实体组合上，这些组合提供有限的见解。与论文新颖性有关的内容通常分布在不同的核心部分，例如简介，方法论和结果。因此，探索评估纸张新颖性的部分的最佳组合对于推进自动化新颖性评估很重要。在本文中，我们利用从学术论文的各节组合作为输入来推动语言模型来预测新颖得分。然后，我们分析结果，以确定新颖性评分预测的最佳部分组合。我们首先采用自然语言处理技术来识别学术论文的分段结构，将其分类为引言，方法，结果和讨论（IMRAD）。随后，我们使用这些部分的不同组合（例如，引言和方法）作为预审前的语言模型（PLM）和大语言模型（LLMS）的输入，采用人类专家审阅者提供的新颖分数作为基础真实标签来获得预测结果。结果表明，使用介绍，结果和讨论最适合评估论文的新颖性，而整个文本的使用则不会产生显着的结果。此外，根据PLM和LLM的结果，引入和结果似乎是新颖得分预测任务的最重要部分。可以在此HTTPS URL上访问本文的代码和数据集。

Title: Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance

Authors: Taeyoon Kwon, Dongwook Choi, Sunghwan Kim, Hyojun Kim, Seungjun Moon, Beong-woo Kwak, Kuan-Hao Huang, Jinyoung Yeo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16348
Pdf URL: https://arxiv.org/pdf/2505.16348
Copy Paste: [[2505.16348]] Embodied Agents Meet Personalization: Exploring Memory Utilization for Personalized Assistance(https://arxiv.org/abs/2505.16348)
Keywords: language model, gpt, llm, agent
Abstract: Embodied agents empowered by large language models (LLMs) have shown strong performance in household object rearrangement tasks. However, these tasks primarily focus on single-turn interactions with simplified instructions, which do not truly reflect the challenges of providing meaningful assistance to users. To provide personalized assistance, embodied agents must understand the unique semantics that users assign to the physical world (e.g., favorite cup, breakfast routine) by leveraging prior interaction history to interpret dynamic, real-world instructions. Yet, the effectiveness of embodied agents in utilizing memory for personalized assistance remains largely underexplored. To address this gap, we present MEMENTO, a personalized embodied agent evaluation framework designed to comprehensively assess memory utilization capabilities to provide personalized assistance. Our framework consists of a two-stage memory evaluation process design that enables quantifying the impact of memory utilization on task performance. This process enables the evaluation of agents' understanding of personalized knowledge in object rearrangement tasks by focusing on its role in goal interpretation: (1) the ability to identify target objects based on personal meaning (object semantics), and (2) the ability to infer object-location configurations from consistent user patterns, such as routines (user patterns). Our experiments across various LLMs reveal significant limitations in memory utilization, with even frontier models like GPT-4o experiencing a 30.5% performance drop when required to reference multiple memories, particularly in tasks involving user patterns. These findings, along with our detailed analyses and case studies, provide valuable insights for future research in developing more effective personalized embodied agents. Project website: this https URL
摘要：由大语言模型（LLM）授权的体现的代理在家庭对象重新安排任务中表现出很强的表现。但是，这些任务主要集中于与简化说明的单转交互，这并不能真正反映出向用户提供有意义帮助的挑战。为了提供个性化的帮助，具体的代理必须通过利用先前的互动历史来解释动态的现实世界说明来理解用户将用户分配给物理世界（例如，喜欢的杯子，早餐例程）的独特语义。然而，体现代理在利用内存进行个性化援助方面的有效性仍然很大程度上没有被倍增。为了解决这一差距，我们提出了Memento，这是一个个性化体现的代理评估框架，旨在全面评估内存利用功能以提供个性化的帮助。我们的框架由一个两阶段的内存评估过程设计组成，该过程能够量化内存利用对任务性能的影响。该过程使代理人通过重点关注其在目标解释中的作用来评估代理对对象重新安排任务中的个性化知识的理解：（1）能够根据个人含义（对象语义）识别目标对象的能力，以及（2）从一致的用户模式（例如例程）（例如例程）中推断对象置换配置的能力。我们在各种LLMS上进行的实验揭示了内存利用率的显着局限性，即使是GPT-4O等边境模型，在需要参考多个记忆的情况下，尤其是在涉及用户模式的任务中，诸如GPT-4O的绩效下降30.5％。这些发现，加上我们的详细分析和案例研究，为未来开发更有效的个性化体现代理的研究提供了宝贵的见解。项目网站：此HTTPS URL

Title: Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization

Authors: Pierre Achkar, Tim Gollub, Martin Potthast
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16349
Pdf URL: https://arxiv.org/pdf/2505.16349
Copy Paste: [[2505.16349]] Ask, Retrieve, Summarize: A Modular Pipeline for Scientific Literature Summarization(https://arxiv.org/abs/2505.16349)
Keywords: retrieval-augmented generation
Abstract: The exponential growth of scientific publications has made it increasingly difficult for researchers to stay updated and synthesize knowledge effectively. This paper presents XSum, a modular pipeline for multi-document summarization (MDS) in the scientific domain using Retrieval-Augmented Generation (RAG). The pipeline includes two core components: a question-generation module and an editor module. The question-generation module dynamically generates questions adapted to the input papers, ensuring the retrieval of relevant and accurate information. The editor module synthesizes the retrieved content into coherent and well-structured summaries that adhere to academic standards for proper citation. Evaluated on the SurveySum dataset, XSum demonstrates strong performance, achieving considerable improvements in metrics such as CheckEval, G-Eval and Ref-F1 compared to existing approaches. This work provides a transparent, adaptable framework for scientific summarization with potential applications in a wide range of domains. Code available at this https URL
摘要：科学出版物的指数增长使研究人员越来越难以有效地保持更新和合成知识。本文介绍了XSUM，这是使用检索型生成（RAG）在科学领域中多文件摘要（MDS）的模块化管道。该管道包括两个核心组件：问题生成模块和一个编辑器模块。问题生成模块动态生成适合输入论文的问题，以确保检索相关和准确的信息。编辑器模块将检索到的内容综合为结实且结构良好的摘要，这些摘要符合学术标准以进行适当的引用。 XSUM在监视数据集中进行了评估，与现有方法相比，诸如Checkeval，g-eval和Ref-F1之类的指标（例如Checkeval，g-eval和Ref-F1）的表现出色。这项工作为科学总结提供了一个透明，适应性的框架，并在各种领域中进行了潜在的应用。此https URL可用代码

Title: PaTH Attention: Position Encoding via Accumulating Householder Transformations

Authors: Songlin Yang, Yikang Shen, Kaiyue Wen, Shawn Tan, Mayank Mishra, Liliang Ren, Rameswar Panda, Yoon Kim
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16381
Pdf URL: https://arxiv.org/pdf/2505.16381
Copy Paste: [[2505.16381]] PaTH Attention: Position Encoding via Accumulating Householder Transformations(https://arxiv.org/abs/2505.16381)
Keywords: language model, llm
Abstract: The attention mechanism is a core primitive in modern large language models (LLMs) and AI more broadly. Since attention by itself is permutation-invariant, position encoding is essential for modeling structured domains such as language. Rotary position encoding (RoPE) has emerged as the de facto standard approach for position encoding and is part of many modern LLMs. However, in RoPE the key/query transformation between two elements in a sequence is only a function of their relative position and otherwise independent of the actual input. This limits the expressivity of RoPE-based transformers. This paper describes PaTH, a flexible data-dependent position encoding scheme based on accumulated products of Householder(like) transformations, where each transformation is data-dependent, i.e., a function of the input. We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices, and implement a FlashAttention-style blockwise algorithm that minimizes I/O cost. Across both targeted synthetic benchmarks and moderate-scale real-world language modeling experiments, we find that PaTH demonstrates superior performance compared to RoPE and other recent baselines.
摘要：注意机制是现代大型语言模型（LLM）和AI的核心原始性。由于注意力本身是置换不变的，因此编码的位置编码对于建模结构化域（例如语言）至关重要。旋转位置编码（ROPE）已成为现代编码的事实上的标准方法，并且是许多现代LLM的一部分。但是，在绳索中，序列中两个元素之间的密钥/查询转换只是其相对位置的函数，而其他独立于实际输入。这限制了基于绳索的变压器的表现力。本文描述了基于居民转换的累积产品的灵活数据依赖性位置方案的路径，其中每个转换均取决于数据依赖性，即输入的函数。我们得出了一种有效的并行算法，用于通过利用居民矩阵产品的紧凑代表来训练，并实施一种闪存式的块型算法，从而最大程度地减少I/O成本。在两个目标的合成基准和中等规模的现实语言建模实验中，我们发现与绳索和其他最近基线相比，路径表现出卓越的性能。

Title: Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models

Authors: Kaiyu He, Tong Zhou, Yubo Chen, Delai Qiu, Shengping Liu, Kang Liu, Jun Zhao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16385
Pdf URL: https://arxiv.org/pdf/2505.16385
Copy Paste: [[2505.16385]] Semantic Pivots Enable Cross-Lingual Transfer in Large Language Models(https://arxiv.org/abs/2505.16385)
Keywords: language model, llm
Abstract: Large language models (LLMs) demonstrate remarkable ability in cross-lingual tasks. Understanding how LLMs acquire this ability is crucial for their interpretability. To quantify the cross-lingual ability of LLMs accurately, we propose a Word-Level Cross-Lingual Translation Task. To find how LLMs learn cross-lingual ability, we trace the outputs of LLMs' intermediate layers in the word translation task. We identify and distinguish two distinct behaviors in the forward pass of LLMs: co-occurrence behavior and semantic pivot behavior. We attribute LLMs' two distinct behaviors to the co-occurrence frequency of words and find the semantic pivot from the pre-training dataset. Finally, to apply our findings to improve the cross-lingual ability of LLMs, we reconstruct a semantic pivot-aware pre-training dataset using documents with a high proportion of semantic pivots. Our experiments validate the effectiveness of our approach in enhancing cross-lingual ability. Our research contributes insights into the interpretability of LLMs and offers a method for improving LLMs' cross-lingual ability.
摘要：大型语言模型（LLMS）在跨语言任务中表现出了显着的能力。了解LLM如何获得此能力对于它们的可解释性至关重要。为了准确量化LLM的跨语性能力，我们提出了一个单词级的跨语性翻译任务。为了找到LLM如何学习跨语言能力，我们在单词翻译任务中追踪LLMS中间层的输出。我们在LLM的正向通过中识别并区分了两种不同的行为：同时出现的行为和语义枢轴行为。我们将LLMS的两种不同行为归因于单词的共发生频率，并从训练数据集中找到语义枢轴。最后，将我们的发现应用于提高LLM的跨语性能力，我们使用具有很高比例的语义枢轴的文档重建语义透视感知的预训练数据集。我们的实验验证了我们方法在增强跨语性能力方面的有效性。我们的研究有助于洞察LLM的可解释性，并提供了提高LLMS跨语性能力的方法。

Title: Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection

Authors: Benjamin Vendeville, Liana Ermakova, Pierre De Loor
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16392
Pdf URL: https://arxiv.org/pdf/2505.16392
Copy Paste: [[2505.16392]] Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection(https://arxiv.org/abs/2505.16392)
Keywords: language model, llm
Abstract: The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.
摘要：公众经常遇到复杂的文本，但没有时间或专业知识来充分理解它们，从而导致错误信息传播。自动文本简化（ATS）有助于使信息更容易访问，但是其评估方法并没有跟上文本生成的进步，尤其是在大型语言模型（LLMS）方面。特别是，最近的研究表明，当前的ATS指标与错误的存在无关。手动检查进一步揭示了各种错误，强调了目前缺乏更细微的评估框架的需求。该资源论文通过引入用于检测和分类简化文本中错误的测试集来解决此差距。首先，我们提出了错误的分类法，并正式关注信息失真。接下来，我们介绍了自动简化科学文本的并行数据集。根据我们建议的分类法，该数据集已通过标签对人进行了注释。最后，我们分析了数据集的质量，并研究了现有模型的性能，以检测和分类该分类法的错误。这些贡献为研究人员提供了更好地评估ATS中错误，开发更可靠的模型并最终提高自动简化文本质量的工具。

Title: From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs

Authors: Muhammad Farid Adilazuarda, Chen Cecilia Liu, Iryna Gurevych, Alham Fikri Aji
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16408
Pdf URL: https://arxiv.org/pdf/2505.16408
Copy Paste: [[2505.16408]] From Surveys to Narratives: Rethinking Cultural Value Adaptation in LLMs(https://arxiv.org/abs/2505.16408)
Keywords: language model, llm
Abstract: Adapting cultural values in Large Language Models (LLMs) presents significant challenges, particularly due to biases and limited training data. Prior work primarily aligns LLMs with different cultural values using World Values Survey (WVS) data. However, it remains unclear whether this approach effectively captures cultural nuances or produces distinct cultural representations for various downstream tasks. In this paper, we systematically investigate WVS-based training for cultural value adaptation and find that relying solely on survey data can homogenize cultural norms and interfere with factual knowledge. To investigate these issues, we augment WVS with encyclopedic and scenario-based cultural narratives from Wikipedia and NormAd. While these narratives may have variable effects on downstream tasks, they consistently improve cultural distinctiveness than survey data alone. Our work highlights the inherent complexity of aligning cultural values with the goal of guiding task-specific behavior.
摘要：在大型语言模型（LLM）中调整文化价值提出了重大挑战，尤其是由于偏见和培训数据有限。先前的工作主要使用世界价值调查（WVS）数据使LLM与不同的文化价值对齐。但是，目前尚不清楚这种方法是否有效地捕捉了文化上的细微差别或为各种下游任务产生不同的文化代表。在本文中，我们系统地研究了基于WVS的文化价值适应培训，并发现仅依靠调查数据可以匀浆文化规范并干扰事实知识。为了调查这些问题，我们通过Wikipedia和Normad的基于百科全书和场景的文化叙事来增强WVS。尽管这些叙述可能会对下游任务产生可变影响，但与单独的调查数据相比，它们始终如一地提高文化独特性。我们的工作突出了将文化价值观与指导特定于任务行为的目标保持一致的固有复杂性。

Title: Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

Authors: Guanting Dong, Yifei Chen, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Yutao Zhu, Hangyu Mao, Guorui Zhou, Zhicheng Dou, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16410
Pdf URL: https://arxiv.org/pdf/2505.16410
Copy Paste: [[2505.16410]] Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning(https://arxiv.org/abs/2505.16410)
Keywords: language model, llm, prompt
Abstract: Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at this https URL.
摘要：最近，大型语言模型（LLMS）通过大规模增强学习（RL）表现出了显着的推理能力。但是，利用RL算法来授权LLMS中有效的多工具协作推理能力仍然是一个悬而未决的挑战。在本文中，我们介绍了Tool-Star，这是一种基于RL的框架，旨在使LLMS在逐步推理期间自主调用多个外部工具。工具星将六种类型的工具集成在一起，并将系统设计纳入数据综合和培训。为了解决工具使用数据的稀缺性，我们提出了一个通用工具集成的推理数据合成管道，该管道将工具集成的提示与基于提示的采样结合起来，以自动且可扩展地生成工具使用轨迹。随后的质量归一化和困难的分类过程过滤了低质量的样本，并将数据集从易于到硬化组织。此外，我们提出了一个两阶段的培训框架，以通过以下方式增强多工具协作推理：（1）冷启动微型调整，它指导LLMS通过工具 - 发动机反馈来探索推理模式；（2）具有分层奖励设计的多工具自我批评RL算法，从而增强了奖励理解并促进有效的工具协作。实验性分析对10多种挑战性推理基准进行了强调刀具明星的有效性和效率。该代码可在此HTTPS URL上找到。

Title: Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Authors: Ruizhe Li, Chen Chen, Yuchen Hu, Yanjun Gao, Xi Wang, Emine Yilmaz
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16415
Pdf URL: https://arxiv.org/pdf/2505.16415
Copy Paste: [[2505.16415]] Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation(https://arxiv.org/abs/2505.16415)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.
摘要：检索增强的生成（RAG）利用大型语言模型（LLM）与外部环境相结合，以增强生成的响应的准确性和可靠性。但是，由于当前方法的计算密集型性质，这通常需要广泛的微调或人类注释，因此可靠地将生成的内容归因于特定上下文段，上下文归因仍然具有挑战性。在这项工作中，我们介绍了一种新颖的Jensen Shannon Divergence驱动的方法将响应归因于上下文（ARC-JSD），从而有效而准确地识别了基本上下文句子，而无需其他微调或替代建模。与先前基于替代的方法相比，使用指令调整的LLMS对诸如Tydi QA，Hotpot QA和Musique等广泛的抹布基准的评估表明，具有较高的准确性和显着的计算效率。此外，我们的机械分析揭示了负责上下文归因的特定注意力头和多层感知器（MLP）层，从而为抹布模型的内部工作提供了宝贵的见解。

Title: WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning

Authors: Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, Lihong Li
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16421
Pdf URL: https://arxiv.org/pdf/2505.16421
Copy Paste: [[2505.16421]] WebAgent-R1: Training Web Agents via End-to-End Multi-Turn Reinforcement Learning(https://arxiv.org/abs/2505.16421)
Keywords: language model, llm, prompt, chain-of-thought, agent
Abstract: While reinforcement learning (RL) has demonstrated remarkable success in enhancing large language models (LLMs), it has primarily focused on single-turn tasks such as solving math problems. Training effective web agents for multi-turn interactions remains challenging due to the complexity of long-horizon decision-making across dynamic web interfaces. In this work, we present WebAgent-R1, a simple yet effective end-to-end multi-turn RL framework for training web agents. It learns directly from online interactions with web environments by asynchronously generating diverse trajectories, entirely guided by binary rewards depending on task success. Experiments on the WebArena-Lite benchmark demonstrate the effectiveness of WebAgent-R1, boosting the task success rate of Qwen-2.5-3B from 6.1% to 33.9% and Llama-3.1-8B from 8.5% to 44.8%, significantly outperforming existing state-of-the-art methods and strong proprietary models such as OpenAI o3. In-depth analyses reveal the effectiveness of the thinking-based prompting strategy and test-time scaling through increased interactions for web tasks. We further investigate different RL initialization policies by introducing two variants, namely WebAgent-R1-Zero and WebAgent-R1-CoT, which highlight the importance of the warm-up training stage (i.e., behavior cloning) and provide insights on incorporating long chain-of-thought (CoT) reasoning in web agents.
摘要：尽管增强学习（RL）在增强大语言模型（LLM）方面取得了巨大的成功，但它主要集中于解决数学问题等单转弯任务。由于跨动态的Web界面的长马决策的复杂性，培训有效的Web代理对多转交互作用仍然具有挑战性。在这项工作中，我们介绍了Webagent-R1，这是一个简单而有效的端到端多转弯RL RL框架，用于培训网络代理。它通过异步产生各种轨迹的方式直接从与Web环境的在线互动中学习，这完全取决于任务成功。 Webarena-Lite基准的实验证明了Webagent-R1的有效性，将QWEN-2.5-3B的任务成功率从6.1％提高到33.9％，而Llama-3.1-8B从8.5％提高到44.8％，显着超过了现有的现有尚未实现的尚未实现的尚未实现的尚未实现的方法，例如Openerai Openai Openai Operai o3。深入的分析揭示了通过增加网络任务的相互作用的增加基于思维的提示策略和测试时间扩展的有效性。我们通过引入两个变体，即Webagent-R1-Zero和Webagent-R1-COT，进一步研究不同的RL初始化策略，这些变体强调了热身训练阶段的重要性（即行为克隆）的重要性，并提供了有关在网络代理商中结合长链（COT）推理的见解。

Title: Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems

Authors: Song Jin, Juntian Zhang, Yuhan Liu, Xun Zhang, Yufei Zhang, Guojun Yin, Fei Jiang, Wei Lin, Rui Yan
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16429
Pdf URL: https://arxiv.org/pdf/2505.16429
Copy Paste: [[2505.16429]] Beyond Static Testbeds: An Interaction-Centric Agent Simulation Platform for Dynamic Recommender Systems(https://arxiv.org/abs/2505.16429)
Keywords: llm, chain-of-thought, agent
Abstract: Evaluating and iterating upon recommender systems is crucial, yet traditional A/B testing is resource-intensive, and offline methods struggle with dynamic user-platform interactions. While agent-based simulation is promising, existing platforms often lack a mechanism for user actions to dynamically reshape the environment. To bridge this gap, we introduce RecInter, a novel agent-based simulation platform for recommender systems featuring a robust interaction mechanism. In RecInter platform, simulated user actions (e.g., likes, reviews, purchases) dynamically update item attributes in real-time, and introduced Merchant Agents can reply, fostering a more realistic and evolving ecosystem. High-fidelity simulation is ensured through Multidimensional User Profiling module, Advanced Agent Architecture, and LLM fine-tuned on Chain-of-Thought (CoT) enriched interaction data. Our platform achieves significantly improved simulation credibility and successfully replicates emergent phenomena like Brand Loyalty and the Matthew Effect. Experiments demonstrate that this interaction mechanism is pivotal for simulating realistic system evolution, establishing our platform as a credible testbed for recommender systems research.
摘要：评估和迭代推荐系统至关重要，但是传统的A/B测试是资源密集的，离线方法与动态的用户平台交互作用。尽管基于代理的仿真是有希望的，但现有平台通常缺乏用户操作动态重塑环境的机制。为了弥合这一差距，我们介绍了Recinter，这是一个基于新颖的代理模拟平台，用于推荐系统，具有强大的相互作用机制。在Recinter平台中，模拟用户操作（例如，喜欢，评论，购买）会实时动态更新项目属性，并引入商户代理可以回复，促进一个更现实，更不断发展的生态系统。通过多维用户分析模块，高级代理体系结构和LLM对富含思维链（COT）丰富的交互数据进行微调来确保高保真模拟。我们的平台可显着提高模拟信誉，并成功地复制了新兴现象，例如品牌忠诚度和Matthew效应。实验表明，这种相互作用机制对于模拟现实的系统演变至关重要，将我们的平台确立为推荐系统研究的可靠测试。

Title: University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection

Authors: Ikhlasul Akmal Hanif, Eryawan Presma Yulianrifat, Jaycent Gunawan Ongris, Eduardus Tjitrahardja, Muhammad Falensi Azmi, Rahmat Bryan Naufal, Alfan Farizki Wicaksono
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16460
Pdf URL: https://arxiv.org/pdf/2505.16460
Copy Paste: [[2505.16460]] University of Indonesia at SemEval-2025 Task 11: Evaluating State-of-the-Art Encoders for Multi-Label Emotion Detection(https://arxiv.org/abs/2505.16460)
Keywords: prompt
Abstract: This paper presents our approach for SemEval 2025 Task 11 Track A, focusing on multilabel emotion classification across 28 languages. We explore two main strategies: fully fine-tuning transformer models and classifier-only training, evaluating different settings such as fine-tuning strategies, model architectures, loss functions, encoders, and classifiers. Our findings suggest that training a classifier on top of prompt-based encoders such as mE5 and BGE yields significantly better results than fully fine-tuning XLMR and mBERT. Our best-performing model on the final leaderboard is an ensemble combining multiple BGE models, where CatBoost serves as the classifier, with different configurations. This ensemble achieves an average F1-macro score of 56.58 across all languages.
摘要：本文介绍了我们对Semeval 2025 Task 11 Track A的方法，重点介绍了28种语言的多标签情感分类。我们探讨了两种主要策略：完全微调的变压器模型和仅分类器的培训，评估不同的设置，例如微调策略，模型架构，损失功能，编码器和分类器。我们的发现表明，与完全微调XLMR和Mbert相比，在ME5和BGE等及时的编码器（例如ME5和BGE）之外的培训分类器的结果明显更好。我们在最终排行榜上表现最好的模型是合并多个BGE模型的合奏，其中Catboost用作分类器，并具有不同的配置。在所有语言中，这个合奏的平均F1-MaCro得分为56.58。

Title: Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization

Authors: Vera Neplenbroek, Arianna Bisazza, Raquel Fernández
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16467
Pdf URL: https://arxiv.org/pdf/2505.16467
Copy Paste: [[2505.16467]] Reading Between the Prompts: How Stereotypes Shape LLM's Implicit Personalization(https://arxiv.org/abs/2505.16467)
Keywords: language model, llm, prompt
Abstract: Generative Large Language Models (LLMs) infer user's demographic information from subtle cues in the conversation -- a phenomenon called implicit personalization. Prior work has shown that such inferences can lead to lower quality responses for users assumed to be from minority groups, even when no demographic information is explicitly provided. In this work, we systematically explore how LLMs respond to stereotypical cues using controlled synthetic conversations, by analyzing the models' latent user representations through both model internals and generated answers to targeted user questions. Our findings reveal that LLMs do infer demographic attributes based on these stereotypical signals, which for a number of groups even persists when the user explicitly identifies with a different demographic group. Finally, we show that this form of stereotype-driven implicit personalization can be effectively mitigated by intervening on the model's internal representations using a trained linear probe to steer them toward the explicitly stated identity. Our results highlight the need for greater transparency and control in how LLMs represent user identity.
摘要：生成性大语言模型（LLMS）从对话中的微妙提示中推断出用户的人口统计信息 - 一种称为隐式个性化的现象。先前的工作表明，即使没有明确提供人口统计信息，这种推论也会导致假定是来自少数群体的用户的质量响应较低。在这项工作中，我们系统地探讨了LLM通过通过模型内部设备和针对目标用户问题生成的答案来分析模型的潜在用户表示，如何使用受控的合成对话对刻板印象提示进行响应。我们的发现表明，LLMS确实基于这些刻板印象的信号来推断人口统计学属性，对于许多组，当用户明确识别其他人口统计组时，这些信号甚至会持续存在。最后，我们表明，通过使用训练有素的线性探测器介入模型的内部表示，可以将这种形式的刻板印象驱动的隐式个性化有效地减轻，从而将它们引导到明确指定的身份。我们的结果突出了对LLM代表用户身份的更大透明度和控制的需求。

Title: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Authors: Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16483
Pdf URL: https://arxiv.org/pdf/2505.16483
Copy Paste: [[2505.16483]] Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning(https://arxiv.org/abs/2505.16483)
Keywords: language model, gpt, llm
Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.
摘要：在提供的上下文中，教导大型语言模型（LLMS）忠实于构建可靠的信息寻求信息系统至关重要。因此，我们提出了一个系统的框架，以提高在没有人类注释的情况下，在短形式和长期生成任务中LLM的忠诚。具体而言，我们首先将短形式的提问（QA）数据合成四个不同的任务，以构建高质量且易于验证的培训数据而无需人类注释。此外，我们提出了Dual-GRPO，这是一种基于规则的加固学习方法，其中包括从合成的短形式QA数据获得的三个量身定制的基于规则的奖励，同时同时优化了短形式和长期响应的生成。值得注意的是，双重GRPO消除了手动标记偏好数据以训练奖励模型，并避免仅依靠合成的短形式QA数据时避免过度优化的短形式生成。实验结果表明，独木舟极大地提高了在11个不同的下游任务中LLM的忠诚，甚至超过最先进的LLM，例如GPT-4O和OpenAI O1。

Title: LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing

Authors: Dario Di Palma, Alessandro De Bellis, Giovanni Servedio, Vito Walter Anelli, Fedelucio Narducci, Tommaso Di Noia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16491
Pdf URL: https://arxiv.org/pdf/2505.16491
Copy Paste: [[2505.16491]] LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing(https://arxiv.org/abs/2505.16491)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.
摘要：大型语言模型（LLM）已迅速成为NLP的核心，表明了他们通过提示技术（包括情感分析）适应各种任务的能力。但是，我们仍然对这些模型如何捕获与情感相关的信息有有限的了解。这项研究探究了Llama模型的隐藏层，以查明最有代表性的情感特征，并评估这如何影响情感分析。使用探针分类器，我们分析跨层和尺度编码的情感，识别最能捕获情感信号的层和汇总方法。我们的结果表明，情绪信息最集中在中层中，用于二进制极性任务，而检测准确性比提示技术的次数提高了14％。此外，我们发现，在仅解码模型中，最后一个令牌并不是一贯的情感编码最有用的信息。最后，这种方法可以使情感任务平均减少57％。这些见解有助于对LLM中的情感有更广泛的了解，这表明特定于层的探测是提示超出提示的情感任务的有效方法，并有可能增强模型效用并减少内存需求。

Title: Sparse Activation Editing for Reliable Instruction Following in Narratives

Authors: Runcong Zhao, Chengyu Cao, Qinglin Zhu, Xiucheng Lv, Shun Shao, Lin Gui, Ruifeng Xu, Yulan He
Subjects: cs.CL, cs.AI, cs.HC
Abstract URL: https://arxiv.org/abs/2505.16505
Pdf URL: https://arxiv.org/pdf/2505.16505
Copy Paste: [[2505.16505]] Sparse Activation Editing for Reliable Instruction Following in Narratives(https://arxiv.org/abs/2505.16505)
Keywords: language model
Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.
摘要：复杂的叙事环境通常会挑战语言模型遵循指示的能力，并且现有的基准无法捕捉这些困难。为了解决这个问题，我们提出了Concise-Sae，这是一个无培训的框架，通过仅使用自然语言指令来识别和编辑指令的神经元来改善以下教学，而无需标记数据。为了彻底评估我们的方法，我们介绍了自由建筑，这是1,212个示例的多样化和现实的基准，强调了叙事丰富的环境中教学的挑战。虽然最初是出于复杂的叙述的动机，但简洁表现出了跨多种任务的最新指导依从性，而不会损害发电质量。

Title: CUB: Benchmarking Context Utilisation Techniques for Language Models

Authors: Lovisa Hagström, Youna Kim, Haeun Yu, Sang-goo Lee, Richard Johansson, Hyunsoo Cho, Isabelle Augenstein
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16518
Pdf URL: https://arxiv.org/pdf/2505.16518
Copy Paste: [[2505.16518]] CUB: Benchmarking Context Utilisation Techniques for Language Models(https://arxiv.org/abs/2505.16518)
Keywords: language model, retrieval-augmented generation
Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.
摘要：合并外部知识对于知识密集的任务至关重要，例如问答和事实检查。但是，语言模型（LMS）可能会忽略与过时的参数记忆相矛盾或不相关环境分散注意力的相关信息。尽管最近有人提出了鼓励或抑制上下文利用的许多上下文利用操纵技术（CMT）来减轻这些问题，但很少有人看到系统的比较。在本文中，我们开发了CUB（上下文利用基准），以帮助从业者中的从业人员（RAG）确定最佳的CMT满足其需求。 CUB允许对三种不同的上下文类型进行严格的测试，观察到可以在现实上下文利用方案中捕获关键挑战。借助此基准，我们评估了七个最先进的方法，代表了CMT的主要类别的三种不同数据集和任务，应用于9个LMS。我们的结果表明，大多数现有的CMT都在努力处理现实检索 - 夸大的场景中可能遇到的全部类型的上下文类型。此外，我们发现许多CMT在简单合成的数据集上显示出夸张的性能，与具有天然样本的更真实的数据集相比。总的来说，我们的结果表明需要对CMT进行整体测试以及可以处理多种上下文类型的CMT的开发。

Title: Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs

Authors: Giovanni Servedio, Alessandro De Bellis, Dario Di Palma, Vito Walter Anelli, Tommaso Di Noia
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16520
Pdf URL: https://arxiv.org/pdf/2505.16520
Copy Paste: [[2505.16520]] Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs(https://arxiv.org/abs/2505.16520)
Keywords: language model, llm, hallucination
Abstract: Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.
摘要：事实幻觉是大型语言模型（LLM）的主要挑战。他们通过产生不准确或制造的内容来破坏可靠性和用户信任。最近的研究表明，在产生虚假陈述时，LLM的内部状态编码了有关真实性的信息。但是，这些研究通常依赖于缺乏现实主义的合成数据集，这在评估模型本身产生的文本的事实准确性时限制了概括。在本文中，我们通过调查编码能力的真实性来挑战先前工作的发现，从而产生更现实和挑战性的数据集。具体而言，我们通过引入以下工作来扩展以前的工作：（1）从表格数据中抽样合理的真实factoid句子，以及（2）从问题回答收集中生成现实的，依赖LLM的true-false数据集的过程。我们对两个开源LLM的分析表明，尽管先前研究的发现得到了部分验证，但对LLM生成的数据集的概括仍然具有挑战性。这项研究为LLMS中未来的事实研究奠定了基础，并提供了更有效评估的实用指南。

Title: Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing

Authors: Zhouhao Sun, Zhiyuan Kan, Xiao Ding, Li Du, Yang Zhao, Bing Qin, Ting Liu
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16522
Pdf URL: https://arxiv.org/pdf/2505.16522
Copy Paste: [[2505.16522]] Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing(https://arxiv.org/abs/2505.16522)
Keywords: language model, llm
Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.
摘要：尽管取得了重大进展，但最近的研究表明，当前的大语言模型（LLMS）在推断过程中仍可能利用偏见，从而导致LLM的普遍性差。提出了一些基准测试来研究LLM的普遍性，每个数据通常包含一种类型的受控偏置。但是，单个数据可能包含多种类型的偏见。为了弥合这一差距，我们提出了一个多偏置基准测试，其中每个数据包含五种偏见。在此基准测试上进行的评估表明，现有的LLM和偏见方法的性能是不令人满意的，突出了同时消除多种类型的偏见的挑战。为了克服这一挑战，我们提出了一种因果效应估计引导的多偏消除方法（CMBE）。该方法首先同时估算多种类型的偏见的因果效应。随后，我们消除了偏见的因果关系，从推论过程中的语义信息和偏见施加的总因果效应中。实验结果表明，CMBE可以有效地消除多种类型的偏见，以增强LLM的普遍性。

Title: EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance

Authors: Heejae Suh, Yejin Jeon, Deokhyung Kang, Taehee Park, Yejin Min, Gary Geunbae Lee
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16526
Pdf URL: https://arxiv.org/pdf/2505.16526
Copy Paste: [[2505.16526]] EnSToM: Enhancing Dialogue Systems with Entropy-Scaled Steering Vectors for Topic Maintenance(https://arxiv.org/abs/2505.16526)
Keywords: language model, llm, chat
Abstract: Small large language models (sLLMs) offer the advantage of being lightweight and efficient, which makes them suitable for resource-constrained environments. However, sLLMs often struggle to maintain topic consistency in task-oriented dialogue systems, which is critical for scenarios such as service chatbots. Specifically, it is important to ensure that the model denies off-topic or malicious inputs and adheres to its intended functionality so as to prevent potential misuse and uphold reliability. Towards this, existing activation engineering approaches have been proposed to manipulate internal activations during inference. While these methods are effective in certain scenarios, our preliminary experiments reveal their limitations in ensuring topic adherence. Therefore, to address this, we propose a novel approach termed Entropy-scaled Steering vectors for Topic Maintenance (EnSToM). EnSToM dynamically adjusts the steering intensity based on input uncertainty, which allows the model to handle off-topic distractors effectively while preserving on-topic accuracy. Our experiments demonstrate that EnSToM achieves significant performance gain with a relatively small data size compared to fine-tuning approaches. By improving topic adherence without compromising efficiency, our approach provides a robust solution for enhancing sLLM-based dialogue systems.
摘要：小型大语模型（SLLMS）具有轻巧有效的优势，使其适合于资源受限的环境。但是，SLLM经常难以维持以任务为导向的对话系统中的主题一致性，这对于诸如服务聊天机器人之类的方案至关重要。具体而言，重要的是要确保该模型否认主题或恶意投入并遵守其预期功能，以防止潜在的滥用和维护可靠性。为此，已经提出了现有的激活工程方法来操纵推理期间的内部激活。尽管这些方法在某些情况下是有效的，但我们的初步实验揭示了它们在确保主题依从性方面的局限性。因此，为了解决这个问题，我们提出了一种新颖的方法，称为主题维护（ENSTOM）的熵尺度的转向向量。 ENSTOM根据输入不确定性动态调节转向强度，该转向强度允许模型有效地处理主题干扰物，同时保持主题准确性。我们的实验表明，与微调方法相比，ENSTOM可以通过相对较小的数据大小实现显着的性能增长。通过提高主题依从性而不会损害效率，我们的方法为增强基于SLLM的对话系统提供了强大的解决方案。

Title: Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models

Authors: Ercong Nie, Helmut Schmid, Hinrich Schütze
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16538
Pdf URL: https://arxiv.org/pdf/2505.16538
Copy Paste: [[2505.16538]] Mechanistic Understanding and Mitigation of Language Confusion in English-Centric Large Language Models(https://arxiv.org/abs/2505.16538)
Keywords: language model, llm
Abstract: Language confusion -- where large language models (LLMs) generate unintended languages against the user's need -- remains a critical challenge, especially for English-centric models. We present the first mechanistic interpretability (MI) study of language confusion, combining behavioral benchmarking with neuron-level analysis. Using the Language Confusion Benchmark (LCB), we show that confusion points (CPs) -- specific positions where language switches occur -- are central to this phenomenon. Through layer-wise analysis with TunedLens and targeted neuron attribution, we reveal that transition failures in the final layers drive confusion. We further demonstrate that editing a small set of critical neurons, identified via comparative analysis with multilingual-tuned models, substantially mitigates confusion without harming general competence or fluency. Our approach matches multilingual alignment in confusion reduction for most languages and yields cleaner, higher-quality outputs. These findings provide new insights into the internal dynamics of LLMs and highlight neuron-level interventions as a promising direction for robust, interpretable multilingual language modeling.
摘要：语言混乱 - 大型语言模型（LLMS）会根据用户的需求产生意外的语言 - 仍然是一个关键的挑战，尤其是对于以英语为中心的模型。我们介绍了语言混乱的首个机械性解释性研究，将行为基准和神经元级分析相结合。使用语言混乱基准（LCB），我们表明混淆点（CPS） - 语言开关的特定位置 - 对此现象至关重要。通过使用Tunedlens和靶向神经元归因的层分析，我们揭示了最终层中的过渡失败驱动混乱。我们进一步证明，编辑一小部分关键神经元，通过与多语种模型进行比较分析确定，大大减轻了混乱而不会损害一般能力或流利性。我们的方法与大多数语言的混乱减少相匹配匹配，并产生更清洁的更高质量的输出。这些发现为LLM的内部动力学提供了新的见解，并突出了神经元水平的干预措施，是可靠，可解释的多语言语言建模的有希望的方向。

Title: Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains

Authors: Wenhui Tan, Jiaze Li, Jianzhong Ju, Zhenbo Luo, Jian Luan, Ruihua Song
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16552
Pdf URL: https://arxiv.org/pdf/2505.16552
Copy Paste: [[2505.16552]] Think Silently, Think Fast: Dynamic Latent Compression of LLM Reasoning Chains(https://arxiv.org/abs/2505.16552)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) achieve superior performance through Chain-of-Thought (CoT) reasoning, but these token-level reasoning chains are computationally expensive and inefficient. In this paper, we introduce Compressed Latent Reasoning (CoLaR), a novel framework that dynamically compresses reasoning processes in latent space through a two-stage training approach. First, during supervised fine-tuning, CoLaR extends beyond next-token prediction by incorporating an auxiliary next compressed embedding prediction objective. This process merges embeddings of consecutive tokens using a compression factor randomly sampled from a predefined range, and trains a specialized latent head to predict distributions of subsequent compressed embeddings. Second, we enhance CoLaR through reinforcement learning (RL) that leverages the latent head's non-deterministic nature to explore diverse reasoning paths and exploit more compact ones. This approach enables CoLaR to: i) perform reasoning at a dense latent level (i.e., silently), substantially reducing reasoning chain length, and ii) dynamically adjust reasoning speed at inference time by simply prompting the desired compression factor. Extensive experiments across four mathematical reasoning datasets demonstrate that CoLaR achieves 14.1% higher accuracy than latent-based baseline methods at comparable compression ratios, and reduces reasoning chain length by 53.3% with only 4.8% performance degradation compared to explicit CoT method. Moreover, when applied to more challenging mathematical reasoning tasks, our RL-enhanced CoLaR demonstrates performance gains of up to 5.4% while dramatically reducing latent reasoning chain length by 82.8%. The code and models will be released upon acceptance.
摘要：大型语言模型（LLMS）通过思考链（COT）推理实现出色的性能，但是这些令牌级别的推理链在计算上昂贵且效率低下。在本文中，我们引入了压缩潜在推理（COLAR），这是一个新颖的框架，通过两阶段的训练方法动态压缩潜在空间中的推理过程。首先，在受监督的微调过程中，Colar通过纳入辅助下一个压缩嵌入预测目标，超出了下一句话的预测。该过程使用从预定范围的范围随机采样的压缩因子将连续令牌的嵌入嵌入，并训练专门的潜在头，以预测随后的压缩嵌入的分布。其次，我们通过增强学习（RL）来增强colar，以利用潜在头部的非确定性探索各种推理路径并利用更紧凑的道路。这种方法使colar能够：i）在密集的潜在水平（即默默地）执行推理，大大降低了推理链长度，ii）通过简单地提示所需的压缩因子，在推理时间动态调整推理速度。在四个数学推理数据集中进行的广泛实验表明，在可比较的压缩比下，Colar的精度比基于潜在的基线方法提高了14.1％，并且与显式COT方法相比，Chinable链长度只有4.8％的性能降解，而推理链长度只有4.8％。此外，当应用于更具挑战性的数学推理任务时，我们的RL增强的Colar显示出高达5.4％的性能增长，同时将潜在的推理链长度大幅降低了82.8％。代码和模型将在接受后发布。

Title: ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

Authors: Dongwon Noh, Donghyeok Koh, Junghun Yuk, Gyuwan Kim, Jaeyong Lee, Kyungtae Lim, Cheoneum Park
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16566
Pdf URL: https://arxiv.org/pdf/2505.16566
Copy Paste: [[2505.16566]] ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts(https://arxiv.org/abs/2505.16566)
Keywords: language model, llm
Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.
摘要：用于评估大语言模型（LLMS）特定领域知识的先前基准缺乏处理复杂的学术任务的可扩展性。为了解决这个问题，我们介绍了\ texttt {Scholarbench}，这是一个基准，该基准为中心，以深入的专家知识和复杂的学术问题解决，该基准评估了LLM的学术推理能力，并通过三步过程来构建。 \ texttt {ScholarBench}目标是从学术文献中得出的更专业和逻辑上的复杂环境，其中包括五种不同的问题类型。与先前的基准分析不同，\ texttt {ScholarBench}评估了LLM在八个不同的研究领域中LLM的抽象，理解和推理能力。为了确保高质量的评估数据，我们定义了特定于类别的示例属性和设计问题，这些属性与每个领域的特征研究方法和话语结构一致。此外，该基准是作为英语 - korean双语数据集运行的，促进了两种语言中LLMS语言能力的同时评估。该基准包括韩语为5,031个示例和5,309个英语，甚至O3-Mini（例如O3 Mini）的最先进的模型仅达到平均评估得分仅为0.543，这表明了该基准的挑战性质。

Title: URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training

Authors: Dongyang Fan, Vinko Sabolčec, Martin Jaggi
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16570
Pdf URL: https://arxiv.org/pdf/2505.16570
Copy Paste: [[2505.16570]] URLs Help, Topics Guide: Understanding Metadata Utility in LLM Training(https://arxiv.org/abs/2505.16570)
Keywords: language model, llm, prompt
Abstract: Large Language Models (LLMs) are commonly pretrained on vast corpora of text without utilizing contextual metadata such as source, quality, or topic, leading to a context-free learning paradigm. While recent studies suggest that adding metadata like URL information as context (i.e., auxiliary inputs not used in the loss calculation) can improve training efficiency and downstream performance, they offer limited understanding of which types of metadata are truly effective and under what conditions. In this work, we conduct a systematic evaluation and find that not all metadata types contribute equally. Only URL context speeds up training, whereas quality scores and topic/format domain information offer no clear benefit. Furthermore, the improved downstream performances of URL conditioning emerge only when longer prompts are used at inference time. In addition, we demonstrate that context-aware pretraining enables more controllable generation than context-free pretraining, in a classifier-free guidance fashion. Although topic and format metadata do not accelerate training, they are effective for steering outputs, offering human-interpretable control over generation.
摘要：大型语言模型（LLM）通常在不利用上下文元数据（例如源，质量或主题）的情况下鉴定在庞大的文本中，从而导致无上下文的学习范式。尽管最近的研究表明，添加元数据等元数据作为上下文（即损失计算中未使用的辅助输入）可以提高训练效率和下游性能，但他们对哪种类型的元数据真正有效，并且在什么条件下提供了有限的了解。在这项工作中，我们进行了系统的评估，发现并非所有元数据类型都同样贡献。只有URL上下文加快培训，而质量分数和主题/格式域信息没有明确的好处。此外，只有在推理时间使用较长的提示时，就会出现URL调理的下游性能。此外，我们证明，以无上下文预处理的方式，以无上下文的指导方式进行了上下文感知的预期能力。尽管主题和格式的元数据不能加速培训，但它们对于转向输出有效，提供了对生成的人性化控制。

Title: EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions

Authors: Spencer Hong, Meng Luo, Xinyi Wan
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16576
Pdf URL: https://arxiv.org/pdf/2505.16576
Copy Paste: [[2505.16576]] EMULATE: A Multi-Agent Framework for Determining the Veracity of Atomic Claims by Emulating Human Actions(https://arxiv.org/abs/2505.16576)
Keywords: language model, agent
Abstract: Determining the veracity of atomic claims is an imperative component of many recently proposed fact-checking systems. Many approaches tackle this problem by first retrieving evidence by querying a search engine and then performing classification by providing the evidence set and atomic claim to a large language model, but this process deviates from what a human would do in order to perform the task. Recent work attempted to address this issue by proposing iterative evidence retrieval, allowing for evidence to be collected several times and only when necessary. Continuing along this line of research, we propose a novel claim verification system, called EMULATE, which is designed to better emulate human actions through the use of a multi-agent framework where each agent performs a small part of the larger task, such as ranking search results according to predefined criteria or evaluating webpage content. Extensive experiments on several benchmarks show clear improvements over prior work, demonstrating the efficacy of our new multi-agent framework.
摘要：确定原子主张的真实性是许多最近提出的事实检查系统的当务之急。许多方法通过首先查询搜索引擎，然后通过向大语言模型提供证据集和原子主张来解决分类来解决这个问题，但是该过程偏离了人类为执行任务而做的事情。最近的工作试图通过提出迭代证据检索来解决这个问题，从而使证据几次，并且只有在必要时才能收集。继续沿着这一研究，我们提出了一个新型的主张验证系统，称为仿真，该系统旨在通过使用多代理框架来更好地模仿人类的行动，在该框架中，每个代理执行较大任务的一小部分，例如根据预定义的标准对搜索结果进行排名或评估网页内容。对几个基准测试的广泛实验对先前的工作有了明显的改进，这证明了我们新的多代理框架的功效。

Title: O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Authors: Jianbiao Mei, Tao Hu, Daocheng Fu, Licheng Wen, Xuemeng Yang, Rong Wu, Pinlong Cai, Xing Gao, Yu Yang, Chengjun Xie, Botian Shi, Yong Liu, Yu Qiao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16582
Pdf URL: https://arxiv.org/pdf/2505.16582
Copy Paste: [[2505.16582]] O$^2$-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering(https://arxiv.org/abs/2505.16582)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O$^2$-Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O$^2$-Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O$^2$-QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O$^2$-Searcher, using only a 3B model, significantly surpasses leading LLM agents on O$^2$-QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.
摘要：大型语言模型（LLMS）尽管取得了进步，但从根本上受到其静态参数知识的限制，阻碍了需要开放式域最新信息的任务的绩效。在使LLM与外部知识环境互动的同时，目前的工作主要解决了闭门问题。开放式问题的特征是缺乏标准答案或提供非唯一和多样化的答案，但仍未得到充实。为了弥合这一差距，我们提出了o $^2 $ -Searcher，这是一名新颖的搜索代理，利用强化学习有效地解决了开放式域中的开放式和封闭式问题。 o $^2 $ - 搜索者利用了一个有效的，当地模拟的搜索环境进行动态知识获取，从而有效地将外部世界知识与模型的复杂推理过程解开。它采用精心设计的奖励功能采用统一的培训机制，使代理商能够识别问题类型并适应不同的答案策略。此外，为了评估复杂的开放式任务的性能，我们构建了O $^2 $ -QA，这是一种高质量的基准测试，具有300个手动策划的，多域的开放式开放式问题，并带有关联的网页缓存。广泛的实验表明，仅使用3B型号的o $^2 $ -Searcher，在O $^2 $ -QA上显着超过了LLM代理。它还可以针对类似大小的模型在各种封闭式的QA基准上实现SOTA结果，同时与更大的模型进行表现。

Title: Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering

Authors: Bowen Jiang, Runchuan Zhu, Jiang Wu, Zinco Jiang, Yifan He, Junyuan Gao, Jia Yu, Rui Min, Yinfan Wang, Haote Yang, Songyang Zhang, Dahua Lin, Lijun Wu, Conghui He
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16591
Pdf URL: https://arxiv.org/pdf/2505.16591
Copy Paste: [[2505.16591]] Evaluating Large Language Model with Knowledge Oriented Language Specific Simple Question Answering(https://arxiv.org/abs/2505.16591)
Keywords: language model, llm
Abstract: We introduce KoLasSimpleQA, the first benchmark evaluating the multilingual factual ability of Large Language Models (LLMs). Inspired by existing research, we created the question set with features such as single knowledge point coverage, absolute objectivity, unique answers, and temporal stability. These questions enable efficient evaluation using the LLM-as-judge paradigm, testing both the LLMs' factual memory and self-awareness ("know what they don't know"). KoLasSimpleQA expands existing research in two key dimensions: (1) Breadth (Multilingual Coverage): It includes 9 languages, supporting global applicability evaluation. (2) Depth (Dual Domain Design): It covers both the general domain (global facts) and the language-specific domain (such as history, culture, and regional traditions) for a comprehensive assessment of multilingual capabilities. We evaluated mainstream LLMs, including traditional LLM and emerging Large Reasoning Models. Results show significant performance differences between the two domains, particularly in performance metrics, ranking, calibration, and robustness. This highlights the need for targeted evaluation and optimization in multilingual contexts. We hope KoLasSimpleQA will help the research community better identify LLM capability boundaries in multilingual contexts and provide guidance for model optimization. We will release KoLasSimpleQA at this https URL .
摘要：我们介绍了Kolassimpleqa，这是第一个评估大语模型（LLMS）多语言事实能力的基准。受现有研究的启发，我们创建了一个问题集，其特征，例如单个知识点覆盖，绝对客观性，独特的答案和时间稳定性。这些问题能够使用LLM-As-Gudge范式进行有效的评估，从而测试了LLM的事实记忆和自我意识（“知道他们不知道的事”）。 Kolassimpleqa在两个关键方面扩展了现有的研究：（1）广度（多语言覆盖范围）：它包括9种语言，支持全球适用性评估。（2）深度（双重领域设计）：它涵盖了通用领域（全球事实）和特定于语言的领域（例如历史，文化和地区传统），以全面评估多语言能力。我们评估了主流LLM，包括传统的LLM和新兴的大型推理模型。结果显示两个领域之间的性能差异很大，尤其是在性能指标，排名，校准和鲁棒性方面。这强调了在多语言上下文中有针对性评估和优化的必要性。我们希望Kolassimpleqa能够帮助研究社区更好地识别多语言环境中的LLM功能边界，并为模型优化提供指导。我们将在此HTTPS URL上发布Kolassimpleqa。

Title: What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse

Authors: Shijia Zhou, Siyao Peng, Simon Luebke, Jörg Haßler, Mario Haim, Saif M. Mohammad, Barbara Plank
Subjects: cs.CL, cs.MM
Abstract URL: https://arxiv.org/abs/2505.16592
Pdf URL: https://arxiv.org/pdf/2505.16592
Copy Paste: [[2505.16592]] What Media Frames Reveal About Stance: A Dataset and Study about Memes in Climate Change Discourse(https://arxiv.org/abs/2505.16592)
Keywords: llm
Abstract: Media framing refers to the emphasis on specific aspects of perceived reality to shape how an issue is defined and understood. Its primary purpose is to shape public perceptions often in alignment with the authors' opinions and stances. However, the interaction between stance and media frame remains largely unexplored. In this work, we apply an interdisciplinary approach to conceptualize and computationally explore this interaction with internet memes on climate change. We curate CLIMATEMEMES, the first dataset of climate-change memes annotated with both stance and media frames, inspired by research in communication science. CLIMATEMEMES includes 1,184 memes sourced from 47 subreddits, enabling analysis of frame prominence over time and communities, and sheds light on the framing preferences of different stance holders. We propose two meme understanding tasks: stance detection and media frame detection. We evaluate LLaVA-NeXT and Molmo in various setups, and report the corresponding results on their LLM backbone. Human captions consistently enhance performance. Synthetic captions and human-corrected OCR also help occasionally. Our findings highlight that VLMs perform well on stance, but struggle on frames, where LLMs outperform VLMs. Finally, we analyze VLMs' limitations in handling nuanced frames and stance expressions on climate change internet memes.
摘要：媒体框架是指强调感知现实的特定方面，以塑造问题的定义和理解方式。它的主要目的是与作者的观点和立场相吻合，通常塑造公众的看法。但是，立场与媒体框架之间的相互作用在很大程度上尚未得到探索。在这项工作中，我们采用跨学科的方法来概念化和计算探讨与气候变化的互联网模因的这种互动。我们策划了气候变化模因的第一个数据集，该模因以姿势和媒体框架注释，灵感来自传播科学研究。 Climatemes包括来自47个子列表的1,184个模因，可以对随时间和社区的框架突出性进行分析，并阐明了不同立场持有人的框架偏好。我们提出了两个模因理解任务：立场检测和媒体框架检测。我们在各种设置中评估了LLAVA-NEXT和MOLMO，并在其LLM骨架上报告相应的结果。人体字幕始终提高性能。合成字幕和经过人校正的OCR也偶尔会有所帮助。我们的发现强调了VLM在立场上表现良好，但在LLMS上表现优于VLM的框架上挣扎。最后，我们分析了VLM在处理气候变化互联网模因的细微框架和立场表达中的局限性。

Title: From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment

Authors: Jing Ye, Lu Xiang, Yaping Zhang, Chengqing Zong
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16610
Pdf URL: https://arxiv.org/pdf/2505.16610
Copy Paste: [[2505.16610]] From Generic Empathy to Personalized Emotional Support: A Self-Evolution Framework for User Preference Alignment(https://arxiv.org/abs/2505.16610)
Keywords: language model, llm
Abstract: Effective emotional support hinges on understanding users' emotions and needs to provide meaningful comfort during multi-turn interactions. Large Language Models (LLMs) show great potential for expressing empathy; however, they often deliver generic and one-size-fits-all responses that fail to address users' specific needs. To tackle this issue, we propose a self-evolution framework designed to help LLMs improve their responses to better align with users' implicit preferences concerning user profiles (personalities), emotional states, and specific situations. Our framework consists of two distinct phases: \textit{(1)} \textit{Emotional Support Experience Acquisition}, where LLMs are fine-tuned on limited emotional support conversation data to provide basic support, and \textit{(2)} \textit{Self-Improvement for Personalized Emotional Support}, where LLMs leverage self-reflection and self-refinement to generate personalized responses. Through iterative direct preference optimization between the pre- and post-refined responses, our model generates responses that reflect a better understanding of the user's implicit preferences. Extensive experiments and evaluations demonstrate that our method significantly enhances the model's performance in emotional support, reducing unhelpful responses and minimizing discrepancies between user preferences and model outputs.
摘要：有效的情感支持取决于了解用户的情绪，并需要在多转交战期间提供有意义的舒适感。大型语言模型（LLM）表达同理心的巨大潜力；但是，他们通常会提供通用和一定程度的全部响应，这些响应无法满足用户的特定需求。为了解决这个问题，我们提出了一个自我进化框架，旨在帮助LLMS改善其响应，以更好地与用户有关用户资料（个性），情感状态和特定情况的隐性偏好保持一致。 Our framework consists of two distinct phases: \textit{(1)} \textit{Emotional Support Experience Acquisition}, where LLMs are fine-tuned on limited emotional support conversation data to provide basic support, and \textit{(2)} \textit{Self-Improvement for Personalized Emotional Support}, where LLMs leverage self-reflection and self-refinement to generate personalized responses.通过迭代的直接偏好优化在预制后的响应之间，我们的模型产生了反映对用户隐式偏好的理解的响应。广泛的实验和评估表明，我们的方法显着提高了模型在情感支持中的表现，减少了无助的响应，并最大程度地减少了用户偏好和模型输出之间的差异。

Title: Steering Large Language Models for Machine Translation Personalization

Authors: Daniel Scalena, Gabriele Sarti, Arianna Bisazza, Elisabetta Fersini, Malvina Nissim
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16612
Pdf URL: https://arxiv.org/pdf/2505.16612
Copy Paste: [[2505.16612]] Steering Large Language Models for Machine Translation Personalization(https://arxiv.org/abs/2505.16612)
Keywords: language model, llm, prompt
Abstract: High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.
摘要：基于大语言模型（LLM）的高质量机器翻译系统简化了反映特定风格约束的个性化翻译的产生。但是，这些系统仍然在风格要求不太明确并且可能难以通过提示传达的设置中挣扎。我们探讨了在低资源环境中个性化LLM生成翻译的各种策略，重点是具有挑战性的文学翻译领域。我们探索了转向模型世代的促进策略和推理时间干预措施，并提出了一个对比框架，利用从稀疏自动编码器中提取的潜在概念来识别明显的个性化属性。我们的结果表明，转向在保持翻译质量的同时实现了强大的个性化。我们进一步研究了转向对LLM表示形式的影响，发现对个性化具有相关影响的模型层受到多弹射提示和我们的转向方法的影响，这表明了类似的机制。

Title: SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Authors: Wenjie Yang, Mao Zheng, Mingyang Song, Zheng Li
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16637
Pdf URL: https://arxiv.org/pdf/2505.16637
Copy Paste: [[2505.16637]] SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation(https://arxiv.org/abs/2505.16637)
Keywords: language model, gpt, llm
Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.
摘要：大型语言模型（LLMS）最近在机器翻译（MT）中表现出了显着的功能。但是，最先进的MT特异性LLM在很大程度上依赖于培训期间的外部监督信号，例如人类注销的参考数据或受过训练的奖励模型（RMS），这些模型（RMS）通常是昂贵的，并且可以进行规模挑战。为了克服这一限制，我们提出了一个简单的自我奖励（SSR）增强学习（RL）框架，该框架是无参考，完全在线的，仅依赖于自我判断的奖励。使用SSR进行13K单语示例和QWEN-2.5-7B作为骨干的培训，我们的型号SSR-Zero-7b在现有的MT特异性LLMS上，例如TowerinStruct-13b和Gemmax-28-9b，以及QWEN2.5-32B-instruction $ intrance in Enlstral interver interion intruptift inther intructy intruptions in qwen2.5-32b-instruction $ n intruct in intruct $ WMT23，WMT24和Flores200基准。此外，通过通过Comet的外部监督来增强SSR，这是我们最强的模型SSR-X-Zero-7b，在英语中实现了最先进的性能，$ \ leftrightArrow $中文翻译超过了所有现有的72b型型号以下的所有现有开放式型号，以及均匀的封闭模型，例如，均超过了封闭式模型，例如，g.，g.，g.，gpt-4o。我们的分析强调了自我奖励机制的有效性与MT中的外部LLM-AS-A-A-AS-A-AUDGE方法相比，并在与训练有素的RMS结合使用时证明了其互补益处。我们的发现提供了对自我提高RL方法潜力的宝贵见解。我们已公开发布了代码，数据和模型。

Title: Collaboration among Multiple Large Language Models for Medical Question Answering

Authors: Kexin Shang, Chia-Hsuan Chang, Christopher C. Yang
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16648
Pdf URL: https://arxiv.org/pdf/2505.16648
Copy Paste: [[2505.16648]] Collaboration among Multiple Large Language Models for Medical Question Answering(https://arxiv.org/abs/2505.16648)
Keywords: language model, llm
Abstract: Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.
摘要：新一代大语言模型（LLMS）在庞大的内部知识水库中授权，表现出未开发的解决医疗任务的潜力。但是，从多个LLM的专业知识和背景中召唤出协同效应的努力不足。在这项研究中，我们提出了一个基于医疗多项选择数据集的多LLLM协作框架。通过对3名训练前LLM参与者的事后分析，我们的框架被证明可以提高所有LLMS推理能力，并减轻其在问题之间的分歧。当LLM面对其他LLM的对手意见时，我们还可以衡量LLM的信心，并观察到LLM的信心和预测准确性之间的同意。

Title: A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

Authors: Issey Sukeda, Takuro Fujii, Kosei Buma, Shunsuke Sasaki, Shinnosuke Ono
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16661
Pdf URL: https://arxiv.org/pdf/2505.16661
Copy Paste: [[2505.16661]] A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP(https://arxiv.org/abs/2505.16661)
Keywords: language model, gpt, llm
Abstract: We present a Japanese domain-specific language model for the pharmaceutical field, developed through continual pretraining on 2 billion Japanese pharmaceutical tokens and 8 billion English biomedical tokens. To enable rigorous evaluation, we introduce three new benchmarks: YakugakuQA, based on national pharmacist licensing exams; NayoseQA, which tests cross-lingual synonym and terminology normalization; and SogoCheck, a novel task designed to assess consistency reasoning between paired statements. We evaluate our model against both open-source medical LLMs and commercial models, including GPT-4o. Results show that our domain-specific model outperforms existing open models and achieves competitive performance with commercial ones, particularly on terminology-heavy and knowledge-based tasks. Interestingly, even GPT-4o performs poorly on SogoCheck, suggesting that cross-sentence consistency reasoning remains an open challenge. Our benchmark suite offers a broader diagnostic lens for pharmaceutical NLP, covering factual recall, lexical variation, and logical consistency. This work demonstrates the feasibility of building practical, secure, and cost-effective language models for Japanese domain-specific applications, and provides reusable evaluation resources for future research in pharmaceutical and healthcare NLP. Our model, codes, and datasets are released at this https URL.
摘要：我们为制药领域提供了一种日本特定领域的语言模型，该模型是通过对20亿日本药品代币和80亿英国生物医学代币进行预处理开发的。为了进行严格的评估，我们根据国家药剂师许可考试介绍了三个新的基准：Yakugakuqa； Nayoseqa，测试跨语性同义词和术语归一化；和Sogocheck，这是一项新型任务，旨在评估配对语句之间的一致性推理。我们根据包括GPT-4O在内的开源医疗LLM和商业模型来评估我们的模型。结果表明，我们特定领域的模型优于现有的开放模型，并与商业型号达到竞争性能，尤其是在术语繁重和基于知识的任务上。有趣的是，即使是GPT-4O在Sogocheck上的表现也很差，这表明跨句子一致性推理仍然是一个开放的挑战。我们的基准套件为药品NLP提供了更广泛的诊断镜头，涵盖了事实召回，词汇变化和逻辑一致性。这项工作证明了为日本特定领域的应用建立实用，安全且具有成本效益的语言模型的可行性，并为药物和医疗保健NLP的未来研究提供了可重复使用的评估资源。我们的模型，代码和数据集在此HTTPS URL上发布。

Title: Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Authors: Gouki Minegishi, Hiroki Furuta, Shohei Taniguchi, Yusuke Iwasawa, Yutaka Matsuo
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16694
Pdf URL: https://arxiv.org/pdf/2505.16694
Copy Paste: [[2505.16694]] Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence(https://arxiv.org/abs/2505.16694)
Keywords: language model
Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.
摘要：基于变压器的语言模型表现出在文章中的内部学习（ICL），其中预测是根据上下文自适应做出的。虽然先前的工作链接通过突然的准确性链接到ICL，但这只有在上下文中包含答案时才能说明ICL。但是，在大语言模型中，实用ICL的重要属性是能够从上下文中求解任务，而不仅仅是从上下文中复制答案；在训练期间如何获得这种能力的方式在很大程度上没有探索。在本文中，我们通过实验阐明了如何通过分析训练过程中模型电路的动力学来获得这种元学习能力。具体来说，我们将复制任务从以前的研究扩展到了元素元学习设置，模型必须从示例中推断出一个任务以回答查询。有趣的是，在这种情况下，我们发现在获得此类能力的过程中存在多个阶段，并且在每个阶段都出现了独特的电路，与归纳负责人的单重点变化形成对比。此类电路的出现可能与大语言模型中已知的几种现象有关，我们的分析使人们对变压器ICL能力的来源有了更深入的了解。

Title: Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs

Authors: Zeping Yu, Sophia Ananiadou
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16703
Pdf URL: https://arxiv.org/pdf/2505.16703
Copy Paste: [[2505.16703]] Locate-then-Merge: Neuron-Level Parameter Fusion for Mitigating Catastrophic Forgetting in Multimodal LLMs(https://arxiv.org/abs/2505.16703)
Keywords: language model, llm, hallucination
Abstract: Although multimodal large language models (MLLMs) have achieved impressive performance, the multimodal instruction tuning stage often causes catastrophic forgetting of the base LLM's language ability, even in strong models like Llama3. To address this, we propose Locate-then-Merge, a training-free parameter fusion framework that first locates important parameters and then selectively merges them. We further introduce Neuron-Fusion, a neuron-level strategy that preserves the influence of neurons with large parameter shifts--neurons likely responsible for newly acquired visual capabilities--while attenuating the influence of neurons with smaller changes that likely encode general-purpose language skills. This design enables better retention of visual adaptation while mitigating language degradation. Experiments on 13 benchmarks across both language and visual tasks show that Neuron-Fusion consistently outperforms existing model merging methods. Further analysis reveals that our method effectively reduces context hallucination in generation.
摘要：尽管多模式的大语言模型（MLLM）取得了令人印象深刻的表现，但多模式指令调整阶段通常会导致灾难性忘记基本LLM的语言能力，即使在Llama3之类的强大模型中也是如此。为了解决这个问题，我们提出了定位，然后是一个无训练的参数融合框架，该框架首先定位重要参数，然后选择性合并它们。我们进一步介绍了神经元融合，这是一种神经元级策略，它保留了具有较大参数转移的神经元的影响 - 可能负责新获得的视觉能力的神经元 - 虽然减弱了具有较小变化的神经元的影响，但可能编码通用语言技能的较小变化。该设计使视觉适应能够更好地保留，同时减轻语言退化。对语言和视觉任务的13个基准测试的实验表明，神经元融合始终优于现有模型合并方法。进一步的分析表明，我们的方法有效地减少了代发电的情境幻觉。

Title: Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Authors: Himanshu Beniwal, Youngwoo Kim, Maarten Sap, Soham Dan, Thomas Hartvigsen
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16722
Pdf URL: https://arxiv.org/pdf/2505.16722
Copy Paste: [[2505.16722]] Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification(https://arxiv.org/abs/2505.16722)
Keywords: language model, llm
Abstract: As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at this https URL.
摘要：随着大型语言模型（LLM）在全球应用中变得越来越普遍，因此确保它们在各种语言环境中无毒性无毒性仍然是一个至关重要的挑战。我们探索“跨语性排毒”，这是一种减轻毒性的跨语义范式，使排毒能力能够在不同脚本系列的高水库和低资源语言之间转移。我们通过504个广泛的环境分析了跨语性排毒的有效性，以评估数据降低跨分布环境的毒性，并调查缓解措施如何影响模型的性能对无毒任务影响，从而揭示了安全性和知识保存之间的权衡。我们的代码和数据集可在此HTTPS URL上公开获得。

Title: TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Authors: Florentin Beck, William Rudman, Carsten Eickhoff
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16743
Pdf URL: https://arxiv.org/pdf/2505.16743
Copy Paste: [[2505.16743]] TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning(https://arxiv.org/abs/2505.16743)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: this https URL
摘要：大型语言模型（LLMS）由于其尺寸广泛而引起了重大的计算和内存挑战，使修剪对其有效部署至关重要。现有的单发修剪方法通常在各个层或每一层内都施加均匀的稀疏性约束，从而导致次优性能，尤其是在高稀疏性比下。这项工作介绍了TRIM（针对性的行迭代迭代式驱动的修剪），一种新型方法，将各种稀疏比在每个层中的单个输出尺寸（行）应用于各个层。 TRIM采用迭代调整过程，由质量指标指导，以优化尺寸的稀疏性分配，重点是减少跨输出质量保留的差异，以保留关键信息。装饰可以与现有的层修剪策略无缝集成。我们对各种LLM家族（QWEN2.5，LLAMA-2和OPT）的困惑和零摄像任务的评估以及稀疏水平表明，Trim可以实现新的最先进的结果并增强稳定性。例如，与基线方法相比，QWEN2.5-14B的稀疏性在80％的稀疏度中，QWEN2.5-14B的困惑减少了48％，而Opt-13b的修剪度则超过90％。我们得出的结论是，细粒度，尺寸的稀疏性适应对于推动极限LLM压缩的极限至关重要。代码可用：此HTTPS URL

Title: IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models

Authors: Yiming Gao, Bin Wang, Chengwei Wei, Shuo Sun, AiTi Aw
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16774
Pdf URL: https://arxiv.org/pdf/2505.16774
Copy Paste: [[2505.16774]] IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models(https://arxiv.org/abs/2505.16774)
Keywords: language model, llm
Abstract: Large language models (LLMs) have demonstrated strong instruction-following capabilities in text-based tasks. However, this ability often deteriorates in multimodal models after alignment with non-text modalities such as images or audio. While several recent efforts have investigated instruction-following performance in text and vision-language models, instruction-following in audio-based large language models remains largely unexplored. To bridge this gap, we introduce IFEval-Audio, a novel evaluation dataset designed to assess the ability to follow instructions in an audio LLM. IFEval-Audio contains 280 audio-instruction-answer triples across six diverse dimensions: Content, Capitalization, Symbol, List Structure, Length, and Format. Each example pairs an audio input with a text instruction, requiring the model to generate an output that follows a specified structure. We benchmark state-of-the-art audio LLMs on their ability to follow audio-involved instructions. The dataset is released publicly to support future research in this emerging area.
摘要：大型语言模型（LLMS）在基于文本的任务中表现出强大的指导关注功能。但是，这种能力通常在与非文本模态（例如图像或音频）对齐后的多模式模型中恶化。尽管最近的一些努力调查了文本和视觉模型中的指导性能，但基于音频的大语言模型的指导遵循仍未得到探索。为了弥合这一差距，我们介绍了Ifeval-Audio，这是一个新颖的评估数据集，旨在评估在Audio LLM中遵循说明的能力。 ifeval-audio在六个不同的方面包含280个音频指导 - 招聘三元组：内容，资本化，符号，列表结构，长度和格式。每个示例将音频输入与文本指令配对，要求该模型生成遵循指定结构的输出。我们根据其遵循音频涉及说明的能力进行了最先进的音频LLM。该数据集公开发布，以支持该新兴领域的未来研究。

Title: Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning

Authors: Xinghao Chen, Anhao Zhao, Heming Xia, Xuan Lu, Hanlin Wang, Yanjun Chen, Wei Zhang, Jian Wang, Wenjie Li, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16782
Pdf URL: https://arxiv.org/pdf/2505.16782
Copy Paste: [[2505.16782]] Reasoning Beyond Language: A Comprehensive Survey on Latent Chain-of-Thought Reasoning(https://arxiv.org/abs/2505.16782)
Keywords: language model, llm, prompt, chain-of-thought
Abstract: Large Language Models (LLMs) have achieved impressive performance on complex reasoning tasks with Chain-of-Thought (CoT) prompting. However, conventional CoT relies on reasoning steps explicitly verbalized in natural language, introducing inefficiencies and limiting its applicability to abstract reasoning. To address this, there has been growing research interest in latent CoT reasoning, where inference occurs within latent spaces. By decoupling reasoning from language, latent reasoning promises richer cognitive representations and more flexible, faster inference. Researchers have explored various directions in this promising field, including training methodologies, structural innovations, and internal reasoning mechanisms. This paper presents a comprehensive overview and analysis of this reasoning paradigm. We begin by proposing a unified taxonomy from four perspectives: token-wise strategies, internal mechanisms, analysis, and applications. We then provide in-depth discussions and comparative analyses of representative methods, highlighting their design patterns, strengths, and open challenges. We aim to provide a structured foundation for advancing this emerging direction in LLM reasoning. The relevant papers will be regularly updated at this https URL.
摘要：大型语言模型（LLMS）在具有思想链（COT）提示的复杂推理任务上取得了令人印象深刻的表现。但是，传统的COT依赖于以自然语言明确口头的推理步骤，引入效率低下并限制其对抽象推理的适用性。为了解决这一问题，对潜在的COT推理的研究兴趣越来越大，在潜在空间内发生推理。通过将推理与语言解耦，潜在推理有望更丰富的认知表征和更灵活，更快的推论。研究人员探索了这个有前途的领域的各个方向，包括培训方法，结构创新和内部推理机制。本文介绍了对这种推理范式的全面概述和分析。我们首先从四个角度提出统一的分类法：象征性策略，内部机制，分析和应用。然后，我们提供了代表性方法的深入讨论和比较分析，突出了它们的设计模式，优势和开放挑战。我们旨在为在LLM推理中推进这一新兴方向提供结构化的基础。相关论文将定期在此HTTPS URL上更新。

Title: Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Authors: Punya Syon Pandey, Samuel Simko, Kellin Pelrine, Zhijing Jin
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16789
Pdf URL: https://arxiv.org/pdf/2505.16789
Copy Paste: [[2505.16789]] Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability(https://arxiv.org/abs/2505.16789)
Keywords: language model
Abstract: As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at this https URL.
摘要：随着大型语言模型的流行，它们对对抗攻击的脆弱性仍然是主要问题。尽管通常采用特定于域数据集的微调模型来提高模型性能，但它可以在基础模型中引入漏洞。在这项工作中，我们调查了通过微调数据的特征引起的意外错位，意外的漏洞。我们首先确定潜在的相关因素，例如语言特征，语义相似性和实验数据集中的毒性。然后，我们评估这些微调模型的对抗性表现，并评估数据集因子与攻击成功率的相关性。最后，我们探讨了潜在的因果关系，为对抗性防御策略提供了新的见解，并突出了数据集设计在维护模型一致性中的关键作用。我们的代码可在此HTTPS URL上找到。

Title: Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation

Authors: Changbing Yang, Garrett Nicolai
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16800
Pdf URL: https://arxiv.org/pdf/2505.16800
Copy Paste: [[2505.16800]] Learning Beyond Limits: Multitask Learning and Synthetic Data for Low-Resource Canonical Morpheme Segmentation(https://arxiv.org/abs/2505.16800)
Keywords: language model, llm
Abstract: We introduce a transformer-based morpheme segmentation system that augments a low-resource training signal through multitask learning and LLM-generated synthetic data. Our framework jointly predicts morphological segments and glosses from orthographic input, leveraging shared linguistic representations obtained through a common documentary process to enhance model generalization. To further address data scarcity, we integrate synthetic training data generated by large language models (LLMs) using in-context learning. Experimental results on the SIGMORPHON 2023 dataset show that our approach significantly improves word-level segmentation accuracy and morpheme-level F1-score across multiple low-resource languages.
摘要：我们引入了一个基于变压器的词素分割系统，该系统通过多任务学习和LLM生成的合成数据来增强低资源训练信号。我们的框架共同预测了形态学片段和拼字信息的光，利用通过共同的文献过程获得的共享语言表示，以增强模型概括。为了进一步解决数据稀缺性，我们使用文字内部学习来整合由大语言模型（LLMS）生成的合成训练数据。 Sigmorphon 2023数据集的实验结果表明，我们的方法显着提高了单词级分割精度和多种低资源语言的词素级别的F1得分。

Title: Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement

Authors: Kexin Zhang, Junlan Chen, Daifeng Li, Yuxuan Zhang, Yangyang Feng, Bowen Deng, Weixu Chen
Subjects: cs.CL, cs.IR
Abstract URL: https://arxiv.org/abs/2505.16806
Pdf URL: https://arxiv.org/pdf/2505.16806
Copy Paste: [[2505.16806]] Two-way Evidence self-Alignment based Dual-Gated Reasoning Enhancement(https://arxiv.org/abs/2505.16806)
Keywords: language model, llm
Abstract: Large language models (LLMs) encounter difficulties in knowledge-intensive multi-step reasoning (KIMSR) tasks. One challenge is how to effectively extract and represent rationale evidence. The current methods often extract semantically relevant but logically irrelevant evidence, resulting in flawed reasoning and inaccurate responses. We propose a two-way evidence self-alignment (TW-ESA) module, which utilizes the mutual alignment between strict reasoning and LLM reasoning to enhance its understanding of the causal logic of evidence, thereby addressing the first challenge. Another challenge is how to utilize the rationale evidence and LLM's intrinsic knowledge for accurate reasoning when the evidence contains uncertainty. We propose a dual-gated reasoning enhancement (DGR) module to gradually fuse useful knowledge of LLM within strict reasoning, which can enable the model to perform accurate reasoning by focusing on causal elements in the evidence and exhibit greater robustness. The two modules are collaboratively trained in a unified framework ESA-DGR. Extensive experiments on three diverse and challenging KIMSR datasets reveal that ESA-DGR significantly surpasses state-of-the-art LLM-based fine-tuning methods, with remarkable average improvements of 4% in exact match (EM) and 5% in F1 score. The implementation code is available at this https URL.
摘要：大型语言模型（LLMS）在知识密集的多步理论（KIMSR）任务中遇到困难。一个挑战是如何有效提取和代表理由证据。当前的方法通常提取语义上相关但在逻辑上无关的证据，从而导致推理和不准确的反应。我们提出了一个双向证据自我调整（TW-ESA）模块，该模块利用严格推理和LLM推理之间的相互一致性来增强其对证据因果关系逻辑的理解，从而解决了第一个挑战。另一个挑战是如何利用基本原理证据和LLM的内在知识，以便在证据含有不确定性时进行准确的推理。我们提出了一个双门控推理增强（DGR）模块，以在严格的推理中逐渐融合有用的LLM知识，这可以使模型能够通过关注证据中的因果因素并表现出更大的鲁棒性来执行准确的推理。这两个模块在统一的框架ESA-DGR中进行了协作培训。对三种不同和挑战的KIMSR数据集进行了广泛的实验表明，ESA-DGR显着超过了最先进的基于LLM的微调方法，精确匹配（EM）的平均平均提高为4％，而F1分数为5％。该实现代码可在此HTTPS URL上获得。

Title: Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Authors: Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Haibo Hu, Minxin Du
Subjects: cs.CL, cs.AI, cs.CR, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16831
Pdf URL: https://arxiv.org/pdf/2505.16831
Copy Paste: [[2505.16831]] Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs(https://arxiv.org/abs/2505.16831)
Keywords: language model, llm
Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: this https URL.
摘要：大型语言模型（LLM）的学习旨在消除特定数据的影响，但是当前的评估在很大程度上取决于令牌级别的指标，例如准确性和困惑。我们表明这些指标可能会产生误导：模型通常似乎忘记了，但是通过最少的微调可以快速恢复它们的原始行为，这表明未学习可能掩盖了信息而不是删除信息。为了诊断这种现象，我们使用基于PCA的相似性和转移，中心内核对齐和Fisher信息引入了代表级评估框架。在六个未学习方法，三个域（文本，代码，数学）和两个开源LLMS上应用此工具包，我们发现了可逆和不可逆的遗忘之间的关键区别。在可逆情况下，模型遭受令牌级别的崩溃，但保留了潜在特征。在不可逆的情况下，会发生更深层的代表性损害。我们进一步提供了一个理论帐户，将输出层附近的浅重量扰动连接到误导性信号，并证明可逆性是由任务类型和超级参数调节的。我们的发现揭示了当前评估实践的根本差距，并为在LLMS中可信赖的学习良好的诊断基础建立了新的诊断基础。我们提供了一个统一的工具包，用于分析LLM表示和重新学习下的更改：此HTTPS URL。

Title: SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Authors: Shuang Sun, Huatong Song, Yuhao Wang, Ruiyang Ren, Jinhao Jiang, Junjie Zhang, Fei Bai, Jia Deng, Wayne Xin Zhao, Zheng Liu, Lei Fang, Zhongyuan Wang, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.16834
Pdf URL: https://arxiv.org/pdf/2505.16834
Copy Paste: [[2505.16834]] SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis(https://arxiv.org/abs/2505.16834)
Keywords: language model, llm, retrieval-augmented generation
Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at this https URL.
摘要：在复杂的深层搜索场景中，检索增强的生成（RAG）系统具有需要多步推理和迭代信息检索的复杂搜索场景中的高级大语言模型（LLM）。但是，现有方法面临缺乏高质量训练轨迹的关键局限性或在模拟环境中的分布不匹配以及现实世界部署的高度计算成本。本文介绍了SimpleDeepsearcher，这是一个轻巧而有效的框架，它通过战略数据工程而不是复杂的培训范式弥合了这一差距。我们的方法通过模拟实时Web搜索环境中的现实用户交互来综合高质量的培训数据，再加上多标准策略，以优化输入和输出方面的多样性和质量。在各种领域的五个基准上进行的实验表明，仅在871个策划样品上进行SFT可以对基于RL的基线产生显着改善。我们的工作通过系统地解决数据制成的瓶颈，将SFT作为可行的途径建立，从而为有效的深度搜索系统提供了实用的见解。我们的代码可在此HTTPS URL上找到。

Title: R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search

Authors: Yibo Wang, Li Shen, Huanjin Yao, Tiansheng Huang, Rui Liu, Naiqiang Tan, Jiaxing Huang, Kai Zhang, Dacheng Tao
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16838
Pdf URL: https://arxiv.org/pdf/2505.16838
Copy Paste: [[2505.16838]] R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search(https://arxiv.org/abs/2505.16838)
Keywords: language model, llm, chain-of-thought
Abstract: Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving, yet its extension to Long-CoT introduces substantial computational overhead due to increased token length. Existing compression approaches -- instance-level and token-level -- either sacrifice essential local reasoning signals like reflection or yield incoherent outputs. To address these limitations, we propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence. Our method segments Long-CoT into manageable chunks, applies LLM-driven inner-chunk compression, and employs an inter-chunk search mechanism to select the short and coherent sequence. Experiments on Qwen2.5-Instruct models across MATH500, AIME24, and GPQA-Diamond demonstrate that R1-Compress significantly reduces token usage while maintaining comparable reasoning accuracy. On MATH500, R1-Compress achieves an accuracy of 92.4%, with only a 0.6% drop compared to the Long-CoT baseline, while reducing token usage by about 20%. Source code will be available at this https URL
摘要：通过逐步解决问题的解决方案（COT）推理（COT）推理可以增强大型语言模型（LLMS），但由于令牌长度的增加，其扩展到长时间的延伸到长时间介绍了大量的计算开销。现有的压缩方法 - 实例级别和令牌级别 - 牺牲必不可少的局部推理信号，例如反射或产量不连贯的输出。为了解决这些限制，我们提出了R1-Compress，这是一个两阶段的块级压缩框架，可保留本地信息和连贯性。我们的方法段长期为可管理的块，应用了LLM驱动的内锁压缩，并采用了锁骨间搜索机制来选择短而相干的序列。在Math500，AIME24和GPQA-Diamond上进行QWEN2.5教学模型的实验表明，R1压缩可以显着降低令牌使用，同时保持可比的推理精度。在Math500上，R1压缩的准确度为92.4％，与长期基线相比仅下降了0.6％，同时将令牌用法降低了约20％。源代码将在此HTTPS URL上找到

Title: Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study

Authors: Baran Barbarestani, Isa Maks, Piek Vossen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16847
Pdf URL: https://arxiv.org/pdf/2505.16847
Copy Paste: [[2505.16847]] Understanding and Analyzing Inappropriately Targeting Language in Online Discourse: A Comparative Annotation Study(https://arxiv.org/abs/2505.16847)
Keywords: gpt, chat
Abstract: This paper introduces a method for detecting inappropriately targeting language in online conversations by integrating crowd and expert annotations with ChatGPT. We focus on English conversation threads from Reddit, examining comments that target individuals or groups. Our approach involves a comprehensive annotation framework that labels a diverse data set for various target categories and specific target words within the conversational context. We perform a comparative analysis of annotations from human experts, crowd annotators, and ChatGPT, revealing strengths and limitations of each method in recognizing both explicit hate speech and subtler discriminatory language. Our findings highlight the significant role of contextual factors in identifying hate speech and uncover new categories of targeting, such as social belief and body image. We also address the challenges and subjective judgments involved in annotation and the limitations of ChatGPT in grasping nuanced language. This study provides insights for improving automated content moderation strategies to enhance online safety and inclusivity.
摘要：本文介绍了一种通过将人群和专家注释与Chatgpt集成，用于在线对话中检测不适当的针对语言的方法。我们专注于Reddit的英语对话线程，检查针对个人或群体的评论。我们的方法涉及一个综合的注释框架，该框架在对话环境中为各种目标类别和特定目标单词标记了不同的数据集。我们对人类专家，人群注释者和Chatgpt的注释进行了比较分析，以揭示每种方法的优势和局限性，以识别明确的仇恨言论和微妙的歧视性语言。我们的发现突出了上下文因素在识别仇恨言论和发现定位的新类别（例如社会信仰和身体形象）中的重要作用。我们还解决了注释所涉及的挑战和主观判断以及掌握细微差别语言的Chatgpt的局限性。这项研究提供了改善自动内容审核策略以提高在线安全性和包容性的见解。

Title: MPO: Multilingual Safety Alignment via Reward Gap Optimization

Authors: Weixiang Zhao, Yulin Hu, Yang Deng, Tongtong Wu, Wenxuan Zhang, Jiahe Guo, An Zhang, Yanyan Zhao, Bing Qin, Tat-Seng Chua, Ting Liu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16869
Pdf URL: https://arxiv.org/pdf/2505.16869
Copy Paste: [[2505.16869]] MPO: Multilingual Safety Alignment via Reward Gap Optimization(https://arxiv.org/abs/2505.16869)
Keywords: language model, llm
Abstract: Large language models (LLMs) have become increasingly central to AI applications worldwide, necessitating robust multilingual safety alignment to ensure secure deployment across diverse linguistic contexts. Existing preference learning methods for safety alignment, such as RLHF and DPO, are primarily monolingual and struggle with noisy multilingual data. To address these limitations, we introduce Multilingual reward gaP Optimization (MPO), a novel approach that leverages the well-aligned safety capabilities of the dominant language (English) to improve safety alignment across multiple languages. MPO directly minimizes the reward gap difference between the dominant language and target languages, effectively transferring safety capabilities while preserving the original strengths of the dominant language. Extensive experiments on three LLMs, LLaMA-3.1, Gemma-2 and Qwen2.5, validate MPO's efficacy in multilingual safety alignment without degrading general multilingual utility.
摘要：大型语言模型（LLMS）已成为全球AI应用程序的核心，需要强大的多语言安全一致性，以确保在各种语言环境中进行安全部署。现有的安全一致性学习方法，例如RLHF和DPO，主要是单语的，并且在嘈杂的多语言数据中挣扎。为了解决这些限制，我们引入了多语言奖励差距优化（MPO），这是一种新型方法，利用了优势语言（英语）的良好安全性安全能力，以提高跨多种语言的安全一致性。 MPO直接最大程度地减少了主导语言和目标语言之间的奖励差距差异，从而有效地传递了安全能力，同时保留了主导语言的原始优势。在三个LLMS，即Llama-3.1，Gemma-2和Qwen2.5上进行了广泛的实验，验证了MPO在多语言安全对齐中的功效，而不会降低一般多语言效用。

Title: CASTILLO: Characterizing Response Length Distributions of Large Language Models

Authors: Daniel F. Perez-Ramirez, Dejan Kostic, Magnus Boman
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16881
Pdf URL: https://arxiv.org/pdf/2505.16881
Copy Paste: [[2505.16881]] CASTILLO: Characterizing Response Length Distributions of Large Language Models(https://arxiv.org/abs/2505.16881)
Keywords: language model, llm, prompt
Abstract: Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$prompt, model$\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.
摘要：有效地管理大语模型（LLM）推断的计算资源，由于自然的随机文本生成长度固有和可变长度，因此仍然具有挑战性。预先估算响应长度的准确估算可以主动的资源分配，但是现有的方法要么偏向文本生成某些长度，要么依赖于忽略模型和及时特异性可变性的假设。我们介绍了Castillo，这是一个在13个广泛使用的开源LLMS上进行了响应长度分布的数据集，该数据集对七个不同的指令遵循的语料库进行了评估。对于每个$ \ langle $提示，型号$ \ rangle $样本对，我们使用固定的解码超参数生成10个独立的完成，记录每个响应的令牌长度，并发布摘要统计信息（平均值，STD-DEV，百分位数），以及最短和最长的完成以及最多的完成以及确切的生成生成设置。我们的分析揭示了响应长度（即使在相同的生成设置）以及仅在响应子集中的部分文本变性的特定行为和部分文本变性的情况下的响应长度（即使在相同的生成设置）中的显着变化。 Castillo可以开发主动调度的预测模型，并为分析特定于模型的生成行为提供了系统的框架。我们公开发布数据集和代码，以在生成语言建模和系统的交集中促进研究。

Title: Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs

Authors: Zeyu Wei, Shuo Wang, Xiaohui Rong, Xuemin Liu, He Li
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16894
Pdf URL: https://arxiv.org/pdf/2505.16894
Copy Paste: [[2505.16894]] Shadows in the Attention: Contextual Perturbation and Representation Drift in the Dynamics of Hallucination in LLMs(https://arxiv.org/abs/2505.16894)
Keywords: language model, llm, hallucination
Abstract: Hallucinations -- plausible yet erroneous outputs -- remain a critical barrier to reliable deployment of large language models (LLMs). We present the first systematic study linking hallucination incidence to internal-state drift induced by incremental context injection. Using TruthfulQA, we construct two 16-round "titration" tracks per question: one appends relevant but partially flawed snippets, the other injects deliberately misleading content. Across six open-source LLMs, we track overt hallucination rates with a tri-perspective detector and covert dynamics via cosine, entropy, JS and Spearman drifts of hidden states and attention maps. Results reveal (1) monotonic growth of hallucination frequency and representation drift that plateaus after 5--7 rounds; (2) relevant context drives deeper semantic assimilation, producing high-confidence "self-consistent" hallucinations, whereas irrelevant context induces topic-drift errors anchored by attention re-routing; and (3) convergence of JS-Drift ($\sim0.69$) and Spearman-Drift ($\sim0$) marks an "attention-locking" threshold beyond which hallucinations solidify and become resistant to correction. Correlation analyses expose a seesaw between assimilation capacity and attention diffusion, clarifying size-dependent error modes. These findings supply empirical foundations for intrinsic hallucination prediction and context-aware mitigation mechanisms.
摘要：幻觉 - 合理但错误的输出 - 仍然是可靠部署大型语言模型（LLM）的关键障碍。我们提出了第一个系统的研究，将幻觉发生率与逐步注入引起的内部状态漂移联系起来。我们使用真实的Qa，每个问题构建了两个16轮“滴定”曲目：一个附加相关但部分有缺陷的片段，另一个注入了刻板的内容。在六个开源LLM中，我们通过余弦，熵，JS和Spearman漂移的隐藏状态和注意力图的秘密探测器和秘密动态跟踪明显的幻觉率。结果表明（1）幻觉频率的单调生长和表示在5--7回合后的高原漂移；（2）相关环境推动了更深层的语义同化，产生了高信心的“自谐”幻觉，而无关的环境会导致主题拖延错误，这是由于重新路由而锚定的；（3）JS-Drift（$ \ sim0.69 $）和Spearman-Drift（$ \ sim0 $）的融合标志着“注意力锁定”阈值，超过该阈值，幻觉却稳固并具有抵抗力。相关分析在同化能力和注意力扩散之间揭示了SEESAW，从而阐明了尺寸依赖性误差模式。这些发现为内在幻觉预测和情境感知的缓解机制提供了经验基础。

Title: Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality

Authors: Jintian Shao, Hongyi Huang, Jiayi Wu, Beiwen Zhang, ZhiYu Wu, You Shan, MingKai Zheng
Subjects: cs.CL, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16900
Pdf URL: https://arxiv.org/pdf/2505.16900
Copy Paste: [[2505.16900]] Power-Law Decay Loss for Large Language Model Finetuning: Focusing on Information Sparsity to Enhance Generation Quality(https://arxiv.org/abs/2505.16900)
Keywords: language model
Abstract: During the finetuning stage of text generation tasks, standard cross-entropy loss treats all tokens equally. This can lead models to overemphasize high-frequency, low-information tokens, neglecting lower-frequency tokens crucial for specificity and informativeness in generated content. This paper introduces a novel loss function, Power-Law Decay Loss (PDL), specifically designed to optimize the finetuning process for text generation. The core motivation for PDL stems from observations in information theory and linguistics: the informativeness of a token is often inversely proportional to its frequency of occurrence. PDL re-weights the contribution of each token in the standard cross-entropy loss based on its frequency in the training corpus, following a power-law decay. Specifically, the weights for high-frequency tokens are reduced, while low-frequency, information-dense tokens are assigned higher weights. This mechanism guides the model during finetuning to focus more on learning and generating tokens that convey specific and unique information, thereby enhancing the quality, diversity, and informativeness of the generated text. We theoretically elaborate on the motivation and construction of PDL and discuss its potential applications and advantages across various text generation finetuning tasks, such as abstractive summarization, dialogue systems, and style transfer.
摘要：在文本生成任务的固定阶段，标准的跨凝结损失同样对待所有令牌。这可以导致模型过分强调高频，低信息令牌，从而忽略了对生成内容的特殊性和信息性至关重要的低频令牌。本文介绍了一种新颖的损失功能，即幂律衰减损失（PDL），该损失损失（PDL）是专门为优化文本生成的登录过程而设计的。 PDL的核心动机源于信息理论和语言学的观察：令牌的信息通常与其发生的频率成反比。 PDL根据训练语料库中的频率，将每个令牌在标准的跨凝结损失中的贡献重新重新重新赋予。具体而言，高频代币的权重降低，而低频，信息密度令牌的权重分配了更高的权重。这种机制指导了填充过程中的模型，将更多的精力集中在传达特定和独特信息的学习和产生代币上，从而增强生成的文本的质量，多样性和信息性。从理论上讲，我们详细阐述了PDL的动机和构建，并讨论了其在各种文本生成鉴定任务中的潜在应用和优势，例如抽象性摘要，对话系统和样式转移。

Title: UNCLE: Uncertainty Expressions in Long-Form Generation

Authors: Ruihan Yang, Caiqi Zhang, Zhisong Zhang, Xinting Huang, Dong Yu, Nigel Collier, Deqing Yang
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16922
Pdf URL: https://arxiv.org/pdf/2505.16922
Copy Paste: [[2505.16922]] UNCLE: Uncertainty Expressions in Long-Form Generation(https://arxiv.org/abs/2505.16922)
Keywords: language model, llm, hallucination, prompt
Abstract: Large Language Models (LLMs) are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.
摘要：大型语言模型（LLM）容易幻觉，尤其是在长期的世代中。减轻幻觉的有希望的方向是教LLM在缺乏足够的知识时明确表达不确定性。但是，现有工作缺乏对LLMS在长期产生中有效表达不确定性的能力的直接评估。为了解决这一差距，我们首先介绍了叔叔，这是一种基准，旨在评估长期和短形式答案（QA）中的不确定性表达。叔叔跨越了五个域，包括4K长格式的质量检查实例和超过20k的短形式QA对。我们的数据集是第一个通过配对问题和金标准答案直接桥接短形和长形式质量检查的数据集。除基准外，我们提出了一套新指标，以评估模型的能力有选择性地表达不确定性。然后，我们证明了当前模型在长期生成中无法适当地传达不确定性。我们进一步探讨了基于迅速的和基于培训的方法，以提高模型的性能，并带来基于培训的方法的提高。对短形和长期不确定性表达之间的一致性差距的进一步分析突出了使用叔叔的未来研究的有希望的方向。

Title: Latent Principle Discovery for Language Model Self-Improvement

Authors: Keshav Ramji, Tahira Naseem, Ramón Fernandez Astudillo
Subjects: cs.CL, cs.AI, cs.LG
Abstract URL: https://arxiv.org/abs/2505.16927
Pdf URL: https://arxiv.org/pdf/2505.16927
Copy Paste: [[2505.16927]] Latent Principle Discovery for Language Model Self-Improvement(https://arxiv.org/abs/2505.16927)
Keywords: language model
Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.
摘要：当语言模型（LM）用户旨在提高其世代的质量时，至关重要的是指定该模型应努力反思的具体行为属性。但是，即使是非详尽的范围，也需要一个劳动密集型的注释过程。为了自动化此过程，我们建议通过在自校正环境中明确对这些潜在属性推理来指导人类偏爱的响应。我们的方法从LM本身挖掘出新的原理，并通过聚类将发现的元素压缩为可解释的集合。具体而言，我们采用后验证的蒙特卡洛期望最大化的近似值，都可以识别一组最有效的潜在原则，并教导LM策略性地援引它们，以便从本质上完善其响应。我们证明，在多个迭代中，引导算法使较小的语言模型（7-8B参数）可以自我突出，在Alpacaeval赢得胜利率中获得 +8-10％的 +8-10％，平均在MT Bench上+0.3，而在Ifeval上获得了 +19-23％的 +19-23％。我们还表明，聚类原理会产生可解释和多样化的模型生成的构成，同时保留模型性能。我们的方法获得的实现强调了自动化，原理驱动的训练后食谱对持续自我改善的潜力。

Title: In-Context Watermarks for Large Language Models

Authors: Yepeng Liu, Xuandong Zhao, Christopher Kruegel, Dawn Song, Yuheng Bu
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16934
Pdf URL: https://arxiv.org/pdf/2505.16934
Copy Paste: [[2505.16934]] In-Context Watermarks for Large Language Models(https://arxiv.org/abs/2505.16934)
Keywords: language model, llm, prompt
Abstract: The growing use of large language models (LLMs) for sensitive applications has highlighted the need for effective watermarking techniques to ensure the provenance and accountability of AI-generated text. However, most existing watermarking methods require access to the decoding process, limiting their applicability in real-world settings. One illustrative example is the use of LLMs by dishonest reviewers in the context of academic peer review, where conference organizers have no access to the model used but still need to detect AI-generated reviews. Motivated by this gap, we introduce In-Context Watermarking (ICW), which embeds watermarks into generated text solely through prompt engineering, leveraging LLMs' in-context learning and instruction-following abilities. We investigate four ICW strategies at different levels of granularity, each paired with a tailored detection method. We further examine the Indirect Prompt Injection (IPI) setting as a specific case study, in which watermarking is covertly triggered by modifying input documents such as academic manuscripts. Our experiments validate the feasibility of ICW as a model-agnostic, practical watermarking approach. Moreover, our findings suggest that as LLMs become more capable, ICW offers a promising direction for scalable and accessible content attribution.
摘要：大型语言模型（LLMS）用于敏感应用的日益增长，突显了需要有效的水印技术来确保AI生成的文本的出处和问责制。但是，大多数现有的水印方法都需要访问解码过程，从而限制其在现实世界中的适用性。一个说明性的例子是在学术同行评审的背景下，不诚实的审稿人使用LLM，在该背景下，会议组织者无法访问使用的模型，但仍需要检测AI生成的评论。在这个差距的促进的情况下，我们引入了文化水印（ICW），该水印（ICW）仅通过及时的工程嵌入水印，利用LLMS的秘密学习和跟随教学能力。我们研究了不同水平粒度的四种ICW策略，每种ICW策略与量身定制的检测方法配对。我们进一步研究了间接提示注射（IPI）作为特定案例研究，其中通过修改输入文档（例如学术手稿）秘密触发水印。我们的实验验证了ICW作为一种模型不合时宜的，实用的水印方法的可行性。此外，我们的发现表明，随着LLMS变得越来越有能力，ICW为可扩展且可访问的内容归因提供了有希望的方向。

Title: On Multilingual Encoder Language Model Compression for Low-Resource Languages

Authors: Daniil Gurgurov, Michal Gregor, Josef van Genabith, Simon Ostermann
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16956
Pdf URL: https://arxiv.org/pdf/2505.16956
Copy Paste: [[2505.16956]] On Multilingual Encoder Language Model Compression for Low-Resource Languages(https://arxiv.org/abs/2505.16956)
Keywords: language model
Abstract: In this paper, we combine two-step knowledge distillation, structured pruning, truncation, and vocabulary trimming for extremely compressing multilingual encoder-only language models for low-resource languages. Our novel approach systematically combines existing techniques and takes them to the extreme, reducing layer depth, feed-forward hidden size, and intermediate layer embedding size to create significantly smaller monolingual models while retaining essential language-specific knowledge. We achieve compression rates of up to 92% with only a marginal performance drop of 2-10% in four downstream tasks, including sentiment analysis, topic classification, named entity recognition, and part-of-speech tagging, across three low-resource languages. Notably, the performance degradation correlates with the amount of language-specific data in the teacher model, with larger datasets resulting in smaller performance losses. Additionally, we conduct extensive ablation studies to identify best practices for multilingual model compression using these techniques.
摘要：在本文中，我们结合了两步知识蒸馏，结构化的修剪，截断和词汇修剪，以极度压缩低资源语言的多语言编码语言模型。我们的新颖方法系统地结合了现有技术，并将它们带入极端，减少的层深度，馈送式隐藏尺寸和中间层嵌入尺寸，以创建明显较小的单语言模型，同时保留基本的特定语言知识。在四种低调语言中，我们的四个下游任务中只有2-10％的边际性能下降，包括情感分析，主题分类，命名实体识别和言论部分标记，最多只能达到92％的压缩率。值得注意的是，性能退化与教师模型中的语言特定数据量相关，较大的数据集导致较小的性能损失。此外，我们进行了广泛的消融研究，以确定使用这些技术的多语言模型压缩的最佳实践。

Title: VeriFastScore: Speeding up long-form factuality evaluation

Authors: Rishanth Rajendhran, Amir Zadeh, Matthew Sarte, Chuan Li, Mohit Iyyer
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16973
Pdf URL: https://arxiv.org/pdf/2505.16973
Copy Paste: [[2505.16973]] VeriFastScore: Speeding up long-form factuality evaluation(https://arxiv.org/abs/2505.16973)
Keywords: llm, prompt
Abstract: Metrics like FactScore and VeriScore that evaluate long-form factuality operate by decomposing an input response into atomic claims and then individually verifying each claim. While effective and interpretable, these methods incur numerous LLM calls and can take upwards of 100 seconds to evaluate a single response, limiting their practicality in large-scale evaluation and training scenarios. To address this, we propose VeriFastScore, which leverages synthetic data to fine-tune Llama3.1 8B for simultaneously extracting and verifying all verifiable claims within a given text based on evidence from Google Search. We show that this task cannot be solved via few-shot prompting with closed LLMs due to its complexity: the model receives ~4K tokens of evidence on average and needs to concurrently decompose claims, judge their verifiability, and verify them against noisy evidence. However, our fine-tuned VeriFastScore model demonstrates strong correlation with the original VeriScore pipeline at both the example level (r=0.80) and system level (r=0.94) while achieving an overall speedup of 6.6x (9.9x excluding evidence retrieval) over VeriScore. To facilitate future factuality research, we publicly release our VeriFastScore model and synthetic datasets.
摘要：通过将输入响应分解为原子索赔，然后单独验证每个索赔，诸如Factscore和Veriscore之类的指标可以评估长期事实。这些方法虽然有效且易于解释，但却引起了许多LLM调用，并且可能需要超过100秒的时间来评估单一响应，从而限制了它们在大规模评估和培训方案中的实用性。为了解决这个问题，我们提出了verifastScore，该验证将综合数据利用llama3.1 8b来同时提取和验证基于Google搜索的证据中给定文本中所有可验证的索赔。我们表明，由于其复杂性，该任务无法通过封闭的LLM进行几次提示来解决：该模型平均获得约4K令牌证据，需要同时分解索赔，判断其可验证性并在嘈杂的证据中验证它们。但是，我们的微调VerifastScore模型表明，在示例级别（r = 0.80）和系统水平（r = 0.94）上，与原始的Veriscore管道相关，同时在Veriscore上实现了6.6倍（9.9倍的总体加速（9.9倍）（9.9倍）（不包括证据检索）。为了促进未来的事实研究，我们公开发布了我们的VerifastScore模型和合成数据集。

Title: LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding

Authors: Junlong Tong, Jinlan Fu, Zixuan Lin, Yingqi Fan, Anhao Zhao, Hui Su, Xiaoyu Shen
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16983
Pdf URL: https://arxiv.org/pdf/2505.16983
Copy Paste: [[2505.16983]] LLM as Effective Streaming Processor: Bridging Streaming-Batch Mismatches with Group Position Encoding(https://arxiv.org/abs/2505.16983)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) are primarily designed for batch processing. Existing methods for adapting LLMs to streaming rely either on expensive re-encoding or specialized architectures with limited scalability. This work identifies three key mismatches in adapting batch-oriented LLMs to streaming: (1) input-attention, (2) output-attention, and (3) position-ID mismatches. While it is commonly assumed that the latter two mismatches require frequent re-encoding, our analysis reveals that only the input-attention mismatch significantly impacts performance, indicating re-encoding outputs is largely unnecessary. To better understand this discrepancy with the common assumption, we provide the first comprehensive analysis of the impact of position encoding on LLMs in streaming, showing that preserving relative positions within source and target contexts is more critical than maintaining absolute order. Motivated by the above analysis, we introduce a group position encoding paradigm built on batch architectures to enhance consistency between streaming and batch modes. Extensive experiments on cross-lingual and cross-modal tasks demonstrate that our method outperforms existing approaches. Our method requires no architectural modifications, exhibits strong generalization in both streaming and batch modes. The code is available at repository this https URL.
摘要：大型语言模型（LLMS）主要设计用于批处理处理。现有的用于调整LLMS流量的方法依赖于昂贵的重新编码或具有有限可扩展性的专业体系结构。这项工作确定了将面向批处理的LLMS调整为流的三个关键不匹配：（1）输入注意，（2）输出注意事项，以及（3）位置-ID不匹配。尽管通常认为后两个不匹配需要频繁重新编码，但我们的分析表明，只有输入注意事项不匹配会显着影响性能，这表明重新编码的输出在很大程度上是不必要的。为了更好地理解这种差异，我们对编码在流中LLM的位置的影响进行了首次综合分析，这表明在源和目标环境中保留相对位置比保持绝对顺序更为重要。在上述分析的激励下，我们引入了一个组件，该位置编码基于批处理体系结构的范式，以提高流媒体和批处理模式之间的一致性。关于跨语言和跨模式任务的广泛实验表明，我们的方法表现优于现有方法。我们的方法不需要架构修改，在流模式和批处理模式中都表现出强烈的概括。该代码可在此HTTPS URL的存储库中获得。

Title: T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Authors: Amartya Chakraborty, Paresh Dashore, Nadia Bathaee, Anmol Jain, Anirban Das, Shi-Xiong Zhang, Sambit Sahu, Milind Naphade, Genta Indra Winata
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16986
Pdf URL: https://arxiv.org/pdf/2505.16986
Copy Paste: [[2505.16986]] T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning(https://arxiv.org/abs/2505.16986)
Keywords: language model, llm, agent
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.
摘要：大型语言模型（LLM）表现出令人印象深刻的能力，因为能够解决复杂问题。但是，在涉及API或工具调用之间依赖性依赖性的有效计划中，在多转化对话中，一个重大挑战是一个重大的挑战。为了解决这个问题，我们介绍了T1，这是一种工具功能增强的多域，多转交谈数据集，专门设计，旨在捕获和管理各种域中的工具间依赖性。 T1能够借助集成的缓存机制来为短期和长期记忆进行综合的九个不同域（4个单个域和5个多域）的跨九个不同域（4个单个域和5个多域）的能力进行严格的评估，同时支持动态重新启动，同时决定是否重新计算或重复使用缓存的结果。除了促进有关工具使用和计划的研究外，T1还可以作为评估开源语言模型性能的基准。我们提出了以T1代理为动力的结果，突出了它们在复杂的，依赖工具的场景中计划和推理的能力。

Title: MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Authors: Rui Ye, Keduan Huang, Qimin Wu, Yuzhu Cai, Tian Jin, Xianghe Pang, Xiangrui Liu, Jiaqi Su, Chen Qian, Bohan Tang, Kaiqu Liang, Jiaao Chen, Yue Hu, Zhenfei Yin, Rongye Shi, Bo An, Yang Gao, Wenjun Wu, Lei Bai, Siheng Chen
Subjects: cs.CL, cs.AI, cs.MA
Abstract URL: https://arxiv.org/abs/2505.16988
Pdf URL: https://arxiv.org/pdf/2505.16988
Copy Paste: [[2505.16988]] MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems(https://arxiv.org/abs/2505.16988)
Keywords: llm, agent
Abstract: LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.
摘要：基于LLM的多代理系统（MAS）在增强单个LLM的潜力中表现出很大的潜力，可以解决实际应用中复杂而多样化的任务。尽管取得了长足的进步，但该领域缺乏统一的代码库来巩固现有方法，从而导致重新实施工作，不公平的比较和研究人员的高入口障碍。为了应对这些挑战，我们介绍了基于LLM的MAS的统一，全面且对研究的代码库Maslab。（1）Maslab整合了多个范围内的20多种已建立的方法，每个方法都通过将逐步输出与其官方实施进行比较来严格验证。（2）Maslab提供了一个统一的环境，并具有各种基准，以在方法之间进行公平比较，从而确保一致的输入和标准化的评估协议。（3）Maslab在共享的简化结构中实现方法，从而降低了理解和扩展的障碍。在马斯拉布（Maslab）的基础上，我们进行了涵盖10多种基准和8种型号的广泛实验，为研究人员提供了对当前MAS方法景观的清晰而全面的看法。 Maslab将继续发展，跟踪该领域的最新发展，并邀请更广泛的开源社区的贡献。

Title: DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization

Authors: Chao Zhang, Xin Shi, Xueqiao Zhang, Yifan Zhu, Yi Yang, Yawei Luo
Subjects: cs.CL
Abstract URL: https://arxiv.org/abs/2505.16995
Pdf URL: https://arxiv.org/pdf/2505.16995
Copy Paste: [[2505.16995]] DecoupledESC: Enhancing Emotional Support Generation via Strategy-Response Decoupled Preference Optimization(https://arxiv.org/abs/2505.16995)
Keywords: language model, llm
Abstract: Recent advances in Emotional Support Conversation (ESC) have improved emotional support generation by fine-tuning Large Language Models (LLMs) via Supervised Fine-Tuning (SFT). However, common psychological errors still persist. While Direct Preference Optimization (DPO) shows promise in reducing such errors through pairwise preference learning, its effectiveness in ESC tasks is limited by two key challenges: (1) Entangled data structure: Existing ESC data inherently entangles psychological strategies and response content, making it difficult to construct high-quality preference pairs; and (2) Optimization ambiguity: Applying vanilla DPO to such entangled pairwise data leads to ambiguous training objectives. To address these issues, we introduce Inferential Preference Mining (IPM) to construct high-quality preference data, forming the IPM-PrefDial dataset. Building upon this data, we propose a Decoupled ESC framework inspired by Gross's Extended Process Model of Emotion Regulation, which decomposes the ESC task into two sequential subtasks: strategy planning and empathic response generation. Each was trained via SFT and subsequently enhanced by DPO to align with the psychological preference. Extensive experiments demonstrate that our Decoupled ESC framework outperforms joint optimization baselines, reducing preference bias and improving response quality.
摘要：情感支持对话（ESC）的最新进展通过通过监督的微调（SFT）进行微调大语模型（LLM）改善了情感支持的产生。但是，常见的心理错误仍然存在。虽然直接偏好优化（DPO）显示出通过成对偏好学习减少此类错误的希望，但其在ESC任务中的有效性受到两个关键挑战的限制：（1）纠缠数据结构：现有的ESC数据固有地纠缠了心理策略和响应内容，因此难以构建高质量的偏好对；（2）优化歧义：将香草DPO应用于此类纠缠的成对数据会导致模棱两可的训练目标。为了解决这些问题，我们介绍了推论偏好挖掘（IPM）来构建高质量的偏好数据，从而构成IPM Prefdial数据集。在这些数据的基础上，我们提出了一个受GROSS扩展的情感调节过程模型启发的脱钩ESC框架，该模型将ESC任务分解为两个顺序的子任务：策略计划和移情响应生成。每个人都通过SFT训练，随后通过DPO增强，以与心理偏好保持一致。广泛的实验表明，我们的脱钩ESC框架优于关节优化基线，减少偏好偏差并提高响应质量。

Title: Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Authors: Jin Jiang, Jianing Wang, Yuchen Yan, Yang Liu, Jianhua Zhu, Mengdi Zhang, Xunliang Cai, Liangcai Gao
Subjects: cs.CL, cs.AI
Abstract URL: https://arxiv.org/abs/2505.16998
Pdf URL: https://arxiv.org/pdf/2505.16998
Copy Paste: [[2505.16998]] Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?(https://arxiv.org/abs/2505.16998)
Keywords: language model, llm
Abstract: Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at this https URL.
摘要：大型语言模型（LLMS）已被证明可以在复杂的逻辑推理任务上实现突破性的表现。然而，大多数现有的研究都集中在使用正式语言来指导LLMS来得出可靠的推理路径，而对这些功能的系统评估仍然受到限制。在本文中，我们旨在对利用正式语言的各种逻辑推理问题进行全面评估LLM。从三个维度的角度来看，即LLMS的频谱，任务分类法和轨迹格式，我们的主要发现是：1）思维模型明显超过指导模型，尤其是在使用正式语言时； 2）所有LLM在归纳推理能力方面均表现出局限性，而不论它们是否使用形式语言； 3）具有锅格式的数据可在其他语言上实现最佳的概括性能。此外，我们还策划了形式上的培训数据，以进一步增强小语言模型，实验结果表明，一种简单的拒绝微调方法可以更好地使LLMS能够跨越正式语言并实现最佳的整体性能。我们的代码和报告可在此HTTPS URL上找到。

Title: R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

Authors: Huatong Song, Jinhao Jiang, Wenqing Tian, Zhipeng Chen, Yuhuan Wu, Jiahao Zhao, Yingqian Min, Wayne Xin Zhao, Lei Fang, Ji-Rong Wen
Subjects: cs.CL, cs.AI, cs.IR
Abstract URL: https://arxiv.org/abs/2505.17005
Pdf URL: https://arxiv.org/pdf/2505.17005
Copy Paste: [[2505.17005]] R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning(https://arxiv.org/abs/2505.17005)
Keywords: language model, llm, hallucination, retrieval-augmented generation
Abstract: Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at this https URL.
摘要：大型语言模型（LLM）强大，但由于静态知识而容易出现幻觉。检索增强的生成（RAG）通过注入外部信息来帮助，但是当前的方法通常代价高昂，概括或忽略了模型的内部知识。在本文中，我们介绍了R1-Searcher ++，这是一个旨在训练LLM的新型框架，以适应内部和外部知识来源。 R1-Searcher ++采用了两阶段的培训策略：初步的SFT冷启动阶段，用于初步学习，然后是用于动态知识获取的RL。 RL阶段使用成果 - 居式来鼓励探索，结合了内部知识利用的奖励机制，并整合了一种记忆机制，以连续吸收了被检索的信息，从而丰富了模型的内部知识。通过利用内部知识和外部搜索引擎，该模型不断提高其功能，从而有效地检索了杰出的推理。我们的实验表明，R1-Searcher ++的表现优于先前的抹布和推理方法，并实现了有效的检索。该代码可在此HTTPS URL上找到。